Page 1 of 1

Unicode line separator U+2028

Posted: Wed Jul 27, 2011 2:51 am
by DFH
How does TextPipe handle the Unicode line separator U+2028 ?

e.g. If the Files to be Processed have these as the EOL marker.

Assume that these are Unicode files - encoded in either UTF-16 LE or UTF-8 (with or without BOM).

Also how about in Perl pattern matching?
e.g. In the Patterns options button [...] dialog that include the tick option '.' matches newline.

David

PS. The attachment contains a simple TP filter to convert EOLs to U+2028.

Re: Unicode line separator U+2028

Posted: Wed Jul 27, 2011 9:20 am
by DataMystic Support
Thanks David - we've included your filter in a new 'Unicode' filter subfolder.

I don't believe that PCRE (the library we use) pattern matching handles anything other \r, \r\n and \n line feeds.

Re: Unicode line separator U+2028

Posted: Wed Jul 27, 2011 10:47 pm
by DFH
Well, well, well.

The help page entitled Unicode Pattern Reference includes this:
Definitions

Separator - any one of U+2028, U+2029, NL, CR.
So this suggests that TextPipe ought to be able to handle U+2028 and U+2029.

Something overlooked, perhaps?

David

PS. Of the various Unicode compatible text editors (for Windows) that I use regularly, only SC Unipad handles these correctly.

Re: Unicode line separator U+2028

Posted: Wed Jul 27, 2011 10:52 pm
by DFH
FWIW. Here's a similar filter to change EOLs to U+2029 Paragraph Separator.