PCRE POSIX Character Class [[:punct:]] and non-Roman scripts

Get help with installation and running here.

Moderators: DataMystic Support, Moderators

Post Reply
DFH
Posts: 838
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

PCRE POSIX Character Class [[:punct:]] and non-Roman scripts

Post by DFH » Mon Jan 14, 2019 7:29 pm

The PCRE POSIX Character Class [[:punct:]] does not find punctuation marks for any non-Roman scripts.
In UTF-8 mode, characters with values greater than 255 do not match any of the POSIX character classes.
So for example, it does not find these punctuation marks in a UTF-8 file containing Farsi (Persian) content:

Code: Select all

U+060C	،	1,679	ARABIC COMMA
U+061B	؛	50	ARABIC SEMICOLON
U+061F	؟	156	ARABIC QUESTION MARK
cf. The same character class does find these in Notepad++.

TextPipe should be enhanced to make this and similar character classes have the full scope of Unicode.

e.g. [[:digit;]] should be extended to cover the number characters all non-Roman scripts, etc.

Post Reply