So for example, it does not find these punctuation marks in a UTF-8 file containing Farsi (Persian) content:In UTF-8 mode, characters with values greater than 255 do not match any of the POSIX character classes.
Code: Select all
U+060C ، 1,679 ARABIC COMMA U+061B ؛ 50 ARABIC SEMICOLON U+061F ؟ 156 ARABIC QUESTION MARK
TextPipe should be enhanced to make this and similar character classes have the full scope of Unicode.
e.g. [[:digit;]] should be extended to cover the number characters all non-Roman scripts, etc.