Unicode pattern reference help

Get help with installation and running here.

Moderators: DataMystic Support, Moderators

DFH
Posts: 654
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Unicode pattern reference help

Postby DFH » Thu Jul 28, 2011 12:09 am

The help for Unicode Pattern Reference mentions character property classes (in the Notes section).

Character property classes are not tabulated in the help anywhere.
I found the following list in http://www.koders.com/delphi/fidDBC6499E937AD0723EE4EC7F01E79BA9DA6DB5FF.aspx?s=algorithm#L208.

Code: Select all

 //   Notes:
  //     o  Character property classes are \p or \P followed by a comma separated
  //        list of integers between 1 and 32.  These integers are references to
  //        the following character properties:
  //
  //         N   Character Property
  //         --------------------------
  //         1   _URE_NONSPACING
  //         2   _URE_COMBINING
  //         3   _URE_NUMDIGIT
  //         4   _URE_NUMOTHER
  //         5   _URE_SPACESEP
  //         6   _URE_LINESEP
  //         7   _URE_PARASEP
  //         8   _URE_CNTRL
  //         9   _URE_PRIVATE
  //         10   _URE_UPPER   (note: upper, lower and titel case classes need to have case
  //         11   _URE_LOWER          sensitive search be enabled to match correctly!)
  //         12   _URE_TITLE
  //         13   _URE_MODIFIER
  //         14   _URE_OTHERLETTER
  //         15   _URE_DASHPUNCT
  //         16   _URE_OPENPUNCT
  //         17   _URE_CLOSEPUNCT
  //         18   _URE_OTHERPUNCT
  //         19   _URE_MATHSYM
  //         20   _URE_CURRENCYSYM
  //         21   _URE_OTHERSYM
  //         22   _URE_LTR
  //         23   _URE_RTL
  //         24   _URE_EURONUM
  //         25   _URE_EURONUMSEP
  //         26   _URE_EURONUMTERM
  //         27   _URE_ARABNUM
  //         28   _URE_COMMONSEP
  //         29   _URE_BLOCKSEP
  //         30   _URE_SEGMENTSEP
  //         31   _URE_WHITESPACE
  //         32   _URE_OTHERNEUT


It might be sensible to document property classes in the TextPipe help file.
Whether the above definitions are the correct ones for TextPipe is not for me to say.

On the other hand, the word TCharacterCategory is not defined anywhere in the help, even though the notes refer to it twice.

David

jorjastandish
Posts: 1
Joined: Thu Aug 18, 2011 11:15 am

Re: Unicode pattern reference help

Postby jorjastandish » Thu Aug 18, 2011 11:31 am

Thank you for sharing this one.:)

User avatar
DataMystic Support
Site Admin
Posts: 2164
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Unicode pattern reference help

Postby DataMystic Support » Thu Aug 18, 2011 5:29 pm

Thanks David.

This will go into the help file for the next release (not 8.9.4).

Code: Select all

  TCharacterCategory = (
     // normative categories
0-      ccLetterUppercase,
1 -     ccLetterLowercase,
2 -     ccLetterTitlecase,
3 -     ccMarkNonSpacing,
4 -     ccMarkSpacingCombining,
5 -     ccMarkEnclosing,
6 -     ccNumberDecimalDigit,
7 -     ccNumberLetter,
8 -     ccNumberOther,
9 -     ccSeparatorSpace,
10 -     ccSeparatorLine,
11 -     ccSeparatorParagraph,
12 -     ccOtherControl,
13 -     ccOtherFormat,
14 -     ccOtherSurrogate,
15 -     ccOtherPrivate,
16 -     ccOtherUnassigned,
// informative categories
17 -     ccLetterModifier,
18 -     ccLetterOther,
19 -     ccPunctuationConnector,
20 -     ccPunctuationDash,
21 -     ccPunctuationOpen,
22 -     ccPunctuationClose,
23 -     ccPunctuationInitialQuote,
24 -     ccPunctuationFinalQuote,
25 -     ccPunctuationOther,
26 -     ccSymbolMath,
27 -     ccSymbolCurrency,
28 -     ccSymbolModifier,
29 -     ccSymbolOther,
// bidirectional categories
30 -     ccLeftToRight,
31 -     ccLeftToRightEmbedding,
32 -     ccLeftToRightOverride,
33 -     ccRightToLeft,
34 -     ccRightToLeftArabic,
35 -     ccRightToLeftEmbedding,
36 -     ccRightToLeftOverride,
37 -     ccPopDirectionalFormat,
38 -     ccEuropeanNumber,
39 -     ccEuropeanNumberSeparator,
40 -     ccEuropeanNumberTerminator,
41 -     ccArabicNumber,
42 -     ccCommonNumberSeparator,
43 -     ccBoundaryNeutral,
44 -     ccSegmentSeparator,      // this includes tab and vertical tab
45 -     ccWhiteSpace,
46 -     ccOtherNeutrals,
// self defined categories, they do not appear in the Unicode data file
47 -          ccComposed,              // can be decomposed
48 -     ccNonBreaking,
49 -     ccSymmetric,             // has left and right forms
50 -     ccHexDigit,
51 -     ccQuotationMark,
52 -     ccMirroring,
53 -     ccSpaceOther,
54 -     ccAssigned               // means there is a definition in the Unicode standard
  );
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 654
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode pattern reference help

Postby DFH » Thu Aug 25, 2011 1:09 am

To be consistent, the CamelCase for item 36 should be ccRightToLeftOverride.

DFH
Posts: 654
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode pattern reference help

Postby DFH » Thu Aug 25, 2011 1:13 am

Please also give some thought to providing a number of examples for using TCharacterCategory in TextPipe filters.

DFH
Posts: 654
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode pattern reference help

Postby DFH » Thu Aug 25, 2011 1:17 am

This looks interesting.... found by Googling for TCharacterCategory

http://www.codeproject.com/KB/dotnet/UnicodeCharCatHelper.aspx


Return to “TextPipe Tips and Tricks, Questions and Support”

Who is online

Users browsing this forum: Baidu [Spider] and 10 guests