Unicode Normalization bug

Get help with installation and running here.

Moderators: DataMystic Support, Moderators

DFH
Posts: 636
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Unicode Normalization bug

Postby DFH » Mon Oct 10, 2011 7:25 pm

I recently encountered a bug in Normalization to NFC for text containing Myanmar characters.
The bug affected composite characters each of which uses the same pair of combining characters:

့ MYANMAR SIGN DOT BELOW
် MYANMAR SIGN ASAT

I suspect that TextPipe uses out of date Normalization algorithms.

Some background.

Software that includes Normalization should be tested against the official Unicode Normalization Test http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt (2.2MB) for that version of Unicode,

The process of converting a string to NFC or NFD requires a stage called "canonical ordering", whereby characters are reordered in ascending order according to their canonical combining class [ccc]. See http://www.unicode.org/reports/tr15/?win#Description_Norm.

U+103A MYANMAR SIGN ASAT has ccc=9, whereas U+1037 MYANMAR SIGN DOT BELOW has ccc=7; therefore U+1037 is reordered before U+103A.

The bug is that TextPipe does not reorder these two codepoints.

David

DFH
Posts: 636
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode Normalization bug

Postby DFH » Mon Oct 10, 2011 7:30 pm

Further details - comparing the current version of Unicode with the old one....

Testing the normalization of the sequence U+1000 U+103A U+1037 with the ICU Normalization Browser (which uses the "Internationalization Components for Unicode" library, which is the most widely used Unicode software library), we can verify that it does indeed normalize to U+1000 U+1037 U+103A, with reordering:

See http://bit.ly/nqYzQp.

However, if you run the same test for Unicode 3.2 (released March 2002, and so almost 10 years out of date), there is no reordering:

See http://bit.ly/orZ7df.

NB. I used the URL shortener to allow parameters to be passed to the test page.

DFH
Posts: 636
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode Normalization bug

Postby DFH » Mon Oct 10, 2011 7:34 pm

The attached ZIP file contains a small UTF-8 text file containing 8 composite characters from the Myanmar block of Unicode.

To display Myanmar characters, you may wish to download and install the SIL Padauk font from
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=Padauk

David
Attachments
Test.Myanmar.NFC.zip
Myanmar test file.
(173 Bytes) Downloaded 312 times

User avatar
DataMystic Support
Site Admin
Posts: 2136
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Unicode Normalization bug

Postby DataMystic Support » Wed Oct 19, 2011 3:57 pm

Hi David,

We are working on an update for you, and once we have finished struggling through the AVL tree differences I will post a new beta for you to try.
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

User avatar
DataMystic Support
Site Admin
Posts: 2136
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Unicode Normalization bug

Postby DataMystic Support » Mon Oct 24, 2011 8:54 pm

Hi David,

TextPipe 8.9.8 has been released:

* Capture Text, Break on Value Change window now shows length of strings and
current cursor position.
* Updated internal PCRE (Pattern Matching ) engine to v8.13 and support for
Unicode 6.0.0.
* Updated Unicode internal libraries to support Unicode 4.1 for Normalization
etc.
* COM callees are now notified of Stack violations (e.g. during pattern
execution) or other critical errors via the existing 'FilterWindow.errorText'
variable.
* Only modified files are added back into Zip files such as .zip, .docx, .xlsx
and .pptx.
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 636
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode Normalization bug

Postby DFH » Fri Oct 28, 2011 4:22 am

Just downloaded v8.9.8 and about to install it.

Will let you know how I fare.

David

DFH
Posts: 636
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode Normalization bug

Postby DFH » Sat Oct 29, 2011 1:04 am

Having installed v8.9.8 I am pleased to confirm that when normalizing Burmese script to NFC, TextPipe now gives identical results to BabelPad.

Well done - and thanks especially for giving this issue such a high priority.

David

tuandq
Posts: 1
Joined: Sun Mar 25, 2012 5:04 pm

Re: Unicode Normalization bug

Postby tuandq » Sun Mar 25, 2012 5:35 pm

Today I use TextPipe Pro 9.1 (Evaluation)'s NFC filter on some Vietnamese XML and Word XML files. But TextPipe's NFC filter not only doesn't affect anything but also Word XML files are corrupted after apply filter! The attachment is a zip file includes a Vietnamese Unicode text file and a Vietnamese Unicode Word XML file. Please test on it!

Regards.

Tuandq.
Attachments
VNUnicode.zip
(2.83 KiB) Downloaded 287 times

DFH
Posts: 636
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Unicode Normalization bug

Postby DFH » Tue Apr 03, 2012 7:25 pm

I did a character frequency analysis for the short XML file - see attached.

In what way was the file corrupted?

Were you running a TP filter for the XML file within an MS Word .docx file ?

David
Attachments
VNUnicode Char Freq.zip
Character frequency analysis (BabelPad)
(1.01 KiB) Downloaded 258 times


Return to “TextPipe Tips and Tricks, Questions and Support”

Who is online

Users browsing this forum: No registered users and 1 guest