Speed of Unicode Normalization

Get help with installation and running here.

Moderators: DataMystic Support, Moderators

DFH
Posts: 636
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Speed of Unicode Normalization

Postby DFH » Thu Dec 30, 2010 8:16 pm

Is there anything you can to to improve the speed of Unicode Normalization?

I just added a Normalize to NFC filter to process a set of 27 Arabic UTF-8 text files,
and I'm seeing the predicted end time as over 60 minutes, while it was still processing the first file!

It's much much slower than the same function in BabelPad! See
http://www.babelstone.co.uk/Software/BabelPad.html

Maybe there's something you can learn from BabelStone ?

PS. When the predicted end time reach 90 minutes, I hit the cancel button.

David
TextPipe Standard user

DFH
Posts: 636
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Speed of Unicode Normalization

Postby DFH » Thu Dec 30, 2010 9:13 pm

OK - I did something wrong. I should have read the help page. This reads,
Found under Filters\Unicode (Standard and Pro)

Applies a Unicode NFC - Canonical Decomposition, followed by Canonical Composition transformation to incoming Unicode text (UTF16-LE).

Output is also Unicode UTF16-LE.


I should have used

Code: Select all

Comment...
|  Normalize Unicode to NFC
|
|--Convert from UTF-8 to UTF-16LE
|   
|--NFC - Canonical Decomposition, followed by Canonical Composition
|   
+--Convert from UTF-16LE to UTF-8
   
This works OK and is speedy.

Therefore the real gripe is that TextPipe attempted to apply NFC to UTF-8 (which it cannot do) without reporting that it's impossible.

I would therefore suggest that the Normalization filters should include a detection for encoding,
and report an error message for anything other than UTF-16 (LE) as the input stream.

David
now wearing my beta-tester hat

User avatar
DataMystic Support
Site Admin
Posts: 2136
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Speed of Unicode Normalization

Postby DataMystic Support » Mon Jan 03, 2011 11:48 am

Hi David,

One of TextPipe's strengths is that it makes no assumptions about the data it is processing - it just does what it is told.

This is at the same time, one of its weaknesses. I envisage a situation where TextPipe can detect input file types (where possible), and then allow that information to flow through the filter list - to detect these kinds of issues.
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments


Return to “TextPipe Tips and Tricks, Questions and Support”

Who is online

Users browsing this forum: No registered users and 1 guest