Normalization of a UTF-8 file?

Get help with installation and running here.

Moderators: DataMystic Support, Moderators

DFH
Posts: 636
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Normalization of a UTF-8 file?

Postby DFH » Wed Mar 13, 2013 10:23 pm

Might it be feasible to be able to apply Unicode normalization filters directly to UTF-8 encoded files? If not, why not?

It seems rather slow and inefficient to have to first convert a UTF-8 file to UTF-16 LE before using a normalization to NFC filter, and then back again to UTF-8 afterwards.

Especially when [say] the proportion of combining characters within the file is relatively low.

David

User avatar
DataMystic Support
Site Admin
Posts: 2138
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Normalization of a UTF-8 file?

Postby DataMystic Support » Thu Mar 14, 2013 6:21 am

Hi David,

We take advantage of functions that operate with UTF16LE only (and we don't plan to rewrite them).

Are you finding this very slow? Or is it more a question of making it transparent to the user? ie having this conversion done in the background?
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 636
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Normalization of a UTF-8 file?

Postby DFH » Fri Mar 15, 2013 7:26 pm

Normalization of a UTF-8 file of length 5,521,778 bytes took almost 4 minutes.

The number of combining characters in the input file was 195,986 - which is approximately 3% of the total.

The time penalty arises from having to double the number of bytes required to represent the other 97% of the file
during the conversion of UTF-8 to UTF-16 LE.

In fact the time penalty occurs twice, because these characters also have to be converted back again to UTF-8,
even though they didn't need normalizing in the first place.

Hence my remark in the initial posting.

David

User avatar
DataMystic Support
Site Admin
Posts: 2138
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Normalization of a UTF-8 file?

Postby DataMystic Support » Mon Mar 18, 2013 5:20 pm

Hi David,

Understood. Would you be able to send me a compressed sample file to benchmark against?
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 636
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Normalization of a UTF-8 file?

Postby DFH » Thu Mar 21, 2013 8:23 pm

Hi Simon,

I could arrange that when I have a few moments to spare.

David

User avatar
DataMystic Support
Site Admin
Posts: 2138
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Normalization of a UTF-8 file?

Postby DataMystic Support » Mon Mar 25, 2013 2:48 pm

One other question David,

TextPipe only files with UTF-8 BOMs to be UTF-8, so ANSI files are not considered Utf-8.

I believe that the Unicode spec says that utf-8 files do not need a BOM. Do you think that TextPipe's Restrict to UTF-8 files should be changed to reflect this?

ie
1. Rename the existing filter to Restrict to UTF-8 BOM files
2. Create a new filter for Restrict to UTF-8 files, which allows any files that do not look like UTF16 or UTF32.

What do you think?
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 636
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Normalization of a UTF-8 file?

Postby DFH » Tue Mar 26, 2013 4:03 am

Simon,

I can only report my own experience and common practice.

Many of the files that I handle are encoded as UTF-8 without BOM.
And if the input files are not thus encoded, most of the output files from my filters are.

UTF-8 files that are with BOM are less often seen, and you already have a filter to Remove BOM.

David


Return to “TextPipe Tips and Tricks, Questions and Support”

Who is online

Users browsing this forum: Yahoo [Bot] and 2 guests