Page 1 of 2

Texpipe and Unicode (16LE) files

Posted: Tue Jul 18, 2006 4:24 am
by niccolo
I've heard a lot about Textpipe and decided to try it. Download 7.63 t&b and try to do simple things with Unicode files and can't. It seems it doesn't understand it completely. I tried to remove trailing spaces - nothing. Trying to do that with \t+\n (these are mostly tabs) and nothing again. All other perl pattern doesn't work here with but worked without any problen in Uedit and Emeditor. Why so? Do not propose to convert files to ANSI cause files contains symbols from 3 symbol sets - non standart western, cyrillic, greek.
The help when it talks about work with Unicode files is worse than very bad.
May be necessary to do Unicodepipe?

Posted: Wed Jul 19, 2006 10:08 am
by DataMystic Support
Hi there,

TextPipe has specific filters to deal with Unicode (UTF16LE) data, such as the Unicode search/replace and Unicode pattern filters. For backward compatability, the original ANSI/ASCII based filters have not been modified.

So, if you'd like to use the Remove Trailing Spaces filter (which is ASCII), first convert the file to UTF-8, apply the filters, then convert it back.

The initial conversion to UTF-8 is the key here. TextPipe is used for a lot of mainframe data files, so converting EBCDIC to Unicode for internal processing is not an option until the Mainframe record structure has been unravelled.

see no progress for multilanguage files in 8

Posted: Wed Dec 12, 2007 8:51 am
by niccolo
I have downloaded trial version of 8 Textpipe

Task
Need to create sorted wordlist from UTF8 (I've taken into account Your previous recommendations) txt file containg German and russian words (Umlauts and cyrillic).

Use extract matches \w+
Sort ANSI

and what

In trial output everything seems OK but resulting file have unknown encoding.
Opening it as ANSI makes russian text completely unreadable. Open it as UTF8 shows that all cyrillic words are damaged and can't be used.

What's a hell??? Who is wrong here - I or a program.

Posted: Wed Dec 12, 2007 3:10 pm
by DataMystic Support
You may need to add a new UTF-8 BOM to the resulting file - use
Filters\Add\File Header
with text of

Code: Select all

\xEF\xBB\xBF

Posted: Wed Dec 12, 2007 4:32 pm
by niccolo
the problem is not an unknown encoding that BOM solves. The problem is corrupted cyrillic text in file. What to do with that?

Posted: Wed Dec 12, 2007 7:43 pm
by DataMystic Support
No, the problem may be that sorting moves the line with the BOM further into the file, hence a new BOM is required.

Anyway, please email us your filter and a sample file.

Standard or Pro ?

Posted: Wed Dec 12, 2007 9:20 pm
by DFH
Is niccolo using TextPipe Standard or TextPipe Professional?

For the task in hand does it matter which ?

Posted: Thu Dec 13, 2007 3:41 am
by niccolo
DFH - Textpipe pro trial 8

Here the sample, filters used (1st with sorting 2nd simple wordlist creating) and results. In both results files cyrillic word are corrupted but everything is ok in trial run windows. It's not a BOM problem

http://rapidshare.com/files/76100564/pack.zip.html

I've solved this problem with other software but what's a hell when decide to try Textpipe there are always problem with this. When the native unicode support will be implemented with regexes etc?

Posted: Fri Dec 14, 2007 2:36 am
by niccolo
Just now found that's not so good with trial run area - all german words loose umlauts.

So may for English Textpipe is a good tool but for multilanguage files it should be taken with care.

Posted: Fri Dec 14, 2007 5:37 am
by DataMystic Support
No - if you read the help, the trial run area handles either ANSI or Unicode UTF-16 text (check the box).

If you use any other format you will loose data.

I have been using TextPipe to process lots of UTF-8 files

Posted: Fri Dec 14, 2007 8:11 am
by DFH
I have been using TextPipe Standard to process lots of UTF-8 files, all with success, including many with non-Latin characters, such as Cyrillic, Chinese, Thai, Amharic, Japanese, Hebrew.

Only the trial area has those restrictions, just as Simon already explained.

Posted: Fri Dec 14, 2007 8:50 pm
by niccolo
DFH - If You have everything OK may be You can explain where I'm wrong in my example?

And regarding textpipe - in regex line I can insert sybbols that is not in system locale encoding. But in filter list such symbols look corrupted. When this problem will be solved?

Don't have a rapidshare account

Posted: Sat Dec 15, 2007 1:46 am
by DFH
The link you posted took me to a page wanting me to pay for an account. Please make it easier for other members to help you.

Posted: Sat Dec 15, 2007 2:21 am
by niccolo
DFH - if You don't use proxy there should be no problem with getting file.

Copy link into browser and press enter. In the opened screen press FREE.
Then appears another window where You are asked to enter code on a small picture (No premium Please enter). Type it in box below and press Download via ....... button.

Downloaded it now, thanks !

Posted: Sat Dec 15, 2007 4:59 am
by DFH
I didn't see the buttons before - thanks for help.