split text file per character

Get help with installation and running here.

Moderators: DataMystic Support, Moderators

YST
Posts: 4
Joined: Mon Dec 03, 2007 7:55 pm

split text file per character

Postby YST » Mon Dec 03, 2007 8:11 pm

how to split text file per character,
the files is in chinese ,therefore the file splitted must be readable after splitting?
how to do that?

thanks in advance :)

User avatar
DataMystic Support
Site Admin
Posts: 2138
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Postby DataMystic Support » Tue Dec 04, 2007 8:17 am

Convert the file to utf-8 with textpipe first. Then only break on characters \x00-\x7f as the other characters are the remaining multi-byte characters.
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

YST
Posts: 4
Joined: Mon Dec 03, 2007 7:55 pm

Postby YST » Tue Dec 04, 2007 11:41 am

how about count 1000 characters number of chinese then split after every 1000 characters,\x00-\x7f meaning some specific character to look for,but what I mean is not lik that.

Have converted to utf8 then split at 2100 bytes,but the files will become unreadable,because some characters become gibberish! :cry:

User avatar
DataMystic Support
Site Admin
Posts: 2138
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Postby DataMystic Support » Tue Dec 04, 2007 12:43 pm

Try splitting using this pattern:
([\x00-\x7f][^\x00-\x7f]*){1000}
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

YST
Posts: 4
Joined: Mon Dec 03, 2007 7:55 pm

Postby YST » Thu Dec 06, 2007 9:31 pm

Trying to split at pattern,but error occur:regular expression is too big!THen I split at ([\x00-\x7f][^\x00-\x7f]*),after this many:1000,this time the rubbish character dissapear,(but little files still appear ??? when reconverted it UTF-8 to BIG5.)

after splitting the files will have 50000 characters per file,that is not the standard I want.after counting the files,they seem to irregular.some of them have only 1 line,some are blank,some are around 1000 ,but only when "after this many:30" can create file per aound 1000,but not exactly as some are blank ,some are more or less than other,some have 100 characters only,some have 50 to 60 characters;if using"after this many:1000",,then the file will contain over 50000 characters),why would be like that? :(

really thanks your reply :D

User avatar
DataMystic Support
Site Admin
Posts: 2138
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Postby DataMystic Support » Fri Dec 07, 2007 9:04 am

Please email a sample file to our support address
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

YST
Posts: 4
Joined: Mon Dec 03, 2007 7:55 pm

Postby YST » Sun Dec 16, 2007 9:17 pm

Hi:

Why for example a text are splitted by 1000 characters,then how to avoid a bland text being included in the output path,
e.g.1.txt,2.txt,3.txt
each have 1000 character using the above regular pattern you mentioned,(before or after),but the 3.txt have only blank content in it,how to split so that if a text is blank then not to output ,therefore only have output 1.txt and 2.txt?(because I want to compile them in chm ,but my application will have problems when meeting a blank content text .)


Return to “TextPipe Tips and Tricks, Questions and Support”

Who is online

Users browsing this forum: DataMystic Support and 1 guest