How to remove lines that begin with EM DASH (U+2014)?

Get help with installation and running here.

Moderators: DataMystic Support, Moderators

DFH
Posts: 636
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

How to remove lines that begin with EM DASH (U+2014)?

Postby DFH » Mon Mar 24, 2008 2:09 am

Using TextPipe Standard, how can I remove all lines that begin with an EM DASH ?

The text file to processed is encoded UTF-8 without BOM.

In Unicode, EM DASH is U+2014.

With the filter Remove Matching Lines, I have had no success to match this pattern.

User avatar
DataMystic Support
Site Admin
Posts: 2138
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Postby DataMystic Support » Tue Mar 25, 2008 7:38 am

You should check the encoding of this character.

Using this filter, you can paste a UTF character into the Trial Run Input area (first ensuring that 'Treat Trial Input as Unicode' is checked) and see what its hex equivalent is:

Code: Select all

|--Convert from UTF-16 to UTF-8
|   
|--Hex dump
|   

This shows an Em-dash pasted from MS Word as E1 90 A0.
So to find it with a search/replace, you need to use a non-unicode search replace with a find text of

Code: Select all

\xE1\x90\xA0
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 636
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Remove matching lines not as versatile as the Replace filter

Postby DFH » Tue Mar 25, 2008 6:48 pm

The issue seems to be that the Perl matching options for the Remove Matching Lines filter are not available as a popup like they are for the Replace filter with Perl matching selected.

I later found a UTF8 supported solution (with 'greedy matching') using the Replace filter, with

Code: Select all

Replace [^\x{2014}.*] by []
Even so, it would be neater if the same Perl options could be made available in the Remove [Non-]Matching Lines filters.

User avatar
DataMystic Support
Site Admin
Posts: 2138
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Postby DataMystic Support » Tue Mar 25, 2008 7:07 pm

As far as I know, if \x2014 works for you, then your file is UTF16, not UTF8.
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 636
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

It was definitely UTF8 (without BOM)

Postby DFH » Wed Mar 26, 2008 2:20 am

Hi Simon,

Both SC Unipad and Notepad++ confirm that my file was encoded as UTF8 without BOM.

This is the main encoding that I use for work with Go Bible Creator.

Best regards,
David Haslam

User avatar
DataMystic Support
Site Admin
Posts: 2138
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: How to remove lines that begin with EM DASH (U+2014)?

Postby DataMystic Support » Mon Apr 21, 2008 2:41 pm

Hi David,

It seems my initial statement about an EM-DASH being \xE1\x90\xA0 in UTF-8 was wrong (I don't know what I was working with). I pasted this directly from MS Word into TextPipe's Trial Run Input area.

I just pasted an EM DASH from MS-Word into a Notepad file, and saved this as UTF-8 with an A and B on either side ie A-B.

Using TextPipe to generate a hex dump of this file, I see
00000000 EF BB BF 41 E2 80 94 42 ...A...B

The EF BB BF is the UTF-8 Byte Order Mark (BOM).
Then we have 41 for 'A'
Then E2 80 94 for the EM-Dash
and then 42 for 'B'.

Using a perl search/replace with the UTF-8 option enabled, we can use

Code: Select all

\x{2014}

as the matching criteria, and the matching engine replaces this with \xE2\x80\x94 internally.

However, if we specify a search term of \xE2\x80\x94 with the utf8 option ON, the matching engine will not find anything. If we use \xE2\x80\x94 with utf8 UNCHECKED, then it will find it as normal.

To remove lines beginning with an EM-Dash from a UTF-8 file, we use the pattern:
^\xE2\x80\x94
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 636
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: How to remove lines that begin with EM DASH (U+2014)?

Postby DFH » Wed Apr 23, 2008 5:42 pm

Thanks Simon,

As I see, it was indeed more complicated than it seemed at first glance. The concept that TextPipe changes its search pattern to what gets processed internally by the search engine is something that I had not imagined previously. It would be useful if this was better described in the help file.

Best regards,
David Haslam

stone76567
Posts: 1
Joined: Wed Oct 13, 2010 10:19 am

Re: How to remove lines that begin with EM DASH (U+2014)?

Postby stone76567 » Wed Oct 13, 2010 10:28 am

hi everyone..
thanks for all the information ive learned in this site..
btw im new here and i want to be a member of this site..
i hope you will accept me..
havev a nice day and God Bless..

how to treat depression

DFH
Posts: 636
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: How to remove lines that begin with EM DASH (U+2014)?

Postby DFH » Sat Oct 16, 2010 5:19 am

Simon,

Can the substance of your answers be added to the help file, please?

David

User avatar
DataMystic Support
Site Admin
Posts: 2138
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: How to remove lines that begin with EM DASH (U+2014)?

Postby DataMystic Support » Mon Oct 18, 2010 7:45 am

Done.
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments


Return to “TextPipe Tips and Tricks, Questions and Support”

Who is online

Users browsing this forum: No registered users and 6 guests