Bug in Find whole words only option for replace list filter

Get help with installation and running here.

Moderators: DataMystic Support, Moderators

Post Reply
DFH
Posts: 791
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Bug in Find whole words only option for replace list filter

Post by DFH » Fri May 04, 2018 6:35 am

This is in the context of using the Replace list filter with Pattern (perl) as find type.

My tab-delimited external Replace list contains this as one of the many lines:

Code: Select all

\xF3	o
It's designed to replace the accented letter "ó" by the unaccented letter "o" and is set to apply with
Match case, Find whole words only, UTF-8 support.

I just found that instead of replacing only the 2 single letter words that were intended,
it also replaced the "ó" at the end of 55 words that ended with the letters "ñó".
viz.

Code: Select all

enseñó soñó riñó engañó ciñó constriñó apañó dañó
The letter U+00F1 LATIN SMALL LETTER N WITH TILDE "ñ" seems to be seen in this context as if it were a non-word character!
How else can one interpret this very unexpected result?

This is a surely a software bug!

Aside: The input file contains Spanish text. My Windows locale is English (UK).

Best regards,

David

User avatar
DataMystic Support
Site Admin
Posts: 2275
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Bug in Find whole words only option for replace list filter

Post by DataMystic Support » Wed May 09, 2018 8:18 am

Hi David,

The same issue occurs for perl pattern on its own, outside of the search/replace list.

Currently, TextPipe determines which characters are word characters on startup, and retains this throughout. It does this for (ANSI) characters 0..255, and hence does not appreciate a utf-8 view of the world. Hence it fails for the character below - windows must be telling TextPipe that it is not a word character.

The best approach I can see right now is that instead of relying on this Word Characters array to be checked at the start and end of each potential match, is instead to prefix/append the \b regex to each pattern, and allow the regex engine to use its internal unicode tables for what is and is not a word character.

If you could disable 'Whole Word' and instead add \b around your pattern, it would be interesting to see if it gives correct output for other use cases. It works fine for the case you've given here.

Simon
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 791
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Bug in Find whole words only option for replace list filter

Post by DFH » Mon May 14, 2018 1:22 am

Hi SImon,

Then even the Windows view of which ANSI characters are word characters is faulty, seeing as ñ (U+00F1 aka \xF1) is within the decimal range 0-255 and it's not a punctuation mark!

I'll try with the \b suggestion when I get time - having read what this does; I've not made use of this before.

Code: Select all

  \b     matches at a word boundary
  \B     matches when not at a word boundary
Though this might work for the corner case reported, it seems a tedious fag to have to wrap each search word in the external replace list.
It would be simpler to remove the corner case from the list and deal with it using a separate replace list filter.

David

Post Reply

Who is online

Users browsing this forum: No registered users and 2 guests