Pattern matching \s or \S

Get help with installation and running here.

Moderators: DataMystic Support, Moderators

Post Reply
DFH
Posts: 716
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Pattern matching \s or \S

Post by DFH » Thu Oct 29, 2015 6:39 am

The pattern matching reference states for \s
any white space character.
space, formfeed, newline, carriage return, horizontal tab, and vertical tab
In Notepad++, the pattern \s also matches the no break space character \xA0.

Why is TextPipe different?

Which program conforms to external standards in this respect?

Best regards,

David

User avatar
DataMystic Support
Site Admin
Posts: 2220
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Pattern matching \s or \S

Post by DataMystic Support » Thu Oct 29, 2015 8:30 am

Hi David,

Which version of TextPipe? And is this using the perl pattern match?

According to some PCRE specs from 2012 http://www.pcre.org/pcre.txt

\s does include \xA0

I tested this in TextPipe 9.9.2 and it works fine. I put

Code: Select all

A0
in the trial run, then used the following filter list to convert the hex code to an actual character, then used \s to match on.
Which version of TP are you using?

|
|--Hex Decode
|
|--Perl pattern [\s] with [*]
| [ ] Match case
| [ ] Whole words only
| [ ] Case sensitive replace
| [X] Prompt on replace
| [ ] Skip prompt if identical
| [ ] First only
| [ ] Extract matches
| Maximum text buffer size 4096
| [ ] Maximum match (greedy)
| [ ] Allow comments
| [X] '.' matches newline
| [ ] UTF-8 Support
|
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 716
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Pattern matching \s or \S

Post by DFH » Thu Oct 29, 2015 9:54 pm

I'm using TextPipe 9.9.2

I was merely going by the help text description for \s

It's now clear that \s does include \xA0 - so then please would you update the help file.

In the PCRE link you cited, this is significant:
This list may vary if locale-specific matching
is taking place. For example, in some locales the "non-breaking space"
character (\xA0) is recognized as white space, and in others the VT
character is not.
Q. Remove multiple whitespace only removes ordinary spaces and tabs, doesn't it?
i.e. It doesn't treat \xA0 as whitespace!

Thanks.

Best regards,

David

User avatar
DataMystic Support
Site Admin
Posts: 2220
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Pattern matching \s or \S

Post by DataMystic Support » Thu Oct 29, 2015 11:33 pm

Yes - that filter is very fast and doesn't use PCRE regex. As you say,it doesn't handle \xA0.

Do you think we should change it to use regex and take advantage of the unicode extra characters?
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 716
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Pattern matching \s or \S

Post by DFH » Fri Oct 30, 2015 1:55 am

Hi Simon,

I'd be inclined to say "no", so that it doesn't break existing filters, especially important for all users.

Unicode treats a number of special characters as "white space", but most users rarely come across any of them.

The few users that do can readily devise suitable replace filters.

e.g. Here's how I dealt with no break spaces:

Code: Select all

Comment...
|  Remove redundant no break spaces
|  
|   - Except before punctuation marks :;!?
|
+--Perl pattern [\xA0[\;\:\?\!]] with []
   |  [X] Match case
   |  [ ] Whole words only
   |  [ ] Case sensitive replace
   |  [ ] Prompt on replace
   |  [ ] Skip prompt if identical
   |  [ ] First only
   |  [ ] Extract matches
   |  Maximum text buffer size 4096
   |  [ ] Maximum match (greedy)
   |  [ ] Allow comments
   |  [ ] '.' matches newline
   |  [X] UTF-8 Support
   |
   +--Perl pattern [\xA0(\.)] with [$1]
         [X] Match case
         [ ] Whole words only
         [ ] Case sensitive replace
         [ ] Prompt on replace
         [ ] Skip prompt if identical
         [ ] First only
         [ ] Extract matches
         Maximum text buffer size 4096
         [ ] Maximum match (greedy)
         [ ] Allow comments
         [ ] '.' matches newline
         [X] UTF-8 Support

         [ ] Process longest strings first
         [ ] Simultaneous search

       Further search/replace list phrases (CSV format):
       \xA0,\x20
Having replaced the redundant nbsp by ordinary spaces, it can be followed by remove multiple whitespace, if so required.

The above method can easily be adapted.

Best regards,

David

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest