Pattern matching \s or \S

Get help with installation and running here.

Moderators: DataMystic Support, Moderators

DFH
Posts: 658
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Pattern matching \s or \S

Postby DFH » Thu Oct 29, 2015 6:39 am

The pattern matching reference states for \s
any white space character.
space, formfeed, newline, carriage return, horizontal tab, and vertical tab

In Notepad++, the pattern \s also matches the no break space character \xA0.

Why is TextPipe different?

Which program conforms to external standards in this respect?

Best regards,

David

User avatar
DataMystic Support
Site Admin
Posts: 2164
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Pattern matching \s or \S

Postby DataMystic Support » Thu Oct 29, 2015 8:30 am

Hi David,

Which version of TextPipe? And is this using the perl pattern match?

According to some PCRE specs from 2012 http://www.pcre.org/pcre.txt

\s does include \xA0

I tested this in TextPipe 9.9.2 and it works fine. I put

Code: Select all

A0

in the trial run, then used the following filter list to convert the hex code to an actual character, then used \s to match on.
Which version of TP are you using?

|
|--Hex Decode
|
|--Perl pattern [\s] with [*]
| [ ] Match case
| [ ] Whole words only
| [ ] Case sensitive replace
| [X] Prompt on replace
| [ ] Skip prompt if identical
| [ ] First only
| [ ] Extract matches
| Maximum text buffer size 4096
| [ ] Maximum match (greedy)
| [ ] Allow comments
| [X] '.' matches newline
| [ ] UTF-8 Support
|
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 658
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Pattern matching \s or \S

Postby DFH » Thu Oct 29, 2015 9:54 pm

I'm using TextPipe 9.9.2

I was merely going by the help text description for \s

It's now clear that \s does include \xA0 - so then please would you update the help file.

In the PCRE link you cited, this is significant:
This list may vary if locale-specific matching
is taking place. For example, in some locales the "non-breaking space"
character (\xA0) is recognized as white space, and in others the VT
character is not.

Q. Remove multiple whitespace only removes ordinary spaces and tabs, doesn't it?
i.e. It doesn't treat \xA0 as whitespace!

Thanks.

Best regards,

David

User avatar
DataMystic Support
Site Admin
Posts: 2164
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Pattern matching \s or \S

Postby DataMystic Support » Thu Oct 29, 2015 11:33 pm

Yes - that filter is very fast and doesn't use PCRE regex. As you say,it doesn't handle \xA0.

Do you think we should change it to use regex and take advantage of the unicode extra characters?
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 658
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Pattern matching \s or \S

Postby DFH » Fri Oct 30, 2015 1:55 am

Hi Simon,

I'd be inclined to say "no", so that it doesn't break existing filters, especially important for all users.

Unicode treats a number of special characters as "white space", but most users rarely come across any of them.

The few users that do can readily devise suitable replace filters.

e.g. Here's how I dealt with no break spaces:

Code: Select all

Comment...
|  Remove redundant no break spaces

|   - Except before punctuation marks :;!?
|
+--Perl pattern [\xA0[\;\:\?\!]] with []
   |  [X] Match case
   |  [ ] Whole words only
   |  [ ] Case sensitive replace
   |  [ ] Prompt on replace
   |  [ ] Skip prompt if identical
   |  [ ] First only
   |  [ ] Extract matches
   |  Maximum text buffer size 4096
   |  [ ] Maximum match (greedy)
   |  [ ] Allow comments
   |  [ ] '.' matches newline
   |  [X] UTF-8 Support
   |
   +--Perl pattern [\xA0(\.)] with [$1]
         [X] Match case
         [ ] Whole words only
         [ ] Case sensitive replace
         [ ] Prompt on replace
         [ ] Skip prompt if identical
         [ ] First only
         [ ] Extract matches
         Maximum text buffer size 4096
         [ ] Maximum match (greedy)
         [ ] Allow comments
         [ ] '.' matches newline
         [X] UTF-8 Support

         [ ] Process longest strings first
         [ ] Simultaneous search

       Further search/replace list phrases (CSV format):
       \xA0,\x20

Having replaced the redundant nbsp by ordinary spaces, it can be followed by remove multiple whitespace, if so required.

The above method can easily be adapted.

Best regards,

David


Return to “TextPipe Tips and Tricks, Questions and Support”

Who is online

Users browsing this forum: Bing [Bot] and 1 guest