Suggestion: Provide filter Remove all diacritics

Get help with installation and running here.

Moderators: DataMystic Support, Moderators

DFH
Posts: 940
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Suggestion: Provide filter Remove all diacritics

Post by DFH » Fri Jan 04, 2019 7:13 am

A very useful enhancement to TextPipe would be a filter called Remove all diacritics.

Assume that the input would be UTF-8.

How about this?

David

DFH
Posts: 940
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Suggestion: Provide filter Remove all diacritics

Post by DFH » Tue Feb 05, 2019 10:25 pm

Hi Simon,

Have you thought of a response on this yet?

Best regards,

David

User avatar
DataMystic Support
Site Admin
Posts: 2199
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Suggestion: Provide filter Remove all diacritics

Post by DataMystic Support » Thu Mar 14, 2019 1:27 pm

Hi David,

If you're happy to provide the PCRE pattern to do so, we're happy to bundle this with TextPipe.

Regards,

Simon
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 940
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Suggestion: Provide filter Remove all diacritics

Post by DFH » Wed Mar 20, 2019 6:54 am

It's not that simple, as it depends on Unicode character properties, rather than something you can achieve readily with PCRE.

Clearly the first step would be to Normalise to NFD so that the diacritics become separate characters,
but there's no simple formula to match "these codepoints are diacritics" other than a complex table extracted laboriously from Unicode data.

What's needed is to be able to go deeper.

David

User avatar
DataMystic Support
Site Admin
Posts: 2199
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Suggestion: Provide filter Remove all diacritics

Post by DataMystic Support » Wed Mar 20, 2019 8:08 am

What about this: https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions

Unicode character properties

Unicode defines several properties for each character. Patterns in PCRE can match these properties. e.g. \p{Ps}.*?\p{Pe} would match a string beginning with any "opening punctuation" and ending with any "close punctuation" such as "[abc]". Since version 8.10, matching of certain "normal" metacharacters can be driven by Unicode properties when the compile option PCRE_UCP is set. The option can be set for a pattern by including (*UCP) at the start of pattern. The option alters behavior of the following metacharacters: \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. For example, the set of characters matched by \w (word characters) is expanded to include letters and accented letters as defined by Unicode properties. Such matching is slower than the normal (ASCII-only) non-UCP alternative. Note that the UCP option requires the PCRE library to have been built to include UTF-8 and Unicode property support. Support for UTF-16 is included in version 8.30 while support for UTF-32 was added in version 8.32.
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 940
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Suggestion: Provide filter Remove all diacritics

Post by DFH » Fri Mar 22, 2019 12:31 am

Hi Simon,

I've just read the Wikipedia article but I'm none the wiser as to how this would get us nearer to a pattern that simply matches all diacritics.

I wonder how Andrew West implemented BabelPad | Convert | Other | Strip Diacritics ?

David

User avatar
DataMystic Support
Site Admin
Posts: 2199
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Suggestion: Provide filter Remove all diacritics

Post by DataMystic Support » Mon May 06, 2019 1:55 pm

Could we just use this list?: https://www.compart.com/en/unicode/combining/230
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 940
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Suggestion: Provide filter Remove all diacritics

Post by DFH » Tue May 07, 2019 6:31 pm

Interesting site.

It needs to be very carefully thought out with much attention to detail.

- Not just combining class above but all the various combining classes listed in https://www.compart.com/en/unicode/combining
- And we'd need to consider some of the modifier letters & modifier symbols that also behave as diacritics.

And it it should conform to the character properties in Unicode 11.0 or later, not those of ten years ago.

If the proposed new filter always gives the same results the equivalent convert option in BabelBad, then I'd be very happy.
It's always nice to have something that can be scripted.

Regards,

David

User avatar
DataMystic Support
Site Admin
Posts: 2199
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Suggestion: Provide filter Remove all diacritics

Post by DataMystic Support » Tue May 07, 2019 10:26 pm

I think this is covered by properties available in the unicode definition:

ccDiacritic, // Characters that linguistically modify the meaning of another character to which they apply. Some diacritics are not combining characters, and some combining characters are not diacritics.
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 940
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Suggestion: Provide filter Remove all diacritics

Post by DFH » Tue May 21, 2019 4:01 am

Are those Unicode properties accessible to TextPipe ?

David

User avatar
DataMystic Support
Site Admin
Posts: 2199
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Suggestion: Provide filter Remove all diacritics

Post by DataMystic Support » Tue May 21, 2019 6:27 am

They are definitely available -subject to the Unicode library being up-to-date. We could code this now but against the older Unicode library.
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 940
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Suggestion: Provide filter Remove all diacritics

Post by DFH » Wed May 22, 2019 12:01 am

Please do - if it's not too difficult.

It could then be updated to work with Unicode 12.1 when you have the relevant libraries in place.

Best regards,

David

DFH
Posts: 940
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Suggestion: Provide filter Remove all diacritics

Post by DFH » Mon Mar 02, 2020 9:52 pm

Is the suggested feature now more feasible with Unicode 12.1 being built into TextPipe v11.x ?

Best regards,

David

User avatar
DataMystic Support
Site Admin
Posts: 2199
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Suggestion: Provide filter Remove all diacritics

Post by DataMystic Support » Fri Mar 06, 2020 6:39 am

Hi David - yes - we'll look into this for next release
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 940
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Suggestion: Provide filter Remove all diacritics

Post by DFH » Fri May 08, 2020 12:13 am

Will the Remove diacritics filter[s] handle UTF-8 or is it limited to UTF-16 LE only ?

Several Unicode filters can only handle UTF-16 LE encoding.

Would it be feasible for each of these filters to detect when any other encoding is piped into it and simply stop with an error message?

This would avoid the crazy situation where the filter progress indicator is actually displaying regress (estimated time to completion ever increasing).

David

Post Reply