Locale-sensitive filters and multilingual texts?

Get help with installation and running here.

Moderators: DataMystic Support, Moderators

DFH
Posts: 658
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Locale-sensitive filters and multilingual texts?

Postby DFH » Sat Jun 02, 2012 6:15 am

Several TextPipe filters are sensitive to the locale, especially those that involve sorting or case comparison.

Locales are currently configured as part of the regional settings of the Windows operating system.

Yet a monoglot programmer may be someone who is tasked with processing multilingual text files.
i.e. Several different projects each for a specific language.
Furthermore, the programmer may largely bring IT skills to these projects, rather than skills in each of languages.

It makes next to no sense for the programmer to keep changing the locale at the OS level.
This just leads the way to incomprehensible GUIs for all his Windows applications.
It may also lead to having to put up with unfamiliar keyboard layouts for different alphabets and syllabaries, etc.

To work using TextPipe in such circumstances, it would be much better for each locale-sensitive TextPipe filter
to include options for specifying the locale to use for that filter.

e.g. If you are processing a text written if French, German or Turkish, then the chosen filter will have an option to select one
of these locales from a whole host of locales that TextPipe is designed to support.
    I mention French because vowels have accents and because of the cedilla, etc.
    I mention German because vowels can have accents and because of the chracter ß (U+00DF) LATIN SMALL LETTER SHARP S.
    I mention Turkish here because of the dotted and dotless I aspect of the Turkish alphabet.
Such locales are not restricted to the ANSI range of characters.
Yet even extended the available locales to cover more of the Latin alphabet based locales would be a good start.

Code: Select all

Block Name   Range   Code Points   Characters   Unicode Version
Basic Latin   0000..007F   128   128   1.0.0
Latin-1 Supplement   0080..00FF   128   128   1.0.0
Latin Extended-A   0100..017F   128   128   1.0.0
Latin Extended-B   0180..024F   208   208   1.0.0

Is this something that you would be prepared to develop as an enhancement to TextPipe?

Best regards,
David

User avatar
DataMystic Support
Site Admin
Posts: 2164
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Locale-sensitive filters and multilingual texts?

Postby DataMystic Support » Sun Jun 10, 2012 1:59 pm

Hi David,

In short yes, the question is how to do it.

The earlier discussion we had on each filter knowing its own input and output encoding, and then being able to suggest the appropriate intermediate filter to match the encodings would seem a good way to solve this. You could choose to override this, or perhaps only apply it when required, like the 'Remove prompting' option.

Can you give me two filters that would benefit from this approach?

Case changing filters might be a good option, but in order to prevent an explosion of language-specific filters within TextPipe, we would most likely convert incoming text to Unicode, apply the Case Conversion change using the underlying Windows API, and then convert the text back to its original encoding. This might introduce round-trip issues.

Can you give me a real-world example of how you see this working?
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 658
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Locale-sensitive filters and multilingual texts?

Postby DFH » Mon Jun 11, 2012 1:54 am

Hi Simon,

You may safely assume that all the input files are already Unicode.
Generally I work on text files that are encoded as UTF-8 (without BOM).
I rarely need to work with files encoded using Windows Code Pages (or anything else for that matter).
And there are already filters to convert these to Unicode.

The TextPipe filter features therefore are those that use either
    (a) Case change operations or case sensitive selections
    (b) Sorting in the defined alphabetical (or symbol) order for each specific language
You should assume that (b) also includes Count Duplicates, as this has an implicit sort for its outputs.

Sorting even European languages has complications when the alphabet contains letters with diacritics and/or ligatures or digraphs.
See http://en.wikipedia.org/wiki/Alphabetical_order#Language-specific_conventions

There may some other aspects that I haven't fully thought out, such as languages that do not use spaces as word boundaries.

Best regards,
David

DFH
Posts: 658
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Locale-sensitive filters and multilingual texts?

Postby DFH » Mon Jun 11, 2012 2:13 am

Further information:

In many Latin scripted languages, accented letters are not counted as part of the alphabet, but in some they are!
Welsh is an example of the former. See http://en.wikipedia.org/wiki/Welsh_orthography
Azerbaijani is an example of the latter. See http://en.wikipedia.org/wiki/Azerbaijani_alphabet

David

DFH
Posts: 658
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Locale-sensitive filters and multilingual texts?

Postby DFH » Mon Jun 11, 2012 2:14 am


DFH
Posts: 658
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Locale-sensitive filters and multilingual texts?

Postby DFH » Mon Jun 11, 2012 2:15 am

And most important of all, http://en.wikipedia.org/wiki/Unicode_collation_algorithm

David

PS. I had to split my reply merely because of the number of URLs.

DFH
Posts: 658
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Locale-sensitive filters and multilingual texts?

Postby DFH » Mon Jun 11, 2012 2:23 am

You referred to "to prevent an explosion of language-specific filters within TextPipe".

I see these as avoidable, providing each of the existing relevant filters gain a drop-down selector to specify the language of the input Unicode text files.

However, one cannot simply apply what Windows does, unless one knows which language the input file is.
For example, suppose you are filtering something written in Turkish or Azerbaijani, then your filter needs to know about dotted and dotless letter I. See
http://en.wikipedia.org/wiki/Dotted_and_dotless_I

David

DFH
Posts: 658
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Locale-sensitive filters and multilingual texts?

Postby DFH » Mon Jun 11, 2012 2:30 am

Example of such a task:

For any given Bible translation, with the digital text available, generate a count duplicates style word list.
This involves implicit sorting (not to mention how to deal with various punctuations that can exist within words).
The sorting should be in the required collation order for the language of the translation.

David


Return to “TextPipe Tips and Tricks, Questions and Support”

Who is online

Users browsing this forum: Baidu [Spider], Bing [Bot] and 2 guests