W3C home > Mailing lists > Public > public-i18n-core@w3.org > January to March 2009

Re: [selectors-api] Selectors API I18N Review...

From: Andrew Cunningham <andrewc@vicnet.net.au>
Date: Wed, 04 Feb 2009 10:15:10 +1100
Message-ID: <4988CFFE.7020208@vicnet.net.au>
To: Richard Ishida <ishida@w3.org>
CC: 'Martin Duerst' <duerst@it.aoyama.ac.jp>, "'Phillips, Addison'" <addison@amazon.com>, public-i18n-core@w3.org, 'fantasai' <fantasai.lists@inkedblade.net>, 'Lachlan Hunt' <lachlan.hunt@lachy.id.au>, www-style@w3.org
For languages that have to use combining diacritics and that are 
diacritic heavy it is impossible with most keyboard and input frameworks 
to develop a keyboard layout that generates NFC or NFD output, that 
requires a certain degree of sequence checking, reordering and other 
forms of processing, in essence a smart input system. Most keyboard 
frameworks are not smart input systems.



The genral rule of thumb is that if all the letters required by the 
language can not be represented by only precomposed characters, then its 
most likely you will have unnormalised text generated.



Richard Ishida wrote:

>> -----Original Message-----
>> From: Martin Duerst [mailto:duerst@it.aoyama.ac.jp]
>> Sent: 31 January 2009 07:11
>>     
> ...
>   
>>> And the 83 million inhabitants of Vietnam are not the only
>>> people who face this issue.  There are many languages that use combining
>>> characters, including the Latin script based languages of Africa and
>>> aboriginal North America, most scripts of Asia, etc., and one can't always
>>> guarantee that the input methods used for those languages will always
>>>       
>> create
>>     
>>> text in one given form vis a vis normalization.
>>>       
>> Using combinging characters isn't what's important. If no precombined
>> variant is available, and there is only one combining character, there
>> are no problems,... It would be very good to see some actual examples,
>> rather than roundaboutly mentioning whole continents.
>>     
>
> Andrew cited some examples of African languages where denormalisation or different normalised forms can appear in text.  Jonathan mentioned some Arabic.  One or two more examples, then... 
>
> I recently came across http://languagegeek.com/ which provides a fair number of keyboards (and other things) to support aboriginal (mostly North American) languages.  I didn't have to look hard for a problem. If you install the Tlicho (Tłįchǫ or Dogrib) keyboard on Windows (see a picture at http://rishida.net/scripts/pickers/tlicho/) and type the name of the language itself, it comes out in NFD.  It is also possible to incorrectly order multiple diacritics (ie. not even NFD). You could say that the keyboard *ought* to churn out NFC, but it's too late. People using those keyboards will be producing content that may look different to that created by people using other input methods.  For example the following was generated by typing the first two accented letters using the Tlicho keyboard then the same two using the US International keyboard:
>
>     e   U+0065:   LATIN SMALL LETTER E  
>
>     ̀   U+0300:   COMBINING GRAVE ACCENT   
>
>     o   U+006F:   LATIN SMALL LETTER O 
>
>     ̀   U+0300:   COMBINING GRAVE ACCENT   
>
>     é   U+00E9:   LATIN SMALL LETTER E WITH ACUTE  
>
>     ò   U+00F2:   LATIN SMALL LETTER O WITH GRAVE 
>
> Another example would be using a standard Tamil Windows keyboard, where *on the same keyboard* it is just as easy to produce a result that looks the same using
>
> 0B95:   க  TAMIL LETTER KA
>
> 0BCB:   ோ  TAMIL VOWEL SIGN OO
>
> as 
>
> 0B95:   க  TAMIL LETTER KA
>
> 0BC7:   ே  TAMIL VOWEL SIGN EE
>
> 0BBE:   ா  TAMIL VOWEL SIGN AA 
>
> The font I use doesn't complain about it.  The single cc is NFC, the two is NFD.  This of course applies to a number of indic scripts.
>
> Another two examples, Khmer (http://rishida.net/scripts/khmer/#cporder) and Myanmar (http://rishida.net/scripts/myanmar/#cporder) are almost as sensitive to combining character order as Vietnamese.  Some fonts only work with one order for multiple diacritics, other fonts allow different ordering.  Again, a single keyboard tends to allow a user to input the same text in different ways.
>
> These are just a few examples, and by no means exhaustive.
>
> RI
>
>
>
>   

-- 
Andrew Cunningham
Senior Manager, Research and Development
Vicnet
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000

Ph: +61-3-8664-7430
Fax: +61-3-9639-2175

Email: andrewc@vicnet.net.au
Alt email: lang.support@gmail.com

http://home.vicnet.net.au/~andrewc/
http://www.openroad.net.au
http://www.vicnet.net.au
http://www.slv.vic.gov.au


Received on Tuesday, 3 February 2009 23:16:42 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 February 2009 23:16:43 GMT