W3C home > Mailing lists > Public > public-i18n-core@w3.org > January to March 2009

RE: [selectors-api] Selectors API I18N Review...

From: Richard Ishida <ishida@w3.org>
Date: Tue, 3 Feb 2009 13:10:28 -0000
To: "'Martin Duerst'" <duerst@it.aoyama.ac.jp>, "'Phillips, Addison'" <addison@amazon.com>, <public-i18n-core@w3.org>
Cc: "'fantasai'" <fantasai.lists@inkedblade.net>, "'Lachlan Hunt'" <lachlan.hunt@lachy.id.au>, <www-style@w3.org>
Message-ID: <001301c98600$c9902980$5cb07c80$@org>

> -----Original Message-----
> From: Martin Duerst [mailto:duerst@it.aoyama.ac.jp]
> Sent: 31 January 2009 07:11
...
> >And the 83 million inhabitants of Vietnam are not the only
> >people who face this issue.  There are many languages that use combining
> >characters, including the Latin script based languages of Africa and
> >aboriginal North America, most scripts of Asia, etc., and one can't always
> >guarantee that the input methods used for those languages will always
> create
> >text in one given form vis a vis normalization.
> 
> Using combinging characters isn't what's important. If no precombined
> variant is available, and there is only one combining character, there
> are no problems,... It would be very good to see some actual examples,
> rather than roundaboutly mentioning whole continents.

Andrew cited some examples of African languages where denormalisation or different normalised forms can appear in text.  Jonathan mentioned some Arabic.  One or two more examples, then... 

I recently came across http://languagegeek.com/ which provides a fair number of keyboards (and other things) to support aboriginal (mostly North American) languages.  I didn't have to look hard for a problem. If you install the Tlicho (Tłįchǫ or Dogrib) keyboard on Windows (see a picture at http://rishida.net/scripts/pickers/tlicho/) and type the name of the language itself, it comes out in NFD.  It is also possible to incorrectly order multiple diacritics (ie. not even NFD). You could say that the keyboard *ought* to churn out NFC, but it's too late. People using those keyboards will be producing content that may look different to that created by people using other input methods.  For example the following was generated by typing the first two accented letters using the Tlicho keyboard then the same two using the US International keyboard:

    e   U+0065:   LATIN SMALL LETTER E  

    ̀   U+0300:   COMBINING GRAVE ACCENT   

    o   U+006F:   LATIN SMALL LETTER O 

    ̀   U+0300:   COMBINING GRAVE ACCENT   

    é   U+00E9:   LATIN SMALL LETTER E WITH ACUTE  

    ò   U+00F2:   LATIN SMALL LETTER O WITH GRAVE 

Another example would be using a standard Tamil Windows keyboard, where *on the same keyboard* it is just as easy to produce a result that looks the same using

0B95:   க  TAMIL LETTER KA

0BCB:   ோ  TAMIL VOWEL SIGN OO

as 

0B95:   க  TAMIL LETTER KA

0BC7:   ே  TAMIL VOWEL SIGN EE

0BBE:   ா  TAMIL VOWEL SIGN AA 

The font I use doesn't complain about it.  The single cc is NFC, the two is NFD.  This of course applies to a number of indic scripts.

Another two examples, Khmer (http://rishida.net/scripts/khmer/#cporder) and Myanmar (http://rishida.net/scripts/myanmar/#cporder) are almost as sensitive to combining character order as Vietnamese.  Some fonts only work with one order for multiple diacritics, other fonts allow different ordering.  Again, a single keyboard tends to allow a user to input the same text in different ways.

These are just a few examples, and by no means exhaustive.

RI
Received on Tuesday, 3 February 2009 13:10:37 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 February 2009 13:10:38 GMT