W3C home > Mailing lists > Public > public-i18n-core@w3.org > January to March 2009

Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization

From: Philip TAYLOR <P.Taylor@Rhul.Ac.Uk>
Date: Fri, 30 Jan 2009 16:45:23 +0000
Message-ID: <49832EA3.9020306@Rhul.Ac.Uk>
To: Jonathan Kew <jonathan@jfkew.plus.com>
CC: Anne van Kesteren <annevk@opera.com>, Richard Ishida <ishida@w3.org>, "'L. David Baron'" <dbaron@dbaron.org>, public-i18n-core@w3.org, www-style@w3.org



Jonathan Kew wrote:

> If anyone has created content that relies on a distinction between 
> names/IDs that would be erased by normalization -- i.e., names that are 
> canonically equivalent, as defined by Unicode -- then that data and/or 
> the processes using it are not compliant with the Unicode standard, and 
> they're liable to break at some undefined point in the future when they 
> attempt to interoperate with products or data that *are* 
> Unicode-compliant. That's really a bad idea. Better to face the issue 
> now, define appropriate and robust standards, and encourage anyone who 
> has currently got such ill-designed data to fix it. (I doubt it actually 
> exists, though.)

I agree with Jonathan and Richard : normalising ("normalization",
if you prefer the longer form) will result in predictable
(and logical, and justifiable) behaviour, even if -- in the short
term  -- it causes a small number of problems.

The issue with regard to Vietnamese interests me in particular,
because it is a language to which I have considerable exposure
(my wife is Vietnamese) and the behaviour reported for Windows XP

> Microsoft keyboards under XP produce unnormalized output where 
 > tone marks are separate combining characters but diacritics that
 > differentiate letters are composed with their base character.

closely models that of Vietnamese handwriting, where a character
and its diacritic (horn, etc) are written as a single entity
(i.e., before moving on to the next letter) whereas the tone
marker (which applies to the whole word, not to a single
character, even though it is normally positioned over a
vowel that is physically near the centre of the word) is
written as an "afterthought" (that is, after the word is
otherwise complete).

Philip TAYLOR
Received on Friday, 30 January 2009 16:46:10 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Friday, 30 January 2009 16:46:11 GMT