W3C home > Mailing lists > Public > public-i18n-core@w3.org > January to March 2009

Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization

From: Philip TAYLOR <P.Taylor@Rhul.Ac.Uk>
Date: Fri, 30 Jan 2009 16:45:23 +0000
Message-ID: <49832EA3.9020306@Rhul.Ac.Uk>
To: Jonathan Kew <jonathan@jfkew.plus.com>
CC: Anne van Kesteren <annevk@opera.com>, Richard Ishida <ishida@w3.org>, "'L. David Baron'" <dbaron@dbaron.org>, public-i18n-core@w3.org, www-style@w3.org

Jonathan Kew wrote:

> If anyone has created content that relies on a distinction between 
> names/IDs that would be erased by normalization -- i.e., names that are 
> canonically equivalent, as defined by Unicode -- then that data and/or 
> the processes using it are not compliant with the Unicode standard, and 
> they're liable to break at some undefined point in the future when they 
> attempt to interoperate with products or data that *are* 
> Unicode-compliant. That's really a bad idea. Better to face the issue 
> now, define appropriate and robust standards, and encourage anyone who 
> has currently got such ill-designed data to fix it. (I doubt it actually 
> exists, though.)

I agree with Jonathan and Richard : normalising ("normalization",
if you prefer the longer form) will result in predictable
(and logical, and justifiable) behaviour, even if -- in the short
term  -- it causes a small number of problems.

The issue with regard to Vietnamese interests me in particular,
because it is a language to which I have considerable exposure
(my wife is Vietnamese) and the behaviour reported for Windows XP

> Microsoft keyboards under XP produce unnormalized output where 
 > tone marks are separate combining characters but diacritics that
 > differentiate letters are composed with their base character.

closely models that of Vietnamese handwriting, where a character
and its diacritic (horn, etc) are written as a single entity
(i.e., before moving on to the next letter) whereas the tone
marker (which applies to the whole word, not to a single
character, even though it is normally positioned over a
vowel that is physically near the centre of the word) is
written as an "afterthought" (that is, after the word is
otherwise complete).

Received on Friday, 30 January 2009 16:46:10 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:23:04 UTC