Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Jonathan Kew on 2009-02-01 (public-i18n-core@w3.org from January to March 2009)

From: Jonathan Kew <jonathan@jfkew.plus.com>
Date: Sun, 1 Feb 2009 18:17:52 +0000
To: "Anne van Kesteren" <annevk@opera.com>
Cc: "Andrew Cunningham" <andrewc@vicnet.net.au>, "Richard Ishida" <ishida@w3.org>, "'L. David Baron'" <dbaron@dbaron.org>, public-i18n-core@w3.org, www-style@w3.org
Message-Id: <20FE7543-350C-4A66-9D20-5D3101321D31@jfkew.plus.com>

On 1 Feb 2009, at 16:02, Anne van Kesteren wrote:

> On Sun, 01 Feb 2009 03:17:04 +0100, Andrew Cunningham <andrewc@vicnet.net.au 
> > wrote:
>> but developers have to type code, sometimes more than one developer  
>> needs
>> to work on the code. And if they are using different input tolls, and
>> those tools are generating different codepoints, when  identical
>> codepoints are required ... then there is a problem.
>
> I can definitely see that problems might arise. And I can also see  
> that putting complexity on the user agent side is better than  
> putting it on the developer side. However, there are several things  
> to take into consideration here.
>
> 1. How many developers are actually facing this problem? We know  
> that theoretically there is an issue here, but I do not believe  
> research has shown that this is a problem in practice. E.g. as I  
> understand things this could occur with the character ë, but has it?

Probably not, as virtually all keyboard layouts used for typical  
European languages generate precomposed characters.

The situation is different in some other parts of the world. For  
example, Arabic script uses a number of diacritics, both for vowels  
and other functions. It happens (unfortunately) that the conventional  
order in which some of these are usually entered by typists does not  
match the canonical order in Unicode, and therefore the exact code  
sequence differs depending whether some process between the original  
typist's fingers and the document that gets delivered to the browser  
has normalized the data (either as NFC or NFD).

> 2. What is the performance impact on processing? That is, is the  
> impact so neglicable that browser vendors can add it? (FWIW, we care  
> about microseconds.)

A definitive answer to this can best be provided by tests on an actual  
implementation, of course. My guess -- based on experience with  
Unicode, normalization, etc, but nevertheless just a guess -- is that  
the impact can be made negligible at least for (the vast majority of)  
cases where the selectors or other identifiers in question are in fact  
"simple" NFC-encoded Western (Latin script) strings. An implementation  
can verify this very cheaply at the same time as performing a naive  
comparison, and only go to a more expensive path in the rare cases  
where combining marks or other "complicating" factors are actually  
present.

> 3. How likely is that XML will change to require doing NFC  
> normalization on input? Currently XML does reference Unicode  
> Normalization normatively, but it does only do so from a non- 
> normative section on guidelines for designing XML names.

The section on "Normalization Checking" at http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-normalization-checking 
  seems to strongly encourage the use of NFC throughout, although it  
does not mandate it: "All XML parsed entities (including document  
entities) should be fully normalized...." (note "should" rather than  
"must").

> If XML does not change it does not make a whole lot of sense to  
> change e.g. CSS selector matching because that would mean some XML  
> element names that are not in NFC could no longer be selected.
>
> The last one is quite important. If Unicode Normalization is so  
> important it has to happen everywhere, otherwise the platform  
> becomes inconsistent. This means XML will have to change, HTML will  
> have to change, CSS will have change, DOM APIs will have to change,  
> etc. That's a lot of tedious work with great potential for bugs and  
> performance issues. Without very clear evidence that such a major  
> overhaul is needed, I doubt you'll convince many vendors.

The Unicode standard defines conformance requirements that already  
include: "(C6) A process shall not assume that the interpretations of  
two canonical-equivalent character sequences are distinct." If XML,  
for example, uses Unicode as its underlying character encoding, then  
it ought to follow from Unicode conformance requirements that XML  
processors should not be treating <e-dieresis> and <e, combining  
dieresis> as distinct.

JK

Received on Sunday, 1 February 2009 18:18:40 UTC