Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization

At 03:17 09/02/02, Jonathan Kew wrote:
>
>On 1 Feb 2009, at 16:02, Anne van Kesteren wrote:

>> 1. How many developers are actually facing this problem? We know  
>> that theoretically there is an issue here, but I do not believe  
>> research has shown that this is a problem in practice. E.g. as I  
>> understand things this could occur with the character $Bu
(B but has it?
>
>Probably not, as virtually all keyboard layouts used for typical  
>European languages generate precomposed characters.
>
>The situation is different in some other parts of the world. For  
>example, Arabic script uses a number of diacritics, both for vowels  
>and other functions. It happens (unfortunately) that the conventional  
>order in which some of these are usually entered by typists does not  
>match the canonical order in Unicode, and therefore the exact code  
>sequence differs depending whether some process between the original  
>typist's fingers and the document that gets delivered to the browser  
>has normalized the data (either as NFC or NFD).

That's a good actual example (the first after Vietnamese),
but how many developers are going to use multiple diacritics
on identifiers such as class names?

>> 2. What is the performance impact on processing? That is, is the  
>> impact so neglicable that browser vendors can add it? (FWIW, we care  
>> about microseconds.)
>
>A definitive answer to this can best be provided by tests on an actual  
>implementation, of course. My guess -- based on experience with  
>Unicode, normalization, etc, but nevertheless just a guess -- is that  
>the impact can be made negligible at least for (the vast majority of)  
>cases where the selectors or other identifiers in question are in fact  
>"simple" NFC-encoded Western (Latin script) strings. An implementation  
>can verify this very cheaply at the same time as performing a naive  
>comparison, and only go to a more expensive path in the rare cases  
>where combining marks or other "complicating" factors are actually  
>present.

My guess is that the impact can be really low, because as said here,
it's easy to check quickly and only do something if the relevant
data isn't in NFC, and in addition, it is easily possible to
pre-normalize every class/element/whatever identifier (or even
the whole file, although in some cases, that might have some
interactions with fonts) rather than to check/normalize for
each comparison separately.


>> 3. How likely is that XML will change to require doing NFC  
>> normalization on input? Currently XML does reference Unicode  
>> Normalization normatively, but it does only do so from a non- normative section on guidelines for designing XML names.
>
>The section on "Normalization Checking" at http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-normalization-checking  seems to strongly encourage the use of NFC throughout, although it  
>does not mandate it: "All XML parsed entities (including document  
>entities) should be fully normalized...." (note "should" rather than  
>"must").

Please note that this is XML 1.1, not XML 1.0. While XML 1.0 is
alive and very, very strong, some people will cass XML 1.1
"essentially dead", and they won't be far from wrong.

>The Unicode standard defines conformance requirements that already  
>include: "(C6) A process shall not assume that the interpretations of  
>two canonical-equivalent character sequences are distinct." If XML,  
>for example, uses Unicode as its underlying character encoding, then  
>it ought to follow from Unicode conformance requirements that XML  
>processors should not be treating <e-dieresis> and <e, combining  
>dieresis> as distinct.

It looks like this, but it's actually a bit different.

For your reference, here are some of the explanatory
notes after (C6):

o The implications of this conformance clause are twofold. First,
  a process is never required to give different interpretations to
  two different, but cannonical-equivalent character sequences.
  Second, no process can assume that another process will make a
  distinction between two different, but canonical-equivalent
  character sequences.

o Ideally, an implemenation would always interpret two canonical-
  equivalent character sequences identically. There are practical
  circumstances under which implementations may reasonably
  distinguish them.


Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp     

Received on Monday, 2 February 2009 05:38:09 UTC