- From: Jonathan Kew <jonathan@jfkew.plus.com>
- Date: Sun, 1 Feb 2009 18:17:52 +0000
- To: "Anne van Kesteren" <annevk@opera.com>
- Cc: "Andrew Cunningham" <andrewc@vicnet.net.au>, "Richard Ishida" <ishida@w3.org>, "'L. David Baron'" <dbaron@dbaron.org>, public-i18n-core@w3.org, www-style@w3.org
On 1 Feb 2009, at 16:02, Anne van Kesteren wrote: > On Sun, 01 Feb 2009 03:17:04 +0100, Andrew Cunningham <andrewc@vicnet.net.au > > wrote: >> but developers have to type code, sometimes more than one developer >> needs >> to work on the code. And if they are using different input tolls, and >> those tools are generating different codepoints, when identical >> codepoints are required ... then there is a problem. > > I can definitely see that problems might arise. And I can also see > that putting complexity on the user agent side is better than > putting it on the developer side. However, there are several things > to take into consideration here. > > 1. How many developers are actually facing this problem? We know > that theoretically there is an issue here, but I do not believe > research has shown that this is a problem in practice. E.g. as I > understand things this could occur with the character ë, but has it? Probably not, as virtually all keyboard layouts used for typical European languages generate precomposed characters. The situation is different in some other parts of the world. For example, Arabic script uses a number of diacritics, both for vowels and other functions. It happens (unfortunately) that the conventional order in which some of these are usually entered by typists does not match the canonical order in Unicode, and therefore the exact code sequence differs depending whether some process between the original typist's fingers and the document that gets delivered to the browser has normalized the data (either as NFC or NFD). > 2. What is the performance impact on processing? That is, is the > impact so neglicable that browser vendors can add it? (FWIW, we care > about microseconds.) A definitive answer to this can best be provided by tests on an actual implementation, of course. My guess -- based on experience with Unicode, normalization, etc, but nevertheless just a guess -- is that the impact can be made negligible at least for (the vast majority of) cases where the selectors or other identifiers in question are in fact "simple" NFC-encoded Western (Latin script) strings. An implementation can verify this very cheaply at the same time as performing a naive comparison, and only go to a more expensive path in the rare cases where combining marks or other "complicating" factors are actually present. > 3. How likely is that XML will change to require doing NFC > normalization on input? Currently XML does reference Unicode > Normalization normatively, but it does only do so from a non- > normative section on guidelines for designing XML names. The section on "Normalization Checking" at http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-normalization-checking seems to strongly encourage the use of NFC throughout, although it does not mandate it: "All XML parsed entities (including document entities) should be fully normalized...." (note "should" rather than "must"). > If XML does not change it does not make a whole lot of sense to > change e.g. CSS selector matching because that would mean some XML > element names that are not in NFC could no longer be selected. > > The last one is quite important. If Unicode Normalization is so > important it has to happen everywhere, otherwise the platform > becomes inconsistent. This means XML will have to change, HTML will > have to change, CSS will have change, DOM APIs will have to change, > etc. That's a lot of tedious work with great potential for bugs and > performance issues. Without very clear evidence that such a major > overhaul is needed, I doubt you'll convince many vendors. The Unicode standard defines conformance requirements that already include: "(C6) A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct." If XML, for example, uses Unicode as its underlying character encoding, then it ought to follow from Unicode conformance requirements that XML processors should not be treating <e-dieresis> and <e, combining dieresis> as distinct. JK
Received on Sunday, 1 February 2009 18:18:39 UTC