- From: L. David Baron <dbaron@dbaron.org>
- Date: Fri, 6 Feb 2009 14:58:32 -0800
- To: www-style@w3.org, public-i18n-core@w3.org
I think these threads on Unicode Normalization have gotten a little out of hand; the messages are coming too quickly for all but a small number of people to keep up with them. Since this is a discussion that needs to be discussed with other groups, I think what we really need to do is prepare a summary of the reasons for wanting normalization, the possible solutions, and their known advantages and disadvantages. We can then share this with other groups (HTML, ECMA) and other relevant content producers and consumers. (A huge thread with lots of back-and-forth isn't useful to share.) Here is a start at such a summary, but it's far from complete since I haven't had the time to read anywhere near all the messages in the thread, or, for that matter, to incorporate information from many of the ones I did read. It might be worth putting this on a wiki somewhere and trying to complete it (but I don't think it should get to more than, say, twice as long as it is now; anything very substantial should be linked). -David Statement of Problem: Many things that people perceive as "a character" can be represented in multiple ways in Unicode. To take a simple example, a small "a" with an acute accent can be represented as either: U+00E1 LATIN SMALL LETTER A WITH ACUTE or as the sequence: U+0061 LATIN SMALL LETTER A U+0301 COMBINING ACUTE ACCENT Tools that users use to input text may vary as to which of these forms is produced, depending on the programs used (operating systems, input methods, editors) and perhaps on how the user enters the text. There may be more of this variation in some languages than others (LINK TO EVIDENCE NEEDED, IF TRUE). Unicode normalization is the process of converting to a form in which these differences are not present. NFC normalization is a set of rules for converting strings containing characters such as those above to the most-combined (composed) form (e.g., U+00E1 above), and NFD normalization is a set of rules for converting everything to the most-separated (decomposed) form (e.g., U+0061 U+0301 above). (NFKC and NFKD are analogous normalization forms that eliminate even more differences, including some that are perceivable; normalization to them does not appear to be under consideration.) Various Web technologies depend on string matching. For example, CSS selectors allow matching of author-chosen classes and IDs, and the document.getElementById() method allows retrieving an element by its ID. When authors use strings in their own language, those strings should match when the author perceives those strings to be the same, whether or not different tools were used to produce, e.g., the markup and the style sheet. This author expectation is not met when the string match fails because of differences in Unicode normalization. Possible solutions: (1) State that authors producing content for the Web should use tools that always use one normalization. The preferred normalization would need to be defined (NFC appears to be preferred by a majority). Authors who do not follow this recommendation risk the problems described above. Advantages: It does not require changes to Web standards or software that consumes Web documents. (CITATION NEEDED) (MORE HERE) Disadvantages: Lots of possible points of failure. (CITATION NEEDED) Doesn't substantively improve the problematic situation described above. (CITATION NEEDED) (MORE HERE) (2) Require a normalization pass during the parsing of text-based Web content formats (perhaps after character encoding conversion but before parsing), but do not perform any further normalization. The preferred normalization would need to be defined (NFC appears to be the majority preference). Advantages: Requires changes to software and specifications at a very small number of places. (CITATION NEEDED) (MORE HERE) Disadvantages: (MORE HERE) (3) Require that all data structures representing Web content be in a consistent normalization. (This may be a superset of (2), although it might not be precisely, depending on whether parsing rules for any Web languages would vary depending on whether normalization was done before parsing.) The preferred normalization would need to be defined (NFC appears to be the majority preference). Advantages: (MORE HERE) Disadvantages: Requires changes to specifications and software at many points. (CITATION NEEDED) (MORE HERE) (4) Require that all string comparisons done by implementations of Web technology report that strings that normalize to the same thing compare as equal. A preferred normalization would not need to be defined. Advantages: Allows whatever normalization the author preferred to produce the text in to persist without modification. (CITATION NEEDED) (MORE HERE) Disadvantages: Performance of comparisons. (CITATION NEEDED) Requires changes to specifications and software at many points. (CITATION NEEDED) (MORE HERE) -- L. David Baron http://dbaron.org/ Mozilla Corporation http://www.mozilla.com/
Received on Friday, 6 February 2009 22:59:12 UTC