- From: Robert J Burns <rob@robburns.com>
- Date: Mon, 2 Feb 2009 21:15:37 -0600
- To: public-i18n-core@w3.org
- Message-Id: <A56A94A3-2E0D-4946-866C-D262FCA7A85E@robburns.com>
Regarding Andrew Cunningham's message[1] which I encountered reading the archives: Hi Andrew et al, I think it's important to take a step back and look at normalization from a broader perspective and understand the intent of normalization. First, characters within NFC/NFD strings are intended within Unicode to be equivalent characters. That is the intent of Unicode is that: 1) any two strings normalized to one of those forms (either one it doesn't matter which form) will match byte for byte. 2) authors not use two canonically equivalent characters (or character sequences) in a semantically distinct way The reason for NOT normalizing (aside from the performance gains, the just being lazy reason :-)) is to support cases where authors have mistakenly treated two canonically equivalent characters as semantically distinct (again that's a mistake for an author to do that). So the prevalence of Unicode UAs that do not normalize (one way or the other NFC or NFD) tends to increase the likelihood that authors will make a mistake. In addition the prominence of poorly designed UI for input systems also increases the likelihood that authors will make such a mistake. So this gets back to the recurring issue of strict error-handling. If the W3C can take a stand and insist on normalizing strings as much as possible, this will take care of one part of the author error problem. The other issue of better GUI input methods is not really something W3C is in a position to address (at least to my knowledge). However, I think at the least the W3C could promote (in CSS and elsewhere): 1) Requiring implementations to perform canonical normalization of non-character tokens on parsing (NFC is the general thrust now and so there's nothing much to be gained by trying to reverse that) 2) Requiring implementations to produce NFC for newly produced content, especially for markup as opposed to content (which is consistent with the NFC normalization of parsing for editors that involve parsing 3) Possibly doing the same for items (1) and (2) even for content 4) For characters with the property NFC_Quick_Check=NO (and perhaps also the 102 characters with NFC_Quick_Check=MAYBE whenever those are used in a non-NFC normalized string) * Prohibit the use in markup (as opposed to content) * Discourage the use in content 5) For characters with the property NFKC_Quick_Check=NO (and perhaps also the 102 characters with NFKC_Quick_Check=MAYBE whenever those are used in a non-NFKC normalized string) * Prohibit the use in markup (as opposed to content) * Discourage the use in content In general I think these rules should be used throughout all W3C recommendations. I think we're still early enough in the Unicode adoption process to address the situation in a correct manner. In other words adopting these rules now would cause little pain now for existing content, but any delays would lead to more trouble for content producers. Boris has raised the issue of where from these problems arise? Its helpful to understand that in addressing the issue of Unicode normalization too. Three competing goals of Unicode lead to the normalization confusion: 1) the desire to have separate diacritical and other combining marks that could be combined in any way with any other base characters 2) the desire to have more compressed encodings where common composed characters might be represented in a single code point rather than multiple code points (there also was a strong tendency toward pre-composited characters due to simple text rendering not requiring OpenType like property tables and the like and a preponderance of text systems that used pre-composited characters). 3) the desire to provide a code point to allow roundtrip mapping of every other existing encoding into Unicode/UCS. Without the second goal and third goals, no normalization form C or KC would be possible. Only base characters would have been assigned in the UCS. However, with the other two goals many--but not all-- precomposed characters have been assigned within Unicode/UCS. Earlier Addison said: > NFC does not guarantee that no combining marks appear in the text. > Applying NFC only means that any combining marks that can be > combined with their base characters are, in fact, combined. That is not quite true either. Due to Unicode's normalization stabilization policy (and perhaps other reasons I'm not aware of), some combining marks that have equivalent precomposed forms, do in fact remain in NFC. So there remains folding issues even for normalized forms (though much fewer). For example the sequence U+1D157 U+1D165 is expressible as the precomposed U+1D15E yet U+1D15E is not part of an NFC string but instead the sequence U+1D157-U+1D165 is instead. So normalization form C does not guarantee everything is composed, however either normalization (NFC or NFD) does guarantee that canonically equivalent characters are represented by the same code points (so string comparisons can be made between two or more strings but string length in terms of grapheme cluster boundaries is not dealt with except by following more complex grapheme cluster boundary algorithms[2] which might be needed for example in validation where an inputted value must remain less than 4 characters in which grapheme cluster is probably more the intention than actual code points or octets). Normalization also does not quite guarantee that strings that are rendered the same are also represented by the same code points. Unicode had a withdrawn Technical Report that might have dealt better with that issue. However, the recommendations I listed above combined with the a recommendation to be careful in using characters from the different scripts within any single markup specification: including any NAME, NMTOKEN, attribute value, or other markup (For example, don't introduce a semantically distinct element in the HTML namespace and name it 'Α' (U+0391) alongside 'A' (U+0041)). So I don't think NFC has any strong advantages over NFD (or the advantages are often over-estimated). However, it is just important to pick one normalization or the other and still be aware (and make authors and implementors aware) of the limited benefits for normalization and the possible issues that still remain even after normalization of which authors and implementors need to to remain aware. I also think handling this at parsing would be the best way to go. Unicode has now tried to get strict about new pre-composed character assignment (which in itself introduces a Western bias since its minority languages that are left to be allocated and will now NOT be assigned precomposed characters for whatever limited benefits precomposed characters bring) so one could say that the the preferred approach of Unicode is decomposed characters while the preferred approach of W3C appears to be NFC. With those somewhat bikeshed-like disagreements it is probably better to handle canonical normalization in the parser than expect others to always stick with NFC in content creation. Take care, Rob [1]: <http://lists.w3.org/Archives/Public/public-i18n-core/2009JanMar/0116.html > [2]: See <http://www.unicode.org/reports/tr29/> and <http://www.unicode.org/reports/tr14/ > [3]: <http://www.unicode.org/reports/tr30/tr30-4.html>
Received on Tuesday, 3 February 2009 03:16:21 UTC