- From: Robert J Burns <rob@robburns.com>
- Date: Fri, 6 Feb 2009 15:45:04 -0600
- To: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Just to consider what is involved in terms of a parser algorithm to address canonical equivalent string matching, here's some background information. There are 1,115 code points in the NFC_Quick_Check=NO property[1]. There are 102 code point in the NFC_Quick_Check=Maybe property[2]. For fully normalized NFC content, each parsed character (either markup and attribute value only, or markup attribute value and content) would need to be checked against a character set containing these code points. Nothing else would be required for such content. For the 102 "maybe" affirmative characters a check would be needed to see if the base character in the present combining character sequence required NFC normalization (for each "maybe" there are a few unacceptable base characters not allowed in NFC; everything else is allowed in NFC and the combining sequence therefore qualifies as NFC). For content that was not NFC normalized, whenever encountering characters that match these combined character sets, parsers would need to branch into a normalization algorithm. This means that some performance hit would be involved whenever authors failed to normalize to NFC. However, even this case is nothing compared to what an text/ html parser performs now to repair broken HTML. For an XML parser, a very small performance hit to check for NFC. Any author producing NFC content gets rewarded for doing so. This allows us to promote NFC as a best practice and one with performance benefits too. The normalization checking against a character set bitmap would not be a significant performance hit in proportion to the often cited performance advantages of XML parsing over text/html parsing. Also I think it's worth noting that fixing these canonical string matching errors simply improves the web, it doesn't break it. Anne has suggested that authors may be relying on different canonical representations to mean different things in their markup. But even if we can find real world examples of this (and we haven't), surely we should be pushing authors to fix these things (this is a misuse of Unicode). For all of the things we're directing our CPU processing power towards, this fundamental part of text handling should be high on the list of priorities: especially when considering how non-intensive the processing is. There are some definite I18N issues to be solved here. And some of these things probably need to be taken up with Unicode directly, but parser stage handling of canonical strings is something i don't see eliminating by addressing this at more fundamental levels (such as input systems, authoring tools, and fonts). Finally, here's some more complete exposition of the previous example I provided to help think through these issues: 1) Ệ (U+1EC6) [NFC] 2) Ê (U+00CA) ˆ (U+0323) 3) Ẹ (U+1EB8) ̣(U+0302) 4) E (U+0045) ˆ (U+0323) ̣(U+0302) 5) E (U+0045) ̣(U+0302) ˆ (U+0323) [NFD] Another singleton example is: 1) 慈 (U+2F8A6) [non-normalized] 2) 慈 (U+6148) [NFC and NFD] I note the font HiraKakuProN-W3 on my system presents these with slightly different glyphs which as i said before should be considered a bug (but like input systems, font makers really have not gotten clear norms about this) At least in the case of the name of this character ("CJK COMPATIBILITY IDEOGRAPH-2F8A6"), the name provides some indication of discouraged use (which may be all an author encounters when using a character input system). My feeling is that singletons are an ill-conceived part of NFC and NFD normalization (closer to compatibility decompositions than canonical decompositions), but that the non-singleton parts of normalization are essential to proper text handling (and I don't see how Unicode could have avoided or could avoid in the future such non-singleton canonical normalization). Take care, Rob [1]: <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:NFC_Quick_Check=No:] > [2]: <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:NFC_Quick_Check=Maybe:] >
Received on Friday, 6 February 2009 21:45:47 UTC