Re: Unicode Normalization

> Just to consider what is involved in terms of a parser algorithm to  
> address canonical equivalent string matching, here's some background  
> information.
>
> There are 1,115 code points in the NFC_Quick_Check=NO property[1].  
> There are 102 code point in the NFC_Quick_Check=Maybe property[2].  
> For fully normalized NFC content, each parsed character (either  
> markup and attribute value only, or markup attribute value and  
> content) would need to be checked against a character set containing  
> these code points. Nothing else would be required for such content.  
> For the 102 "maybe" affirmative characters a check would be needed  
> to see if the base character in the present combining character  
> sequence required NFC normalization (for each "maybe" there are a  
> few unacceptable base characters not allowed in NFC; everything else  
> is allowed in NFC and the combining sequence therefore qualifies as  
> NFC).
>
> For content that was not NFC normalized, whenever encountering  
> characters that match these combined character sets, parsers would  
> need to branch into a normalization algorithm. This means that some  
> performance hit would be involved whenever authors failed to  
> normalize to NFC. However, even this case is nothing compared to  
> what an text/html parser performs now to repair broken HTML.
>
> For an XML parser, a very small performance hit to check for NFC.  
> Any author producing NFC content gets rewarded for doing so. This  
> allows us to promote NFC as a best practice and one with performance  
> benefits too. The normalization checking against a character set  
> bitmap would not be a significant performance hit in proportion to  
> the often cited performance advantages of XML parsing over text/html  
> parsing.
>
> Also I think it's worth noting that fixing these canonical string  
> matching errors simply improves the web, it doesn't break it. Anne  
> has suggested that authors may be relying on different canonical  
> representations to mean different things in their markup. But even  
> if we can find real world examples of this (and we haven't), surely  
> we should be pushing authors to fix these things (this is a misuse  
> of Unicode).
>
> For all of the things we're directing our CPU processing power  
> towards, this fundamental part of text handling should be high on  
> the list of priorities: especially when considering how non- 
> intensive the processing is. There are some definite I18N issues to  
> be solved here. And some of these things probably need to be taken  
> up with Unicode directly, but parser stage handling of canonical  
> strings is something i don't see eliminating by addressing this at  
> more fundamental levels (such as input systems, authoring tools, and  
> fonts).
>
> Finally, here's some more complete exposition of the previous  
> example I provided to help think through these issues:
>
> [snip of examples broken by my own email client]]
>
>
> I note the font HiraKakuProN-W3 on my system presents these with  
> slightly different glyphs which as i said before should be  
> considered a bug (but like input systems, font makers really have  
> not gotten clear norms about this) At least in the case of the name  
> of this character ("CJK COMPATIBILITY IDEOGRAPH-2F8A6"), the name  
> provides some indication of discouraged use (which may be all an  
> author encounters when using a character input system). My feeling  
> is that singletons are an ill-conceived part of NFC and NFD  
> normalization (closer to compatibility decompositions than canonical  
> decompositions), but that the non-singleton parts of normalization  
> are essential to proper text handling (and I don't see how Unicode  
> could have avoided or could avoid in the future such non-singleton  
> canonical normalization).
>
> Take care,
> Rob
>
> [1]: <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:NFC_Quick_Check=No:] 
> >
> [2]: <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:NFC_Quick_Check=Maybe:] 
> >

Rescued from the lists web archive:

1) Ệ (U+1EC6) [NFC]
2) Ê (U+00CA) ˆ (U+0323)
3) Ẹ (U+1EB8)   ̣(U+0302)
4) E (U+0045)  ˆ (U+0323)   ̣(U+0302)
5) E (U+0045)   ̣(U+0302)  ˆ (U+0323) [NFD]

Another singleton example is:

1) 慈 (U+2F8A6) [non-normalized]
2) 慈  (U+6148) [NFC and NFD]

Something I neglected to include for a parser level normalization  
algorithm (or really any normalization algorithm) is that in addition  
to checking for the 1,600 or so NFC_Quick_Check characters, the  
algorithm also needs to canonically reorder the combining characters  
which are permitted in NFC. So for combining character sequences that  
remain in NFC, the combining characters need to be reordered according  
to their canonical combining class value ( such as E ccc=0; combining  
dot below ccc=220; and combining circumflex accent ccc=230). This  
reordering during parsing then permits a code point by code point  
comparison of any strings (and even octet by octet matching  
comparisons when those are all maintained in the same UTF; i.e., when  
we're not talking about collation string comparison).

Take care,
Rob

Received on Friday, 6 February 2009 22:53:55 UTC