Re: Unicode Normalization from Robert J Burns on 2009-02-06 (public-i18n-core@w3.org from January to March 2009)

From: Robert J Burns <rob@robburns.com>
Date: Fri, 6 Feb 2009 15:45:04 -0600
To: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <DB12182E-E23F-4173-9B04-1D835E86FB28@robburns.com>
Just to consider what is involved in terms of a parser algorithm to  
address canonical equivalent string matching, here's some background  
information.

There are 1,115 code points in the NFC_Quick_Check=NO property[1].  
There are 102 code point in the NFC_Quick_Check=Maybe property[2]. For  
fully normalized NFC content, each parsed character (either markup and  
attribute value only, or markup attribute value and content) would  
need to be checked against a character set containing these code  
points. Nothing else would be required for such content. For the 102  
"maybe" affirmative characters a check would be needed to see if the  
base character in the present combining character sequence required  
NFC normalization (for each "maybe" there are a few unacceptable base  
characters not allowed in NFC; everything else is allowed in NFC and  
the combining sequence therefore qualifies as NFC).

For content that was not NFC normalized, whenever encountering  
characters that match these combined character sets, parsers would  
need to branch into a normalization algorithm. This means that some  
performance hit would be involved whenever authors failed to normalize  
to NFC. However, even this case is nothing compared to what an text/ 
html parser performs now to repair broken HTML.

For an XML parser, a very small performance hit to check for NFC. Any  
author producing NFC content gets rewarded for doing so. This allows  
us to promote NFC as a best practice and one with performance benefits  
too. The normalization checking against a character set bitmap would  
not be a significant performance hit in proportion to the often cited  
performance advantages of XML parsing over text/html parsing.

Also I think it's worth noting that fixing these canonical string  
matching errors simply improves the web, it doesn't break it. Anne has  
suggested that authors may be relying on different canonical  
representations to mean different things in their markup. But even if  
we can find real world examples of this (and we haven't), surely we  
should be pushing authors to fix these things (this is a misuse of  
Unicode).

For all of the things we're directing our CPU processing power  
towards, this fundamental part of text handling should be high on the  
list of priorities: especially when considering how non-intensive the  
processing is. There are some definite I18N issues to be solved here.  
And some of these things probably need to be taken up with Unicode  
directly, but parser stage handling of canonical strings is something  
i don't see eliminating by addressing this at more fundamental levels  
(such as input systems, authoring tools, and fonts).

Finally, here's some more complete exposition of the previous example  
I provided to help think through these issues:

1) Ệ (U+1EC6) [NFC]
2) Ê (U+00CA) ˆ (U+0323)
3) Ẹ (U+1EB8)   ̣(U+0302)
4) E (U+0045)  ˆ (U+0323)   ̣(U+0302)
5) E (U+0045)   ̣(U+0302)  ˆ (U+0323) [NFD]

Another singleton example is:

1) 慈 (U+2F8A6) [non-normalized]
2) 慈  (U+6148) [NFC and NFD]

I note the font HiraKakuProN-W3 on my system presents these with  
slightly different glyphs which as i said before should be considered  
a bug (but like input systems, font makers really have not gotten  
clear norms about this) At least in the case of the name of this  
character ("CJK COMPATIBILITY IDEOGRAPH-2F8A6"), the name provides  
some indication of discouraged use (which may be all an author  
encounters when using a character input system). My feeling is that  
singletons are an ill-conceived part of NFC and NFD normalization  
(closer to compatibility decompositions than canonical  
decompositions), but that the non-singleton parts of normalization are  
essential to proper text handling (and I don't see how Unicode could  
have avoided or could avoid in the future such non-singleton canonical  
normalization).

Take care,
Rob

[1]: <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:NFC_Quick_Check=No:] 
 >
[2]: <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:NFC_Quick_Check=Maybe:] 
 >
Received on Friday, 6 February 2009 21:45:56 UTC