- From: Robert J Burns <rob@robburns.com>
- Date: Fri, 6 Feb 2009 15:45:04 -0600
- To: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Just to consider what is involved in terms of a parser algorithm to
address canonical equivalent string matching, here's some background
information.
There are 1,115 code points in the NFC_Quick_Check=NO property[1].
There are 102 code point in the NFC_Quick_Check=Maybe property[2]. For
fully normalized NFC content, each parsed character (either markup and
attribute value only, or markup attribute value and content) would
need to be checked against a character set containing these code
points. Nothing else would be required for such content. For the 102
"maybe" affirmative characters a check would be needed to see if the
base character in the present combining character sequence required
NFC normalization (for each "maybe" there are a few unacceptable base
characters not allowed in NFC; everything else is allowed in NFC and
the combining sequence therefore qualifies as NFC).
For content that was not NFC normalized, whenever encountering
characters that match these combined character sets, parsers would
need to branch into a normalization algorithm. This means that some
performance hit would be involved whenever authors failed to normalize
to NFC. However, even this case is nothing compared to what an text/
html parser performs now to repair broken HTML.
For an XML parser, a very small performance hit to check for NFC. Any
author producing NFC content gets rewarded for doing so. This allows
us to promote NFC as a best practice and one with performance benefits
too. The normalization checking against a character set bitmap would
not be a significant performance hit in proportion to the often cited
performance advantages of XML parsing over text/html parsing.
Also I think it's worth noting that fixing these canonical string
matching errors simply improves the web, it doesn't break it. Anne has
suggested that authors may be relying on different canonical
representations to mean different things in their markup. But even if
we can find real world examples of this (and we haven't), surely we
should be pushing authors to fix these things (this is a misuse of
Unicode).
For all of the things we're directing our CPU processing power
towards, this fundamental part of text handling should be high on the
list of priorities: especially when considering how non-intensive the
processing is. There are some definite I18N issues to be solved here.
And some of these things probably need to be taken up with Unicode
directly, but parser stage handling of canonical strings is something
i don't see eliminating by addressing this at more fundamental levels
(such as input systems, authoring tools, and fonts).
Finally, here's some more complete exposition of the previous example
I provided to help think through these issues:
1) Ệ (U+1EC6) [NFC]
2) Ê (U+00CA) ˆ (U+0323)
3) Ẹ (U+1EB8) ̣(U+0302)
4) E (U+0045) ˆ (U+0323) ̣(U+0302)
5) E (U+0045) ̣(U+0302) ˆ (U+0323) [NFD]
Another singleton example is:
1) 慈 (U+2F8A6) [non-normalized]
2) 慈 (U+6148) [NFC and NFD]
I note the font HiraKakuProN-W3 on my system presents these with
slightly different glyphs which as i said before should be considered
a bug (but like input systems, font makers really have not gotten
clear norms about this) At least in the case of the name of this
character ("CJK COMPATIBILITY IDEOGRAPH-2F8A6"), the name provides
some indication of discouraged use (which may be all an author
encounters when using a character input system). My feeling is that
singletons are an ill-conceived part of NFC and NFD normalization
(closer to compatibility decompositions than canonical
decompositions), but that the non-singleton parts of normalization are
essential to proper text handling (and I don't see how Unicode could
have avoided or could avoid in the future such non-singleton canonical
normalization).
Take care,
Rob
[1]: <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:NFC_Quick_Check=No:]
>
[2]: <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:NFC_Quick_Check=Maybe:]
>
Received on Friday, 6 February 2009 21:45:56 UTC