- From: Robert J Burns <rob@robburns.com>
- Date: Wed, 18 Feb 2009 12:36:12 -0600
- To: public-i18n-core@w3.org
Hi Mark, On Tue, 17 Feb 2009, at 15:07:42 -0800, Mark Davis <mark.davis@icu-project.org > wrote: > I put together a short FAQ on NFC, including some figures on > performance and > frequencies. > > http://www.macchiato.com/unicode/nfc-faq This faq is great. It definitely makes it clear that normalization is achievable in addition to being the right thing to do. There are some questions I had however. On the statistics, is this the percentage of total content (as in total characters) that are appropriate to NFC. Or is it the percentage of the total number of pages that in total conform to NFC. Either way some clarification on the FAQ would be a good idea. It might be worthwhile to further underscore how much ASCII heavy markup and content may be skewing the numbers and that growth in minority script publishing would make normalization all that much more important. My concern is that someone might read these statistics and think that: "OK, well the normalization issue is already solved by the content producers". I think that would be a mistake since what the FAQ does demonstrate is that ensuring normalization on the consumer side is very inexpensive. Is Normalization to NFC Lossy? Certainly for authors, font makers, character palettes and implementations that treat canonical singletons as equivalents this is true almost by definition. The problem with this is that authors and implementors are not all that aware of the canonical equivalence of singletons. As recent threads and other examinations reveal, font makers, input system makers (particularly character palettes), and implementations otherwise do very little to make it clear what characters are equivalents. It is easy for an author to use two canonically equivalent characters in semantically distinct ways and not realize that the Unicode Standard discourages it. After all, authors are not likely to be that intimately familiar with the Unicode normative criteria. Take for example the earlier discussed ideographs[1]. Likewise kelvin (U+212A), angstrom (U+212B), ohm (U+2126), euler's constant (U+2107), micro sign (U+00B5), etc. Some of these are canonical equivalents while others are compatibility equivalents. Many time font makers feel they're supposed to provide different glyphs for these characters on the one hand and their canonically decomposable equivalents on the other. Character palettes typically present these with nothing more than a localized name a representative glyph and a code point (at most). Though they are discouraged use characters, there's no way for an author to knot that. Therefore I think it is not really safe to say that normalization is lossless. I want it to be. I want authors to follow the norms of Unicode, but I don't see how we can make such assumptions. Earlier I noted that my mail client (Apple's Mail.app) actually normalizes on paste (precomposed NFC-like), however it does not normalize canonical singletons. This approach strikes me as much less lossy and also as a much faster normalization to perform. Ideographic Variation Database I think the IVD is a great advancement. It would make string comparison even simpler since a limited range of variation selector characters could simply be ignored while comparing byte for byte (no lookups). However, is Unicode doing anything to migrate singletons to this approach. Ideally with variation selectors (perhaps independent of the IVD), all of the canonically decomposable singletons could be deprecated and specific canonically-equivalent/variation-selector combinations could be used instead. Has anything like this been proposed? Just a few thoughts, but in any event thanks for the excellent FAQ. Take care, Rob [1]: <http://lists.w3.org/Archives/Public/www-style/2009Feb/0229.html>
Received on Wednesday, 18 February 2009 18:36:51 UTC