Re: NFC FAQ

Hi Mark,

On Tue, 17 Feb 2009, at 15:07:42 -0800, Mark Davis <mark.davis@icu-project.org 
 >  wrote:
> I put together a short FAQ on NFC, including some figures on  
> performance and
> frequencies.
>
> http://www.macchiato.com/unicode/nfc-faq

This faq is great. It definitely makes it clear that normalization is  
achievable in addition to being the right thing to do.

There are some questions I had however. On the statistics, is this the  
percentage of total content (as in total characters) that are  
appropriate to NFC. Or is it the percentage of the total number of  
pages that in total conform to NFC. Either way some clarification on  
the FAQ would be a good idea.

It might be worthwhile to further underscore how much ASCII heavy  
markup and content may be skewing the numbers and that growth in  
minority script publishing would make normalization all that much more  
important. My concern is that someone might read these statistics and  
think that: "OK, well the normalization issue is already solved by the  
content producers". I think that would be a mistake since what the FAQ  
does demonstrate is that ensuring normalization on the consumer side  
is very inexpensive.

Is Normalization to NFC Lossy?
Certainly for authors, font makers, character palettes and  
implementations that treat canonical singletons as equivalents this is  
true almost by definition. The problem with this is that authors and  
implementors are not all that aware of the canonical equivalence of  
singletons. As recent threads and other examinations reveal, font  
makers, input system makers (particularly character palettes), and  
implementations otherwise do very little to make it clear what  
characters are equivalents. It is easy for an author to use two  
canonically equivalent characters in semantically distinct ways and  
not realize that the Unicode Standard discourages it. After all,  
authors are not likely to be that intimately familiar with the Unicode  
normative criteria.

Take for example the earlier discussed ideographs[1]. Likewise kelvin  
(U+212A), angstrom (U+212B), ohm (U+2126), euler's constant (U+2107),  
micro sign (U+00B5), etc. Some of these are canonical equivalents  
while others are compatibility equivalents. Many time font makers feel  
they're supposed to provide different glyphs for these characters on  
the one hand and their canonically decomposable equivalents on the  
other. Character palettes typically present these with nothing more  
than a localized name a representative glyph and a code point (at  
most). Though they are discouraged use characters, there's no way for  
an author to knot that.

Therefore I think it is not really safe to say that normalization is  
lossless. I want it to be. I want authors to follow the norms of  
Unicode, but I don't see how we can make such assumptions. Earlier I  
noted that my mail client (Apple's Mail.app) actually normalizes on  
paste (precomposed NFC-like), however it does not normalize canonical  
singletons. This approach strikes me as much less lossy and also as a  
much faster normalization to perform.

Ideographic Variation Database
I think the IVD is a great advancement. It would make string  
comparison even simpler since a limited range of variation selector  
characters could simply be ignored while comparing byte for byte (no  
lookups). However, is Unicode doing anything to migrate singletons to  
this approach. Ideally with variation selectors (perhaps independent  
of the IVD), all of the canonically decomposable singletons could be  
deprecated and specific canonically-equivalent/variation-selector  
combinations could be used instead. Has anything like this been  
proposed?

Just a few thoughts, but in any event thanks for the excellent FAQ.

Take care,
Rob

[1]: <http://lists.w3.org/Archives/Public/www-style/2009Feb/0229.html>

Received on Wednesday, 18 February 2009 18:36:51 UTC