Re: NFC FAQ from Mark Davis on 2009-02-18 (public-i18n-core@w3.org from January to March 2009)

From: Mark Davis <mark.davis@icu-project.org>
Date: Wed, 18 Feb 2009 11:40:30 -0800
To: Robert J Burns <rob@robburns.com>
Cc: public-i18n-core@w3.org
Message-ID: <30b660a20902181140m7047e387s79ba10385692bcd7@mail.gmail.com>
Some very brief comments.
Mark


On Wed, Feb 18, 2009 at 10:36, Robert J Burns <rob@robburns.com> wrote:

> Hi Mark,
>
> On Tue, 17 Feb 2009, at 15:07:42 -0800, Mark Davis <
> mark.davis@icu-project.org>  wrote:
>
>> I put together a short FAQ on NFC, including some figures on performance
>> and
>> frequencies.
>>
>> http://www.macchiato.com/unicode/nfc-faq
>>
>
> This faq is great. It definitely makes it clear that normalization is
> achievable in addition to being the right thing to do.


Thanks.

>
>
> There are some questions I had however. On the statistics, is this the
> percentage of total content (as in total characters) that are appropriate to
> NFC. Or is it the percentage of the total number of pages that in total
> conform to NFC. Either way some clarification on the FAQ would be a good
> idea.


percentage of total characters. I also added some other information about
the sampling - please take a look to see if it is useful.


>
>
> It might be worthwhile to further underscore how much ASCII heavy markup
> and content may be skewing the numbers and that growth in minority script
> publishing would make normalization all that much more important. My concern
> is that someone might read these statistics and think that: "OK, well the
> normalization issue is already solved by the content producers". I think
> that would be a mistake since what the FAQ does demonstrate is that ensuring
> normalization on the consumer side is very inexpensive.


right


>
>
> Is Normalization to NFC Lossy?
> Certainly for authors, font makers, character palettes and implementations
> that treat canonical singletons as equivalents this is true almost by
> definition. The problem with this is that authors and implementors are not
> all that aware of the canonical equivalence of singletons. As recent threads
> and other examinations reveal, font makers, input system makers
> (particularly character palettes), and implementations otherwise do very
> little to make it clear what characters are equivalents. It is easy for an
> author to use two canonically equivalent characters in semantically distinct
> ways and not realize that the Unicode Standard discourages it. After all,
> authors are not likely to be that intimately familiar with the Unicode
> normative criteria.
>
> Take for example the earlier discussed ideographs[1]. Likewise kelvin
> (U+212A), angstrom (U+212B), ohm (U+2126), euler's constant (U+2107), micro
> sign (U+00B5), etc. Some of these are canonical equivalents while others are
> compatibility equivalents. Many time font makers feel they're supposed to
> provide different glyphs for these characters on the one hand and their
> canonically decomposable equivalents on the other. Character palettes
> typically present these with nothing more than a localized name a
> representative glyph and a code point (at most). Though they are discouraged
> use characters, there's no way for an author to knot that.
>
> Therefore I think it is not really safe to say that normalization is
> lossless. I want it to be. I want authors to follow the norms of Unicode,
> but I don't see how we can make such assumptions. Earlier I noted that my
> mail client (Apple's Mail.app) actually normalizes on paste (precomposed
> NFC-like), however it does not normalize canonical singletons. This approach
> strikes me as much less lossy and also as a much faster normalization to
> perform.


I disagree - and it is not appreciably faster (because of the frequencies
involved). I did add a comment to that section.


>
> Ideographic Variation Database
> I think the IVD is a great advancement. It would make string comparison
> even simpler since a limited range of variation selector characters could
> simply be ignored while comparing byte for byte (no lookups). However, is
> Unicode doing anything to migrate singletons to this approach. Ideally with
> variation selectors (perhaps independent of the IVD), all of the canonically
> decomposable singletons could be deprecated and specific
> canonically-equivalent/variation-selector combinations could be used
> instead. Has anything like this been proposed?


People would be encouraged to use the IVD to avoid problems.

>
>
> Just a few thoughts, but in any event thanks for the excellent FAQ.
>
> Take care,
> Rob
>
> [1]: <http://lists.w3.org/Archives/Public/www-style/2009Feb/0229.html>
>
>
Received on Wednesday, 18 February 2009 19:41:12 UTC