- From: Mark Davis <mark.davis@icu-project.org>
- Date: Wed, 18 Feb 2009 11:40:30 -0800
- To: Robert J Burns <rob@robburns.com>
- Cc: public-i18n-core@w3.org
- Message-ID: <30b660a20902181140m7047e387s79ba10385692bcd7@mail.gmail.com>
Some very brief comments. Mark On Wed, Feb 18, 2009 at 10:36, Robert J Burns <rob@robburns.com> wrote: > Hi Mark, > > On Tue, 17 Feb 2009, at 15:07:42 -0800, Mark Davis < > mark.davis@icu-project.org> wrote: > >> I put together a short FAQ on NFC, including some figures on performance >> and >> frequencies. >> >> http://www.macchiato.com/unicode/nfc-faq >> > > This faq is great. It definitely makes it clear that normalization is > achievable in addition to being the right thing to do. Thanks. > > > There are some questions I had however. On the statistics, is this the > percentage of total content (as in total characters) that are appropriate to > NFC. Or is it the percentage of the total number of pages that in total > conform to NFC. Either way some clarification on the FAQ would be a good > idea. percentage of total characters. I also added some other information about the sampling - please take a look to see if it is useful. > > > It might be worthwhile to further underscore how much ASCII heavy markup > and content may be skewing the numbers and that growth in minority script > publishing would make normalization all that much more important. My concern > is that someone might read these statistics and think that: "OK, well the > normalization issue is already solved by the content producers". I think > that would be a mistake since what the FAQ does demonstrate is that ensuring > normalization on the consumer side is very inexpensive. right > > > Is Normalization to NFC Lossy? > Certainly for authors, font makers, character palettes and implementations > that treat canonical singletons as equivalents this is true almost by > definition. The problem with this is that authors and implementors are not > all that aware of the canonical equivalence of singletons. As recent threads > and other examinations reveal, font makers, input system makers > (particularly character palettes), and implementations otherwise do very > little to make it clear what characters are equivalents. It is easy for an > author to use two canonically equivalent characters in semantically distinct > ways and not realize that the Unicode Standard discourages it. After all, > authors are not likely to be that intimately familiar with the Unicode > normative criteria. > > Take for example the earlier discussed ideographs[1]. Likewise kelvin > (U+212A), angstrom (U+212B), ohm (U+2126), euler's constant (U+2107), micro > sign (U+00B5), etc. Some of these are canonical equivalents while others are > compatibility equivalents. Many time font makers feel they're supposed to > provide different glyphs for these characters on the one hand and their > canonically decomposable equivalents on the other. Character palettes > typically present these with nothing more than a localized name a > representative glyph and a code point (at most). Though they are discouraged > use characters, there's no way for an author to knot that. > > Therefore I think it is not really safe to say that normalization is > lossless. I want it to be. I want authors to follow the norms of Unicode, > but I don't see how we can make such assumptions. Earlier I noted that my > mail client (Apple's Mail.app) actually normalizes on paste (precomposed > NFC-like), however it does not normalize canonical singletons. This approach > strikes me as much less lossy and also as a much faster normalization to > perform. I disagree - and it is not appreciably faster (because of the frequencies involved). I did add a comment to that section. > > Ideographic Variation Database > I think the IVD is a great advancement. It would make string comparison > even simpler since a limited range of variation selector characters could > simply be ignored while comparing byte for byte (no lookups). However, is > Unicode doing anything to migrate singletons to this approach. Ideally with > variation selectors (perhaps independent of the IVD), all of the canonically > decomposable singletons could be deprecated and specific > canonically-equivalent/variation-selector combinations could be used > instead. Has anything like this been proposed? People would be encouraged to use the IVD to avoid problems. > > > Just a few thoughts, but in any event thanks for the excellent FAQ. > > Take care, > Rob > > [1]: <http://lists.w3.org/Archives/Public/www-style/2009Feb/0229.html> > >
Received on Wednesday, 18 February 2009 19:41:12 UTC