- From: John Cowan <jcowan@reutershealth.com>
- Date: Thu, 22 Feb 2001 20:12:19 -0500
- To: rden@loc.gov
- CC: www-i18n-comments@w3.org, w3c-i18n-ig@w3.org
Ray Denenberg wrote: > The Library of Congress has reviewed the proposed W3C draft Character > Model. We have two concerns, both pertaining to character > normalization: > (1) the preference for Unicode Normalization Format "C" (NFC), and > (2) "early" normalization. I have considered your message carefully, and must emphasize that in this message I am speaking only for myself, not for the I18N IG. It seems to me that your two stated concerns reflect two underlying questions: (a) "Who's going to do the work?" and (b) "Can the work get done at all?" I have reordered your message accordingly. I apologize if I am telling you things you know perfectly well. > This Unicode repertoire of characters for diacritical marks would > allow libraries to serve data on the Web in a form closer to its > legacy encoding. > > "Early" normalization puts the burden of conversion on library systems > receiving queries and serving MARC records in response. We suspect > libraries would object to this. These things are undoubtedly true. However, the Rest Of Us, who are your eventual customers, seem to prefer precomposed encodings. The total work is minimized if you convert once at your Web interface, rather than requiring all your users to convert every time they download something. You will already need to decompose incoming queries, which uses the same tables as NFC normalization. Even if you serve up NFD data (fully decomposed) to your users, their queries will be formulated in NFC terms, since essentially all non-Unicode encodings except ANSEL itself are NFC-compatible. The time overhead of NFC normalization is small, and there are simple table-driven routines to do it. The necessary table size is a few kilobytes, which is trivial today on any server. > There are 96 letter-with-diacritic combinations, that do not have > corresponding Unicode character encoding, used in library > transliterations of non-Latin scripts into Latin. Is it 100% clear to you that NFC does *not* forbid the use of combining characters? If you need to have a Latin "f" with a dot below, you encode it in NFC just as in ASCII/ANSEL or unrestricted Unicode: LATIN SMALL LETTER F followed by COMBINING DOT BELOW. The only decomposed combinations forbidden by NFC are those which already have canonically equivalent Unicode characters. Thus NFC represents Latin "d" followed by a dot below as U+1E0D LATIN SMALL LETTER D WITH DOT BELOW. In this way, your customers only have to handle a single representation of such characters, while allowing every character to be represented. -- There is / one art || John Cowan <jcowan@reutershealth.com> no more / no less || http://www.reutershealth.com to do / all things || http://www.ccil.org/~cowan with art- / lessness \\ -- Piet Hein
Received on Thursday, 22 February 2001 20:12:02 UTC