Re: Library of Congress comments on W3C Draft Character Model from John Cowan on 2001-02-23 (www-i18n-comments@w3.org from February 2001)

From: John Cowan <jcowan@reutershealth.com>
Date: Thu, 22 Feb 2001 20:12:19 -0500
To: rden@loc.gov
CC: www-i18n-comments@w3.org, w3c-i18n-ig@w3.org
Message-ID: <3A95B8F3.1090001@reutershealth.com>

Ray Denenberg wrote:

 > The Library of Congress has reviewed the proposed W3C draft Character
 > Model. We have two concerns, both pertaining to character
 > normalization:
 >  (1) the preference for Unicode Normalization Format "C" (NFC), and
 >  (2) "early" normalization.

I have considered your message carefully, and must emphasize that
in this message I am speaking only for myself, not for the I18N IG.
It seems to me that your two stated concerns reflect two underlying
questions: (a) "Who's going to do the work?" and (b) "Can the work get
done at all?"  I have reordered your message accordingly.
I apologize if I am telling you things you know perfectly well.

 > This Unicode repertoire of characters for diacritical marks would
 > allow libraries to serve data on the Web in a form closer to its
 > legacy encoding.
 >
 > "Early" normalization puts the burden of conversion on library systems
 > receiving queries and serving MARC records in response. We suspect
 > libraries would object to this.

These things are undoubtedly true.  However, the Rest Of Us, who
are your eventual customers, seem to prefer precomposed encodings.
The total work is minimized if you convert once at your Web interface,
rather than requiring all your users to convert every time
they download something.

You will already need to decompose incoming queries, which uses
the same tables as NFC normalization. Even if you serve up NFD data
(fully decomposed) to your users, their queries will be formulated in
NFC terms, since essentially all non-Unicode encodings except ANSEL
itself are NFC-compatible.

The time overhead of NFC normalization is small, and there are simple
table-driven routines to do it.  The necessary table size is a few
kilobytes, which is trivial today on any server.

 > There are 96 letter-with-diacritic combinations, that do not have
 > corresponding Unicode character encoding, used in library
 > transliterations of non-Latin scripts into Latin.

Is it 100% clear to you that NFC does *not* forbid the use of combining
characters?  If you need to have a Latin "f" with a dot below,
you encode it in NFC just as in ASCII/ANSEL or unrestricted
Unicode:  LATIN SMALL LETTER F followed by COMBINING DOT BELOW.

The only decomposed combinations forbidden by NFC are those which
already have canonically equivalent Unicode characters.  Thus NFC
represents Latin "d" followed by a dot below as U+1E0D LATIN SMALL
LETTER D WITH DOT BELOW.

In this way, your customers only have to handle a single representation
of such characters, while allowing every character to be represented.

-- 
There is / one art || John Cowan <jcowan@reutershealth.com>
no more / no less || http://www.reutershealth.com
to do / all things || http://www.ccil.org/~cowan
with art- / lessness \\ -- Piet Hein

Received on Thursday, 22 February 2001 20:12:02 UTC