Library of Congress comments on W3C Draft Character Model from Ray Denenberg on 2001-02-22 (www-i18n-comments@w3.org from February 2001)

From: Ray Denenberg <rden@loc.gov>
Date: Thu, 22 Feb 2001 18:54:34 -0500
To: www-i18n-comments@w3.org, w3c-i18n-ig@w3.org
CC: Ray Denenberg <rden@loc.gov>
Message-ID: <3A95A6BA.A393CB3F@rs8.loc.gov>

The Library of Congress has reviewed the proposed W3C draft Character Model. We
have two concerns, both pertaining to character normalization: (1) the
preference for Unicode Normalization Format "C" (NFC), and (2) “early”
normalization.

1.  Normalization Format
NFC prescribes pre-composed characters (letter-with-diacritic) as opposed to
base-letter followed by non-spacing-modifier. This is contrary to the MARC
encoding model which prescribes the latter. (“MARC” refers to a class of
standard formats for bibliographic and related information. These formats are
widely adopted and deployed throughout the world.)  Unicode includes a
repertoire of non-spacing-modifier characters, for use by communities that have
used the ANSEL (Extended Latin) repertoire for many years. This Unicode
repertoire of characters for diacritical marks would allow libraries to serve
data on the Web in a form closer to its legacy encoding. There are 96
letter-with-diacritic combinations, that do not  have corresponding Unicode
character encoding, used in library transliterations of non-Latin scripts into
Latin. (This number, 96,  is based on a 1997 analysis by Randall Barry of the
Library of Congress of the modified Latin script characters prescribed in the
ALA-LC Romanization Tables.) We doubt that Unicode will add codes for these 96
combinations.
    The problem this raises for libraries is not trivial, as a significant
number of records for transliterated data involve some of these 96 known
combinations.  The problem is compounded by the ASCII/ANSEL union which has
resulted in an essentially open repertoire of combinations.  The characters in
these two sets were carried over to Unicode(/ISO 10646). The number of potential
combinations of base letters with "combining" marks is large; often more than
one combining mark can be associated with a base letter. For example, for some
Vietnamese letters, there are three associated combining marks.
    Even for Latin-based languages, there are dialects and oddities that show up
on title pages, for example, writers on obscure topics such as ancient
manuscripts invent new letter-with-diacritic combinations, which ANSEL allows
catalogers to encode.

2. “early” normalization
The proposed model prescribes that normalization  should occur "early", meaning
close to where the data is stored, before transmission.  This put the burden of
conversion on library systems receiving queries and serving MARC records in
response.  We suspect libraries would object to this.



--
Ray Denenberg
Library of Congress
rden@loc.gov
202-707-5795

Received on Thursday, 22 February 2001 18:55:04 UTC