- From: Ray Denenberg <rden@loc.gov>
- Date: Thu, 22 Feb 2001 18:54:34 -0500
- To: www-i18n-comments@w3.org, w3c-i18n-ig@w3.org
- CC: Ray Denenberg <rden@loc.gov>
The Library of Congress has reviewed the proposed W3C draft Character Model. We have two concerns, both pertaining to character normalization: (1) the preference for Unicode Normalization Format "C" (NFC), and (2) “early” normalization. 1. Normalization Format NFC prescribes pre-composed characters (letter-with-diacritic) as opposed to base-letter followed by non-spacing-modifier. This is contrary to the MARC encoding model which prescribes the latter. (“MARC” refers to a class of standard formats for bibliographic and related information. These formats are widely adopted and deployed throughout the world.) Unicode includes a repertoire of non-spacing-modifier characters, for use by communities that have used the ANSEL (Extended Latin) repertoire for many years. This Unicode repertoire of characters for diacritical marks would allow libraries to serve data on the Web in a form closer to its legacy encoding. There are 96 letter-with-diacritic combinations, that do not have corresponding Unicode character encoding, used in library transliterations of non-Latin scripts into Latin. (This number, 96, is based on a 1997 analysis by Randall Barry of the Library of Congress of the modified Latin script characters prescribed in the ALA-LC Romanization Tables.) We doubt that Unicode will add codes for these 96 combinations. The problem this raises for libraries is not trivial, as a significant number of records for transliterated data involve some of these 96 known combinations. The problem is compounded by the ASCII/ANSEL union which has resulted in an essentially open repertoire of combinations. The characters in these two sets were carried over to Unicode(/ISO 10646). The number of potential combinations of base letters with "combining" marks is large; often more than one combining mark can be associated with a base letter. For example, for some Vietnamese letters, there are three associated combining marks. Even for Latin-based languages, there are dialects and oddities that show up on title pages, for example, writers on obscure topics such as ancient manuscripts invent new letter-with-diacritic combinations, which ANSEL allows catalogers to encode. 2. “early” normalization The proposed model prescribes that normalization should occur "early", meaning close to where the data is stored, before transmission. This put the burden of conversion on library systems receiving queries and serving MARC records in response. We suspect libraries would object to this. -- Ray Denenberg Library of Congress rden@loc.gov 202-707-5795
Received on Thursday, 22 February 2001 18:55:04 UTC