Re: Library of Congress comments on W3C Draft Character Model

Hello Ray,

Many thanks for your comments. Just a few remarks, in addition to those
by John Cowan, which I think addressed most of the issues, and some
requests for additional information:

First, do you have any pointers to ANSEL? (web-based ones are preferred)
In particular, in this ansel system, are the diacritics stored before or
after the base letter?

Second, John spoke about future systems, but I assume there are already
interfaces from Mark systems to Web pages. If they deal (at least partly)
with accented characters, there is quite a certain chance that they try
to convert the data to iso-8859-1 (the first and most widely used character
encoding on the Web). If there are such systems, then the basic work
is already done, it just needs to be extended to the characters beyond
iso-8859-1. Can you check whether there are such systems? (maybe even
give us an URI if you know one?) Or do Web interfaces of Mark systems
use different strategies?

Regards,   Martin.

At 18:54 01/02/22 -0500, Ray Denenberg wrote:
>The Library of Congress has reviewed the proposed W3C draft Character 
>Model. We
>have two concerns, both pertaining to character normalization: (1) the
>preference for Unicode Normalization Format "C" (NFC), and (2) $BEF(Barly$BG(B
>normalization.
>
>1.  Normalization Format
>NFC prescribes pre-composed characters (letter-with-diacritic) as opposed to
>base-letter followed by non-spacing-modifier. This is contrary to the MARC
>encoding model which prescribes the latter. ($BE.(BARC$BG(Brefers to a class of
>standard formats for bibliographic and related information. These formats are
>widely adopted and deployed throughout the world.)  Unicode includes a
>repertoire of non-spacing-modifier characters, for use by communities that 
>have
>used the ANSEL (Extended Latin) repertoire for many years. This Unicode
>repertoire of characters for diacritical marks would allow libraries to serve
>data on the Web in a form closer to its legacy encoding. There are 96
>letter-with-diacritic combinations, that do not  have corresponding Unicode
>character encoding, used in library transliterations of non-Latin scripts into
>Latin. (This number, 96,  is based on a 1997 analysis by Randall Barry of the
>Library of Congress of the modified Latin script characters prescribed in the
>ALA-LC Romanization Tables.) We doubt that Unicode will add codes for these 96
>combinations.
>     The problem this raises for libraries is not trivial, as a significant
>number of records for transliterated data involve some of these 96 known
>combinations.  The problem is compounded by the ASCII/ANSEL union which has
>resulted in an essentially open repertoire of combinations.  The characters in
>these two sets were carried over to Unicode(/ISO 10646). The number of 
>potential
>combinations of base letters with "combining" marks is large; often more than
>one combining mark can be associated with a base letter. For example, for some
>Vietnamese letters, there are three associated combining marks.
>     Even for Latin-based languages, there are dialects and oddities that 
> show up
>on title pages, for example, writers on obscure topics such as ancient
>manuscripts invent new letter-with-diacritic combinations, which ANSEL allows
>catalogers to encode.
>
>2. $BEF(Barly$BG(Bnormalization
>The proposed model prescribes that normalization  should occur "early", 
>meaning
>close to where the data is stored, before transmission.  This put the 
>burden of
>conversion on library systems receiving queries and serving MARC records in
>response.  We suspect libraries would object to this.
>
>
>
>--
>Ray Denenberg
>Library of Congress
>rden@loc.gov
>202-707-5795
>

Received on Friday, 23 February 2001 11:59:03 UTC