- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Thu, 1 May 1997 21:24:52 +0200 (MET DST)
- To: "Michael Kung <MKUNG.US.ORACLE.COM>" <MKUNG@us.oracle.com>
- cc: URI mailing list <uri@bunyip.com>
On 30 Apr 1997, Michael Kung <MKUNG.US.ORACLE.COM> wrote: > Agree on the 'key words'. But this rule also implies that I cannot put any > double byte English Alphabet in my company name (or I have to change my > company name for URL). Well, I think a company would not be well-advised (and probably even challenged in court) if it choose a name identical with another one except for the fact that it uses full-width instead of half-width characters. Also, it should probably be in the company's own interest to use an URL that doesn't bring problems to the users. Most users, esp. new ones, are not very familliar with the artificial separation of character codes brought about by some confusion between characters and glyphs on current computers. Most companies with English letter names will already have their ASCII URL long before full-width characters will work. That said, I know like many others on this list that this is only one case of a whole can of worms that we have to address in some way or another. As promized, I have therefore started to write a draft, which is currently at about the same stage as Larry's recent draft, so that I can bother you with it below. Any comments are wellcome. You don't need to have a solution for a problem, just helping me listing the problems is extremely valuable. Regards, Martin. Internet Draft M. Duerst <draft-duerst-i18n-norm-00?.txt> University of Zurich Expires in six months May 1997 Normalization of Internationalized Identifiers Status of this Memo This document is an Internet-Draft. Internet-Drafts are working doc- uments of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute work- ing documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months. Internet-Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet- Drafts as reference material or to cite them other than as a "working draft" or "work in progress". To learn the current status of any Internet-Draft, please check the 1id-abstracts.txt listing contained in the Internet-Drafts Shadow Directories on ds.internic.net (US East Coast), nic.nordu.net (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim). Distribution of this document is unlimited. Please send comments to the author at <mduerst@ifi.unizh.ch> or to the uri mailing list at uri@bunyip.com. This document is currently a pre-draft, for restricted discussion only. It is intended to become part of a suite of documents related to the internationalization of URLs. Abstract The Universal Character Set (UCS) makes it possible to extend the repertoire of characters used in non-local identifiers beyond US- ASCII. The UCS contains a large overall number of characters, many codepoints for backwards compatibility, and various mechanisms to cope with the features of the writing systems of the world. These features can lead to ambiguities in representation. Such ambiguities are not a problem when representing running text, and therefore existing standards have only defined equivalences. For the use in identifiers, which are compared using their binary representation by most software, this is not sufficient. This document defines a nor- malization algorithm and gives usage guidelines to avoid such ambigu- ities. Expires in six months [Page 1] Internet DrafNormalization of Internationalized Identifiers May 1997 Table of contents 1. Introduction ................................................... ? 1.1 General ......................................................? To be completed Bibliography .......................................................? Author's Address ...................................................? 1. Introduction 1.1 General For the identification of resources in networks, many kinds of iden- tifiers are in use. Locally, many kinds of identifiers can contain characters from all kinds of languages and scripts, but because the characters were encoded differently, network identifiers had to be limited to a very restricted character repertoire, usually a subset of US-ASCII [US-ASCII]. With the definition of the Universal Character Set (UCS) [ISO 10646] [Unicode2], it becomes possible to extend the character repertoire of such identifiers. In some cases, this has already been done [Java][URN-Syntax]; other cases are under study. While identifiers for resources of full worldwide interest should continue to be lim- ited to a very restricted set of widestly known characters, names for resources mainly used in a language-local or script-local context may provide significant additional user convenience if they can make use of a wider character repertoire [iURL rationale]. The UCS contains a large overall number of characters, many code- points for backwards compatibility, and various mechanisms to cope with the features of the writing systems of the world. These all lead to ambiguities that in some cases can be resolved by careful display, printing, and examination by the reader, but which in other cases are intended to be unnoticable by the reader. Such ambiguities can be dealt with in systems dealing with running text by using various kinds of equivalences and normalizations, which may differ by implementation. However, software processing identi- fiers usually compares their binary representation to establish that two identifiers are identical. In some cases, some additional pro- cessing is also done to account for the specifics of identifier syn- tax variation. To upgrade all such software to taking into account Expires in six months [Page 2] Internet DrafNormalization of Internationalized Identifiers May 1997 the equivalences and ambiguities in the UCS would be extremely tedious. For some classes of identifiers, it would be impossible because their binary representation is transparent in the sense that it may allow legacy character encodings besides a character encoding based on UCS to be used and/or it may allow for arbitrary binary data to be contained in identifiers. In order to facilitate widespread use of identifiers containing char- acters from UCS, this document therefore develops clear specifica- tions for a normalization algorithm removing basic ambiguities and guidelines for the use of characters with potential ambiguity. 1.? Guidelines The specifications and guidelines in this document have been devel- opped with the following goals in mind: - Avoid bad surprises for users when they cannot understand that two identifiers looking exactly the same don't match. The user in this case is an average user without any specific knowledge of character encoding, but with a basic dose of "computer literacy" (e.g. know that 0 and O have distinct keys on a keyboard). - Restrict normalization to cases where it is really necessary; cover remaining ambiguities by guidelines. - Define normalization so that it can be implemented using widely accessible documentation. - Define normalization so that most identifiers currently existing locally are not affected. - Take measures for best possible compatibility with future addi- tions to the UCS. 1.? Notation Codepoints from the UCS are denoted as U+XXXX, where XXXX is their hexadecimal representation, according to [Unicode, p.???]. Stretches of characters? Official character names and components all upper case. Expires in six months [Page 3] Internet DrafNormalization of Internationalized Identifiers May 1997 2. Categories of Ambiguity and Problems Comparing two sequences of codepoints from the UCS, various degrees of ambiguity can arise: Category A: The two sequences are expected to be rendered exactly the same, considered identical by the user, and cannot be disambiguated by context. Category B: The two sequences are "semantically" different but diffi- cult or impossible to distinguish in rendering. Category C: ????? ???? There are also a number of codepoints in the UCS that should not be used for various reasons, mainly that they are not available on usual keyboards. These go into Category X. ?. Normalization of Combining Sequences One of the main reasons for Category A ambiguities is the fact that the UCS contains a general mechanism for encoding diacritic combina- tions from base letters and modifying diacritics, but that many com- binations also exist as precomposed codepoints. The following algorithm normalizes such combinations: Step 1: Starting from the beginning of the identifier, find a maximal sequence of a base character (possibly decomposable) followed by mod- ifying letters. Step 2: Fully decompose the sequence found in step 1, using all canonical decompositions defined in [Unicode2] and all canonical decompositions defined for future additions to the UCS. Step 3: Sort the sequence of modifying letters found in Step 2 according to the canonical ordering algorithm of Section 3.9 of [Uni- code2]. Step 4: Try to recombine as much as possible of the sequence result- ing from Step 3 into a precomposed character by findind the longest initial match with any canonical decomposition sequence defined in [Unicode2], ignoring decomposition sequences of length 1. Expires in six months [Page 4] Internet DrafNormalization of Internationalized Identifiers May 1997 Step 5: Use the result of Step 4 as output and continue with Step 1. Note: In Step 4, the decomposition sequences in [Unicode2] have to be recursively expanded for each character (except for decomposition sequences of length 1) before application. Otherwise, a character such as U+1E1C, LATIN CAPITAL LETTER E WITH CEDILLA AND BREVE, will not be recomposed correctly. Note: In Step 4, canonical decompositions defined for future addi- tions to the UCS are explicitly not considered to ease forwards com- patibility. It is assumed that systems knowing about newly defined precompositions will be able to decompose them correctly in Step 2, but that it is hard to change identifierst on older systems using a decomposed representation. Note: A different definition of Step 4 may lead to shorter normaliza- tions for some identifiers. The current definition was choosen for simplicity and implementation speed. (this may be subject to discus- sion, in particular if somebody has an implementation and is ready to share the code). Note: The above algorithm can be sped up by shortcuts, in particular by noting that precomposed characters [with the important exception of those that have a decomposition sequence of length 1] which are not followed by modifying letters, are already normalized. Note: A completely different algorithm that results in the same observed input-output behaviour is also acceptable. Note: The exception for "precomposed letters that have a decomposi- tion sequence of length 1" in Step 4 is necessary to avoid e.g. the letter "K" being "aggregated" to "KELVIN SIGN" U+212A. ?. Hangul Jamo Normalization Hangul Jamo (U+1100-U+11FF) provide ample possibilities for ambiguous notations and therefore must be carefully normalized. The following algorithm or its equivalents in terms of input-output behaviour should be used: Step 1: A seqence of Hangul jamo is split up into syllables according to the definition of syllable boundaries on page 3-12 of [Unicode2]. Each of these syllables is processed according to Steps 2-4. Step 2: Fillers are inserted as neccessary to form a canonical sylla- ble as defined on page 3-12 of [Unicode2]. Expires in six months [Page 5] Internet DrafNormalization of Internationalized Identifiers May 1997 Step 3: Sequences of choseong, jungseong, and jongseong (leading con- sonants, vowels, and trailing consonants) are replaced by a single choseong, jungseong, and jongseong respectively according to the com- patibility decompositions given in [Unicode2]. If this is not possi- ble, the sequence is malformed and the user should be warned. Step 4: The seqence is replaced by a Hangul Syllable (U+AC00-U+D7AF) if this is possible according to the algorithm given on pp. 3-12/3 of [Unicode2]. Note: We need some for dealing with compatibility Jamo (U+3130...). ?. Other Cases of Ambiguities General considerations about case. Similar letters in different alphabets (e.g. Latin/Greek/Cyrillic A): The letter from the correct alphabet should be used in context with other letters from that alphabet. Mixed-alphabet identifiers have to be avoided. In the case of single letters mixed with numbers and such, which should be avoided in the first place, it should be assumed that such letters are Latin if possible, and Cyrillic other- wise. Lower-case identifiers should be prefered because lower-case has less such problems. (should heuristics based on wider context (e.g. domain names) be mentionned?) Half-width and full-width compatibility characters (U+FF00...): The version not in the compatibility section (i.e. half-width for Latin and symbols, full-width for Katakana, Hangul, "LIGHT VERTICAL", arrows, black square, and white circle) should be used wherever pos- sible. Because half-with Latin characters may be needed in certain parts of certain identifiers anyway, keyboard settings in places where identifiers are input may be set to produce half-width Latin characters by default, making the input of full-width characters more tedious. Also, while the difference between half-width and full-width characters is well visible on computers in contexts that use fixed- pitch displays, they are not well transcribed on paper or with high quality printing. Identifiers should never differ by a half- width/full-width difference only. Vertical variants (U+FE30...): Should not be used, in particular because they are variants of characters that are already discouraged :-). Small form variants (U+FE50...): Strongly discouraged (where do they come from?). Expires in six months [Page 6] Internet DrafNormalization of Internationalized Identifiers May 1997 Ligatures (Latin and Arabic). Not covered by canonical decomposition. Need to write some normalization specs for them! Other script-specific stuff. Signs and symbols. Punctuation. ?. Ideographic Ambiguities Compatibility Ideographs: How to handle the Korean case? How to han- dle the other stuff? Warning about JIS 75/83 (97!) problems (~20 pairs). Warning about backwards-compatibility non-unifications (about 100 pairs and some triples of differing seriousness; affecting inter- typographic-context work but not intra-TC). Explanation about general differences due to simplifications. Acknowledgements I am grateful in particular to the following persons for contributing ideas, advice, criticism and help: Mark Davis, Larry Masenter, (to be completed). Bibliography [HTML] T. Berners-Lee and D. Connolly, "Hypertext Markup Lan- guage - 2.0" (RFC1866), MIT/W3C, November 1995. [Unicode2] Unicode????, Version 2, Addisson-Wesley, Reading, MA, 1996. [HTML-I18N] F. Yergeau, G. Nicol, G. Adams, and M. Duerst, "Inter- nationalization of the Hypertext Markup Language", Work in progress (draft-ietf-html-i18n-05.txt), August 1996. Expires in six months [Page 7] Internet DrafNormalization of Internationalized Identifiers May 1997 Author's Address Martin J. Duerst Multimedia-Laboratory Department of Computer Science University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland Tel: +41 1 257 43 16 Fax: +41 1 363 00 35 E-mail: mduerst@ifi.unizh.ch NOTE -- Please write the author's name with u-Umlaut wherever possible, e.g. in HTML as Dürst. Expires in six months [Page 8]
Received on Thursday, 1 May 1997 15:25:26 UTC