- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Thu, 1 May 1997 21:24:52 +0200 (MET DST)
- To: "Michael Kung <MKUNG.US.ORACLE.COM>" <MKUNG@us.oracle.com>
- cc: URI mailing list <uri@bunyip.com>
On 30 Apr 1997, Michael Kung <MKUNG.US.ORACLE.COM> wrote:
> Agree on the 'key words'. But this rule also implies that I cannot put any
> double byte English Alphabet in my company name (or I have to change my
> company name for URL).
Well, I think a company would not be well-advised (and probably even
challenged in court) if it choose a name identical with another one
except for the fact that it uses full-width instead of half-width
characters.
Also, it should probably be in the company's own interest to use
an URL that doesn't bring problems to the users. Most users, esp.
new ones, are not very familliar with the artificial separation
of character codes brought about by some confusion between characters
and glyphs on current computers. Most companies with
English letter names will already have their ASCII URL long
before full-width characters will work.
That said, I know like many others on this list that this is only
one case of a whole can of worms that we have to address in some
way or another. As promized, I have therefore started to write
a draft, which is currently at about the same stage as Larry's
recent draft, so that I can bother you with it below. Any
comments are wellcome. You don't need to have a solution for
a problem, just helping me listing the problems is extremely
valuable.
Regards, Martin.
Internet Draft M. Duerst
<draft-duerst-i18n-norm-00?.txt> University of Zurich
Expires in six months May 1997
Normalization of Internationalized Identifiers
Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working doc-
uments of the Internet Engineering Task Force (IETF), its areas, and
its working groups. Note that other groups may also distribute work-
ing documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months. Internet-Drafts may be updated, replaced, or obsoleted by
other documents at any time. It is not appropriate to use Internet-
Drafts as reference material or to cite them other than as a "working
draft" or "work in progress".
To learn the current status of any Internet-Draft, please check the
1id-abstracts.txt listing contained in the Internet-Drafts Shadow
Directories on ds.internic.net (US East Coast), nic.nordu.net
(Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific
Rim).
Distribution of this document is unlimited. Please send comments to
the author at <mduerst@ifi.unizh.ch> or to the uri mailing list at
uri@bunyip.com. This document is currently a pre-draft, for
restricted discussion only. It is intended to become part of a suite
of documents related to the internationalization of URLs.
Abstract
The Universal Character Set (UCS) makes it possible to extend the
repertoire of characters used in non-local identifiers beyond US-
ASCII. The UCS contains a large overall number of characters, many
codepoints for backwards compatibility, and various mechanisms to
cope with the features of the writing systems of the world. These
features can lead to ambiguities in representation. Such ambiguities
are not a problem when representing running text, and therefore
existing standards have only defined equivalences. For the use in
identifiers, which are compared using their binary representation by
most software, this is not sufficient. This document defines a nor-
malization algorithm and gives usage guidelines to avoid such ambigu-
ities.
Expires in six months [Page 1]
Internet DrafNormalization of Internationalized Identifiers May 1997
Table of contents
1. Introduction ................................................... ?
1.1 General ......................................................?
To be completed
Bibliography .......................................................?
Author's Address ...................................................?
1. Introduction
1.1 General
For the identification of resources in networks, many kinds of iden-
tifiers are in use. Locally, many kinds of identifiers can contain
characters from all kinds of languages and scripts, but because the
characters were encoded differently, network identifiers had to be
limited to a very restricted character repertoire, usually a subset
of US-ASCII [US-ASCII].
With the definition of the Universal Character Set (UCS) [ISO 10646]
[Unicode2], it becomes possible to extend the character repertoire of
such identifiers. In some cases, this has already been done
[Java][URN-Syntax]; other cases are under study. While identifiers
for resources of full worldwide interest should continue to be lim-
ited to a very restricted set of widestly known characters, names for
resources mainly used in a language-local or script-local context may
provide significant additional user convenience if they can make use
of a wider character repertoire [iURL rationale].
The UCS contains a large overall number of characters, many code-
points for backwards compatibility, and various mechanisms to cope
with the features of the writing systems of the world. These all lead
to ambiguities that in some cases can be resolved by careful display,
printing, and examination by the reader, but which in other cases are
intended to be unnoticable by the reader.
Such ambiguities can be dealt with in systems dealing with running
text by using various kinds of equivalences and normalizations, which
may differ by implementation. However, software processing identi-
fiers usually compares their binary representation to establish that
two identifiers are identical. In some cases, some additional pro-
cessing is also done to account for the specifics of identifier syn-
tax variation. To upgrade all such software to taking into account
Expires in six months [Page 2]
Internet DrafNormalization of Internationalized Identifiers May 1997
the equivalences and ambiguities in the UCS would be extremely
tedious. For some classes of identifiers, it would be impossible
because their binary representation is transparent in the sense that
it may allow legacy character encodings besides a character encoding
based on UCS to be used and/or it may allow for arbitrary binary data
to be contained in identifiers.
In order to facilitate widespread use of identifiers containing char-
acters from UCS, this document therefore develops clear specifica-
tions for a normalization algorithm removing basic ambiguities and
guidelines for the use of characters with potential ambiguity.
1.? Guidelines
The specifications and guidelines in this document have been devel-
opped with the following goals in mind:
- Avoid bad surprises for users when they cannot understand that two
identifiers looking exactly the same don't match. The user in
this case is an average user without any specific knowledge of
character encoding, but with a basic dose of "computer literacy"
(e.g. know that 0 and O have distinct keys on a keyboard).
- Restrict normalization to cases where it is really necessary;
cover remaining ambiguities by guidelines.
- Define normalization so that it can be implemented using widely
accessible documentation.
- Define normalization so that most identifiers currently existing
locally are not affected.
- Take measures for best possible compatibility with future addi-
tions to the UCS.
1.? Notation
Codepoints from the UCS are denoted as U+XXXX, where XXXX is their
hexadecimal representation, according to [Unicode, p.???].
Stretches of characters? Official character names and components all
upper case.
Expires in six months [Page 3]
Internet DrafNormalization of Internationalized Identifiers May 1997
2. Categories of Ambiguity and Problems
Comparing two sequences of codepoints from the UCS, various degrees
of ambiguity can arise:
Category A: The two sequences are expected to be rendered exactly the
same, considered identical by the user, and cannot be disambiguated
by context.
Category B: The two sequences are "semantically" different but diffi-
cult or impossible to distinguish in rendering.
Category C: ?????
????
There are also a number of codepoints in the UCS that should not be
used for various reasons, mainly that they are not available on usual
keyboards. These go into Category X.
?. Normalization of Combining Sequences
One of the main reasons for Category A ambiguities is the fact that
the UCS contains a general mechanism for encoding diacritic combina-
tions from base letters and modifying diacritics, but that many com-
binations also exist as precomposed codepoints.
The following algorithm normalizes such combinations:
Step 1: Starting from the beginning of the identifier, find a maximal
sequence of a base character (possibly decomposable) followed by mod-
ifying letters.
Step 2: Fully decompose the sequence found in step 1, using all
canonical decompositions defined in [Unicode2] and all canonical
decompositions defined for future additions to the UCS.
Step 3: Sort the sequence of modifying letters found in Step 2
according to the canonical ordering algorithm of Section 3.9 of [Uni-
code2].
Step 4: Try to recombine as much as possible of the sequence result-
ing from Step 3 into a precomposed character by findind the longest
initial match with any canonical decomposition sequence defined in
[Unicode2], ignoring decomposition sequences of length 1.
Expires in six months [Page 4]
Internet DrafNormalization of Internationalized Identifiers May 1997
Step 5: Use the result of Step 4 as output and continue with Step 1.
Note: In Step 4, the decomposition sequences in [Unicode2] have to be
recursively expanded for each character (except for decomposition
sequences of length 1) before application. Otherwise, a character
such as U+1E1C, LATIN CAPITAL LETTER E WITH CEDILLA AND BREVE, will
not be recomposed correctly.
Note: In Step 4, canonical decompositions defined for future addi-
tions to the UCS are explicitly not considered to ease forwards com-
patibility. It is assumed that systems knowing about newly defined
precompositions will be able to decompose them correctly in Step 2,
but that it is hard to change identifierst on older systems using a
decomposed representation.
Note: A different definition of Step 4 may lead to shorter normaliza-
tions for some identifiers. The current definition was choosen for
simplicity and implementation speed. (this may be subject to discus-
sion, in particular if somebody has an implementation and is ready to
share the code).
Note: The above algorithm can be sped up by shortcuts, in particular
by noting that precomposed characters [with the important exception
of those that have a decomposition sequence of length 1] which are
not followed by modifying letters, are already normalized.
Note: A completely different algorithm that results in the same
observed input-output behaviour is also acceptable.
Note: The exception for "precomposed letters that have a decomposi-
tion sequence of length 1" in Step 4 is necessary to avoid e.g. the
letter "K" being "aggregated" to "KELVIN SIGN" U+212A.
?. Hangul Jamo Normalization
Hangul Jamo (U+1100-U+11FF) provide ample possibilities for ambiguous
notations and therefore must be carefully normalized. The following
algorithm or its equivalents in terms of input-output behaviour
should be used:
Step 1: A seqence of Hangul jamo is split up into syllables according
to the definition of syllable boundaries on page 3-12 of [Unicode2].
Each of these syllables is processed according to Steps 2-4.
Step 2: Fillers are inserted as neccessary to form a canonical sylla-
ble as defined on page 3-12 of [Unicode2].
Expires in six months [Page 5]
Internet DrafNormalization of Internationalized Identifiers May 1997
Step 3: Sequences of choseong, jungseong, and jongseong (leading con-
sonants, vowels, and trailing consonants) are replaced by a single
choseong, jungseong, and jongseong respectively according to the com-
patibility decompositions given in [Unicode2]. If this is not possi-
ble, the sequence is malformed and the user should be warned.
Step 4: The seqence is replaced by a Hangul Syllable (U+AC00-U+D7AF)
if this is possible according to the algorithm given on pp. 3-12/3 of
[Unicode2].
Note: We need some for dealing with compatibility Jamo (U+3130...).
?. Other Cases of Ambiguities
General considerations about case.
Similar letters in different alphabets (e.g. Latin/Greek/Cyrillic A):
The letter from the correct alphabet should be used in context with
other letters from that alphabet. Mixed-alphabet identifiers have to
be avoided. In the case of single letters mixed with numbers and
such, which should be avoided in the first place, it should be
assumed that such letters are Latin if possible, and Cyrillic other-
wise. Lower-case identifiers should be prefered because lower-case
has less such problems. (should heuristics based on wider context
(e.g. domain names) be mentionned?)
Half-width and full-width compatibility characters (U+FF00...): The
version not in the compatibility section (i.e. half-width for Latin
and symbols, full-width for Katakana, Hangul, "LIGHT VERTICAL",
arrows, black square, and white circle) should be used wherever pos-
sible. Because half-with Latin characters may be needed in certain
parts of certain identifiers anyway, keyboard settings in places
where identifiers are input may be set to produce half-width Latin
characters by default, making the input of full-width characters more
tedious. Also, while the difference between half-width and full-width
characters is well visible on computers in contexts that use fixed-
pitch displays, they are not well transcribed on paper or with high
quality printing. Identifiers should never differ by a half-
width/full-width difference only.
Vertical variants (U+FE30...): Should not be used, in particular
because they are variants of characters that are already discouraged
:-).
Small form variants (U+FE50...): Strongly discouraged (where do they
come from?).
Expires in six months [Page 6]
Internet DrafNormalization of Internationalized Identifiers May 1997
Ligatures (Latin and Arabic). Not covered by canonical decomposition.
Need to write some normalization specs for them!
Other script-specific stuff.
Signs and symbols.
Punctuation.
?. Ideographic Ambiguities
Compatibility Ideographs: How to handle the Korean case? How to han-
dle the other stuff?
Warning about JIS 75/83 (97!) problems (~20 pairs).
Warning about backwards-compatibility non-unifications (about 100
pairs and some triples of differing seriousness; affecting inter-
typographic-context work but not intra-TC).
Explanation about general differences due to simplifications.
Acknowledgements
I am grateful in particular to the following persons for contributing
ideas, advice, criticism and help: Mark Davis, Larry Masenter, (to be
completed).
Bibliography
[HTML] T. Berners-Lee and D. Connolly, "Hypertext Markup Lan-
guage - 2.0" (RFC1866), MIT/W3C, November 1995.
[Unicode2] Unicode????, Version 2, Addisson-Wesley, Reading, MA,
1996.
[HTML-I18N] F. Yergeau, G. Nicol, G. Adams, and M. Duerst, "Inter-
nationalization of the Hypertext Markup Language",
Work in progress (draft-ietf-html-i18n-05.txt), August
1996.
Expires in six months [Page 7]
Internet DrafNormalization of Internationalized Identifiers May 1997
Author's Address
Martin J. Duerst
Multimedia-Laboratory
Department of Computer Science
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich
Switzerland
Tel: +41 1 257 43 16
Fax: +41 1 363 00 35
E-mail: mduerst@ifi.unizh.ch
NOTE -- Please write the author's name with u-Umlaut wherever
possible, e.g. in HTML as Dürst.
Expires in six months [Page 8]
Received on Thursday, 1 May 1997 15:25:26 UTC