Re: html, http, urls and internationalisation from Glenn Adams on 1996-01-31 (ietf-http-wg@w3.org from January to March 1996)

From: Glenn Adams <glenn@stonehand.com>
Date: Wed, 31 Jan 96 09:51:37 -0500
To: Mike_Spreitzer.PARC@xerox.com
Cc: keld@dkuug.dk, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <9601311451.AA23676@trubetzkoy.stonehand.com>

    Date: Tue, 30 Jan 1996 22:06:19 PST
    From: Mike_Spreitzer.PARC@xerox.com

    In Unicode, there are multiple ways to code a given
    character.  For example, Unicode includes Latin-1, which includes O-umlaut.
    Unicode also has an umlaut modifier, so that the same character can be
    coded as the two-code sequence "umlaut, O".  Do people who enter URLs
    have to be careful to do so in a certain canonical way?  Does a server
    have to canonicalize URLs it receives?

While it is true that, in certain cases, Unicode provides multiple ways
of encoding the same textual information (i.e., textual information for
which users make no semantic distinction between its representations), this
is not new to Unicode.  Case folding is of a similar nature.  From a
formal perspective, the "characters":

  (1) U+00FC LATIN SMALL LETTER U WITH DIAERESIS

  and

  (2) U+0075 LATIN SMALL LETTER U followed by
      U+0308 COMBINING DIAERESIS

are distinct.

The first encodes one 'character'; the second encodes two 'characters'.
On the other hand, the textual information encoded by both are generally
construed as equivalent in the context of a single orthographic practice
for a particular language.  [Of course they may be construed differently
when representing information of distinct languages.]

Comparison of Unicode character data can occur according to different
notions of equivalence, ranging from binary equivalence to linguistic
equivalence.  The Unicode 1.1 Preprint Edition (Unicode Technical Report
#4, Section 4.4) specifies a standard algorithm for normalizing
differences in the encoding of combining characters that can be used in
the absence of specific information about semantic distinctions.

I would recommend that HTTP servers and HTML UAs employ this algorithm
by default when comparing Unicode encoded character data unless strict
binary equivalence is necessary or unless a higher level language based
module is available for performing comparison on a per-language basis.

Regards,
Glenn Adams

Received on Wednesday, 31 January 1996 15:14:18 UTC