- From: Glenn Adams <glenn@stonehand.com>
- Date: Wed, 31 Jan 96 09:51:37 -0500
- To: Mike_Spreitzer.PARC@xerox.com
- Cc: keld@dkuug.dk, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Date: Tue, 30 Jan 1996 22:06:19 PST From: Mike_Spreitzer.PARC@xerox.com In Unicode, there are multiple ways to code a given character. For example, Unicode includes Latin-1, which includes O-umlaut. Unicode also has an umlaut modifier, so that the same character can be coded as the two-code sequence "umlaut, O". Do people who enter URLs have to be careful to do so in a certain canonical way? Does a server have to canonicalize URLs it receives? While it is true that, in certain cases, Unicode provides multiple ways of encoding the same textual information (i.e., textual information for which users make no semantic distinction between its representations), this is not new to Unicode. Case folding is of a similar nature. From a formal perspective, the "characters": (1) U+00FC LATIN SMALL LETTER U WITH DIAERESIS and (2) U+0075 LATIN SMALL LETTER U followed by U+0308 COMBINING DIAERESIS are distinct. The first encodes one 'character'; the second encodes two 'characters'. On the other hand, the textual information encoded by both are generally construed as equivalent in the context of a single orthographic practice for a particular language. [Of course they may be construed differently when representing information of distinct languages.] Comparison of Unicode character data can occur according to different notions of equivalence, ranging from binary equivalence to linguistic equivalence. The Unicode 1.1 Preprint Edition (Unicode Technical Report #4, Section 4.4) specifies a standard algorithm for normalizing differences in the encoding of combining characters that can be used in the absence of specific information about semantic distinctions. I would recommend that HTTP servers and HTML UAs employ this algorithm by default when comparing Unicode encoded character data unless strict binary equivalence is necessary or unless a higher level language based module is available for performing comparison on a per-language basis. Regards, Glenn Adams
Received on Wednesday, 31 January 1996 15:14:18 UTC