- From: Misha Wolf <Misha.Wolf@reuters.com>
- Date: Fri, 29 Nov 2002 15:35:20 +0000
- To: www-tag@w3.org
- Cc: xml-names-editor@w3.org, w3c-i18n-ig@w3.org
Hi Tim, > -----Original Message----- > From: Tim Bray [mailto:tbray@textuality.com] > Sent: 29 November 2002 08:14 > To: WWW-Tag > Subject: Posted draft of URI comparison finding > > I just posted, at http://www.textuality.com/tag/uri-comp.html, a > first cut at some finding language in comparing URIs. I'm in > Narita running for a plane so this got less proofreading than I > usually have time for. [...] > Simple String Comparison > > If two URIs, considered as character strings, are found identical, then > it is safe to conclude that they are equivalent. This type of > equivalence test has very low computational cost and is in wide use in a > variety of applications. The Namespaces in XML recommendation mandates > this class of comparison when testing two Namespace Names for > equivalence. The Namespaces in XML recommendation states: [Definition:] URI references which identify namespaces are considered identical when they are exactly the same character-for-character. Note that URI references which are not identical in this sense may in fact be functionally equivalent. Examples include URI references which differ only in case, or which are in external entities which have different effective base URIs. It does not explain what a "character" is or how one is to establish whether two strings are "exactly the same character-for-character". The Namespaces in XML 1.1 WD states: [Definition: IRI references which identify namespaces are considered identical if and only if they are exactly the same character-for- character.] Case differences and escaping differences (including case differences in escape sequences) are therefore significant. Note that IRI references which are not identical in this sense may in fact be functionally equivalent. Examples include IRI references which differ only in case or escaping , or which are in external entities which have different effective base URIs. It still does not explain what a "character" is or how one is to establish whether two strings are "exactly the same character-for- character". It helpfully adds that: - Case differences [..] are [...] significant - [...] escaping differences are [...] significant - Case differences [...] in escape sequences [...] are [...] significant. It does not, however, explain what an "escape sequence" is, so the 2nd and 3rd points above are of no use to us. > Testing strings for equivalence requires some basic precautions. This > procedure is often referred to as "bit-for-bit" or "byte-for-byte" > comparison, which is potentially misleading. Testing of strings for > equality is normally based on pairwise comparison of the characters that > make up the strings, starting from the first and proceeding until both > strings are exhausted and all characters found to be equal, or a pair of > characters compares unequal or one of the strings is exhausted before > the other. The above procedure cannot commence until one has decided where to look for "characters". > These character comparisons require that each pair of characters be put > in comparable form. Indeed. > Should, for example, one URI be stored in a byte > array in EBCDIC encoding, and the second be in a Java String object, > bit-for-bit comparisons applied naively can produce both false-positive > and false-negative errors. Thus in principle it is better to speak of > equality on a character-for-character rather than byte-for-byte or > bit-for-bit basis. In Unicode terminology, this would be properly > referred to as codepoint-for-codepoint comparison. Even this is unclear. [...] > %-Escaping Issues > > It would seem almost wilfully perverse to consider the characters > represented respectively by %7A and %7a in the example above as > different. We cannot judge what is or is not perverse until we have a rigorous defintion. > In fact, since the Namespaces in XML recommendation specifies > "character-for-character" comparison, it might be argued that since %7A > and %7a must per RFC2396 represent the same character, XML namespaces > which differ only in this respect might reasonably be considered equal. I t is not reasonable for the W3C to promote specifications which require the reader to determine what is reasonable. We need specifications which tell us clearly what they intend. In an endeavour to achieve greater clarity, the I18N WG has proposed [1] that: 1. The value of a namespace name is obtained by applying the steps described in: 3.3.3 Attribute-Value Normalization http://www.w3.org/TR/REC-xml#AVNormalize 2. Identity between namespace names is determined by doing a binary match on the results of those steps. 3. Consequently, the following *are identical* (where "é" is a reference to an entity containing "é"): é é é é é and the following *differ* from the above and from one another: %c3%a9 %C3%a9 %c3%A9 %C3%A9 É [1] http://lists.w3.org/Archives/Public/www-tag/2002Nov/0088 Regards, Misha ----------------------------------------------------------------- Visit our Internet site at http://www.reuters.com Get closer to the financial markets with Reuters Messaging - for more information and to register, visit http://www.reuters.com/messaging Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.
Received on Friday, 29 November 2002 10:35:57 UTC