RE: Posted draft of URI comparison finding from Misha Wolf on 2002-11-29 (www-tag@w3.org from November 2002)

From: Misha Wolf <Misha.Wolf@reuters.com>
Date: Fri, 29 Nov 2002 15:35:20 +0000
To: www-tag@w3.org
Cc: xml-names-editor@w3.org, w3c-i18n-ig@w3.org
Message-Id: <200211291535.KAA11566@tux.w3.org>
Hi Tim,

> -----Original Message-----
> From: Tim Bray [mailto:tbray@textuality.com] 
> Sent: 29 November 2002 08:14
> To: WWW-Tag
> Subject: Posted draft of URI comparison finding
> 
> I just posted, at http://www.textuality.com/tag/uri-comp.html, a 
> first cut at some finding language in comparing URIs.  I'm in 
> Narita running for a plane so this got less proofreading than I 
> usually have time for.

[...]

> Simple String Comparison 
> 
> If two URIs, considered as character strings, are found identical, then
> it is safe to conclude that they are equivalent. This type of
> equivalence test has very low computational cost and is in wide use in a
> variety of applications. The Namespaces in XML recommendation mandates
> this class of comparison when testing two Namespace Names for
> equivalence. 

The Namespaces in XML recommendation states:

   [Definition:] URI references which identify namespaces are considered 
   identical when they are exactly the same character-for-character. 
   Note that URI references which are not identical in this sense may in 
   fact be functionally equivalent. Examples include URI references which 
   differ only in case, or which are in external entities which have 
   different effective base URIs. 

It does not explain what a "character" is or how one is to establish
whether two strings are "exactly the same character-for-character".

The Namespaces in XML 1.1 WD states:

   [Definition: IRI references which identify namespaces are considered 
   identical if and only if they are exactly the same character-for-
   character.] Case differences and escaping differences (including case 
   differences in escape sequences) are therefore significant. Note that 
   IRI references which are not identical in this sense may in fact be 
   functionally equivalent. Examples include IRI references which differ 
   only in case or escaping , or which are in external entities which 
   have different effective base URIs. 

It still does not explain what a "character" is or how one is to
establish whether two strings are "exactly the same character-for-
character". It helpfully adds that:

-  Case differences [..] are [...] significant

-  [...] escaping differences are [...] significant

-  Case differences [...] in escape sequences [...] are [...] 
   significant.

It does not, however, explain what an "escape sequence" is, so the 2nd
and 3rd points above are of no use to us.

> Testing strings for equivalence requires some basic precautions. This
> procedure is often referred to as "bit-for-bit" or "byte-for-byte"
> comparison, which is potentially misleading. Testing of strings for
> equality is normally based on pairwise comparison of the characters that
> make up the strings, starting from the first and proceeding until both
> strings are exhausted and all characters found to be equal, or a pair of
> characters compares unequal or one of the strings is exhausted before
> the other. 

The above procedure cannot commence until one has decided where to look
for "characters".

> These character comparisons require that each pair of characters be put
> in comparable form.

Indeed.

> Should, for example, one URI be stored in a byte
> array in EBCDIC encoding, and the second be in a Java String object,
> bit-for-bit comparisons applied naively can produce both false-positive
> and false-negative errors. Thus in principle it is better to speak of
> equality on a character-for-character rather than byte-for-byte or
> bit-for-bit basis. In Unicode terminology, this would be properly
> referred to as codepoint-for-codepoint comparison. 

Even this is unclear.

[...]

> %-Escaping Issues 
> 
> It would seem almost wilfully perverse to consider the characters
> represented respectively by %7A and %7a in the example above as
> different.

We cannot judge what is or is not perverse until we have a rigorous
defintion.

> In fact, since the Namespaces in XML recommendation specifies
> "character-for-character" comparison, it might be argued that since %7A
> and %7a must per RFC2396 represent the same character, XML namespaces
> which differ only in this respect might reasonably be considered equal.
I
 t is not reasonable for the W3C to promote specifications which require
the reader to determine what is reasonable.  We need specifications which
tell us clearly what they intend.

In an endeavour to achieve greater clarity, the I18N WG has proposed [1]
that:

1.  The value of a namespace name is obtained by applying the steps 
    described in:
       3.3.3 Attribute-Value Normalization
       http://www.w3.org/TR/REC-xml#AVNormalize

2.  Identity between namespace names is determined by doing a binary 
    match on the results of those steps.

3.  Consequently, the following *are identical* (where "&eacute;" is a 
    reference to an entity containing "é"):
       é
       &#xe9;
       &#xE9;
       &#233;
       &eacute;
    and the following *differ* from the above and from one another:
       %c3%a9
       %C3%a9
       %c3%A9
       %C3%A9
       É

[1] http://lists.w3.org/Archives/Public/www-tag/2002Nov/0088

Regards,
Misha



-----------------------------------------------------------------
        Visit our Internet site at http://www.reuters.com

Get closer to the financial markets with Reuters Messaging - for more
information and to register, visit http://www.reuters.com/messaging

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.
Received on Friday, 29 November 2002 10:35:57 UTC