RE: on "How to Compare Uniform Resource Identifiers" from Williams, Stuart on 2003-01-14 (www-tag@w3.org from January 2003)

From: Williams, Stuart <skw@hplb.hpl.hp.com>
Date: Tue, 14 Jan 2003 12:17:30 -0000
To: "'Dan Connolly'" <connolly@w3.org>
Cc: www-tag@w3.org
Message-ID: <5E13A1874524D411A876006008CD059F04A07213@0-mail-1.hpl.hp.com>
Hi Dan,

Some agreement and disagreements/commentary interwoven below...

> -----Original Message-----
> From: Dan Connolly [mailto:connolly@w3.org]
> Sent: 13 January 2003 20:02
> To: www-tag@w3.org
> Subject: on "How to Compare Uniform Resource Identifiers"
> 
<snip/>

> comments in document order...
> 
> | Such comparisons can have two outcomes, in this document labeled 
> | "equivalent" and "different"."
> 
> er... what about "identical"?
> 
> Also: this suggests that there's just one relationship
> between URIs. I think it's CRITICAL to be 100% clear
> that there are several:
> 
> 	identical, i.e. string-equal
> 	dns-equivalent, e.g. http://www.w3.org/ and http://WWW.W3.ORG/
> 	http-scheme-equivalent,
> 		e.g. http://Example.COM:80/ and http://example.com:80/
> 	cache-hit-likely-equivalent, e.g.
> 		http://example/ and http://example/index.html
> 
> and so on. And the cache-hit-likely-equivalent relation is
> usually parameterized by information that the consumer
> has picked up while interacting with the web; e.g.
> HTTP redirection replies and such.

I had a similar concerns expressed in the first comment in [1]. I like your
example equivalence relations. Does the paragraph I offered in [1] help at
all?
<quote>
"URI are equivalent with respect to some purpose. The strongest equivalence
relation is identity and arises between URI that are the same,
character-by-character. Other equivalence relations arises in a context
dependent way eg. two URI may be equivalent for purpose of retrieving
representations of a resource eg http://example.com and
http://example.com:80, but not for the purpose of naming a namespace. These
other equivalence relations respect the identity relation in that if two URI
are identical they remain equivalent under these other equivalence
relations."
</quote>

[1] http://lists.w3.org/Archives/Public/www-tag/2003Jan/0019.html 


<snip/>
> 
> | it is never possible to be sure that they identify
> | different resources.
> 
> yes, it is; see the HTTP last modified example I gave
> prevously. @@

Hmmm.... do you have the pointer - I'd be interested in re-reading the
example. 

Maybe this is one of those "Never say never" cases... although the text
isn't explicitly qualified as "soley on the basis of comparing URI, its is
never possible..." I read that qualification into Tim's asserting. I'm
assuming that the HTTP last modified example uses more than just URI as the
basis of the comparison.

<snip/>

> | RFC2396 defines a URI as a sequence of characters, with the
> | definition of "character" not tied to any particular form of
> | storage; the characters may be stored on disk one byte per
> | character, in a Java string two bytes per character, painted
> | on the side of a bus, or spoken in conversation.
> 
> well said.

+1

<snip/>

> | RFC2396 defines a construct called a "URI reference" which
> | differs syntactically from URIs ...
> 
> The TAG has decided to use the term "URI" to include
> relative URI references. CRITICAL.

Hmmm... my recollection of what we agreed is slightly different.

I think that we agreed the use of the term URI for the absolute form of URI
References; that we did not invent a term for relative forms of URI
references and that the meaning of the term URI Reference was unchanged and
covered both absolute and relative forms and same-document references... see
final paragraph [2,3] and footnote #3 and minutes at [4].

[2] http://www.w3.org/2001/tag/2002/webarch-20021206#identification
[3] http://www.w3.org/TR/webarch/#identification
[4] http://www.w3.org/2002/09/24-tag-summary#archdoc-comments (re: Email
from Dan Connolly).


<snip/>

> | However, an application using this approach could reasonably consider
> | the following two URIs equivalent:
> |
> | example://a/b/c/%7A
> | eXAMPLE://a/b/../x/b/c/%7a
> 
> huh? how do you get that?
> 
> The consumer isn't licensed to conclude that
> example: and eXAMPLE refer to the same scheme,

Hmmm... RFC 2396 Section 6 grants some license to conclude that the scheme
names are equivalent... although it is not clear to me (today:-)) what the
qualification "When a scheme uses elements of the common syntax..." means
ie. what are the elements of the common syntax that a scheme can elect to
use or not?

<quote>
6. URI Normalization and Equivalence

   In many cases, different URI strings may actually identify the
   identical resource. For example, the host names used in URL are
   actually case insensitive, and the URL <http://www.XEROX.com> is
   equivalent to <http://www.xerox.com>. In general, the rules for
   equivalence and definition of a normal form, if any, are scheme
   dependent. When a scheme uses elements of the common syntax, it will
   also use the common syntax equivalence rules, namely that the scheme
   and hostname are case insensitive and a URL with an explicit ":port",
   where the port is the default for the scheme, is equivalent to one
   where the port is elided.
</quote>


> nor that %7a and %7A are equivalent,

RFC 2396 Section 2.1 speaks of two mappings "URI Characters -> octets" and
"octets->original character sequence". It calls the second  mapping a
character set and indicates that (at present) the charset is established by
external (to RFC2396) agreement. The language of section 2.1 appears to
delegate the 1st mapping to the scheme definition (4th paragraph begins "A
URI scheme may define a mapping from URI characters to octets;"). I may be
the intent that 2396 intend that there be a single such mapping that schemes
could elect to use, but the langauge appears to delegate the definition of
the mapping aswell.

If I'm reading 2396 correctly, the escaped forms %7a and %7A arise in the
"URI Character" sequence, and whilst I think it is common that, for a given
scheme, by the first mapping both these sequences will map to the same
octet, since that definition of that mapping appears to be delegated to the
scheme, then in general one can't infact know that %7a and %7A map to the
same octet.

When I look at the side of a bus, I am left asking whether I'm looking at
"URI characters" or "original characters", even more so if characters
outside the ASCII character set appear in the symbols painted there-on. When
we talk of character-by-character comparision of URI, we also need to be
clear about whether we are talking of "URI Characters" or "original
characters".

I guess that's a somewhat long winded agreement with Dan.

> nor
> that b/../x/c can be reduced to b/c.

Presumably because 1) one is only licensed to eliminate the "b/../" when
absolutizing (yuk) a relative URI and 2) the example may broken (or I can't
absolutize things in my head)... if one were to eliminate the "b/.." I think
youd be left with example://a/x/b/c (but I am flaky on this).

> Producers should be warned against relying
> on these distinctions, but consumers aren't
> licensed to eliminate them.
> 
> CRITICAL.


> | It would seem almost willfully perverse to consider the
> | data represented respectively by %7A and %7a in the example
> | above as different, since per RFC2396 they must represent
> | the same octet.
> 
> which part of 2396 says that? %xx is just something a provider
> can choose to use as part of a URI for any reason whatsoever,
> the use of it to encode reserved characters is just a common
> use, but not something that's visible to consumers.
> 
> RFC2396 seems to be just broken on this; it says:
> 
> |  An escaped octet is encoded as a character triplet, consisting of the
> |  percent character "%" followed by the two hexadecimal digits
> |  representing the octet code. For example, "%20" is the escaped
> |  encoding for the US-ASCII space character.
> 
> but the US-ASCII space character isn't an octet.

I think this is "URI Character", "octet" and "original character" confusion.
I think the <space> character has to be an "original character" (its not an
octet and it's not admissable as a "URI character") and that RFC2396 section
2.1 is telling us that "original characters" are mapped into octets (by a
character set established by means outside 2396) and that those octets are
mapped into "URI Characters" by a scheme dependent mapping.  

> | Only %-escape characters where required by RFC2396.
> 
> Elsewhere in this document and in RFC2396, %-escaping is
> something done to octets, not to characters.

ie. only %-escape octets that do not have a direct mapping to an admissable
"URI Character" ?

> 
> -- 
> Dan Connolly, W3C http://www.w3.org/People/Connolly/
> 
> 

Regards

Stuart
Received on Tuesday, 14 January 2003 07:17:53 UTC