W3C home > Mailing lists > Public > www-tag@w3.org > January 2003

RE: Draft 2 of "How to Compare URIs"

From: Williams, Stuart <skw@hplb.hpl.hp.com>
Date: Mon, 6 Jan 2003 17:38:09 -0000
Message-ID: <5E13A1874524D411A876006008CD059F04A071DF@0-mail-1.hpl.hp.com>
To: "'Tim Bray'" <tbray@textuality.com>, Stefan Eissing <stefan.eissing@greenbytes.de>
Cc: WWW-Tag <www-tag@w3.org>

Hi Tim,

I have finally gotten round to reading your draft [1]. As Stefan said...
"Excellent read". A few thoughts/comments:

1) I wondered whether the introduction should say more about equivalence.  

It very quickly gets into talking about comparisons and refers to the
outcome of a comparison as "equivalent" or "different". I find myself
thinking of equivalence as a type of relation (reflexive, symmetric,
transitive) between URI (URI References?) and that given some set of URI,
different equivalence relations would partition the set differently. Maybe
something like:

"URI are equivalent with respect to some purpose. The strongest equivalence
relation is identity and arises between URI that are the same,
character-by-character. Other equivalence relations arises in a context
dependent way eg. two URI may be equivalent for purpose of retrieving
representations of a resource eg http://example.com and
http://example.com:80, but not for the purpose of naming a namespace. These
other equivalence relations respect the identity relation in that if two URI
are identical they remain equivalent under these other equivalence

2) On the topic of %-escape encoding, which I continue to find confusing
despite the opening sentence in RFC 2396 section 2.1.

RFC 2396 appears to delgate the 'URI Character -> octet' mapping to the URI
scheme definition. The 4th Paragraph of Sec 2.1 begins:

  "A URI scheme may define a mapping from URI 
   characters to octets; whether this is done 
   depends on the scheme."

Then, regarding the second mapping RFC 2396 speaks of  'octets -> original
characters': "A charset defines this mapping." RFC2396 states "However,
there is currently no provision within the generic URI syntax to accomplish
this identification." It then offers possible options including delegation
of charset default and/or selection mechanism to URI scheme definition.

The URI Scheme registration template RFC2717 includes a field for "character
encoding consideration". However, on a quick scan of the scheme
registrations referenced from http://www.iana.org/assignments/uri-schemes I
couldn't find any that offered any "character encoding consideration" :-)

However, I think that there is an upside. Even if the first URI character ->
octet mapping is scheme dependent, I think that one can be confident that
for all %xx, for http://example.com/%xx and http://example.com/%xx, the
octet sequences arising from the first mapping will be identical because the
same scheme is in use. It's less clear that the second mapping, the charset
which maps octets to original characters, is going to be the same in all
contexts (like some of the forms examples)... however, in a given context...
http://example.com/%xx will be equivalent to itself (surely!).

3) The three levels of 'URI Characters', 'octets' and 'original characters'
discussed in 2396 seems to suggest that an octet-by-octet and 'original
character-by-original character' (modulo charset selection issues)
comparison of http://example.com/%61 and http://example.com/a would each
make them equivalent, where-as a 'URI Character-by-URI character' comparison
would make them different. This leaves me confused about when we speak of
'character-by-character' comparison whether we are speaking of 'URI
Characters' or 'original characters'. That said, I also struggled with the
terms 'URI character' and 'original character' and may be confused about
them too.

Hmmm... not sure any of this is helpful. It (URI Equivalence that is) all
seems much more complicated than it ought to be. I kind of like the
operational notions of equivalence, which I think is where Larry has been
coming from in the past, such that in some context of use two URI are
equivalent if one can be substituted for the other and give rise to
equivalent results (effects and side effects).



> -----Original Message-----
> From: Tim Bray [mailto:tbray@textuality.com]
> Sent: 13 December 2002 15:28
> To: Stefan Eissing
> Cc: WWW-Tag
> Subject: Re: Draft 2 of "How to Compare URIs"
> Stefan Eissing wrote:
> > RFC 2396 Ch. 2.1
> > 
> > " In the simplest case, the original character sequence 
> contains only 
> > characters that are defined in US-ASCII, and the two levels 
> of mapping 
> > are simple and easily invertible: each 'original character' is 
> > represented as the octet for the US-ASCII code for it, which is, in 
> > turn, represented as either the US-ASCII character, or else the "%" 
> > escape sequence for that octet."
> You're saying you read this as "all characters in the ASCII 
> range must 
> use the ASCII codepoints for character->octet"?  I guess that's 
> plausible, but I had read 2.1 to say "there are many character->octet 
> mappings, one of the simplest being that for ASCII chracters".  And 
> assuming you're right, it still seems like there's a window 
> open here, 
> if you're operating in a non-ASCII environment then the char->octet 
> mapping is left 100% undefined, so you can't know whether %xx 
> == %xx for 
> all %xx > 0x7f. -Tim
Received on Monday, 6 January 2003 12:42:37 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:32:36 UTC