- From: Williams, Stuart <skw@hplb.hpl.hp.com>
- Date: Tue, 14 Jan 2003 12:17:30 -0000
- To: "'Dan Connolly'" <connolly@w3.org>
- Cc: www-tag@w3.org
Hi Dan, Some agreement and disagreements/commentary interwoven below... > -----Original Message----- > From: Dan Connolly [mailto:connolly@w3.org] > Sent: 13 January 2003 20:02 > To: www-tag@w3.org > Subject: on "How to Compare Uniform Resource Identifiers" > <snip/> > comments in document order... > > | Such comparisons can have two outcomes, in this document labeled > | "equivalent" and "different"." > > er... what about "identical"? > > Also: this suggests that there's just one relationship > between URIs. I think it's CRITICAL to be 100% clear > that there are several: > > identical, i.e. string-equal > dns-equivalent, e.g. http://www.w3.org/ and http://WWW.W3.ORG/ > http-scheme-equivalent, > e.g. http://Example.COM:80/ and http://example.com:80/ > cache-hit-likely-equivalent, e.g. > http://example/ and http://example/index.html > > and so on. And the cache-hit-likely-equivalent relation is > usually parameterized by information that the consumer > has picked up while interacting with the web; e.g. > HTTP redirection replies and such. I had a similar concerns expressed in the first comment in [1]. I like your example equivalence relations. Does the paragraph I offered in [1] help at all? <quote> "URI are equivalent with respect to some purpose. The strongest equivalence relation is identity and arises between URI that are the same, character-by-character. Other equivalence relations arises in a context dependent way eg. two URI may be equivalent for purpose of retrieving representations of a resource eg http://example.com and http://example.com:80, but not for the purpose of naming a namespace. These other equivalence relations respect the identity relation in that if two URI are identical they remain equivalent under these other equivalence relations." </quote> [1] http://lists.w3.org/Archives/Public/www-tag/2003Jan/0019.html <snip/> > > | it is never possible to be sure that they identify > | different resources. > > yes, it is; see the HTTP last modified example I gave > prevously. @@ Hmmm.... do you have the pointer - I'd be interested in re-reading the example. Maybe this is one of those "Never say never" cases... although the text isn't explicitly qualified as "soley on the basis of comparing URI, its is never possible..." I read that qualification into Tim's asserting. I'm assuming that the HTTP last modified example uses more than just URI as the basis of the comparison. <snip/> > | RFC2396 defines a URI as a sequence of characters, with the > | definition of "character" not tied to any particular form of > | storage; the characters may be stored on disk one byte per > | character, in a Java string two bytes per character, painted > | on the side of a bus, or spoken in conversation. > > well said. +1 <snip/> > | RFC2396 defines a construct called a "URI reference" which > | differs syntactically from URIs ... > > The TAG has decided to use the term "URI" to include > relative URI references. CRITICAL. Hmmm... my recollection of what we agreed is slightly different. I think that we agreed the use of the term URI for the absolute form of URI References; that we did not invent a term for relative forms of URI references and that the meaning of the term URI Reference was unchanged and covered both absolute and relative forms and same-document references... see final paragraph [2,3] and footnote #3 and minutes at [4]. [2] http://www.w3.org/2001/tag/2002/webarch-20021206#identification [3] http://www.w3.org/TR/webarch/#identification [4] http://www.w3.org/2002/09/24-tag-summary#archdoc-comments (re: Email from Dan Connolly). <snip/> > | However, an application using this approach could reasonably consider > | the following two URIs equivalent: > | > | example://a/b/c/%7A > | eXAMPLE://a/b/../x/b/c/%7a > > huh? how do you get that? > > The consumer isn't licensed to conclude that > example: and eXAMPLE refer to the same scheme, Hmmm... RFC 2396 Section 6 grants some license to conclude that the scheme names are equivalent... although it is not clear to me (today:-)) what the qualification "When a scheme uses elements of the common syntax..." means ie. what are the elements of the common syntax that a scheme can elect to use or not? <quote> 6. URI Normalization and Equivalence In many cases, different URI strings may actually identify the identical resource. For example, the host names used in URL are actually case insensitive, and the URL <http://www.XEROX.com> is equivalent to <http://www.xerox.com>. In general, the rules for equivalence and definition of a normal form, if any, are scheme dependent. When a scheme uses elements of the common syntax, it will also use the common syntax equivalence rules, namely that the scheme and hostname are case insensitive and a URL with an explicit ":port", where the port is the default for the scheme, is equivalent to one where the port is elided. </quote> > nor that %7a and %7A are equivalent, RFC 2396 Section 2.1 speaks of two mappings "URI Characters -> octets" and "octets->original character sequence". It calls the second mapping a character set and indicates that (at present) the charset is established by external (to RFC2396) agreement. The language of section 2.1 appears to delegate the 1st mapping to the scheme definition (4th paragraph begins "A URI scheme may define a mapping from URI characters to octets;"). I may be the intent that 2396 intend that there be a single such mapping that schemes could elect to use, but the langauge appears to delegate the definition of the mapping aswell. If I'm reading 2396 correctly, the escaped forms %7a and %7A arise in the "URI Character" sequence, and whilst I think it is common that, for a given scheme, by the first mapping both these sequences will map to the same octet, since that definition of that mapping appears to be delegated to the scheme, then in general one can't infact know that %7a and %7A map to the same octet. When I look at the side of a bus, I am left asking whether I'm looking at "URI characters" or "original characters", even more so if characters outside the ASCII character set appear in the symbols painted there-on. When we talk of character-by-character comparision of URI, we also need to be clear about whether we are talking of "URI Characters" or "original characters". I guess that's a somewhat long winded agreement with Dan. > nor > that b/../x/c can be reduced to b/c. Presumably because 1) one is only licensed to eliminate the "b/../" when absolutizing (yuk) a relative URI and 2) the example may broken (or I can't absolutize things in my head)... if one were to eliminate the "b/.." I think youd be left with example://a/x/b/c (but I am flaky on this). > Producers should be warned against relying > on these distinctions, but consumers aren't > licensed to eliminate them. > > CRITICAL. > | It would seem almost willfully perverse to consider the > | data represented respectively by %7A and %7a in the example > | above as different, since per RFC2396 they must represent > | the same octet. > > which part of 2396 says that? %xx is just something a provider > can choose to use as part of a URI for any reason whatsoever, > the use of it to encode reserved characters is just a common > use, but not something that's visible to consumers. > > RFC2396 seems to be just broken on this; it says: > > | An escaped octet is encoded as a character triplet, consisting of the > | percent character "%" followed by the two hexadecimal digits > | representing the octet code. For example, "%20" is the escaped > | encoding for the US-ASCII space character. > > but the US-ASCII space character isn't an octet. I think this is "URI Character", "octet" and "original character" confusion. I think the <space> character has to be an "original character" (its not an octet and it's not admissable as a "URI character") and that RFC2396 section 2.1 is telling us that "original characters" are mapped into octets (by a character set established by means outside 2396) and that those octets are mapped into "URI Characters" by a scheme dependent mapping. > | Only %-escape characters where required by RFC2396. > > Elsewhere in this document and in RFC2396, %-escaping is > something done to octets, not to characters. ie. only %-escape octets that do not have a direct mapping to an admissable "URI Character" ? > > -- > Dan Connolly, W3C http://www.w3.org/People/Connolly/ > > Regards Stuart
Received on Tuesday, 14 January 2003 07:17:53 UTC