- From: Dan Connolly <connolly@w3.org>
- Date: Mon, 13 Jan 2003 14:02:21 -0600
- To: www-tag@w3.org
With apologies for taking so long, here are my review comments on How to Compare Uniform Resource Identifiers Author: Tim Bray http://www.textuality.com/tag/uri-comp-2.html Last-Modified: Fri, 13 Dec 2002 08:17:45 GMT It certainly addresses my issue about "codepoint by codepoing" and such from earlier drafts, and the good practice bit is really good. I think there are some critical, though minor, bugs, and a number of editorial nits. Unfortunately these comments are still a bit rambly... comments in document order... | Such comparisons can have two outcomes, in this document labeled | "equivalent" and "different"." er... what about "identical"? Also: this suggests that there's just one relationship between URIs. I think it's CRITICAL to be 100% clear that there are several: identical, i.e. string-equal dns-equivalent, e.g. http://www.w3.org/ and http://WWW.W3.ORG/ http-scheme-equivalent, e.g. http://Example.COM:80/ and http://example.com:80/ cache-hit-likely-equivalent, e.g. http://example/ and http://example/index.html and so on. And the cache-hit-likely-equivalent relation is usually parameterized by information that the consumer has picked up while interacting with the web; e.g. HTTP redirection replies and such. | For these reasons, determination of equivalence or difference must | be based on string comparison" that doesn't follow. | one or more RFCs why RFC? you mean specifications, no? | it is never possible to be sure that they identify | different resources. yes, it is; see the HTTP last modified example I gave prevously. @@ | the present document cannot really be understood without | reference to that RFC er... then why no link? [editorial] | RFC2396 defines a URI as a sequence of characters, with the | definition of "character" not tied to any particular form of | storage; the characters may be stored on disk one byte per | character, in a Java string two bytes per character, painted | on the side of a bus, or spoken in conversation. well said. | RFC2396 specifies that every URI has a "scheme", a leading | sequence of characters delimited by a colon character : The scheme is one thing; the sequence of characters is a name for that thing, no? Well, this draft does use "scheme" to refer to the character sequence consistently... | certain parts of HTTP URIs (but not others) are meant to | be processed case-insensitively. hmm... I wouldn't put it that way... Their semantics is grounded in a case-insensitive namespace (dns). So yes, DNS servers need to process them case-insensitively. But most of this URI how-to is talking about client processing, so this seems misleading. | RFC2396 defines a construct called a "URI reference" which | differs syntactically from URIs ... The TAG has decided to use the term "URI" to include relative URI references. CRITICAL. | It is generally impossible to compare relative URI | references correctly. what does "correctly" refer to here? You can strcmp() URI references just fine. The result might not be very relevant to life as we know it, but it's just fine for, say, XPath's string-compare() function. | Applications may choose to perform comparison operations on either the | base URIs or the references including fragment identifiers. another example, please. | However, an application using this approach could reasonably consider | the following two URIs equivalent: | | example://a/b/c/%7A | eXAMPLE://a/b/../x/b/c/%7a huh? how do you get that? The consumer isn't licensed to conclude that example: and eXAMPLE refer to the same scheme, nor that %7a and %7A are equivalent, nor that b/../x/c can be reduced to b/c. Producers should be warned against relying on these distinctions, but consumers aren't licensed to eliminate them. CRITICAL. | It would seem almost willfully perverse to consider the | data represented respectively by %7A and %7a in the example | above as different, since per RFC2396 they must represent | the same octet. which part of 2396 says that? %xx is just something a provider can choose to use as part of a URI for any reason whatsoever, the use of it to encode reserved characters is just a common use, but not something that's visible to consumers. RFC2396 seems to be just broken on this; it says: | An escaped octet is encoded as a character triplet, consisting of the | percent character "%" followed by the two hexadecimal digits | representing the octet code. For example, "%20" is the escaped | encoding for the US-ASCII space character. but the US-ASCII space character isn't an octet. | Only %-escape characters where required by RFC2396. Elsewhere in this document and in RFC2396, %-escaping is something done to octets, not to characters. -- Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Monday, 13 January 2003 15:02:41 UTC