on "How to Compare Uniform Resource Identifiers" from Dan Connolly on 2003-01-13 (www-tag@w3.org from January 2003)

From: Dan Connolly <connolly@w3.org>
Date: Mon, 13 Jan 2003 14:02:21 -0600
To: www-tag@w3.org
Message-id: <1042488140.3997.61.camel@dirk.dm93.org>
With apologies for taking so long, here are
my review comments on

  How to Compare Uniform Resource Identifiers
  Author: Tim Bray
  http://www.textuality.com/tag/uri-comp-2.html
  Last-Modified: Fri, 13 Dec 2002 08:17:45 GMT

It certainly addresses my issue about "codepoint by
codepoing" and such from earlier drafts, and
the good practice bit is really good.

I think there are some critical, though minor,
bugs, and a number of editorial nits.

Unfortunately these comments are still a bit
rambly...

comments in document order...

| Such comparisons can have two outcomes, in this document labeled 
| "equivalent" and "different"."

er... what about "identical"?

Also: this suggests that there's just one relationship
between URIs. I think it's CRITICAL to be 100% clear
that there are several:

	identical, i.e. string-equal
	dns-equivalent, e.g. http://www.w3.org/ and http://WWW.W3.ORG/
	http-scheme-equivalent,
		e.g. http://Example.COM:80/ and http://example.com:80/
	cache-hit-likely-equivalent, e.g.
		http://example/ and http://example/index.html

and so on. And the cache-hit-likely-equivalent relation is
usually parameterized by information that the consumer
has picked up while interacting with the web; e.g.
HTTP redirection replies and such.


| For these reasons, determination of equivalence or difference must
| be based on string comparison"

that doesn't follow.


| one or more RFCs

why RFC? you mean specifications, no?


| it is never possible to be sure that they identify
| different resources.

yes, it is; see the HTTP last modified example I gave
prevously. @@

| the present document cannot really be understood without
| reference to that RFC

er... then why no link? [editorial]

| RFC2396 defines a URI as a sequence of characters, with the
| definition of "character" not tied to any particular form of
| storage; the characters may be stored on disk one byte per
| character, in a Java string two bytes per character, painted
| on the side of a bus, or spoken in conversation.

well said.

| RFC2396 specifies that every URI has a "scheme", a leading
| sequence of characters delimited by a colon character :

The scheme is one thing; the sequence of characters is a
name for that thing, no? Well, this draft does
use "scheme" to refer to the character sequence
consistently...

| certain parts of HTTP URIs (but not others) are meant to
| be processed case-insensitively.

hmm... I wouldn't put it that way...
Their semantics is grounded in a case-insensitive namespace
(dns). So yes, DNS servers need to process them
case-insensitively. But most of this URI how-to is talking
about client processing, so this seems misleading.

| RFC2396 defines a construct called a "URI reference" which
| differs syntactically from URIs ...

The TAG has decided to use the term "URI" to include
relative URI references. CRITICAL.

| It is generally impossible to compare relative URI
| references correctly.

what does "correctly" refer to here? You can strcmp()
URI references just fine. The result might not be
very relevant to life as we know it, but it's just
fine for, say, XPath's string-compare() function.

| Applications may choose to perform comparison operations on either the
| base URIs or the references including fragment identifiers.

another example, please.


| However, an application using this approach could reasonably consider
| the following two URIs equivalent:
|
| example://a/b/c/%7A
| eXAMPLE://a/b/../x/b/c/%7a

huh? how do you get that?

The consumer isn't licensed to conclude that
example: and eXAMPLE refer to the same scheme,
nor that %7a and %7A are equivalent, nor
that b/../x/c can be reduced to b/c.

Producers should be warned against relying
on these distinctions, but consumers aren't
licensed to eliminate them.

CRITICAL.

| It would seem almost willfully perverse to consider the
| data represented respectively by %7A and %7a in the example
| above as different, since per RFC2396 they must represent
| the same octet.

which part of 2396 says that? %xx is just something a provider
can choose to use as part of a URI for any reason whatsoever,
the use of it to encode reserved characters is just a common
use, but not something that's visible to consumers.

RFC2396 seems to be just broken on this; it says:

|  An escaped octet is encoded as a character triplet, consisting of the
|  percent character "%" followed by the two hexadecimal digits
|  representing the octet code. For example, "%20" is the escaped
|  encoding for the US-ASCII space character.

but the US-ASCII space character isn't an octet.

| Only %-escape characters where required by RFC2396.

Elsewhere in this document and in RFC2396, %-escaping is
something done to octets, not to characters.



-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Monday, 13 January 2003 15:02:41 UTC