- From: Martin Duerst <duerst@w3.org>
- Date: Thu, 05 Dec 2002 08:50:05 +0900
- To: Tim Bray <tbray@textuality.com>, WWW-Tag <www-tag@w3.org>
Hello Tim,
At 00:13 02/11/29 -0800, Tim Bray wrote:
>I just posted, at http://www.textuality.com/tag/uri-comp.html, a first cut
>at some finding language in comparing URIs. I'm in Narita running for a
>plane so this got less proofreading than I usually have time for.
Thanks for your effort to write these things down.
Some comments that I haven't yet seen from others:
- Your doc should say that it applies equally well to IRIs as it
does to URIs (because it does).
- 'Software is commonly required to': Does this mean 'Software has a need to'
or 'Software is needed to'?
- "Since the world contains many characters useful in identifying resources
beyond those in US-ASCII, and since the special characters such as ':' and
'/' are also often useful, RFC2396 provides a mechanism for
"%-escaping" such
characters; they are represented as a sequence of 2-digit hexadecimal
codes,
each representing the value of one byte and preceded by the percent
sign '%'."
This assumes 1 character == 1 byte, and a direct character -> %hh mapping,
which is clearly not the case. See section 2.1 of
http://www.ietf.org/rfc/rfc2396.txt.
(this is one of the very few places where the explanation is a bit different
for IRIs).
- RFC 2395 (one occurrence) -> RFC 2396
- * example://a/b/c/d/%7A
* eXAMPLE://a/b/../x/b/c/%7a
these two would not be equivalent even under rfc 2396 rules, because of
the /d
in the first one but not in the second one.
- "It would seem almost wilfully perverse to consider the characters
represented
respectively by %7A and %7a in the example above as different."
One can certainly argue about the stylistic merit of 'almost willfully
(spelling)
perverse'. But that's not my point. The sentence assumes that %7A and %7a
represent a character, where in actual fact in an URI (see again section
2.1 of
http://www.ietf.org/rfc/rfc2396.txt) 'z', '%7A', and '%7a' are three
different
ways to represent the byte <7a>, which in turn in most cases (but not
necessarily
guaranteed) represents the character 'z'.
- "Another example:
* http://a/b/
* http://%61/b/
Such software might consider these equivalent, since %61 encodes the
character 'a' in both ASCII and UTF-8, but context becomes significant.
RFC2396 does not constrain the character encoding scheme of URIs; if the
original document were encoded in EBCDIC, or the URIs were sourced from two
different documents whose original encoding was not known, there is a
(slim)
chance of a false-positive in finding these equivalent."
This is very clearly and completely wrong. %61 and 'a' in an URI are
ALWAYS equivalent (when looking at %hh-escaping-equivalence). EBCDIC
(or any other encoding) don't come into play at all here. There are
two places where EBCDIC can come into play:
1) the URI is represented as EBCDIC (e.g. if you read this mail on
an IBM mainframe). In that case, both 'a' and '%61' would be
represented in EBCDIC, but they would still be equivalent.
2) The resource is e.g. actually on an EBCDIC-based file system,
and the server exposes EBCDIC-based resource names directly.
Then both the 'a' and the '%61' would stand for a '/' (*)
(see e.g. http://www.egrannie.com/cheatsheets/asciiebcdic.html
for the actual table), or if there is an actual 'a' in the
resource name, it would have to be represented as %81.
[(*) that / would be a non-reserved one, i.e. a part of a
path component]
- "This is reasonable behavior based on the rules provided by RFC 2616,
which defines HTTP.": It may be worth mentioning that rfc 2616 also
defines the http: URI scheme, please see
http://www.ietf.org/rfc/rfc2616.txt,
section 3.2.2
- A point which is very important to mention is that software
transporting URIs should avoid any changes in URIs, unless it has
very, very good and specific reasons to do so. This will avoid
false negatives under any kind of equivalence.
- "Web Robots, which are at pains to reduce the incidence of false negatives"
'are at pains' sounds colloquial and therefore difficult to understand
world-wide. Maybe 'try very hard'?
Regards, Martin.
Received on Wednesday, 4 December 2002 18:50:31 UTC