Re: Posted draft of URI comparison finding from Martin Duerst on 2002-12-04 (www-tag@w3.org from December 2002)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 05 Dec 2002 08:50:05 +0900
To: Tim Bray <tbray@textuality.com>, WWW-Tag <www-tag@w3.org>
Message-Id: <4.2.0.58.J.20021205080903.04f8b730@localhost>
Hello Tim,

At 00:13 02/11/29 -0800, Tim Bray wrote:

>I just posted, at http://www.textuality.com/tag/uri-comp.html, a first cut 
>at some finding language in comparing URIs.  I'm in Narita running for a 
>plane so this got less proofreading than I usually have time for.

Thanks for your effort to write these things down.
Some comments that I haven't yet seen from others:

- Your doc should say that it applies equally well to IRIs as it
   does to URIs (because it does).

- 'Software is commonly required to': Does this mean 'Software has a need to'
   or 'Software is needed to'?

- "Since the world contains many characters useful in identifying resources
    beyond those in US-ASCII, and since the special characters such as ':' and
    '/' are also often useful, RFC2396 provides a mechanism for 
"%-escaping" such
    characters; they are represented as a sequence of 2-digit hexadecimal 
codes,
    each representing the value of one byte and preceded by the percent 
sign '%'."

   This assumes 1 character == 1 byte, and a direct character -> %hh mapping,
   which is clearly not the case. See section 2.1 of 
http://www.ietf.org/rfc/rfc2396.txt.
   (this is one of the very few places where the explanation is a bit different
    for IRIs).

- RFC 2395 (one occurrence) -> RFC 2396

- * example://a/b/c/d/%7A
   * eXAMPLE://a/b/../x/b/c/%7a

   these two would not be equivalent even under rfc 2396 rules, because of 
the /d
   in the first one but not in the second one.

- "It would seem almost wilfully perverse to consider the characters 
represented
    respectively by %7A and %7a in the example above as different."

   One can certainly argue about the stylistic merit of 'almost willfully 
(spelling)
   perverse'. But that's not my point. The sentence assumes that %7A and %7a
   represent a character, where in actual fact in an URI (see again section 
2.1 of
   http://www.ietf.org/rfc/rfc2396.txt) 'z', '%7A', and '%7a' are three 
different
   ways to represent the byte <7a>, which in turn in most cases (but not 
necessarily
   guaranteed) represents the character 'z'.

- "Another example:

     * http://a/b/
     * http://%61/b/

    Such software might consider these equivalent, since %61 encodes the
    character 'a' in both ASCII and UTF-8, but context becomes significant.
    RFC2396 does not constrain the character encoding scheme of URIs; if the
    original document were encoded in EBCDIC, or the URIs were sourced from two
    different documents whose original encoding was not known, there is a 
(slim)
    chance of a false-positive in finding these equivalent."

    This is very clearly and completely wrong. %61 and 'a' in an URI are
    ALWAYS equivalent (when looking at %hh-escaping-equivalence). EBCDIC
    (or any other encoding) don't come into play at all here. There are
    two places where EBCDIC can come into play:
    1) the URI is represented as EBCDIC (e.g. if you read this mail on
       an IBM mainframe). In that case, both 'a' and '%61' would be
       represented in EBCDIC, but they would still be equivalent.
    2) The resource is e.g. actually on an EBCDIC-based file system,
       and the server exposes EBCDIC-based resource names directly.
       Then both the 'a' and the '%61' would stand for a '/' (*)
       (see e.g. http://www.egrannie.com/cheatsheets/asciiebcdic.html
        for the actual table), or if there is an actual 'a' in the
        resource name, it would have to be represented as %81.
       [(*) that / would be a non-reserved one, i.e. a part of a
        path component]

- "This is reasonable behavior based on the rules provided by RFC 2616,
    which defines HTTP.": It may be worth mentioning that rfc 2616 also
    defines the http: URI scheme, please see 
http://www.ietf.org/rfc/rfc2616.txt,
    section 3.2.2

- A point which is very important to mention is that software
   transporting URIs should avoid any changes in URIs, unless it has
   very, very good and specific reasons to do so. This will avoid
   false negatives under any kind of equivalence.

- "Web Robots, which are at pains to reduce the incidence of false negatives"
   'are at pains' sounds colloquial and therefore difficult to understand
   world-wide. Maybe 'try very hard'?


Regards,    Martin.
Received on Wednesday, 4 December 2002 18:50:31 UTC