Re: Posted draft of URI comparison finding from Tim Bray on 2002-12-11 (www-tag@w3.org from December 2002)

From: Tim Bray <tbray@textuality.com>
Date: Wed, 11 Dec 2002 09:30:00 -0800
To: Martin Duerst <duerst@w3.org>
Cc: WWW-Tag <www-tag@w3.org>
Message-ID: <3DF77618.30304@textuality.com>
Martin Duerst wrote:

> - "Since the world contains many characters useful in identifying resources
>    beyond those in US-ASCII, and since the special characters such as 
> ':' and
>    '/' are also often useful, RFC2396 provides a mechanism for 
> "%-escaping" such
>    characters; they are represented as a sequence of 2-digit hexadecimal 
> codes,
>    each representing the value of one byte and preceded by the percent 
> sign '%'."
> 
>   This assumes 1 character == 1 byte, and a direct character -> %hh mapping,
>   which is clearly not the case. See section 2.1 of 
> http://www.ietf.org/rfc/rfc2396.txt.

I just did (for the 87th time).  I will reword slightly to point out 
that octets represent characters and %-escapes represent octets.

> - "It would seem almost wilfully perverse to consider the characters 
> represented
>    respectively by %7A and %7a in the example above as different."
> 
>   One can certainly argue about the stylistic merit of 'almost willfully 
> (spelling)
>   perverse'. But that's not my point. The sentence assumes that %7A and %7a
>   represent a character,

Right, I'll fix that.

> - "Another example:
> 
>     * http://a/b/
>     * http://%61/b/
> 
>    Such software might consider these equivalent, since %61 encodes the
>    character 'a' in both ASCII and UTF-8, but context becomes significant.
>    RFC2396 does not constrain the character encoding scheme of URIs; if the
>    original document were encoded in EBCDIC, or the URIs were sourced 
> from two
>    different documents whose original encoding was not known, there is a 
> (slim)
>    chance of a false-positive in finding these equivalent."
> 
>    This is very clearly and completely wrong. %61 and 'a' in an URI are
>    ALWAYS equivalent (when looking at %hh-escaping-equivalence).

I'm having trouble here.  Section 2.1 is terribly fuzzy on this, and
says essentially nothing useful about the character->octet mapping,
giving UTF-8 as an interesting example.  By my reading, if that 'a' were 
encoded in EBCDIC in my instance, then RFC2396 wouldn't stop me from 
encoding that as %81.   Now, I've never seen this happen, but the point 
is that assuming.

>   There are
>    two places where EBCDIC can come into play:
>    1) the URI is represented as EBCDIC (e.g. if you read this mail on
>       an IBM mainframe). In that case, both 'a' and '%61' would be
>       represented in EBCDIC, but they would still be equivalent.

Why couldn't 'a' be represented as %81?  And if I imported the URI with 
this encoding from that system, it's quite possible that the EBCDIC and 
ASCIII versions of http://example.com/%81/ are in fact different.  -Tim

>    2) The resource is e.g. actually on an EBCDIC-based file system,
>       and the server exposes EBCDIC-based resource names directly.
>       Then both the 'a' and the '%61' would stand for a '/' (*)
>       (see e.g. http://www.egrannie.com/cheatsheets/asciiebcdic.html
>        for the actual table), or if there is an actual 'a' in the
>        resource name, it would have to be represented as %81.
>       [(*) that / would be a non-reserved one, i.e. a part of a
>        path component]

Right, so I can't be sure that %81 is the same as %81, depending on 
where they come from.  Or what am I missing?

> 
> - "This is reasonable behavior based on the rules provided by RFC 2616,
>    which defines HTTP.": It may be worth mentioning that rfc 2616 also
>    defines the http: URI scheme, please see 
> http://www.ietf.org/rfc/rfc2616.txt,
>    section 3.2.2

Right.

> - A point which is very important to mention is that software
>   transporting URIs should avoid any changes in URIs, unless it has
>   very, very good and specific reasons to do so. This will avoid
>   false negatives under any kind of equivalence.

Right.

> - "Web Robots, which are at pains to reduce the incidence of false 
> negatives"
>   'are at pains' sounds colloquial and therefore difficult to understand
>   world-wide. Maybe 'try very hard'?

Sigh.  "At pains" is formal and perhaps a bit old-fashioned rather than 
colloquial.  But OK. -Tim
Received on Wednesday, 11 December 2002 15:10:52 UTC