Re: "How to Compare URIs" update 3 from Martin Duerst on 2003-02-24 (www-tag@w3.org from February 2003)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 24 Feb 2003 15:45:35 -0500
To: Tim Bray <tbray@textuality.com>
Cc: WWW-Tag <www-tag@w3.org>, uri@w3.org
Message-Id: <4.2.0.58.J.20030224150100.051f81c8@localhost>
At 15:59 03/02/22 -0800, Tim Bray wrote:
>Martin Duerst wrote:
>
>>Ah, now I see where your confusion is comming from.
>>The characters in an URI (the ones that are compared character-by-
>>character in namespaces) are just that, characters. URIs are
>>defined independent of any particular representation. The URI
>>spec says that /dir/a and /dir/%61 are equivalent, independent
>>of the representation. They are equivalent if they appear in
>>ASCII. They are equivalent if they appear on paper, on the
>>side of a bus, and so on. They are equivalent when spoken
>>over the radio. And they are equivalent when encoded as UTF-16
>>(as your Java example shows) or in EBCDIC.
>
>No, 'a' and %61 are *not* equivalent in an EBCDIC environment.  I just 
>don't see where, in RFC2396, it says that the hex-encoding is necessarily 
>that of the ASCII value of the character.

A character (conceptually) never gets directly encoded into a
%-escaping, you always have octets in the middle.


>I repeat: if I'm on an EBCDIC computer, and the URI reads out as /dir/a, 
>that is *different* from /dir/%61.  Yes, this is egregiously broken and 
>stupid, but it's within the bounds set by RFC2396.

I agree that it may not be extremely clear. But I disagree that your
interpretation is within the bounds of RFC 2396. For example, in
"2. URI Characters and Escape Sequences", we have:

 >>>>>>>>
    Within a URI, characters are either used as delimiters, or to
    represent strings of data (octets) within the delimited portions.
    Octets are either represented directly by a character (using the US-
    ASCII character for that octet [ASCII]) or by an escape encoding.
    This representation is elaborated below.
 >>>>>>>>

Now let's take your example, "/dir/a". Let's assume that's a directory
name 'dir' and a file name 'a' on a computer that uses EBCDIC.
We don't have to care about the '/' here, because this is a separator
that is part of the URI syntax, independent of local usage (see e.g.
MSWin).

So now let's look at how the ebcdic server exposes 'dir' and 'a'.
It can either decide to expose them as EBCDIC (which makes server
implementation easier) or to expose them as ASCII (which makes the
URI more readable).


If the server on the EBCDIC system decides to expose as EBCDIC,
then this will give us the following octets:

     /<84><89><99>/<81>

This then results in an URI of /%84%89%99/%81. There is no other
choice, as we have in "2.4.1. Escaped Encoding"

    An escaped octet is encoded as a character triplet, consisting of the
    percent character "%" followed by the two hexadecimal digits
    representing the octet code.

(Well, you could claim that instead of %84, it may also be %48, because
the RFC doesn't say which order the digits go, but I hope you don't
want to go there.) For an example that is a bit different, let's
say '/d+r/a', we would get /<84><78><99>/<81> in terms of octets,
and then /%84N%99/%81 in the actual URI (because the RFC clearly
says that the octet <78> is encoded with US-ASCII, which results in
an 'N'. We could also use /%84%78%99/%81.

The other alternative is to expose the resource as US-ASCII,
i.e. have the conversion work being done on the server. In that
case, we have /<64><69><72>/<61>, which trivially results in
/dir/a. It could of course also result in /dir/%61, because
%61 is the escape for octet <61>. Please remember that it says:

 >>>>>>>>
    Octets are either represented directly by a character (using the US-
    ASCII character for that octet [ASCII]) or by an escape encoding.
    This representation is elaborated below.
 >>>>>>>>

So overall, the server can make the choice of how to expose a
resource name as a series of octets. But it doesn't have a choice
to expose the resource name as one octet if the octet is escaped,
an as another octet if the octet is not escaped.


>>RFC 2396 gives three levels, condensed in the following line:
>
>Actually, the problem is that RFC2396 is just hopelessly unclear.

I't pretty unclear, I definitely agree with that, but it's not
totally hopeless.


>The fact that you and I are unable to agree on what it says is 
>incontrovertible proof of this fact.  May I argue for a brief truce?

Ok.


>As part of the revision process, I'm working on an essay whose subject is 
>what RFC2396bis *should* say on the subject of data and characters and 
>octets and %-escaping, so that we don't have to have these endless 
>arguments.  Frankly I would rather not waste any more time arguing about 
>what the current revision of 2396 says.  -Tim

Well, if your proposals result in some clarification, and is reasonably
within existing mainstream practice, then that's a good thing.
I'm definitely looking forward to your proposal, and I'm glad to
help. Is that different from the "How to Compare URIs" doc?
If you can include the diagram in my last mail, I think that
will help.

Regards,    Martin.
Received on Monday, 24 February 2003 15:46:12 UTC