- From: Martin Duerst <duerst@w3.org>
- Date: Mon, 24 Feb 2003 15:45:35 -0500
- To: Tim Bray <tbray@textuality.com>
- Cc: WWW-Tag <www-tag@w3.org>, uri@w3.org
At 15:59 03/02/22 -0800, Tim Bray wrote: >Martin Duerst wrote: > >>Ah, now I see where your confusion is comming from. >>The characters in an URI (the ones that are compared character-by- >>character in namespaces) are just that, characters. URIs are >>defined independent of any particular representation. The URI >>spec says that /dir/a and /dir/%61 are equivalent, independent >>of the representation. They are equivalent if they appear in >>ASCII. They are equivalent if they appear on paper, on the >>side of a bus, and so on. They are equivalent when spoken >>over the radio. And they are equivalent when encoded as UTF-16 >>(as your Java example shows) or in EBCDIC. > >No, 'a' and %61 are *not* equivalent in an EBCDIC environment. I just >don't see where, in RFC2396, it says that the hex-encoding is necessarily >that of the ASCII value of the character. A character (conceptually) never gets directly encoded into a %-escaping, you always have octets in the middle. >I repeat: if I'm on an EBCDIC computer, and the URI reads out as /dir/a, >that is *different* from /dir/%61. Yes, this is egregiously broken and >stupid, but it's within the bounds set by RFC2396. I agree that it may not be extremely clear. But I disagree that your interpretation is within the bounds of RFC 2396. For example, in "2. URI Characters and Escape Sequences", we have: >>>>>>>> Within a URI, characters are either used as delimiters, or to represent strings of data (octets) within the delimited portions. Octets are either represented directly by a character (using the US- ASCII character for that octet [ASCII]) or by an escape encoding. This representation is elaborated below. >>>>>>>> Now let's take your example, "/dir/a". Let's assume that's a directory name 'dir' and a file name 'a' on a computer that uses EBCDIC. We don't have to care about the '/' here, because this is a separator that is part of the URI syntax, independent of local usage (see e.g. MSWin). So now let's look at how the ebcdic server exposes 'dir' and 'a'. It can either decide to expose them as EBCDIC (which makes server implementation easier) or to expose them as ASCII (which makes the URI more readable). If the server on the EBCDIC system decides to expose as EBCDIC, then this will give us the following octets: /<84><89><99>/<81> This then results in an URI of /%84%89%99/%81. There is no other choice, as we have in "2.4.1. Escaped Encoding" An escaped octet is encoded as a character triplet, consisting of the percent character "%" followed by the two hexadecimal digits representing the octet code. (Well, you could claim that instead of %84, it may also be %48, because the RFC doesn't say which order the digits go, but I hope you don't want to go there.) For an example that is a bit different, let's say '/d+r/a', we would get /<84><78><99>/<81> in terms of octets, and then /%84N%99/%81 in the actual URI (because the RFC clearly says that the octet <78> is encoded with US-ASCII, which results in an 'N'. We could also use /%84%78%99/%81. The other alternative is to expose the resource as US-ASCII, i.e. have the conversion work being done on the server. In that case, we have /<64><69><72>/<61>, which trivially results in /dir/a. It could of course also result in /dir/%61, because %61 is the escape for octet <61>. Please remember that it says: >>>>>>>> Octets are either represented directly by a character (using the US- ASCII character for that octet [ASCII]) or by an escape encoding. This representation is elaborated below. >>>>>>>> So overall, the server can make the choice of how to expose a resource name as a series of octets. But it doesn't have a choice to expose the resource name as one octet if the octet is escaped, an as another octet if the octet is not escaped. >>RFC 2396 gives three levels, condensed in the following line: > >Actually, the problem is that RFC2396 is just hopelessly unclear. I't pretty unclear, I definitely agree with that, but it's not totally hopeless. >The fact that you and I are unable to agree on what it says is >incontrovertible proof of this fact. May I argue for a brief truce? Ok. >As part of the revision process, I'm working on an essay whose subject is >what RFC2396bis *should* say on the subject of data and characters and >octets and %-escaping, so that we don't have to have these endless >arguments. Frankly I would rather not waste any more time arguing about >what the current revision of 2396 says. -Tim Well, if your proposals result in some clarification, and is reasonably within existing mainstream practice, then that's a good thing. I'm definitely looking forward to your proposal, and I'm glad to help. Is that different from the "How to Compare URIs" doc? If you can include the diagram in my last mail, I think that will help. Regards, Martin.
Received on Monday, 24 February 2003 15:46:13 UTC