Re: "How to Compare URIs" update 3 from Martin Duerst on 2003-02-22 (www-tag@w3.org from February 2003)

From: Martin Duerst <duerst@w3.org>
Date: Sat, 22 Feb 2003 15:35:54 -0500
To: Tim Bray <tbray@textuality.com>
Cc: WWW-Tag <www-tag@w3.org>, uri@w3.org
Message-Id: <4.2.0.58.J.20030221180511.04dec3d0@localhost>

At 15:01 03/02/21 -0800, Tim Bray wrote:
>Martin Duerst wrote:

>>   I therefore suggest that the section on %-escaping be the first
>>   section of RFC2396-Sensitive Comparison, and that it nails down
>>   that %-equivalence (e.g. ~ == %7E == %7e) is (as currently observed)
>>   and be (to be specified by specifications) the minimal (baseline)
>>   equivalence to be respected by resolution-related operations
>>   (as opposed to namespace-like equivalences). (with the exception
>>   of reserved characters)
>
>Except for, in EBCDIC, %61 is '/', a reserved character; so on an EBCDIC 
>computer, what you see on the screen as /dir/a and /dir/%61 reference 
>different URIs.

Ah, now I see where your confusion is comming from.
The characters in an URI (the ones that are compared character-by-
character in namespaces) are just that, characters. URIs are
defined independent of any particular representation. The URI
spec says that /dir/a and /dir/%61 are equivalent, independent
of the representation. They are equivalent if they appear in
ASCII. They are equivalent if they appear on paper, on the
side of a bus, and so on. They are equivalent when spoken
over the radio. And they are equivalent when encoded as UTF-16
(as your Java example shows) or in EBCDIC.

RFC 2396 gives three levels, condensed in the following line:

URI character sequence->octet sequence->original character sequence

In practice, there are two more layers, one on each side.
We then get:

a) substrate: paper, metal, audio waves, ascii, UTF-16, EBCDIC,...
    We don't want to limit that to a particular encoding.
    ^
    |   conversion depending on substrate representation
    V
b) URI character sequence (just characters)
    ^
    |   conversion defined by RFC 2396 (always US-ASCII!)
    V
c) octet sequence (just octets)
    ^
    |   conversion currently scheme/server dependent, moving towards UTF-8
    V
d) original character sequence (file names on server, query strings,...)
    ^
    |   conversion server-dependent
    V
e) original octet sequence (e.g. UTF-16 for a filename on WinNT, EBCDIC
                             on an EBCDIC server, and so on)

Maybe this diagram should go into the new version of RFC 2396.

>And if you were writing EBCDIC software, you'd have to be careful.

Yes, converting from 'a' to '%61' needs an ASCII table, which is
implicit on an ASCII-based architecture. But apart from that,
there is probably not much difference.

>  Personally, if I were rewriting 2396 I would simply decree that all 
> %-escaping be done on the UTF-8 mapping and only the UTF-8 mapping,

I would be the first to agree with that. Unfortunately, it's not
as easy as that, because there is some legacy out there.

>and be forbidden unless it's compulsory. But that's not what 2396 
>currently says.

Exactly.

>>   I'd be glad to help with some actual text.
>
>Please. -Tim

Will do. Can you give me few hours?

Regards,   Martin.

Received on Saturday, 22 February 2003 15:36:06 UTC