- From: Martin Duerst <duerst@w3.org>
- Date: Sat, 22 Feb 2003 15:35:54 -0500
- To: Tim Bray <tbray@textuality.com>
- Cc: WWW-Tag <www-tag@w3.org>, uri@w3.org
At 15:01 03/02/21 -0800, Tim Bray wrote: >Martin Duerst wrote: >> I therefore suggest that the section on %-escaping be the first >> section of RFC2396-Sensitive Comparison, and that it nails down >> that %-equivalence (e.g. ~ == %7E == %7e) is (as currently observed) >> and be (to be specified by specifications) the minimal (baseline) >> equivalence to be respected by resolution-related operations >> (as opposed to namespace-like equivalences). (with the exception >> of reserved characters) > >Except for, in EBCDIC, %61 is '/', a reserved character; so on an EBCDIC >computer, what you see on the screen as /dir/a and /dir/%61 reference >different URIs. Ah, now I see where your confusion is comming from. The characters in an URI (the ones that are compared character-by- character in namespaces) are just that, characters. URIs are defined independent of any particular representation. The URI spec says that /dir/a and /dir/%61 are equivalent, independent of the representation. They are equivalent if they appear in ASCII. They are equivalent if they appear on paper, on the side of a bus, and so on. They are equivalent when spoken over the radio. And they are equivalent when encoded as UTF-16 (as your Java example shows) or in EBCDIC. RFC 2396 gives three levels, condensed in the following line: URI character sequence->octet sequence->original character sequence In practice, there are two more layers, one on each side. We then get: a) substrate: paper, metal, audio waves, ascii, UTF-16, EBCDIC,... We don't want to limit that to a particular encoding. ^ | conversion depending on substrate representation V b) URI character sequence (just characters) ^ | conversion defined by RFC 2396 (always US-ASCII!) V c) octet sequence (just octets) ^ | conversion currently scheme/server dependent, moving towards UTF-8 V d) original character sequence (file names on server, query strings,...) ^ | conversion server-dependent V e) original octet sequence (e.g. UTF-16 for a filename on WinNT, EBCDIC on an EBCDIC server, and so on) Maybe this diagram should go into the new version of RFC 2396. >And if you were writing EBCDIC software, you'd have to be careful. Yes, converting from 'a' to '%61' needs an ASCII table, which is implicit on an ASCII-based architecture. But apart from that, there is probably not much difference. > Personally, if I were rewriting 2396 I would simply decree that all > %-escaping be done on the UTF-8 mapping and only the UTF-8 mapping, I would be the first to agree with that. Unfortunately, it's not as easy as that, because there is some legacy out there. >and be forbidden unless it's compulsory. But that's not what 2396 >currently says. Exactly. >> I'd be glad to help with some actual text. > >Please. -Tim Will do. Can you give me few hours? Regards, Martin.
Received on Saturday, 22 February 2003 15:36:07 UTC