Re: Draft 2 of "How to Compare URIs" from Stefan Eissing on 2002-12-16 (www-tag@w3.org from December 2002)

From: Stefan Eissing <stefan.eissing@greenbytes.de>
Date: Mon, 16 Dec 2002 11:13:15 +0100
To: Tim Bray <tbray@textuality.com>
Cc: WWW-Tag <www-tag@w3.org>
Message-Id: <FD988E59-10DE-11D7-BAFF-00039384827E@greenbytes.de>

Am Freitag, 13.12.02, um 16:28 Uhr (Europe/Berlin) schrieb Tim Bray:

> Stefan Eissing wrote:
>
>> RFC 2396 Ch. 2.1
>> " In the simplest case, the original character sequence contains only 
>> characters that are defined in US-ASCII, and the two levels of 
>> mapping are simple and easily invertible: each 'original character' 
>> is represented as the octet for the US-ASCII code for it, which is, 
>> in turn, represented as either the US-ASCII character, or else the 
>> "%" escape sequence for that octet."
>
> You're saying you read this as "all characters in the ASCII range must 
> use the ASCII codepoints for character->octet"?  I guess that's 
> plausible, but I had read 2.1 to say "there are many character->octet 
> mappings, one of the simplest being that for ASCII chracters".  And 
> assuming you're right, it still seems like there's a window open here, 
> if you're operating in a non-ASCII environment then the char->octet 
> mapping is

I'd like to close that window. :)
IMO, it does not matter in which environment one operates. URIs tend to 
leak out into
other environments (one could say they are designed to do that) and, 
unfortunately,
in my experience they tend to leave their charset definition behind.

>  left 100% undefined, so you can't know whether %xx == %xx for all %xx 
> > 0x7f. -Tim

Ch. 2.1 continues:

"For original character sequences that contain non-ASCII characters, 
however, the situation is more difficult. Internet protocols that 
transmit octet sequences intended to represent character sequences are 
expected to provide some way of identifying the charset used, if there 
might be more than one"

So, I read this as: whatever your charset is, if your characters are 
defined in US-ASCII, it's easy
and you use US-ASCII code points. If you have other characters, you 
have to make sure that the
"other side" knows what charset you are using.

One could therefore argue that the absence of a accompanying charset 
indicates that US-ASCII
(my preference would be UTF-8) is the base charset. Otherwise how can 
one
safely asssume that "http://example.com/a%61" and 
"http://example.com/a%61" are
equivalent URIs? One might be US-ASCII and the other might be EBCDIC 
based iff the default
charsets for URIs varies...

Is there an environment where other default charsets for URIs do make 
sense?

//Stefan

Received on Monday, 16 December 2002 05:14:07 UTC