- From: Stefan Eissing <stefan.eissing@greenbytes.de>
- Date: Mon, 16 Dec 2002 11:13:15 +0100
- To: Tim Bray <tbray@textuality.com>
- Cc: WWW-Tag <www-tag@w3.org>
Am Freitag, 13.12.02, um 16:28 Uhr (Europe/Berlin) schrieb Tim Bray: > Stefan Eissing wrote: > >> RFC 2396 Ch. 2.1 >> " In the simplest case, the original character sequence contains only >> characters that are defined in US-ASCII, and the two levels of >> mapping are simple and easily invertible: each 'original character' >> is represented as the octet for the US-ASCII code for it, which is, >> in turn, represented as either the US-ASCII character, or else the >> "%" escape sequence for that octet." > > You're saying you read this as "all characters in the ASCII range must > use the ASCII codepoints for character->octet"? I guess that's > plausible, but I had read 2.1 to say "there are many character->octet > mappings, one of the simplest being that for ASCII chracters". And > assuming you're right, it still seems like there's a window open here, > if you're operating in a non-ASCII environment then the char->octet > mapping is I'd like to close that window. :) IMO, it does not matter in which environment one operates. URIs tend to leak out into other environments (one could say they are designed to do that) and, unfortunately, in my experience they tend to leave their charset definition behind. > left 100% undefined, so you can't know whether %xx == %xx for all %xx > > 0x7f. -Tim Ch. 2.1 continues: "For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, if there might be more than one" So, I read this as: whatever your charset is, if your characters are defined in US-ASCII, it's easy and you use US-ASCII code points. If you have other characters, you have to make sure that the "other side" knows what charset you are using. One could therefore argue that the absence of a accompanying charset indicates that US-ASCII (my preference would be UTF-8) is the base charset. Otherwise how can one safely asssume that "http://example.com/a%61" and "http://example.com/a%61" are equivalent URIs? One might be US-ASCII and the other might be EBCDIC based iff the default charsets for URIs varies... Is there an environment where other default charsets for URIs do make sense? //Stefan
Received on Monday, 16 December 2002 05:14:07 UTC