- From: Martin Duerst <duerst@w3.org>
- Date: Fri, 20 Feb 2004 17:39:28 -0500
- To: "Adam M. Costello BOGUS address, see signature" <BOGUS@BOGUS.nicemice.net>, uri@w3.org
Hello Adam, >Experts, if my understanding of the model is flawed, please correct me! No, congratulations on your excellent understanding and explanation! The only flaw for practical purposes would be that iso-2022-jp isn't a very good choice, because Japanese file systems and similar things are euc-jp (unix) and shift_jis (PC and Mac). Regards, Martin. At 08:54 04/02/17 -0500, Adam M. Costello BOGUS address, see signature wrote: >Graham Klyne <GK@ninebynine.org> wrote: > > > I am concerned that on one hand the specification seems to say that it > > is agnostic with respect to the character encoding used, yet on the > > other hand it requires certain octet values expressed with %-encoding > > to be treated as specific characters. > >This is an inherently confusing aspect of URIs. The key is that there >are *three* charsets involved, one fixed (ASCII) and two unfixed (but >typically UTF-8 and ASCII). > >Suppose I have a string of characters that I want to represent as a >component of a URI (a path component, say). Let's say the string is >a single character: the Chinese character for "sun". First I need to >represent this character as a sequence of bytes, so that's where the >first charset comes in. I'm encouraged but not required to use UTF-8, >but for illustrative purposes let's say I choose iso-2022-jp. The >sequence of bytes is (in hex): 1B 24 42 46 7C 1B 28 42 > >Next, I need to represent that sequence of bytes as a sequence of >characters, because a URI is a sequence of characters. This is where >the second charset comes in, and it is always ASCII. All bytes that >map (via ASCII) to characters that are unreserved in that component (a >path component in this example) are represented by those characters, and >all other bytes are represented by percent-escapes. The sequence of >characters is therefore: %1B$BF%7C%1B(B > >If I am writing the URI on a napkin, I'm done. But if I'm transmitting >it over a network, I need to represent that sequence of characters as a >sequence of bytes. This is where te third charset comes in. Typically >it is some superset of ASCII (the URI will use only the ASCII subset), >but it need not be. Let's say I'm including the URI in the body of a >mail message that is Content-Type: text/plain; charset=ebcdic-us. Then >the URI will be represented by the sequence of bytes (in hex): 6C F1 C2 >5B C2 C5 6C F7 C3 6C F1 C2 4D C2 > >Experts, if my understanding of the model is flawed, please correct me! > >See also RFC-2396 section 2.1 (which covers the first two charsets) and >section 1.6 (which covers the third). > >AMC
Received on Friday, 20 February 2004 18:09:14 UTC