Re: draft-fielding-uri-rfc2396bis-04.txt from Martin Duerst on 2004-02-20 (uri@w3.org from February 2004)

From: Martin Duerst <duerst@w3.org>
Date: Fri, 20 Feb 2004 17:39:28 -0500
To: "Adam M. Costello BOGUS address, see signature" <BOGUS@BOGUS.nicemice.net>, uri@w3.org
Message-Id: <4.2.0.58.J.20040220173633.02798830@localhost>

Hello Adam,

>Experts, if my understanding of the model is flawed, please correct me!

No, congratulations on your excellent understanding and explanation!

The only flaw for practical purposes would be that iso-2022-jp
isn't a very good choice, because Japanese file systems and similar
things are euc-jp (unix) and shift_jis (PC and Mac).

Regards,   Martin.

At 08:54 04/02/17 -0500, Adam M. Costello BOGUS address, see signature wrote:




>Graham Klyne <GK@ninebynine.org> wrote:
>
> > I am concerned that on one hand the specification seems to say that it
> > is agnostic with respect to the character encoding used, yet on the
> > other hand it requires certain octet values expressed with %-encoding
> > to be treated as specific characters.
>
>This is an inherently confusing aspect of URIs.  The key is that there
>are *three* charsets involved, one fixed (ASCII) and two unfixed (but
>typically UTF-8 and ASCII).
>
>Suppose I have a string of characters that I want to represent as a
>component of a URI (a path component, say).  Let's say the string is
>a single character: the Chinese character for "sun".  First I need to
>represent this character as a sequence of bytes, so that's where the
>first charset comes in.  I'm encouraged but not required to use UTF-8,
>but for illustrative purposes let's say I choose iso-2022-jp.  The
>sequence of bytes is (in hex): 1B 24 42 46 7C 1B 28 42
>
>Next, I need to represent that sequence of bytes as a sequence of
>characters, because a URI is a sequence of characters.  This is where
>the second charset comes in, and it is always ASCII.  All bytes that
>map (via ASCII) to characters that are unreserved in that component (a
>path component in this example) are represented by those characters, and
>all other bytes are represented by percent-escapes.  The sequence of
>characters is therefore: %1B$BF%7C%1B(B
>
>If I am writing the URI on a napkin, I'm done.  But if I'm transmitting
>it over a network, I need to represent that sequence of characters as a
>sequence of bytes.  This is where te third charset comes in.  Typically
>it is some superset of ASCII (the URI will use only the ASCII subset),
>but it need not be.  Let's say I'm including the URI in the body of a
>mail message that is Content-Type: text/plain; charset=ebcdic-us.  Then
>the URI will be represented by the sequence of bytes (in hex):  6C F1 C2
>5B C2 C5 6C F7 C3 6C F1 C2 4D C2
>
>Experts, if my understanding of the model is flawed, please correct me!
>
>See also RFC-2396 section 2.1 (which covers the first two charsets) and
>section 1.6 (which covers the third).
>
>AMC

Received on Friday, 20 February 2004 18:09:14 UTC