Re: draft-fielding-uri-rfc2396bis-04.txt from Adam M. Costello BOGUS address, see signature on 2004-02-17 (uri@w3.org from February 2004)

From: Adam M. Costello BOGUS address, see signature <BOGUS@BOGUS.nicemice.net>
Date: Tue, 17 Feb 2004 08:54:20 -0500
To: uri@w3.org
Message-Id: <4.2.0.58.J.20040217085342.05f84b20@localhost>

Graham Klyne <GK@ninebynine.org> wrote:

 > I am concerned that on one hand the specification seems to say that it
 > is agnostic with respect to the character encoding used, yet on the
 > other hand it requires certain octet values expressed with %-encoding
 > to be treated as specific characters.

This is an inherently confusing aspect of URIs.  The key is that there
are *three* charsets involved, one fixed (ASCII) and two unfixed (but
typically UTF-8 and ASCII).

Suppose I have a string of characters that I want to represent as a
component of a URI (a path component, say).  Let's say the string is
a single character: the Chinese character for "sun".  First I need to
represent this character as a sequence of bytes, so that's where the
first charset comes in.  I'm encouraged but not required to use UTF-8,
but for illustrative purposes let's say I choose iso-2022-jp.  The
sequence of bytes is (in hex): 1B 24 42 46 7C 1B 28 42

Next, I need to represent that sequence of bytes as a sequence of
characters, because a URI is a sequence of characters.  This is where
the second charset comes in, and it is always ASCII.  All bytes that
map (via ASCII) to characters that are unreserved in that component (a
path component in this example) are represented by those characters, and
all other bytes are represented by percent-escapes.  The sequence of
characters is therefore: %1B$BF%7C%1B(B

If I am writing the URI on a napkin, I'm done.  But if I'm transmitting
it over a network, I need to represent that sequence of characters as a
sequence of bytes.  This is where te third charset comes in.  Typically
it is some superset of ASCII (the URI will use only the ASCII subset),
but it need not be.  Let's say I'm including the URI in the body of a
mail message that is Content-Type: text/plain; charset=ebcdic-us.  Then
the URI will be represented by the sequence of bytes (in hex):  6C F1 C2
5B C2 C5 6C F7 C3 6C F1 C2 4D C2

Experts, if my understanding of the model is flawed, please correct me!

See also RFC-2396 section 2.1 (which covers the first two charsets) and
section 1.6 (which covers the third).

AMC

Received on Tuesday, 17 February 2004 09:55:02 UTC