RE: URIEquivalence-15: characters in RFC 2396 (was: Re: [Minutes] 27 Jan 2003 TAG teleconf (..., IRIEverywhere-27, ...)) from Martin Duerst on 2003-02-04 (www-international@w3.org from January to March 2003)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 04 Feb 2003 17:52:54 -0500
To: "Williams, Stuart" <skw@hplb.hpl.hp.com>, "Ian B. Jacobs" <ij@w3.org>, www-tag@w3.org
Cc: www-international@w3.org, Michel Suignard <michelsu@microsoft.com>
Message-Id: <4.2.0.58.J.20030204173241.078dd418@localhost>

Hello Stuart,

At 13:27 03/02/04 +0000, Williams, Stuart wrote:

>I am prone to think of the "URI character sequence" as the sequence of
>characters, constrainted by URI syntax, that I might write on a piece of
>paper, or paint on the side of the bus. An "original character sequences"
>seems to be more about the character sequence I might have wanted to paint
>on the side of a bus, or present in a user interface (eg. kanji, ) that are
>prohibited from direct by the constraints of generic URI syntax.

Based on my long experience and repeated reading of RFC 2396, I think
your interpretation comes very close. There is one caveat:
"original character sequences" refers not only to characters
that are prohibited from direct representation by the constraints
of the (generic or opaque) URI syntax, but refers to any kind of
character. And it may well be that you also wanted to have this
on the side of a bus or in an user interface, but the important
point is that that's what you originally had, for example in a
file name or directory name if this is how the URI was made up,
or the characters that you actually wanted to query for in the
query part.

>To come back to the one character or three question... '%7e' might be viewed
>as 3 "URI Characters"; one "octet"; and one "original character" '~'
>(maybe).

Yes, exactly. The 'maybe' for '~' is quite appropriate.
If somebody ran an http server on a computer where people
still used e.g. the German version of ISO 646
(see http://www.itscj.ipsj.or.jp/ISO-IR/021.pdf), then
the original character would be a sharp-s.

As another example, '%7c' would be three URI characters, which
correspond to one octet, which usually correspond to '|' (vertical line)
as an original character, but which may also correspond to
o-umlaut in the German version of ISO 646, as well as many
other characters in other versions of ISO 646,... (fortunately,
most ISO 646 versions except US-ASCII are pretty much dead these
days).

The general problem with all this language in RFC 2396 is that it's
not easy for everybody to imagine characters being represented as
octets being in turn represented as characters (and so on).
But that's very difficult to fix.

Regards,     Martin.

Received on Tuesday, 4 February 2003 18:40:51 UTC