RE: URIEquivalence-15: characters in RFC 2396 (was: Re: [Minutes] 27 Jan 2003 TAG teleconf (..., IRIEverywhere-27, ...)) from Williams, Stuart on 2003-02-04 (www-tag@w3.org from February 2003)

From: Williams, Stuart <skw@hplb.hpl.hp.com>
Date: Tue, 4 Feb 2003 13:27:47 -0000
To: "'Martin Duerst'" <duerst@w3.org>, "Ian B. Jacobs" <ij@w3.org>, www-tag@w3.org
Cc: www-international@w3.org, Michel Suignard <michelsu@microsoft.com>
Message-ID: <5E13A1874524D411A876006008CD059F04A072D0@0-mail-1.hpl.hp.com>

Martin,

> -----Original Message-----
> From: Martin Duerst [mailto:duerst@w3.org]
> Sent: 03 February 2003 19:13
> To: Ian B. Jacobs; www-tag@w3.org
> Cc: www-international@w3.org; Michel Suignard
> Subject: URIEquivalence-15: characters in RFC 2396 (was: Re: [Minutes]
> 27 Jan 2003 TAG teleconf (..., IRIEverywhere-27, ...))

<snip/>

> So overall, my conclusion on the question of whether RFC 2396
> would talk about '%7e' as one character (Roy) or three (Dan)
> is that for the most part, three is much more plausible.
> RFC 2396 defines '%7e' to be one instance of the syntax rule
> 'uric', but it doesn't explicitly say that 'uric' is a character.

Section 2.1 of RFC 2396 speaks of two sorts of character sequences, "URI
character sequences" and "original character sequences".

   "There are two mappings, one from URI characters to octets, and
   a second from octets to original characters:

   URI character sequence->octet sequence->original character sequence

   A URI is represented as a sequence of characters, not as a sequence
   of octets. That is because URI might be "transported" by means that
   are not through a computer network, e.g., printed on paper, read over
   the radio, etc."


The mapping into an octet sequence seems to involve the decoding of escape
sequences into octets, while the mapping into an "original charater
sequence" is described as a character set, and what character set applies is
a given setting is defined outside of RFC 2396:

   "Internet protocols that
   transmit octet sequences intended to represent character sequences
   are expected to provide some way of identifying the charset used, if
   there might be more than one [RFC2277].  However, there is currently
   no provision within the generic URI syntax to accomplish this
   identification. An individual URI scheme may require a single
   charset, define a default charset, or provide a way to indicate the
   charset used."

I am prone to think of the "URI character sequence" as the sequence of
characters, constrainted by URI syntax, that I might write on a piece of
paper, or paint on the side of the bus. An "original character sequences"
seems to be more about the character sequence I might have wanted to paint
on the side of a bus, or present in a user interface (eg. kanji, ) that are
prohibited from direct by the constraints of generic URI syntax. 

To come back to the one character or three question... '%7e' might be viewed
as 3 "URI Characters"; one "octet"; and one "original character" '~'
(maybe).

> Regards,    Martin.

Regards

Stuart

Received on Tuesday, 4 February 2003 08:31:53 UTC