- From: Williams, Stuart <skw@hplb.hpl.hp.com>
- Date: Tue, 4 Feb 2003 13:27:47 -0000
- To: "'Martin Duerst'" <duerst@w3.org>, "Ian B. Jacobs" <ij@w3.org>, www-tag@w3.org
- Cc: www-international@w3.org, Michel Suignard <michelsu@microsoft.com>
Martin, > -----Original Message----- > From: Martin Duerst [mailto:duerst@w3.org] > Sent: 03 February 2003 19:13 > To: Ian B. Jacobs; www-tag@w3.org > Cc: www-international@w3.org; Michel Suignard > Subject: URIEquivalence-15: characters in RFC 2396 (was: Re: [Minutes] > 27 Jan 2003 TAG teleconf (..., IRIEverywhere-27, ...)) <snip/> > So overall, my conclusion on the question of whether RFC 2396 > would talk about '%7e' as one character (Roy) or three (Dan) > is that for the most part, three is much more plausible. > RFC 2396 defines '%7e' to be one instance of the syntax rule > 'uric', but it doesn't explicitly say that 'uric' is a character. Section 2.1 of RFC 2396 speaks of two sorts of character sequences, "URI character sequences" and "original character sequences". "There are two mappings, one from URI characters to octets, and a second from octets to original characters: URI character sequence->octet sequence->original character sequence A URI is represented as a sequence of characters, not as a sequence of octets. That is because URI might be "transported" by means that are not through a computer network, e.g., printed on paper, read over the radio, etc." The mapping into an octet sequence seems to involve the decoding of escape sequences into octets, while the mapping into an "original charater sequence" is described as a character set, and what character set applies is a given setting is defined outside of RFC 2396: "Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, if there might be more than one [RFC2277]. However, there is currently no provision within the generic URI syntax to accomplish this identification. An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used." I am prone to think of the "URI character sequence" as the sequence of characters, constrainted by URI syntax, that I might write on a piece of paper, or paint on the side of the bus. An "original character sequences" seems to be more about the character sequence I might have wanted to paint on the side of a bus, or present in a user interface (eg. kanji, ) that are prohibited from direct by the constraints of generic URI syntax. To come back to the one character or three question... '%7e' might be viewed as 3 "URI Characters"; one "octet"; and one "original character" '~' (maybe). > Regards, Martin. Regards Stuart
Received on Tuesday, 4 February 2003 08:31:53 UTC