- From: Larry Masinter <masinter@parc.xerox.com>
- Date: Sun, 4 Jan 1998 22:09:46 PST
- To: uri-i18n@unicode.org
- CC: fielding@ics.uci.edu, uri@bunyip.com, Jacob Palme <jpalme@dsv.su.se>
Jacob Palme pointed out that the second (one line) paragraph of the current Section 2.1 of the URI/URL/whatever draft was hard to understand. With some amount of trepidation, I propose the following (alas lengthy) rewrite: ----------------------------------------------- 2.1 URIs and non-ASCII characters The relationship between URIs and characters (for characters that are not part of ASCII) has been a source of confusion. To describe the relationship, it is useful to distinguish between a "character" (as a distinguishable semantic entity) and an "octet" (an 8-bit byte). There are two mappings, one from URI characters to octets, and a second from octets to original characters: URI character sequence->octet sequence->original character sequence A URI is represented as a sequence of characters, not as a sequence of octets. That is because URIs might be "transported" by means that are not through a computer network, e.g., printed on paper, read over the radio. URI schemes may define a mapping from URI characters to octets; whether this is done depends on the scheme. Commonly, within a delimited section of a URI a sequence of characters may be used to represent a sequence of octets. For example, the character "a" represents the octet 97 (decimal), while the character sequence "%", "0", "a" represents the octet 10 (decimal). Secondarily, for some schemes and protocols, there is a second translation: the sequence of octets defined by a component of the URI is subsequently used to represent sequence of characters. A 'charset' defines this mapping. There are many charsets in use in Internet Protocols. For example, UTF8 [UTF8] defines a mapping from sequences of octets to sequences of characters in the repertoire of ISO 10646. In the simplest case, the original character sequence contains only characters that are defined in US-ASCII, and the two levels of mapping are simple and easily invertable: each 'original character' is represented as the octet for the US-ASCII code for it, which is, in turn, represented as either the US-ASCII character, or else the "%" escape sequence for that octet. For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols which transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset to used, if there might be more than one [RFC-char-standard]. However, there is currently no provision within the generic URI syntax to accomplish this identification. Of course, individual URI schemes may provide a way to indicate the charset used or define a default charset. In addition, there is no definition of the meaning of characters outside of a limited repertoire for interpretation of non-ASCII URI characters. It is expected that a systematic treatment of character encoding within URIs will be developed as a future modification of this specification.
Received on Monday, 5 January 1998 01:10:11 UTC