proposed rewrite of 2.1 of draft-fielding-uri-syntax ... from Larry Masinter on 1998-01-05 (uri@w3.org from January 1998)

From: Larry Masinter <masinter@parc.xerox.com>
Date: Sun, 4 Jan 1998 22:09:46 PST
To: uri-i18n@unicode.org
CC: fielding@ics.uci.edu, uri@bunyip.com, Jacob Palme <jpalme@dsv.su.se>
Message-ID: <34B0792A.1FF36EE4@parc.xerox.com>

Jacob Palme pointed out that the second (one line) paragraph
of the current Section 2.1 of the URI/URL/whatever draft was
hard to understand. With some amount of trepidation, I propose
the following (alas lengthy) rewrite:
-----------------------------------------------
2.1 URIs and non-ASCII characters   

   The relationship between URIs and characters (for characters that
   are not part of ASCII) has been a source of confusion. To describe
   the relationship, it is useful to distinguish between a "character"
   (as a distinguishable semantic entity) and an "octet" (an 8-bit
   byte). There are two mappings, one from URI characters to octets,
   and a second from octets to original characters:

   URI character sequence->octet sequence->original character sequence

   A URI is represented as a sequence of characters, not as a sequence
   of octets. That is because URIs might be "transported" by means that
   are not through a computer network, e.g., printed on paper, read
   over the radio.

   URI schemes may define a mapping from URI characters to octets;
   whether this is done depends on the scheme. Commonly, within a
   delimited section of a URI a sequence of characters may be
   used to represent a sequence of octets. For example, the character
   "a" represents the octet 97 (decimal), while the character sequence
   "%", "0", "a" represents the octet 10 (decimal).

   Secondarily, for some schemes and protocols, there is a second
   translation: the sequence of octets defined by a component of the URI
   is subsequently used to represent sequence of characters. A 'charset'
   defines this mapping. There are many charsets in use in Internet
   Protocols. For example, UTF8 [UTF8] defines a mapping from sequences
   of octets to sequences of characters in the repertoire of ISO 10646.
   
   In the simplest case, the original character sequence contains
   only characters that are defined in US-ASCII, and the two levels
   of mapping are simple and easily invertable: each 'original character'
   is represented as the octet for the US-ASCII code for it, which is,
   in turn, represented as either the US-ASCII character, or else the
   "%" escape sequence for that octet.

   For original character sequences that contain non-ASCII characters,
   however, the situation is more difficult. Internet protocols which
   transmit octet sequences intended to represent character sequences
   are expected to provide some way of identifying the charset to used,
   if there might be more than one [RFC-char-standard].  However,
   there is currently no provision within the generic URI syntax to
   accomplish this identification. Of course, individual URI schemes
   may provide a way to indicate the charset used or define a default
   charset. In addition, there is no definition of the meaning of
   characters outside of a limited repertoire for interpretation of
   non-ASCII URI characters.  

   It is expected that a systematic treatment of character encoding
   within URIs will be developed as a future modification of this
   specification.

Received on Monday, 5 January 1998 01:10:11 UTC