Comments on section 2 of RFC2396bis from Tim Bray on 2003-03-08 (uri@w3.org from March 2003)

From: Tim Bray <tbray@textuality.com>
Date: Fri, 07 Mar 2003 16:09:21 -0800
To: uri@w3.org
Message-ID: <3E6934B1.2010407@textuality.com>
Ref: http://www.apache.org/~fielding/uri/rev-2002/rfc2396bis.html

I have read section 2 about 11 times and now have a persistent nagging 
headache that just won't quit.  I now approximately half the time am 
convinced that I think that I understand what it's trying to say, but 
I'd have a hard time justifying it with text from the RFC.

It seems to me that the RFC defines - very clearly - the syntax of a 
URI.  However, the explanation of how those characters and escape 
sequences might have got there is pretty well opaque to me.

In explaining matters of character encoding, section 2.1 envisions 
something sort of standing behind the URI, the phrase original character 
is used (occasionally in quotes), as well as "original character 
sequences" (not in quotes).  So maybe there's a notion of an "original 
URI" hiding behind the URI?

This is confusing because the "original URI" might differ from the 
actual URI because

(a) it contains ASCII characters which are reserved, e.g. '/' or '%'
(b) it contains non-ASCII characters
(c) it contains non-character octets

A question: what gets painted on the side of a bus?  The URI or the 
"original" behind it?  The answer is probably "The URI", except for case 
(b), when it might become an IRI with the native non-ASCII characters 
appearing on the side of the bus.

(c) is kind of confusing and counter-intuitive, but is the only way I 
can explain the baffling language about mapping from characters to 
octets, and the phrase in 2.2 "The data for a URI component".

If section 2 were redrafted as follows, all the ambiguity and 
hand-waving would be squashed like a bug.

===============================================

2. Characters and URIs [New title]

A URI consists of a restricted set of characters, primarily chosen to 
  aid transcribability and usability both in computer systems and in 
non-computer communications. Characters used conventionally as 
delimiters around a URI are excluded.  The restricted set of 
characters consists of digits, letters, and a few graphic symbols 
chosen from those common to most of the character encodings and input 
facilities available to Internet users.

uric          = reserved / unreserved / escaped

Within a URI, characters are either used as delimiters or to 
represent strings of data (octets) within the delimited portions. [Same 
as now except lose last 2 sentences].

2.1 Encoding of Characters

In the general case, there are many mappings between characters as 
abstractions comprising the smallest atomic units of text and the octets 
used to store them in a computer.  The US-ASCII standard specifies not 
only a set of characters but a particular mode of storage where each 
character's numeric value (in the range 0-127) is stored directly in a 
single octet.  Note that many widely deployed systems for storing 
characters which include non-ASCII characters nonetheless store ASCII 
characters as specified by US-ASCII directly one per octet.  This 
includes Shift-JIS, EUC, UTF-8, and ISO-8859 (all parts).

This RFC does not mandate the use of any particular mapping between its 
character set and octets of computer storage.

2.2 The Characters in the URI Scheme

The "scheme" part of a URI consists of a sequence of ASCII characters 
which represent nothing except themselves.

2.3 The Characters in Non-Scheme Parts of the URI

The ASCII characters making up a component of a URI other than the 
scheme may represent an arbitrary sequence of octets.  The definitions 
of URI schemes MUST specify the interpretation of the characters in the 
components of URIs of that scheme.  There are some constraints on these 
interpretations:

- The interpretation MUST conform to the productions in this RFC, i.e.
   cannot rely on using a character which is forbidden to appear in
   the component.
- The interpretation must be consistent: two instances of a URI
   component which are equal in length and made of pairwise-identical
   ASCII characters MUST represent the same octets.
- The character "%" MUST always be followed by two hexadecimal values
   encoding the numeric value of a single octet.  The hexadecimal
   digits 'A' through 'F' are used identically to the digits 'a' through
   'f', so that two URI components which differ only in the case of
   hexadecimal digits used in %-encoded octets may safely be considered
   identical.

2.4 Textual URIs

Many schemes may wish to constrain the components of URIs to encode 
textual data, consisting only of characters from Unicode (ISO10646). 
This section describes a procedure for encoding textual data for use in 
URIs.  Schemes which describe textual URIs SHOULD use the procedure 
described in this section to generate URI components from textual data.

- ASCII characters which may legally appear in the component MUST
   appear directly as themselves, i.e. 'a' may not be encoded as %61.
- ASCII characters which may not legally appear in the component MUST
   be %-encoded using the numeric value specified by the US-ASCII
   standard, using the upper-case hexadecimal digits 'A' - 'F'.  i.e.
   '<' must always be encoded as %3C.
- Non-ASCII characters MUST be converted to a sequence of octets as
   specified by UTF-8, with each octet then %-encoded.  I.e. Ç
   (U+00C7 LATIN CAPITAL LETTER C WITH CEDILLA) must always be encoded
   as %C7%65.

===============================
I think most of the rest of section 2 is pretty well OK.

-- 
Cheers, Tim Bray
         (ongoing fragmented essay: http://www.tbray.org/ongoing/)
Received on Friday, 7 March 2003 19:09:23 UTC