- From: Tim Bray <tbray@textuality.com>
- Date: Fri, 07 Mar 2003 16:09:21 -0800
- To: uri@w3.org
Ref: http://www.apache.org/~fielding/uri/rev-2002/rfc2396bis.html I have read section 2 about 11 times and now have a persistent nagging headache that just won't quit. I now approximately half the time am convinced that I think that I understand what it's trying to say, but I'd have a hard time justifying it with text from the RFC. It seems to me that the RFC defines - very clearly - the syntax of a URI. However, the explanation of how those characters and escape sequences might have got there is pretty well opaque to me. In explaining matters of character encoding, section 2.1 envisions something sort of standing behind the URI, the phrase original character is used (occasionally in quotes), as well as "original character sequences" (not in quotes). So maybe there's a notion of an "original URI" hiding behind the URI? This is confusing because the "original URI" might differ from the actual URI because (a) it contains ASCII characters which are reserved, e.g. '/' or '%' (b) it contains non-ASCII characters (c) it contains non-character octets A question: what gets painted on the side of a bus? The URI or the "original" behind it? The answer is probably "The URI", except for case (b), when it might become an IRI with the native non-ASCII characters appearing on the side of the bus. (c) is kind of confusing and counter-intuitive, but is the only way I can explain the baffling language about mapping from characters to octets, and the phrase in 2.2 "The data for a URI component". If section 2 were redrafted as follows, all the ambiguity and hand-waving would be squashed like a bug. =============================================== 2. Characters and URIs [New title] A URI consists of a restricted set of characters, primarily chosen to aid transcribability and usability both in computer systems and in non-computer communications. Characters used conventionally as delimiters around a URI are excluded. The restricted set of characters consists of digits, letters, and a few graphic symbols chosen from those common to most of the character encodings and input facilities available to Internet users. uric = reserved / unreserved / escaped Within a URI, characters are either used as delimiters or to represent strings of data (octets) within the delimited portions. [Same as now except lose last 2 sentences]. 2.1 Encoding of Characters In the general case, there are many mappings between characters as abstractions comprising the smallest atomic units of text and the octets used to store them in a computer. The US-ASCII standard specifies not only a set of characters but a particular mode of storage where each character's numeric value (in the range 0-127) is stored directly in a single octet. Note that many widely deployed systems for storing characters which include non-ASCII characters nonetheless store ASCII characters as specified by US-ASCII directly one per octet. This includes Shift-JIS, EUC, UTF-8, and ISO-8859 (all parts). This RFC does not mandate the use of any particular mapping between its character set and octets of computer storage. 2.2 The Characters in the URI Scheme The "scheme" part of a URI consists of a sequence of ASCII characters which represent nothing except themselves. 2.3 The Characters in Non-Scheme Parts of the URI The ASCII characters making up a component of a URI other than the scheme may represent an arbitrary sequence of octets. The definitions of URI schemes MUST specify the interpretation of the characters in the components of URIs of that scheme. There are some constraints on these interpretations: - The interpretation MUST conform to the productions in this RFC, i.e. cannot rely on using a character which is forbidden to appear in the component. - The interpretation must be consistent: two instances of a URI component which are equal in length and made of pairwise-identical ASCII characters MUST represent the same octets. - The character "%" MUST always be followed by two hexadecimal values encoding the numeric value of a single octet. The hexadecimal digits 'A' through 'F' are used identically to the digits 'a' through 'f', so that two URI components which differ only in the case of hexadecimal digits used in %-encoded octets may safely be considered identical. 2.4 Textual URIs Many schemes may wish to constrain the components of URIs to encode textual data, consisting only of characters from Unicode (ISO10646). This section describes a procedure for encoding textual data for use in URIs. Schemes which describe textual URIs SHOULD use the procedure described in this section to generate URI components from textual data. - ASCII characters which may legally appear in the component MUST appear directly as themselves, i.e. 'a' may not be encoded as %61. - ASCII characters which may not legally appear in the component MUST be %-encoded using the numeric value specified by the US-ASCII standard, using the upper-case hexadecimal digits 'A' - 'F'. i.e. '<' must always be encoded as %3C. - Non-ASCII characters MUST be converted to a sequence of octets as specified by UTF-8, with each octet then %-encoded. I.e. Ç (U+00C7 LATIN CAPITAL LETTER C WITH CEDILLA) must always be encoded as %C7%65. =============================== I think most of the rest of section 2 is pretty well OK. -- Cheers, Tim Bray (ongoing fragmented essay: http://www.tbray.org/ongoing/)
Received on Friday, 7 March 2003 19:09:23 UTC