- From: Tim Bray <tbray@textuality.com>
- Date: Fri, 07 Mar 2003 16:09:21 -0800
- To: uri@w3.org
Ref: http://www.apache.org/~fielding/uri/rev-2002/rfc2396bis.html
I have read section 2 about 11 times and now have a persistent nagging
headache that just won't quit. I now approximately half the time am
convinced that I think that I understand what it's trying to say, but
I'd have a hard time justifying it with text from the RFC.
It seems to me that the RFC defines - very clearly - the syntax of a
URI. However, the explanation of how those characters and escape
sequences might have got there is pretty well opaque to me.
In explaining matters of character encoding, section 2.1 envisions
something sort of standing behind the URI, the phrase original character
is used (occasionally in quotes), as well as "original character
sequences" (not in quotes). So maybe there's a notion of an "original
URI" hiding behind the URI?
This is confusing because the "original URI" might differ from the
actual URI because
(a) it contains ASCII characters which are reserved, e.g. '/' or '%'
(b) it contains non-ASCII characters
(c) it contains non-character octets
A question: what gets painted on the side of a bus? The URI or the
"original" behind it? The answer is probably "The URI", except for case
(b), when it might become an IRI with the native non-ASCII characters
appearing on the side of the bus.
(c) is kind of confusing and counter-intuitive, but is the only way I
can explain the baffling language about mapping from characters to
octets, and the phrase in 2.2 "The data for a URI component".
If section 2 were redrafted as follows, all the ambiguity and
hand-waving would be squashed like a bug.
===============================================
2. Characters and URIs [New title]
A URI consists of a restricted set of characters, primarily chosen to
aid transcribability and usability both in computer systems and in
non-computer communications. Characters used conventionally as
delimiters around a URI are excluded. The restricted set of
characters consists of digits, letters, and a few graphic symbols
chosen from those common to most of the character encodings and input
facilities available to Internet users.
uric = reserved / unreserved / escaped
Within a URI, characters are either used as delimiters or to
represent strings of data (octets) within the delimited portions. [Same
as now except lose last 2 sentences].
2.1 Encoding of Characters
In the general case, there are many mappings between characters as
abstractions comprising the smallest atomic units of text and the octets
used to store them in a computer. The US-ASCII standard specifies not
only a set of characters but a particular mode of storage where each
character's numeric value (in the range 0-127) is stored directly in a
single octet. Note that many widely deployed systems for storing
characters which include non-ASCII characters nonetheless store ASCII
characters as specified by US-ASCII directly one per octet. This
includes Shift-JIS, EUC, UTF-8, and ISO-8859 (all parts).
This RFC does not mandate the use of any particular mapping between its
character set and octets of computer storage.
2.2 The Characters in the URI Scheme
The "scheme" part of a URI consists of a sequence of ASCII characters
which represent nothing except themselves.
2.3 The Characters in Non-Scheme Parts of the URI
The ASCII characters making up a component of a URI other than the
scheme may represent an arbitrary sequence of octets. The definitions
of URI schemes MUST specify the interpretation of the characters in the
components of URIs of that scheme. There are some constraints on these
interpretations:
- The interpretation MUST conform to the productions in this RFC, i.e.
cannot rely on using a character which is forbidden to appear in
the component.
- The interpretation must be consistent: two instances of a URI
component which are equal in length and made of pairwise-identical
ASCII characters MUST represent the same octets.
- The character "%" MUST always be followed by two hexadecimal values
encoding the numeric value of a single octet. The hexadecimal
digits 'A' through 'F' are used identically to the digits 'a' through
'f', so that two URI components which differ only in the case of
hexadecimal digits used in %-encoded octets may safely be considered
identical.
2.4 Textual URIs
Many schemes may wish to constrain the components of URIs to encode
textual data, consisting only of characters from Unicode (ISO10646).
This section describes a procedure for encoding textual data for use in
URIs. Schemes which describe textual URIs SHOULD use the procedure
described in this section to generate URI components from textual data.
- ASCII characters which may legally appear in the component MUST
appear directly as themselves, i.e. 'a' may not be encoded as %61.
- ASCII characters which may not legally appear in the component MUST
be %-encoded using the numeric value specified by the US-ASCII
standard, using the upper-case hexadecimal digits 'A' - 'F'. i.e.
'<' must always be encoded as %3C.
- Non-ASCII characters MUST be converted to a sequence of octets as
specified by UTF-8, with each octet then %-encoded. I.e. Ç
(U+00C7 LATIN CAPITAL LETTER C WITH CEDILLA) must always be encoded
as %C7%65.
===============================
I think most of the rest of section 2 is pretty well OK.
--
Cheers, Tim Bray
(ongoing fragmented essay: http://www.tbray.org/ongoing/)
Received on Friday, 7 March 2003 19:09:23 UTC