octets <=> ASCII conversion from Martin Duerst on 2004-03-07 (uri@w3.org from March 2004)

From: Martin Duerst <duerst@w3.org>
Date: Sun, 07 Mar 2004 06:18:39 -0500
To: "Roy T. Fielding" <fielding@gbiv.com>
Cc: uri@w3.org
Message-Id: <4.2.0.58.J.20040307055111.07e84148@localhost>

I have carefully read up to and including section 4 of
draft-fielding-uri-rfc2396bis-04.txt. In general, the
document is in extremely good shape. But there are some
points that should be fixed. I'll mention them in separate
emails, the most important ones first.

Sections 2.1-2.4 repeatedly mention how data octets can be
represented in URIs. For most data octets, it is clearly
defined how they get represented. For example, the binary octet
"00100000" (I'll use the C/... notation 0x20 from here on) gets
represented as %20.

For reserved characters, the document says
"If no such delimiting role has been assigned, then a
reserved character appearing in a component represents the data octet
corresponding to its encoding in US-ASCII."

This allows to get from reserved characters to octets, but does
not say how to get from e.g. 0x40 to a reserved character ("@"
in this case). The reader will probably infer that the inverse
mapping is used, but this should be said in the document.

The situation is even worse for unreserved characters.
The closest one comes to find a correspondence between
data octets and unreserved characters is at the end of 2.3:
"For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A
and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore
(%5F), or tilde (%7E) should not be created by URI producers and,
when found in a URI, should be decoded to their corresponding
unreserved character by URI normalizers."

The informed reader will probably say: "Hey, this looks too
similar to US-ASCII to be anything else, so let's assume that
it's US-ASCII". But this is not what the reader of a spec
should have to do.

So please, at the appropriate place, add a sentence saying
something like:
"Data octets which in the US-ASCII character encoding represent
unreserved characters can be represented by the corresponding
character. For example, the data octet 0x41 can be represented
by "%41" or by "A"; for readability and comparability, the later
is strongly preferred."


Regards,     Martin.

Received on Sunday, 7 March 2004 06:19:02 UTC