Re: octets <=> ASCII conversion (important) from Martin Duerst on 2004-04-21 (uri@w3.org from April 2004)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 21 Apr 2004 18:45:00 +0900
To: "Roy T. Fielding" <fielding@gbiv.com>
Cc: uri@w3.org
Message-Id: <4.2.0.58.J.20040421172714.06f94ed8@localhost>
At 01:06 04/04/21 -0700, Roy T. Fielding wrote:
>>However, the comment in my mail at
>>http://lists.w3.org/Archives/Public/uri/2004Mar/0012.html,
>>cited below, and including actual proposed text, does not
>>seem to have been addressed, nor did I find any reply saying
>>that or explaining why it would not need to be addressed, or
>>that (and how) it has been addressed.
>>
>>So in case you think that this has been addressed, please
>>tell me where/how.
>
>Section 2.5 was added to address this confusion.  It is the same
>issue that Mike Brown was discussing.  While I understand folks
>desire to have a standard give answers to common implementation
>questions, it is inappropriate for the standard to define what
>is the right implementation when no such definition is needed
>for interoperability.

Sorry, no. There are two separate issues:

1) How to put some actual data into an URI. This is what Mike
    was discussing, and this is what Section 2.5 (which is a
    great addition to the spec) discusses.

2) What percent-escapes are equivalent to what unescaped characters.
    The spec in many places comes close to nailing this down, but
    it does not nail this down. And this is needed for interoperability.

    Section 6.2.2.2, Percent-Encoding Normalization, says:

    "The percent-encoding mechanism (Section 2.1) is a frequent source of
     variance among otherwise identical URIs. In addition to the case-
     insensitivity issue noted above, some URI producers percent-encode
     octets that do not require percent-encoding, resulting in URIs that
     are equivalent to their non-encoded counterparts. Such URIs should
     be normalized by decoding any percent-encoded octet that corresponds
     to an unreserved character, as described in Section 2.3."

    In order to normalize in an interoperable way, we have to know
    which percent-encoding corresponds to which unreserved character.
    The reference to Section 2.3 might help, but it only says:

    "For consistency, percent-encoded octets in the ranges of
       ALPHA (%41-%5A and %61-%7A),
       DIGIT (%30-%39),
       hyphen (%2D),
       period (%2E),
       underscore (%5F), or
       tilde (%7E)
     should not be created by URI producers and, when found in a URI,
     should be decoded to their corresponding unreserved character by
     URI normalizers."

     (linebreaks added by me). The correspondence between percent-escapes
     and unescaped characters is given in ranges, and as an aside.
     For the DIGIT part, there are 3,628,800 possible permutations.
     Let's say we have an URI of "%30%31%32%33%34%35%36%37%38%39".
     What in the spec prohibits an implementation to claim that
     "9876543210" is a conformant normalization of the above URI?
     Rather than listing the ranges, just say that the US-ASCII
     encoding is used. The set of characters is just 'unreserved'.

Regards,     Martin.

>  http://gbiv.com/protocols/uri/rev-2002/draft-fielding-uri-rfc2396bis- 
> 05.html#identifying-data
>
>>In case you decided that it does not
>>need addressing, please tell me why you think so.
>
>The specific text that you supplied is not always true.
>While %31 is equivalent to "1", there is no requirement that
>data octets be represented in the URI syntax using the characters
>corresponding to their US-ASCII value -- they could just as easily
>be encoded in HEX first.  That is up to the URI producer.
>
>>>So please, at the appropriate place, add a sentence saying
>>>something like:
>>>"Data octets which in the US-ASCII character encoding represent
>>>unreserved characters can be represented by the corresponding
>>>character. For example, the data octet 0x41 can be represented
>>>by "%41" or by "A"; for readability and comparability, the later
>>>is strongly preferred."
>
>....Roy
Received on Wednesday, 21 April 2004 23:07:25 UTC