Secion 6 Normalization and Comparison from Williams, Stuart on 2003-04-14 (uri@w3.org from April 2003)

From: Williams, Stuart <skw@hplb.hpl.hp.com>
Date: Mon, 14 Apr 2003 16:14:26 +0100
To: "Tim Bray (E-mail)" <tbray@textuality.com>
Cc: "'uri@w3.org'" <uri@w3.org>
Message-ID: <5E13A1874524D411A876006008CD059F04A07483@0-mail-1.hpl.hp.com>

Tim,

I said earlier [1] that I had a couple more comments to make on the URI
comparison section of RFC2396bis [2]. It turns out that I really only have
two substantive comments, both on section 6.2.2.2 below.

Best regards

Stuart
--

6.2.2 Syntax-based normalisation and 
6.2.2.3 Path Segment Normalisation
-------------------------------------
[already noted in [1]]

These sections and section 4 "URI References" differ with respect to the
interpretation of "." and ".." in absolute forms of URI.

6.2.2.2 Escape Normalisation
----------------------------

States: "One cause is the choice of upper-case or lower-case letters for the
hexadecimal digits within the escape sequence (e.g., "%3a" versus "%3A").
Such sequences are always equivalent; for the sake of uniformity, URI
generators and normalizers are strongly encouraged to use upper-case letters
for the hex digits A-F."

"... Such sequences are always equivalent;..." this seems to ignore the
aspect of the purpose of the comparison - eg. are such sequences equivalent
for the purpose of naming a namespace? 

Also states: "Only characters that are excluded from or reserved within the
URI syntax must be escaped when used as data. However, some URI generators
go beyond that and escape characters that do not require escaping, resulting
in URIs that are equivalent to their unescaped counterparts. Such URIs can
be normalized by unescaping sequences that represent the unreserved
characters, as described in Section 2.3."

I think that the reserved use of some characters is scoped by scheme and URI
syntax component (scheme, authority, path, query, fragment)ie. their
reserved purpose is only applicable in certain fields and so escaping should
only be applied to a reserved character when it's reserved purpose is in
scope.

Also, in general it is not clear to me that it is legitimate to unescape the
escape sequence, because in general one doesn't know the character set of
the escaped character - only authority that minted the URI knows that -
looking at a URI you only get to know what octet was escaped. [I think].

[1] http://lists.w3.org/Archives/Public/www-tag/2003Mar/0070.html
[2] http://www.apache.org/~fielding/uri/rev-2002/rfc2396bis.html

Received on Monday, 14 April 2003 11:14:49 UTC