Re: Secion 6 Normalization and Comparison from Roy T. Fielding on 2003-04-26 (uri@w3.org from April 2003)

From: Roy T. Fielding <fielding@apache.org>
Date: Fri, 25 Apr 2003 21:40:10 -0700
To: "Williams, Stuart" <skw@hplb.hpl.hp.com>
Cc: <uri@w3.org>
Message-Id: <29E9682D-77A1-11D7-82C4-000393753936@apache.org>

> 6.2.2.2 Escape Normalisation
> ----------------------------
>
> States: "One cause is the choice of upper-case or lower-case letters 
> for the
> hexadecimal digits within the escape sequence (e.g., "%3a" versus 
> "%3A").
> Such sequences are always equivalent; for the sake of uniformity, URI
> generators and normalizers are strongly encouraged to use upper-case 
> letters
> for the hex digits A-F."
>
> "... Such sequences are always equivalent;..." this seems to ignore the
> aspect of the purpose of the comparison - eg. are such sequences 
> equivalent
> for the purpose of naming a namespace?

Yes, they are always equivalent.  They won't necessarily be the same for
comparison, but they are equivalent (which means applications can 
replace
one with the other if they so desire).

> Also states: "Only characters that are excluded from or reserved 
> within the
> URI syntax must be escaped when used as data. However, some URI 
> generators
> go beyond that and escape characters that do not require escaping, 
> resulting
> in URIs that are equivalent to their unescaped counterparts. Such URIs 
> can
> be normalized by unescaping sequences that represent the unreserved
> characters, as described in Section 2.3."
>
> I think that the reserved use of some characters is scoped by scheme 
> and URI
> syntax component (scheme, authority, path, query, fragment)ie. their
> reserved purpose is only applicable in certain fields and so escaping 
> should
> only be applied to a reserved character when it's reserved purpose is 
> in
> scope.

Unescaping those characters isn't worth the risk, and considerably
complicates the normalizer.

> Also, in general it is not clear to me that it is legitimate to 
> unescape the
> escape sequence, because in general one doesn't know the character set 
> of
> the escaped character - only authority that minted the URI knows that -
> looking at a URI you only get to know what octet was escaped. [I 
> think].

That doesn't matter because the octet remains the same whether it is
escaped or not.  The escaping merely prevents characters from being
misinterpreted as delimiters of components or of the URI itself.

....Roy

Received on Saturday, 26 April 2003 03:31:12 UTC