RE: Secion 6 Normalization and Comparison from Williams, Stuart on 2003-04-28 (uri@w3.org from April 2003)

From: Williams, Stuart <skw@hplb.hpl.hp.com>
Date: Mon, 28 Apr 2003 10:35:21 +0100
To: "'Roy T. Fielding'" <fielding@apache.org>
Cc: uri@w3.org
Message-ID: <5E13A1874524D411A876006008CD059F04A074D2@0-mail-1.hpl.hp.com>

> > 6.2.2.2 Escape Normalisation
> > ----------------------------
> >
> > States: "One cause is the choice of upper-case or lower-case letters for
the
> > hexadecimal digits within the escape sequence (e.g., "%3a" versus
"%3A").
> > Such sequences are always equivalent; for the sake of uniformity, URI
> > generators and normalizers are strongly encouraged to use upper-case 
> > letters for the hex digits A-F."
> >
> > "... Such sequences are always equivalent;..." this seems to ignore 
> > the aspect of the purpose of the comparison - eg. are such sequences 
> > equivalent for the purpose of naming a namespace?
> 
> Yes, they are always equivalent.  They won't necessarily be 
> the same for comparison, but they are equivalent (which means 
> applications can replace one with the other if they so desire).

Oh...! The Namespaces 1.1 CR [1] gives the following example (well yes,
expressed in IRI rather than URI terms):

"The IRI references below are also all different for the purposes of
identifying namespaces: 
...
  http://www.example.org/~wilbur 
  http://www.example.org/%7ewilbur 
  http://www.example.org/%7Ewilbur 
"

Which I read as making these three identifiers *not* equivalent for the
purpose of naming a namespace.

[1] http://www.w3.org/TR/xml-names11/#IRIComparison

<snip/>

> 
> Unescaping those characters isn't worth the risk, and 
> considerably complicates the normalizer.

Ok... I can see that tracking the scope of the reserved/restricted purpose
of a character would complicate both escaping and unescaping, and that it
give more scope for mistakes.

> > Also, in general it is not clear to me that it is legitimate to
> > unescape the escape sequence, because in general one doesn't know the
character set 
> > of the escaped character - only authority that minted the URI knows that
-
> > looking at a URI you only get to know what octet was escaped. [I think].
> 
> That doesn't matter because the octet remains the same 
> whether it is escaped or not.  The escaping merely prevents 
> characters from being misinterpreted as delimiters of 
> components or of the URI itself.

I agree, it's of no consequence for octet based comparison (as in [2] URI
Characters seq->octet seq->Original Character seq).

*If* the document were to say very clearly that URI comparisons should be
based on comparing octet sequences, at least for me, that would explain your
response above - ~, %7e, %7E all contribute the same to an octet sequence.

If it already does say so, then I apologise, because so far I have missed
it.

[2]
http://www.apache.org/~fielding/uri/rev-2002/rfc2396bis.html#rfc.section.2.1

> 
> ....Roy
> 

Regards

Stuart

Received on Monday, 28 April 2003 05:35:40 UTC