W3C home > Mailing lists > Public > uri@w3.org > April 2003

Re: Secion 6 Normalization and Comparison

From: Roy T. Fielding <fielding@apache.org>
Date: Mon, 28 Apr 2003 03:42:37 -0700
Cc: uri@w3.org
To: "Williams, Stuart" <skw@hplb.hpl.hp.com>
Message-Id: <20B27819-7966-11D7-99B7-000393753936@apache.org>

>> Yes, they are always equivalent.  They won't necessarily be
>> the same for comparison, but they are equivalent (which means
>> applications can replace one with the other if they so desire).
>
> Oh...! The Namespaces 1.1 CR [1] gives the following example (well yes,
> expressed in IRI rather than URI terms):
>
> "The IRI references below are also all different for the purposes of
> identifying namespaces:
> ...
>   http://www.example.org/~wilbur
>   http://www.example.org/%7ewilbur
>   http://www.example.org/%7Ewilbur
> "
>
> Which I read as making these three identifiers *not* equivalent for the
> purpose of naming a namespace.
>
> [1] http://www.w3.org/TR/xml-names11/#IRIComparison

The Namespaces CR is welcome to choose CDATA comparison over URI 
comparison,
but it has no choice in regards to URI equivalence.  It cannot claim 
they
are different -- it can only claim that they are inconsistently written.

BTW, there is no reason for the Namespaces specification to include
the quoted text above -- they are over-specifying the protocol.  What
they should say is that identifiers are assumed to be in normal form
and are not normalized for consistency prior to comparison.

>>> Also, in general it is not clear to me that it is legitimate to
>>> unescape the escape sequence, because in general one doesn't know the
> character set
>>> of the escaped character - only authority that minted the URI knows 
>>> that
> -
>>> looking at a URI you only get to know what octet was escaped. [I 
>>> think].
>>
>> That doesn't matter because the octet remains the same
>> whether it is escaped or not.  The escaping merely prevents
>> characters from being misinterpreted as delimiters of
>> components or of the URI itself.
>
> I agree, it's of no consequence for octet based comparison (as in [2] 
> URI
> Characters seq->octet seq->Original Character seq).
>
> *If* the document were to say very clearly that URI comparisons should 
> be
> based on comparing octet sequences, at least for me, that would 
> explain your
> response above - ~, %7e, %7E all contribute the same to an octet 
> sequence.

That is mixing normalization with comparison.  The document doesn't say
that because it isn't usually necessary -- URIs are often compared with 
the
assumption that they are already in normal form.  That's the whole point
of the additions for section 6.

....Roy
Received on Monday, 28 April 2003 06:40:09 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 13 January 2011 12:15:31 GMT