Re: IRI and URI comparisions (was Re: charmodReview-17, LC-k lyne26, LC-kopecky5, LC-kopecky6, LC-booth3, LC-schema17) from Martin Duerst on 2004-03-28 (www-archive@w3.org from March 2004)

From: Martin Duerst <duerst@w3.org>
Date: Sun, 28 Mar 2004 12:44:36 -0500
To: "Williams, Stuart" <skw@hp.com>, Chris Lilley <chris@w3.org>
Cc: www-archive@w3.org
Message-Id: <4.2.0.58.J.20040328110811.05adc398@localhost>

Hello Stuart,

At 08:54 04/03/26 +0000, Williams, Stuart wrote:

>[trimmed this down to just you and Martin]
>
> > I am saying that one should either compare IRIs, or
> > canonicalize the IRIs to URIs and compare the fully
> > canonicalized forms (ie, fully hexified and upper case, not
> > lower, for the hex digits A to F).
>
>So... if you do a character-by-character comparision for on two IRI and find
>them to be different - as a design requirement on the canonicalize IRI to
>URI mapping - would you expect the canonicalize URI to be different?
>
>ie.
>
>   forall x,y in IRI: not( x==y ) => not( iriToUri(x) == iriToUri(x) )
>
>   where == is character-by-character comparison.

Well, I would, but I'm probably too used to this stuff to count :-).

>Martin observed that another property of the current mapping is that
>
>   forall x in IRI: iriToUri(x) == iriToUri(iriToUri(x))
>
>which makes it impossible to achieve the first property - its easy to find a
>counter example where x and iriToUri(x) are different
>character-by-character.
>
>I don't know if this second property is a design requirement (URI map onto
>themselves).
>
>If one regards IRI and URI as distinct sets - ie. the identifiers that
>satisfy the generic URI syntax are URI and *not* IRI. IRI are any other
>identifiers that satisfy the current IRI syntax. If there were a reserved
>character in URI and IRI syntax that were only introduced unescaped into an
>URI by the IRI->URI mapping - then the IRI would map into an otherwised
>unused part of URI space. If the mapping were only applied to IRI (and not
>to things that were already URI) then it wouldn't be applied recursively,
>and... it may also be invertable.

It's somewhat difficult to speak about design requirements, because
although we looked at a lot of designs initially, that was a long time
ago, we mainly did it by trying to come up with somewhat reasonable
designs, and then picking the best one (given various constrains), rather
than doing something like an explicit requirements document.

But thinking about it, it's rather clear for me that it is a design
requirement. While language lawyers could probably live without it,
having a function that does not convert URIs into themselves would
create a lot of hopeless confusion for average users. Having URIs
not be a subset of IRIs would mean that you always have to use a
union type for anything of practical value. Also, with the current
design, there are different upgrade strategies that can be used
for different fields. For some cases, e.g. XML namespaces, it turns
out that IRIs are essentially already implemented because XML attributes
accept Unicode. So the actual upgrade strategy is just to let it happen.
In other cases, for example http, some more explicit conversions are
necessary. The fact that this design allows such differences is
an additional benefit, and if you want, a design requirement.

Regards,    Martin.

Received on Sunday, 28 March 2004 13:06:32 UTC