IRI and URI comparisions (was Re: charmodReview-17, LC-k lyne26, LC-kopecky5, LC-kopecky6, LC-booth3, LC-schema17) from Williams, Stuart on 2004-03-26 (www-archive@w3.org from March 2004)

From: Williams, Stuart <skw@hp.com>
Date: Fri, 26 Mar 2004 08:54:09 -0000
To: Chris Lilley <chris@w3.org>
Cc: Martin Duerst <duerst@w3.org>, www-archive@w3.org
Message-ID: <E864E95CB35C1C46B72FEA0626A2E80801EA1A07@0-mail-br1.hpl.hp.com>

Hello Chris,

[trimmed this down to just you and Martin]

> I am saying that one should either compare IRIs, or 
> canonicalize the IRIs to URIs and compare the fully 
> canonicalized forms (ie, fully hexified and upper case, not 
> lower, for the hex digits A to F).

So... if you do a character-by-character comparision for on two IRI and find
them to be different - as a design requirement on the canonicalize IRI to
URI mapping - would you expect the canonicalize URI to be different?

ie.

  forall x,y in IRI: not( x==y ) => not( iriToUri(x) == iriToUri(x) )

  where == is character-by-character comparison.

Martin observed that another property of the current mapping is that

  forall x in IRI: iriToUri(x) == iriToUri(iriToUri(x))

which makes it impossible to achieve the first property - its easy to find a
counter example where x and iriToUri(x) are different
character-by-character.

I don't know if this second property is a design requirement (URI map onto
themselves).

If one regards IRI and URI as distinct sets - ie. the identifiers that
satisfy the generic URI syntax are URI and *not* IRI. IRI are any other
identifiers that satisfy the current IRI syntax. If there were a reserved
character in URI and IRI syntax that were only introduced unescaped into an
URI by the IRI->URI mapping - then the IRI would map into an otherwised
unused part of URI space. If the mapping were only applied to IRI (and not
to things that were already URI) then it wouldn't be applied recursively,
and... it may also be invertable.

[Just thinking aloud]

Stuart.
--

> -----Original Message-----
> From: Chris Lilley [mailto:chris@w3.org] 
> Sent: 26 March 2004 03:32
> To: Williams, Stuart
> Cc: tag@w3.org; Martin Duerst
> Subject: Re: [Minutes] 22 March 2004 TAG teleconf 
> (charmodReview-17, LC-k lyne26, LC-kopecky5, LC-kopecky6, 
> LC-booth3, LC-schema17)
> 
> On Thursday, March 25, 2004, 1:52:46 PM, Stuart wrote:
> 
> WS> Hello Chris,
> 
> WS> [Apologies for holding a technical discussion here on tag... if its 
> WS> going to go on we should move it elsewhere -  public-iri@w3.org seem 
> WS> most appropriate.]
> 
> >> Which is why it says to keep the character (in this case ~) as a 
> >> character. Once you start escaping it then there are escaped and 
> >> non-escaped forms and upper and lower case forms ....  so the IRI 
> >> spec does the right thing here.
> 
> WS> Hmmm... so on account of the "MUST NOT" above, which I take to be 
> WS> "the right thing" from the IRI spec, are you saying that there are 
> WS> IRI that cannot be mapped to URI?
> 
> Not at all.
> 
> I am saying that one should either compare IRIs, or 
> canonicalize the IRIs to URIs and compare the fully 
> canonicalized forms (ie, fully hexified and upper case, not 
> lower, for the hex digits A to F).
> 
> -- 
>  Chris Lilley                    mailto:chris@w3.org
>  Chair, W3C SVG Working Group
>  Member, W3C Technical Architecture Group
>

Received on Friday, 26 March 2004 04:33:03 UTC