RE: Character-by-Character Distinct IRI-> Char-by-char equivale nt URI (issue #IRIURIcharequiv-20) from Williams, Stuart on 2004-03-25 (public-iri@w3.org from March 2004)

From: Williams, Stuart <skw@hp.com>
Date: Thu, 25 Mar 2004 23:29:20 -0000
To: Martin Duerst <duerst@w3.org>
Cc: public-iri@w3.org
Message-ID: <E864E95CB35C1C46B72FEA0626A2E80801EA19FB@0-mail-br1.hpl.hp.com>
Hello Martin,

> I therefore am tentatively closing this issue with 'no action needed'.
> If you think that some change to the document is needed, 
> please say so (ideally with some actual proposed text).

The short answer is yes... I agree.

I chewed on this of quite a time. Even came with and idea to address your
challenge (which I'll mention below) but I think it amounts the same thing
as the MUST NOT in section 5.1:

   As an example,
   http://example.org/~user, http://example.org/%7euser, and
   http://example.org/%7Euser are not equivalent under this definition.
   In such a case, the comparison function MUST NOT map IRIs to URIs,
   because such a mapping would create additional spurious equivalences.

It was the "spurious equivalences" phrase that triggered my concern.

My idea to address your challenge, which I'm not overly committed to... and
I think amounts the the same as the MUST NOT above is this:

Regard the lexical spaces of URI and IRI as disjoint ie. if the identifier
string is valid under the URI grammar... it's a URI and *not* an IRI. IRI
are then the remaining strings that match the IRI grammar. Effectively this
forces IRI to be only those identifier strings that actually contain
characters that are not 'legal' in URI. Character by character IRI/IRI and
URI/URI comparision work as normal. Character-by-character IRI/URI
comparison always yield not-equal (because URI and IRI are disjoint). 

Thanks,

Stuart
--

> -----Original Message-----
> From: public-iri-request@w3.org 
> [mailto:public-iri-request@w3.org] On Behalf Of Martin Duerst
> Sent: 25 March 2004 20:21
> To: Williams, Stuart; Williams, Stuart; Williams, Stuart; Ian 
> B. Jacobs
> Cc: public-iri@w3.org
> Subject: Re: Character-by-Character Distinct IRI-> 
> Char-by-char equivalent URI (issue #IRIURIcharequiv-20)
> 
> 
> Hello Stuart,
> 
> Many thanks for forwarding this discussion to public-iri. I 
> have listed this as an issue at 
> http://www.w3.org/International/iri-edit/Overview.html#IRIURIc
harequiv-20.
> 
> At 16:38 04/03/25 +0000, Williams, Stuart wrote:
> 
> >[Shifting discussion to public-iri from tag@w3.org]
> >
> > > >So to be clear... the IRI draft section 5.1 seems to 
> suggest that 
> > > >it is possible for the IRI->URI mapping to generate superious 
> > > >character-by-character equivalences (between the resulting URI) 
> > > >from distinct IRI.
> > > >
> > > >If that can't in fact happen, then maybe the wording in 5.1 that 
> > > >suggest that it could should be changed.
> > >
> > > It can happen. The IRI http://www.example.org/ros&eacute; is not 
> > > character-by-character equivalent to the URI/IRI 
> > > http://www.example.org/ros%C3%A9, but when you convert the IRI 
> > > http://www.example.org/ros&eacute; to an URI, you get the URI 
> > > http://www.example.org/ros%C3%A9.
> >
> >So... what other IRI, distinct from 
> http://www.example.org/ros&eacute;
> >yields URI http://www.example.org/ros%C3%A9.
> >
> >Oh... yes I see, the IRI http://www.example.org/ros%C3%A9 
> (because in 
> >your view URI are IRI) maps to the URI 
> http://www.example.org/ros%C3%A9.
> >aaaaggghhhh.
> >
> >This gets us back to the question of whether IRI are 
> intended 1) as a 
> >replacement for URI as 'the' identifier for the Web, or 2) as a 
> >distinct set of a internationalised identifiers (distinct 
> value space, 
> >overlapping lexical space wrt URI) which are mapped to URI 
> for use in 
> >protocol elements that require URI
> 
> My answer here is:
> - both, depending on the situation
> - this might look bothering, but it really isn't much of an issue
> 
> 
> >- URI then remain 'the' identifier on the Web and IRI serve as an 
> >internationalisation friendly form for inclusion in documents and UI 
> >artifacts (from which URI are generated when necessary).
> 
> Once most of what the users will see are IRIs, I'm not sure 
> it's appropriate to call them 'UI artifacts'. Users will just 
> see them, and probably talk about them as URLs, not even 
> URIs, because they haven't yet learned that the 'politically 
> correct' name is now URI.
> 
> 
> >View 2 treats URI and IRI as different types even though 
> there lexical 
> >forms overlap. In this view I'm not sure quite how the IRI
> >http://www.example.org/ros%C3%A9 would arise.
> 
> Hopefully not too often. But somebody can just write it down, 
> as it is written down in this email. You are right that it 
> should be very rare, but from a spec point of view, the 
> question is how to disallow it. We can't disallow 
> percent-escaping because that would disallow 
> http://www.example.org/ros%E9, which has to be legal for 
> backwards compatibility purposes (and is different from the above).
> 
> 
> >I accept that simply looking at the lexical form leaves an ambiguity 
> >about whether one is looking at a URI or an IRI.
> 
> Yes, of course. And often (e.g. on a napkin or a business 
> card), that's the only thing you have.
> 
> 
> > > So what the IRI draft says is that if you need character-by- 
> > > character equivalence, for example for XML namespaces (e.g.
> > > XSLT) or for RDF to match nodes in a graph, don't convert 
> to an URI.
> >
> >ie. you do character-by-character comparison of two IRI.
> 
> Yes.
> 
> 
> > > >[Also, I'm not entirely comfortable with all "URIs are 
> IRIs" - but 
> > > >I'll dwell on that - this discomfort is at the level of 
> whether the 
> > > >set of real numbers and the set of integers are disjoint 
> or not - 
> > > >ie. they are different types (value-space) although the 
> may share 
> > > >lexical space presentation.]
> > >
> > > I'm not totally familiar with this kind of philosophical 
> discussion.
> > > I tend to think about this from an operational view. Most 
> if not all 
> > > operations that you can do on 'integer 2' you can do on 'real 2', 
> > > with the same results. Same the other way round.
> > > As far as I know, all the relevant operations you can do with the 
> > > URI http://www.example.org/ros%C3%A9 you can do with the IRI 
> > > http://www.example.org/ros%C3%A9, and vice versa, with the same 
> > > result. So trying to distinguish them will only 
> complicate the spec, 
> > > without adding anything other than confusing most readers.
> >
> >But you are demonstrating a situation where for two IRI x and y:
> >
> >         (not(x == y)) && (iriToUri(x) == iriToUri(y))
> >
> >operationally this is also confusing.
> 
> Looks like it may create problems. But up to now, I haven't 
> heard about any. This situation is very similar to a 
> situation purely on URIs. It is very easy to construct two 
> URIs that are guaranteed to always resolve to the same thing 
> but that don't compare character-by-character equivalent.
> 
> For example http://www.example.org/ (1) and 
> HTTP://www.example.org (2).
> The URI spec says that these are equivalent, but they 
> identify two different namespaces, and they identify two 
> different nodes in an RDF graph. So we could indeed create 
> two different types, let's call them aURI (the ones that are 
> only equivalent if they are character-by-character 
> equivalent) and bURI (they are equivalent in some more cases, 
> as far as the URI spec allows).
> (1) and (2) are the same bURI, but a different aURI. And you 
> can't even know whether they are aURIs or bURIs by just 
> looking at them. So which ones are 'the' identifier for the Web?
> aURIs or bURIs?
> 
> 
> So in summary:
> - This issue is not a problem because very similar issues also
>    exist with URIs themselves.
> - Given that IRIs allow (many) more characters than URIs, but we
>    want IRIs that are also URIs (or if you prefer: that also look
>    like URIs) to map to themselves, it is impossible to preserve
>    the same equivalence classes for character-by-character on IRIs
>    and on URIs. So there does not seem to be a solution
>    (if you know about one, please tell us).
> - There is already a clear warning in the draft about this, which
>    seems to have been clear enough to catch your attention.
> 
> I therefore am tentatively closing this issue with 'no action needed'.
> If you think that some change to the document is needed, 
> please say so (ideally with some actual proposed text).
> 
> 
> Regards,     Martin.
> 
>
Received on Thursday, 25 March 2004 18:29:51 UTC