Character-by-Character Distinct IRI-> Char-by-char equivalent URI from Williams, Stuart on 2004-03-25 (public-iri@w3.org from March 2004)

From: Williams, Stuart <skw@hp.com>
Date: Thu, 25 Mar 2004 16:38:31 -0000
To: Martin Duerst <duerst@w3.org>, "Williams, Stuart" <skw@hp.com>, "Williams, Stuart" <skw@hp.com>, "Ian B. Jacobs" <ij@w3.org>
Cc: public-iri@w3.org
Message-ID: <E864E95CB35C1C46B72FEA0626A2E80801EA19CE@0-mail-br1.hpl.hp.com>
[Shifting discussion to public-iri from tag@w3.org]

> >So to be clear... the IRI draft section 5.1 seems to suggest that it is 
> >possible for the IRI->URI mapping to generate superious 
> >character-by-character equivalences (between the resulting URI) from 
> >distinct IRI.
> >
> >If that can't in fact happen, then maybe the wording in 5.1 that 
> >suggest that it could should be changed.
> 
> It can happen. The IRI http://www.example.org/ros&eacute; is 
> not character-by-character equivalent to the URI/IRI 
> http://www.example.org/ros%C3%A9, but when you convert the 
> IRI http://www.example.org/ros&eacute; to an URI, you get the 
> URI http://www.example.org/ros%C3%A9.

So... what other IRI, distinct from http://www.example.org/ros&eacute;
yields URI http://www.example.org/ros%C3%A9.

Oh... yes I see, the IRI http://www.example.org/ros%C3%A9 (because in your
view URI are IRI) maps to the URI http://www.example.org/ros%C3%A9.
aaaaggghhhh.

This gets us back to the question of whether IRI are intended 1) as a
replacement for URI as 'the' identifier for the Web, or 2) as a distinct set
of a internationalised identifiers (distinct value space, overlapping
lexical space wrt URI) which are mapped to URI for use in protocol elements
that require URI - URI then remain 'the' identifier on the Web and IRI serve
as an internationalisation friendly form for inclusion in documents and UI
artifacts (from which URI are generated when necessary).

View 2 treats URI and IRI as different types even though there lexical forms
overlap. In this view I'm not sure quite how the IRI
http://www.example.org/ros%C3%A9 would arise.

I accept that simply looking at the lexical form leaves an ambiguity about
whether one is looking at a URI or an IRI.

> So what the IRI draft says is that if you need character-by- 
> character equivalence, for example for XML namespaces (e.g. 
> XSLT) or for RDF to match nodes in a graph, don't convert to an URI.

ie. you do character-by-character comparison of two IRI.

> >[Also, I'm not entirely comfortable with all "URIs are IRIs" - but I'll 
> >dwell on that - this discomfort is at the level of whether the set of 
> >real numbers and the set of integers are disjoint or not - ie. they are 
> >different types (value-space) although the may share lexical space 
> >presentation.]
> 
> I'm not totally familiar with this kind of philosophical discussion.
> I tend to think about this from an operational view. Most if 
> not all operations that you can do on 'integer 2' you can do 
> on 'real 2', with the same results. Same the other way round.
> As far as I know, all the relevant operations you can do with 
> the URI http://www.example.org/ros%C3%A9 you can do with the 
> IRI http://www.example.org/ros%C3%A9, and vice versa, with 
> the same result. So trying to distinguish them will only 
> complicate the spec, without adding anything other than 
> confusing most readers.

But you are demonstrating a situation where for two IRI x and y:

	(not(x == y)) && (iriToUri(x) == iriToUri(y))

operationally this is also confusing.

Stuart
--

> -----Original Message-----
> From: Martin Duerst [mailto:duerst@w3.org] 
> Sent: 25 March 2004 15:18
> To: Williams, Stuart; Williams, Stuart; Ian B. Jacobs
> Cc: tag@w3.org
> Subject: RE: [Minutes] 22 March 2004 TAG teleconf 
> (charmodReview-17, LC-klyne26, LC-kopecky5, LC-kopecky6, 
> LC-booth3, LC-schema17)
> 
> Hello Stuart,
> 
> At 12:46 04/03/25 +0000, Williams, Stuart wrote:
> > > From: Martin Duerst [mailto:duerst@w3.org]
> 
> > > What you are looking for, namely that with respect to 
> > > character-by-character equivalence, the IRI->URI mapping would have 
> > > the property that distinct IRI map to distinct URI, is basically 
> > > impossible.
> > >
> > > For example, you have (in HTML notation)
> > >     http://www.example.org/ros&eacute;
> > > as an IRI, but because all URIs are also IRIs, you also have
> > >     http://www.example.org/ros%C3%A9 (and three other case variants) 
> > > as an IRI. You can't have them be the same and different at the same 
> > > time.
> >
> >If by "other case variants" you're refering to variants that 
> end "%c3%A9"
> 
> Yes.

> >etc, that's not my concern since the IRI spec is clear that the mapping 
> >uses uppercase a-f. In anycase that would address suprious differences 
> >and the concern was triggered by the phrase "spurious equivalence".
> >
> >So to be clear... the IRI draft section 5.1 seems to suggest that it is 
> >possible for the IRI->URI mapping to generate superious 
> >character-by-character equivalences (between the resulting URI) from 
> >distinct IRI.
> >
> >If that can't in fact happen, then maybe the wording in 5.1 that 
> >suggest that it could should be changed.
> 
> It can happen. The IRI http://www.example.org/ros&eacute; is 
> not character-by-character equivalent to the URI/IRI 
> http://www.example.org/ros%C3%A9, but when you convert the 
> IRI http://www.example.org/ros&eacute; to an URI, you get the 
> URI http://www.example.org/ros%C3%A9.
> 
> So what the IRI draft says is that if you need character-by- 
> character equivalence, for example for XML namespaces (e.g. 
> XSLT) or for RDF to match nodes in a graph, don't convert to an URI.
> 
> 
> 
> >BTW "~" seems to be amongst the unreserved characters in RFC2396bis and 
> >so should not be %-encoded anyway - so maybe the example in the draft 
> >IRI spec is bogus.
> 
> It was originally not allowed in URIs. The newest URI draft 
> says that it should not be escaped, but it doesn't disallow 
> %7E or %7e.
> 
> 
> >[Also, I'm not entirely comfortable with all "URIs are IRIs" - but I'll 
> >dwell on that - this discomfort is at the level of whether the set of 
> >real numbers and the set of integers are disjoint or not - ie. they are 
> >different types (value-space) although the may share lexical space 
> >presentation.]
> 
> I'm not totally familiar with this kind of philosophical discussion.
> I tend to think about this from an operational view. Most if 
> not all operations that you can do on 'integer 2' you can do 
> on 'real 2', with the same results. Same the other way round.
> As far as I know, all the relevant operations you can do with 
> the URI http://www.example.org/ros%C3%A9 you can do with the 
> IRI http://www.example.org/ros%C3%A9, and vice versa, with 
> the same result. So trying to distinguish them will only 
> complicate the spec, without adding anything other than 
> confusing most readers.
> 
> Regards,  Martin.
> 
> 
> >Stuart
> >--
> >
> > > BTW, character-by-character equivalence on IRIs is what xml 
> > > namespaces as well as RDF specify. The former is well implemented 
> > > (with IRIs, not URIs, despite the fact that the 1.0 namespace rec 
> > > says URIs) in XSLT processors. The later has tests associated, in 
> > > http://www.w3.org/TR/2004/REC-rdf-testcases-20040210/
> > > at "Issue: rdf-charmod-uris has 4 tests". Late last year, 
> as shown at:
> > > http://www.w3.org/2003/11/results/rdf-core-tests.html,
> > > one test had 5 pass, one undecided, one fail, one test 
> had 7 pass, 
> > > and two tests had two pass and two undecided, all of 
> these very much 
> > > in similar patterns to other, unrelated tests.
> > > Yesterday, I tested with Tim with cwm, which also 
> conformed to the 
> > > RDF REC (although at an earlier point, it didn't).
> >
> >I think this is also a different matter concerned with IRI<->IRI 
> >comparison testing rather than comparison of URI resulting 
> from IRI->URI mapping.
> >
> > >
> > > Regards,    Martin.
> >
> >Stuart
> >--
> >
> > >
> > > At 14:15 04/03/24 +0000, Williams, Stuart wrote:
> > > >Ian,
> > > >
> > > >Under 2.2.1...
> > > >
> > > >[Stuart]
> > > >http://example.org/~user, http://example.org/%7euser, and 
> > > >http://example.org/%7Euser are not equivalent under this
> > > definition. In
> > > >such a case, the comparison function MUST NOT map IRIs to
> > > URIs, because
> > > >such a mapping would create additional spurious equivalences."
> > > >
> > > >...is part of a /me comment that went wrong. I'd be 
> happy for the 
> > > >ommitted /me line(s) that preceded these to be included in
> > > the record
> > > >(it would then make more sense) or for these lines to be 
> struck....
> > > >
> > > >[Stuart] I was disturbed to see the following in section 5.1
> > > of the IRI
> > > >Draft:
> > > >"As an example, http://example.org/~user,
> > > http://example.org/%7euser,
> > > >and http://example.org/%7Euser are not equivalent under this 
> > > >definition. In such a case, the comparison function MUST NOT
> > > map IRIs
> > > >to URIs, because such a mapping would create additional
> > > spurious equivalences."
> > > >
> > > >Not stated on IRC, but for info... what I found distrubing
> > > is that with
> > > >respect to character-by-character equivalence the IRI -> URI 
> > > >mapping does not have the property that distinct IRI map to 
> > > >distinct URI, as evidenced by the quoted example from the draft.
> > > >
> > > >Regards
> > > >
> > > >Stuart
> > > >--
>
Received on Thursday, 25 March 2004 11:42:12 UTC