Re: Character-by-Character Distinct IRI-> Char-by-char equivalent URI (issue #IRIURIcharequiv-20)

Hello Stuart,

Many thanks for forwarding this discussion to public-iri. I have listed
this as an issue at
http://www.w3.org/International/iri-edit/Overview.html#IRIURIcharequiv-20.

At 16:38 04/03/25 +0000, Williams, Stuart wrote:

>[Shifting discussion to public-iri from tag@w3.org]
>
> > >So to be clear... the IRI draft section 5.1 seems to suggest that it is
> > >possible for the IRI->URI mapping to generate superious
> > >character-by-character equivalences (between the resulting URI) from
> > >distinct IRI.
> > >
> > >If that can't in fact happen, then maybe the wording in 5.1 that
> > >suggest that it could should be changed.
> >
> > It can happen. The IRI http://www.example.org/rosé is
> > not character-by-character equivalent to the URI/IRI
> > http://www.example.org/ros%C3%A9, but when you convert the
> > IRI http://www.example.org/rosé to an URI, you get the
> > URI http://www.example.org/ros%C3%A9.
>
>So... what other IRI, distinct from http://www.example.org/rosé
>yields URI http://www.example.org/ros%C3%A9.
>
>Oh... yes I see, the IRI http://www.example.org/ros%C3%A9 (because in your
>view URI are IRI) maps to the URI http://www.example.org/ros%C3%A9.
>aaaaggghhhh.
>
>This gets us back to the question of whether IRI are intended 1) as a
>replacement for URI as 'the' identifier for the Web, or 2) as a distinct set
>of a internationalised identifiers (distinct value space, overlapping
>lexical space wrt URI) which are mapped to URI for use in protocol elements
>that require URI

My answer here is:
- both, depending on the situation
- this might look bothering, but it really isn't much of an issue


>- URI then remain 'the' identifier on the Web and IRI serve
>as an internationalisation friendly form for inclusion in documents and UI
>artifacts (from which URI are generated when necessary).

Once most of what the users will see are IRIs, I'm not sure
it's appropriate to call them 'UI artifacts'. Users will
just see them, and probably talk about them as URLs, not even
URIs, because they haven't yet learned that the 'politically
correct' name is now URI.


>View 2 treats URI and IRI as different types even though there lexical forms
>overlap. In this view I'm not sure quite how the IRI
>http://www.example.org/ros%C3%A9 would arise.

Hopefully not too often. But somebody can just write it down,
as it is written down in this email. You are right that it should
be very rare, but from a spec point of view, the question is how
to disallow it. We can't disallow percent-escaping because that
would disallow http://www.example.org/ros%E9, which has to be
legal for backwards compatibility purposes (and is different
from the above).


>I accept that simply looking at the lexical form leaves an ambiguity about
>whether one is looking at a URI or an IRI.

Yes, of course. And often (e.g. on a napkin or a business card),
that's the only thing you have.


> > So what the IRI draft says is that if you need character-by-
> > character equivalence, for example for XML namespaces (e.g.
> > XSLT) or for RDF to match nodes in a graph, don't convert to an URI.
>
>ie. you do character-by-character comparison of two IRI.

Yes.


> > >[Also, I'm not entirely comfortable with all "URIs are IRIs" - but I'll
> > >dwell on that - this discomfort is at the level of whether the set of
> > >real numbers and the set of integers are disjoint or not - ie. they are
> > >different types (value-space) although the may share lexical space
> > >presentation.]
> >
> > I'm not totally familiar with this kind of philosophical discussion.
> > I tend to think about this from an operational view. Most if
> > not all operations that you can do on 'integer 2' you can do
> > on 'real 2', with the same results. Same the other way round.
> > As far as I know, all the relevant operations you can do with
> > the URI http://www.example.org/ros%C3%A9 you can do with the
> > IRI http://www.example.org/ros%C3%A9, and vice versa, with
> > the same result. So trying to distinguish them will only
> > complicate the spec, without adding anything other than
> > confusing most readers.
>
>But you are demonstrating a situation where for two IRI x and y:
>
>         (not(x == y)) && (iriToUri(x) == iriToUri(y))
>
>operationally this is also confusing.

Looks like it may create problems. But up to now, I haven't heard
about any. This situation is very similar to a situation purely
on URIs. It is very easy to construct two URIs that are guaranteed
to always resolve to the same thing but that don't compare
character-by-character equivalent.

For example http://www.example.org/ (1) and HTTP://www.example.org (2).
The URI spec says that these are equivalent, but they identify
two different namespaces, and they identify two different nodes
in an RDF graph. So we could indeed create two different types,
let's call them aURI (the ones that are only equivalent if they
are character-by-character equivalent) and bURI (they are
equivalent in some more cases, as far as the URI spec allows).
(1) and (2) are the same bURI, but a different aURI. And you
can't even know whether they are aURIs or bURIs by just looking
at them. So which ones are 'the' identifier for the Web?
aURIs or bURIs?


So in summary:
- This issue is not a problem because very similar issues also
   exist with URIs themselves.
- Given that IRIs allow (many) more characters than URIs, but we
   want IRIs that are also URIs (or if you prefer: that also look
   like URIs) to map to themselves, it is impossible to preserve
   the same equivalence classes for character-by-character on IRIs
   and on URIs. So there does not seem to be a solution
   (if you know about one, please tell us).
- There is already a clear warning in the draft about this, which
   seems to have been clear enough to catch your attention.

I therefore am tentatively closing this issue with 'no action needed'.
If you think that some change to the document is needed, please say
so (ideally with some actual proposed text).


Regards,     Martin.

Received on Thursday, 25 March 2004 15:17:47 UTC