Re: URIEquivalence-15 and IRIs from Misha.Wolf@reuters.com on 2002-06-03 (www-tag@w3.org from June 2002)

From: <Misha.Wolf@reuters.com>
Date: Mon, 03 Jun 2002 16:08:19 +0100
To: Tim Bray <tbray@textuality.com>
Cc: www-tag@w3.org, w3c-i18n-ig@w3.org
Message-ID: <T5b43268486c407b707188@reuters.com>
Hi Tim,

On 03/06/2002 05:30:48 Tim Bray wrote:
> Martin Duerst wrote:
>
> > Here is my input on the issue of URI/IRI equivalence, for
> > your consideration. This is a very important issue for IRIs.
>
> Thank you Martin.  I just spent a lot of time reading the IRI draft and
> then Martin's email.
[...]

> Having said that, as Martin points out
>
> > The core choices from the view of IRIs are:
> >
> > a) 'character-by-character equivalence'
> >    (taking a %hh-escaping as three characters)
> > b) '%hh-escape equivalence' (equivalencing %hh-escape
> >    sequences with the characters (based on US-ASCII/UTF-8)
> >    they stand for (except for reserved characters!)
>
> On the face of it, (b) seems like the only sensible thing to do (among
> other things /%ba%be better be the same as /%BA%BE)... I'm feeling
> stupid; I'm sure I'm missing something because Martin isn't jumping up
> and down saying "do it this way!"  He seems to favor it, but he's being
> very careful... why?

Reasons to be careful include that the decisions taken on these matters:
-  will be with us for a long time :-)
-  will affect URI matching in XML Namespaces
-  will affect URI matching in RDF

The matching algorithm must be very clearly specified, so we don't end
up, as we did with Namespaces, with differing interpretations.

> Hmm
>
>   http://example.com/a%2fb
>
> is not the same as
>
>   http://example.com/a/b
>
> which is I assume why (b) says "except for reserved chars".  So what
> other gotchas are there?
>
> As for the following four paragraphs, I'm sorry, I just don't understand
> them without some examples to make the issues (whatever they are)
> concrete.  If I'm dragging down the average level of insight in this
> discussion by being dense on this, I apologize, but in this case I
> regard myself as a proxy for the non-URI/i18n-guru who will have to
> fight through this.
>
> > The difference is more important for IRIs because the mapping
> > from IRIs to URIs is based on (UTF-8 and) %hh-escaping. Because
> > some protocols/formats/APIs will support IRIs whereas others
> > (older/lower level) may not, having both escaped and unescaped
> > versions of the same IRI is probably more frequent than for
> > URIs (where %7E / ~ is the only example I have seen).
> > This is a strong argument for %hh-escape equivalence.

Escaped and non-escaped versions of IRIs will both exist (typically, at
different levels in the protocol stack).  A desire for maximising the
successful matching of these different versions would be a factor
supporting %hh-escape equivalence.

> > Because conversion from a URI to an IRI is not guaranteed to succeed,
> > and even if it succeeds, is not guaranteed to produce the correct
> > result (i.e. the original characters), it is important to convert
> > from IRIs to URIs as late as possible. For %hh-escape equivalence,
> > this means that %hh-escaping is only done for the actual comparison,
> > but that the original IRI is always retained. This would need a
> > certain amount of resources (time or space).

The mapping from IRI to URI is not generally invertible, as RFC 2396
doesn't constrain the use of %hh-escaping to UTF-8 strings.  Given a URI
containing %hh-escapes outside the ASCII range, we don't know what
character encoding was used to generate the %hh-escaped URI.

Consequently, the mapping from IRI to URI should take place as late as
possible in order to avoid information loss.  Note that this is the
position adopted by the XML specification [1].

If %hh-escape equivalence is selected, then a component which compares a
%hh-escaped string with an unescaped string will need to %hh-escape the
unescaped string for purposes of comparison, while retaining the
original unescaped string for subsequent use (as it could not regenerate
it from the %hh-escaped version).  The time vs space choice refers, I
think, to the implementor's freedom to:

a)  keep the two versions of the string (eg in an RDF database) --
    a cost in space but a saving of time, or

b)  keep just the unscaped string and %hh-escape it on demand --
    a cost in time but a saving of space.

> > The argument has been made that using character-by-character equivalence
> > would create strong pressures to not convert from IRIs to URIs prematurely,
> > which would be a good thing. It is difficult to judge whether this will
> > be the case; if things go well, it may indeed provide desirable
> > reinforcement, but if things go wrong, it may create additional confusion.

If the decision is to go for exact character-by-character equivalence,
this might discourage people from prematurely carrying out %hh-escaping
(and discarding the unescaped string).

> > It is thinkable to specify IRI equivalence by specifying character-by-
> > character equivalence for ASCII characters, and %hh-escape equivalence
> > for non-ascii characters. But the chance that this gets implemented
> > is probably very low.

I think that Martin is mentioning this option for completeness, though
it most probably isn't of much interest.  Here, '%7E' would *not* match
'~', but 'http://www.w3.org/People/Dürst' would match
'http://www.w3.org/People/D%c3%bcrst/'.

[1] http://www.w3.org/XML/xml-V10-2e-errata#E26

Regards,
Misha





------------------------------------------------------------- ---
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.
Received on Monday, 3 June 2002 11:11:19 UTC