Re: URIEquivalence-15 and IRIs from Tim Bray on 2002-06-03 (www-tag@w3.org from June 2002)

From: Tim Bray <tbray@textuality.com>
Date: Sun, 02 Jun 2002 21:30:48 -0700
To: Martin Duerst <duerst@w3.org>
Cc: www-tag@w3.org, w3c-i18n-ig@w3.org
Message-ID: <3CFAF0F8.5060606@textuality.com>
Martin Duerst wrote:

> Here is my input on the issue of URI/IRI equivalence, for
> your consideration. This is a very important issue for IRIs.

Thank you Martin.  I just spent a lot of time reading the IRI draft and 
then Martin's email.  It would be awfully nice if draft-duerst-iri-xx.txt:

(a) were available in HTML so you could print it on a modern printer 
with correct page breaks
(b) contained some more examples.  I'm serious about this: I think no 
further drafts on this topic should be taken seriously unless they 
contain a lot of examples of what the various constraints mean.

Having said that, as Martin points out

> The core choices from the view of IRIs are:
> 
> a) 'character-by-character equivalence'
>    (taking a %hh-escaping as three characters)
> b) '%hh-escape equivalence' (equivalencing %hh-escape
>    sequences with the characters (based on US-ASCII/UTF-8)
>    they stand for (except for reserved characters!)

On the face of it, (b) seems like the only sensible thing to do (among 
other things /%ba%be better be the same as /%BA%BE)... I'm feeling 
stupid; I'm sure I'm missing something because Martin isn't jumping up 
and down saying "do it this way!"  He seems to favor it, but he's being 
very careful... why?

Hmm

  http://example.com/a%2fb

is not the same as

  http://example.com/a/b

which is I assume why (b) says "except for reserved chars".  So what 
other gotchas are there?

As for the following four paragraphs, I'm sorry, I just don't understand 
them without some examples to make the issues (whatever they are) 
concrete.  If I'm dragging down the average level of insight in this 
discussion by being dense on this, I apologize, but in this case I 
regard myself as a proxy for the non-URI/i18n-guru who will have to 
fight through this.

> The difference is more important for IRIs because the mapping
> from IRIs to URIs is based on (UTF-8 and) %hh-escaping. Because
> some protocols/formats/APIs will support IRIs whereas others
> (older/lower level) may not, having both escaped and unescaped
> versions of the same IRI is probably more frequent than for
> URIs (where %7E / ~ is the only example I have seen).
> This is a strong argument for %hh-escape equivalence.
> 
> Because conversion from a URI to an IRI is not guaranteed to succeed,
> and even if it succeeds, is not guaranteed to produce the correct
> result (i.e. the original characters), it is important to convert
> from IRIs to URIs as late as possible. For %hh-escape equivalence,
> this means that %hh-escaping is only done for the actual comparison,
> but that the original IRI is always retained. This would need a
> certain amount of resources (time or space).
> 
> The argument has been made that using character-by-character equivalence
> would create strong pressures to not convert from IRIs to URIs prematurely,
> which would be a good thing. It is difficult to judge whether this will
> be the case; if things go well, it may indeed provide desirable
> reinforcement, but if things go wrong, it may create additional confusion.
> 
> It is thinkable to specify IRI equivalence by specifying character-by-
> character equivalence for ASCII characters, and %hh-escape equivalence
> for non-ascii characters. But the chance that this gets implemented
> is probably very low.
Received on Monday, 3 June 2002 00:31:40 UTC