Re: Migration of HTTP to the use of IRIs [altdesign-17] from Martin Duerst on 2004-05-09 (public-iri@w3.org from May 2004)

From: Martin Duerst <duerst@w3.org>
Date: Sun, 09 May 2004 09:37:17 +0900
To: "Chris Haynes" <chris@harvington.org.uk>, "Michel Suignard" <michelsu@windows.microsoft.com>
Cc: <public-iri@w3.org>
Message-Id: <4.2.0.58.J.20040508144430.05a74018@localhost>
Hello Chris,

I have changed the issue for this mail to altdesign-17, because it
seems more appropriate.

At 11:07 04/05/07 +0100, Chris Haynes wrote:

>Michel,
>
>Thanks for this comment, but I think my point is still valid - even just for
>presentational uses.
>
>Given that many URI encodings exist 'in the wild' which use %HH escaping of
>non-UTF-8 sequences, I fail to see how one can know that it is valid to 
>convert
>any such URI into an IRI (as per sect. 3.2) - even if just for presentational
>purposes.

Section 3.2 very clearly says that there is a risk that you convert
to something that didn't exist previously.
But in practice, this is not that much of an issue, because it is
very rare to find reasonable text encoded in legacy encodings that
matches UTF-8 byte patters. Please try to find some examples yourself,
and you will see this.


>My concern is the same:  unless there is some kind of syntactic indicator 
>within
>the URI as a whole, how can one reliably know that UTF-8 has been used and 
>that
>it is intended to have a corresponding IRI?

You are correct that one cannot do this with 100% certainty.
But then, if you study the URI spec very carefully, you will
find that it also doesn't guarantee that an 'a' in an URI
actually corresponds to an 'a' in the original data (e.g.
file name). For details, please see the "Laguna Beach"
example in Section 2.5 of draft-fielding-uri-rfc2396bis-05.txt,
for example at
http://gbiv.com/protocols/uri/rev-2002/draft-fielding-uri-rfc2396bis-05.txt.

So in those rare cases where an URI with an octet sequence
that by chance corresponds to an UTF-8 pattern, but that was
never intended as UTF-8, is converted to an IRI, one will just
get a weird name, but reusing that name again e.g. in a browser
that accepts IRIs will lead back to the original resource.



>It seems to me that IRI will only be deployed accurately and effectively 
>if one
>of the following situations occurs:
>
>1) New protocol schemes (e.g. httpi, httpis ) are introduced which make it
>explicit that this spec. applies to the URI,

Introducing a new URI scheme is *extremely* expensive. I have heard
Tim Berners-Lee say this over and over again, and I know he knows it.
And in the case at hand, it's highly unnecessary. The cost of an
occasional accidental 'wrong' conversion back to an IRI (as discussed
above) is much, much smaller than the cost of introducing new schemes.

And what would the real benefit of new schemes be? Would they be
useful to distinguish URIs from true IRIs (I'm writing 'true' IRIs
here to exclude URIs which are by definition also IRIs). Not really,
it's much cheaper to identify IRIs by checking for non-ASCII characters.

So they would only be used to distinguish URIs without known origin
from URIs originating from conversion from IRIs. But assume I had
an IRI like like http://www.example.org/ros&#xE9; (rose'). In order
to pass it to others whom I know can only process URIs, not IRIs,
would I want to convert it to http://www.example.org/ros%C3%A9,
or to httpi://www.example.org/ros%C3%A9 ? The former strictly
speaking looses the information that this was an IRI, so converting
it back to rose' is a guess (but because of the UTF-8 patters,
actually a rather safe one). But it actually will go to the
right page, on hunderds of millions of Web browsers, without
exception. The later can safely be converted back to the IRI
(by all the software that knows how to do this, which currently
numbers exactly 0). But it will work only on the browsers
that know the httpi: scheme (again, currently numbering
exactly 0). For me the alternative is very clear,
http://www.example.org/ros%C3%A9 works in much more cases,
and is therefore much better.


>2) They are used within a closed environment in which it is a convention that
>only IRIs and IRI-derived URIs are in use (no legacy-encoding escapes, or they
>are allowed to be mis-interpreted)

The current draft clearly allows legacy-encoded escapes, for backwards
compatibility. I'm not sure what you mean by 'mis-interpreted', but
if you mean that they are converted to IRIs, then yes, the current
draft allows this in those cases where it is possible (i.e. the
byte pattern matches UTF-8,...). But this misinterpretation does
not lead to an actual misinterpretation of the resource that the
IRI identifies.


>3) A new market-dominating user agent is launched which behaves as if (2) 
>above
>were the case - i.e. there is an attempt to establish IRIs as the de facto
>default through market force, ignoring or discarding resulting errors of
>presentation or of resource identification.
>
>My big fear is that without rapid progress on (1), IRIs on the open Internet
>will only ever take off if someone does (3) - which will be without benefit of
>adequate standards backing.

I'm not sure I understand you. Several browsers, for example
Opera and Safari, already implement IRIs. MS IE also does it
if the relevant flag is set correctly. And the standard is
close to done; this is the last real issue I'm trying to close.
So I don't see the problem.


>I'd love to either:
>
>a) be shown that my logic is faulty

I guess yes. Not in theory, where absolute correctness is the
only goal, but in practice, where big numbers and deployment
are important.

>or
>
>b) be pleasantly surprised by being told that there _is_  RFC work taking 
>place
>on new schemes covering at least the space of http(s)

Some schemes may benefit from an update, in particular those that
haven't thought about internationalization. The first example that
would come to my mind is the mailto: scheme.


Regards,    Martin.



>otherwise, I fail to understand how IRIs will 'take off' in the 'real world' -
>where they are so badly needed.
>
>Chris
>
>
>
>
>----- Original Message -----
>From: "Michel Suignard" <michelsu@windows.microsoft.com>
>To: "Chris Haynes" <chris@harvington.org.uk>
>Cc: <public-iri@w3.org>; "Martin Duerst" <duerst@w3.org>
>Sent: Friday, May 07, 2004 1:43 AM
>Subject: RE: Migration of HTTP to the use of IRIs [queryclarify-16]
>
>
>
> > From:  Chris Haynes
> > Sent: Thursday, May 06, 2004 4:50 AM
> >
> > Actually, my original core concern has now been covered in your
>section
> > 1.2.a - Applicability, where you make it clear that "the intent is not
>to
> > introduce IRIs into contexts that are not defined to accept them".
> >
> > This now makes it clear that new schemas will be required to replace
> > http: , https: etc. These will need to be self-identifying in some
>way, so
> > that receiving equipment will know that an IRI is being presented.
> >
> > So, as I commented last June, I await with interest the recognition
>among
> > those responsible for the HTTP schema that new schemas with new names
>are
> > required before IRIs can be used.
>
>I'd like to comment on that. The IRI spec is fairly explicit on that IRI
>can be used as presentation elements for URI protocol elements (ref
>clause 3 intro). This is to recognize that applications out there have
>not waited for us for creating presentation layers that use non ascii
>native characters for schemes that supposedly should not use them (such
>as http). The presentation layer principle is there to support that. So
>I expect IRI to be used for both purposes:
>- presentation layer for existing URI schemes
>- core layer for new schemes exclusively defined using IRI for protocol
>elements syntax.
>
>For a while I'd expect the vast majority of IRI usage to be in the first
>category.
>
>Michel
>
>
Received on Saturday, 8 May 2004 20:44:13 UTC