Re: why use IRIs? from John C Klensin on 2012-07-02 (public-iri@w3.org from July 2012)

From: John C Klensin <john@jck.com>
Date: Mon, 02 Jul 2012 19:33:43 -0400
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, Peter Saint-Andre <stpeter@stpeter.im>
cc: public-iri@w3.org
Message-ID: <E66151A9DA69D733A0A5166D@JcK-HP8200.jck.com>

--On Monday, June 25, 2012 18:22 +0900 "\"Martin J. Dürst\""
<duerst@it.aoyama.ac.jp> wrote:

> Hello Peter,
> 
> I think Björn already gave very good answers to your
> questions.

Martin, Björn, Peter, 

> On 2012/06/22 3:28, Peter Saint-Andre wrote:
>> <hat type='individual'/>
>> 
>> I've been thinking about IRIs, and I'm wondering: why would a
>> protocol "upgrade" from URIs to IRIs?
> 
> As Björn said, it's really more about new protocols than
> about upgrades. Also, different protocols (and formats) can
> upgrade in different ways. Sometimes, this can be done
> formally with extensions, at other times it's done gradually
> and sooner or later gets accepted in a spec. For other cases,
> of course, it may never happen.
>...

For whatever it is worth, I don't find that answer particularly
helpful.  My problem with it is one that we have discussed
pieces of before.  If the requirement were to make something
that was coupled closely enough to URIs to be a UI overlay, then
we have one set of issues.  The WG has moved beyond that into
precisely what you are commenting on above and that they key
draft seems to reflect -- a new protocol element to be used
primarily in new, or radically updated/upgraded, protocols.  

But, if we are going to define a new protocol element for new
uses, then why stick with the basic URI syntax framework?  We
already know that causes problems.  It is hard to localize
because it contains a lot of ASCII characters that are special
sometimes and not others, that may have non-Latin-script
lookalikes, and because parsing is method-dependent.  That
method-dependency makes it very hard to create variations that
are appropriate to the local writing system because one has to
be method-sensitive at too many different points.  If some
protocols are to permit only IRIs, some only URIs, and some
both, it would also be beneficial to be able to determine which
is which, rather than wondering whether an IRI that actually
contains only ASCII characters (and no escapes) is actually an
IRI or is just the URI it looks like.   Again, as long as IRIs
were just an UI overlap, it made no difference.  But, as a
protocol element.

I continue to believe that makes a strong case for doing
something that gets us internationalization by moving away from
the URI syntax model, probably to something that explicitly
identifies the data elements that make up a particular URI.  If,
for example, one insisted that domain names be identified as
such wherever they appear, the mess about whether something can
or should be given IDNA treatment (even if only to verify
U-label syntax) and the associated RFC 6055 considerations
become much easier to handle than if one can to guess whether
something might be a domain name or something else with periods
in it.

Stated a little differently, if IRIs are protocol elements that
are intended to support new protocols, then it seems to me that
it is not obvious that the URI syntax is a constraint.
Certainly the WG has not had a serious discussion about what the
advantages of that constraint are and whether they outweigh the
disadvantages.

best,
    john

Received on Monday, 2 July 2012 23:33:50 UTC