Re: why use IRIs?

Hello John,

On 2012/07/02 10:03, John C Klensin wrote:
> (sorry - sent from wrong address)

[Sorry for forwarding as a moderator, I missed this one at first.
I have added the other address to the ignore list, so you should be able 
to post from either in the future.]

> --On Monday, June 25, 2012 18:22 +0900 "\"Martin J. Dürst\""
> <duerst@it.aoyama.ac.jp>  wrote:

>> As Björn said, it's really more about new protocols than
>> about upgrades. Also, different protocols (and formats) can
>> upgrade in different ways. Sometimes, this can be done
>> formally with extensions, at other times it's done gradually
>> and sooner or later gets accepted in a spec. For other cases,
>> of course, it may never happen.
>> ...
>
> For whatever it is worth, I don't find that answer particularly
> helpful.  My problem with it is one that we have discussed
> pieces of before.  If the requirement were to make something
> that was coupled closely enough to URIs to be a UI overlay, then
> we have one set of issues.  The WG has moved beyond that into
> precisely what you are commenting on above and that they key
> draft seems to reflect -- a new protocol element to be used
> primarily in new, or radically updated/upgraded, protocols.

It looks as if some of the discussion in the IRI WG might have led to 
the assumption that we are moving to calling IRIs a "Protocol Element" 
starting with the revision of RFC 3987. This is wrong.

RFC 3987 defines IRIs as a protocol element. Please see the first line 
of the abstract at http://tools.ietf.org/html/rfc3987.

Also, please note that IRIs have been working, and are working, in 
protocols/formats that are in no way new since a long time. The prime 
example here is HTML (of course, there it works with some warts, but 
that's not more warts than the average HTML feature).


> But, if we are going to define a new protocol element for new
> uses, then why stick with the basic URI syntax framework?  We
> already know that causes problems.

First, as said above, your presumption is wrong. Second, other solutions 
have been shown to have problems too.


> It is hard to localize
> because it contains a lot of ASCII characters that are special
> sometimes and not others, that may have non-Latin-script
> lookalikes, and because parsing is method-dependent.  That
> method-dependency makes it very hard to create variations that
> are appropriate to the local writing system because one has to
> be method-sensitive at too many different points.

The fact that URI/IRI characters are sometimes special and sometimes not 
comes from the fact that URIs/IRIs combine a lot of different 
components, and from the desire of people to not have to escape more 
than absolutely necessary. You can always just go ahead and escape all 
delimiters, and be on the safe side, if you don't want to complicate 
your life. This is completely independent of IRIs.

The problem with non-Latin-script (or for that matter, even Latin 
script) lookalikes is already present (and not solved (*)) in domain 
names. It's also a problem in internationalized email addresses, because 
there's @, a full-width variant of @.

[(*) IDNA 2003 had a partial solution, but IDNA 2008 abandoned it.]

As for method-dependent parsing, do you mean scheme-dependent parsing? 
Given the wide variety of different syntax that all the various URI/IRI 
schemes deal with, the amount of parsing that can be done generically is 
actually pretty amazing, I'd think.

> If some
> protocols are to permit only IRIs, some only URIs, and some
> both, it would also be beneficial to be able to determine which
> is which, rather than wondering whether an IRI that actually
> contains only ASCII characters (and no escapes) is actually an
> IRI or is just the URI it looks like.

There is no "only IRIs". IRIs always include URIs. With that tweak, 
let's rewrite the above sentence in two different ways:

If some protocols/formats/applications are to permit only ASCII domain 
names, and others both ASCII and internationalized domain names, it 
would also be beneficial to be able to determine which is which, rather 
than wondering whether an IDN that actually contains only ASCII 
characters is actually an IDN or is just the ASCII domain name it looks 
like.

If some protocols/formats/applications are to permit only ASCII email 
addreses, and others both ASCII and internationalized email addresses, 
it would also be beneficial to be able to determine which is which, 
rather than wondering whether an internationalized email address that 
actually contains only ASCII characters is actually an internationalized 
email address or is just the ASCII email address it looks like.

I don't see a problem, but if IRIs have a problem, so do IDNs and 
internationalized email addresses.


> I continue to believe that makes a strong case for doing
> something that gets us internationalization by moving away from
> the URI syntax model, probably to something that explicitly
> identifies the data elements that make up a particular URI.  If,
> for example, one insisted that domain names be identified as
> such wherever they appear, the mess about whether something can
> or should be given IDNA treatment (even if only to verify
> U-label syntax) and the associated RFC 6055 considerations
> become much easier to handle than if one can to guess whether
> something might be a domain name or something else with periods
> in it.

This problem has three levels of difficulty.

1) For those schemes that follow the generic syntax (e.g. http, 
ftp,...), the domain name is easy to find.

2) There are a few schemes that don't use generic syntax, but use
domain names. A typical example is mailto:. Here you need 
scheme-specific processing.

3) Many URI schemes are open-ended. The typical example is the query 
part of the http scheme, which can contain domain names or even 
(suitably encoded) whole URIs. This is an example, please not the 
"www.ietf.org" at the end:
http://www.google.com/search?as_q=URI&as_sitesearch=www.ietf.org

It is rather trivial to come up with a kind of format/data structure for 
this. I'll give a concrete example using XML, but of course, JSON or 
some other popular format would also work. The details are mostly 
bike-shedding.

<IRI>
   <scheme>http</scheme>
   <host type='dns'>
     <label>www</label>
     <label>google</label>
     <label>com</label>
   </host>
   <path>
     <segment>search</segment>
   <path>
   <query>
     <parameter>
       <name>as_q</name>
       <value>URI</value>
     </parameter>
     <parameter>
       <name>as_sitesearch</name>
       <value type='dns'>
         <label>www</label>
         <label>ietf</label>
         <label>org</label>
       </value>
     </parameter>
   </query>
</IRI>

Note that this duly identifies DNS 'stuff'. It's probably not too 
difficult for anybody to figure out why 
people/applications/formats/protocols use URIs/IRIs rather than 
something like the example above. I'm leaving this as an "exercise for 
the reader".


> Stated a little differently, if IRIs are protocol elements that
> are intended to support new protocols, then it seems to me that
> it is not obvious that the URI syntax is a constraint.
> Certainly the WG has not had a serious discussion about what the
> advantages of that constraint are and whether they outweigh the
> disadvantages.

I hesitate to refer to the charter of the IRI WG 
(http://datatracker.ietf.org/wg/iri/charter/) because some aspects of it 
(in particular the milestones) are hopelessly out of date. I see no 
indication whatsoever about removing the URI syntax constraint, and many 
indications that strongly (although not explicitly) that are 
contradicting such a proposal.


Please note that while IRIs are intended for new protocols (in the sense 
that new protocols should preferably use IRIs and not just URIs), they 
are also intended for "gradual" updates where that's appropriate, and 
they are already used in many protocols/formats.


Regards,    Martin.

Received on Tuesday, 3 July 2012 06:19:57 UTC