- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Tue, 03 Jul 2012 15:19:22 +0900
- To: John C Klensin <john-ietf@jck.com>
- CC: Peter Saint-Andre <stpeter@stpeter.im>, public-iri@w3.org
Hello John, On 2012/07/02 10:03, John C Klensin wrote: > (sorry - sent from wrong address) [Sorry for forwarding as a moderator, I missed this one at first. I have added the other address to the ignore list, so you should be able to post from either in the future.] > --On Monday, June 25, 2012 18:22 +0900 "\"Martin J. Dürst\"" > <duerst@it.aoyama.ac.jp> wrote: >> As Björn said, it's really more about new protocols than >> about upgrades. Also, different protocols (and formats) can >> upgrade in different ways. Sometimes, this can be done >> formally with extensions, at other times it's done gradually >> and sooner or later gets accepted in a spec. For other cases, >> of course, it may never happen. >> ... > > For whatever it is worth, I don't find that answer particularly > helpful. My problem with it is one that we have discussed > pieces of before. If the requirement were to make something > that was coupled closely enough to URIs to be a UI overlay, then > we have one set of issues. The WG has moved beyond that into > precisely what you are commenting on above and that they key > draft seems to reflect -- a new protocol element to be used > primarily in new, or radically updated/upgraded, protocols. It looks as if some of the discussion in the IRI WG might have led to the assumption that we are moving to calling IRIs a "Protocol Element" starting with the revision of RFC 3987. This is wrong. RFC 3987 defines IRIs as a protocol element. Please see the first line of the abstract at http://tools.ietf.org/html/rfc3987. Also, please note that IRIs have been working, and are working, in protocols/formats that are in no way new since a long time. The prime example here is HTML (of course, there it works with some warts, but that's not more warts than the average HTML feature). > But, if we are going to define a new protocol element for new > uses, then why stick with the basic URI syntax framework? We > already know that causes problems. First, as said above, your presumption is wrong. Second, other solutions have been shown to have problems too. > It is hard to localize > because it contains a lot of ASCII characters that are special > sometimes and not others, that may have non-Latin-script > lookalikes, and because parsing is method-dependent. That > method-dependency makes it very hard to create variations that > are appropriate to the local writing system because one has to > be method-sensitive at too many different points. The fact that URI/IRI characters are sometimes special and sometimes not comes from the fact that URIs/IRIs combine a lot of different components, and from the desire of people to not have to escape more than absolutely necessary. You can always just go ahead and escape all delimiters, and be on the safe side, if you don't want to complicate your life. This is completely independent of IRIs. The problem with non-Latin-script (or for that matter, even Latin script) lookalikes is already present (and not solved (*)) in domain names. It's also a problem in internationalized email addresses, because there's @, a full-width variant of @. [(*) IDNA 2003 had a partial solution, but IDNA 2008 abandoned it.] As for method-dependent parsing, do you mean scheme-dependent parsing? Given the wide variety of different syntax that all the various URI/IRI schemes deal with, the amount of parsing that can be done generically is actually pretty amazing, I'd think. > If some > protocols are to permit only IRIs, some only URIs, and some > both, it would also be beneficial to be able to determine which > is which, rather than wondering whether an IRI that actually > contains only ASCII characters (and no escapes) is actually an > IRI or is just the URI it looks like. There is no "only IRIs". IRIs always include URIs. With that tweak, let's rewrite the above sentence in two different ways: If some protocols/formats/applications are to permit only ASCII domain names, and others both ASCII and internationalized domain names, it would also be beneficial to be able to determine which is which, rather than wondering whether an IDN that actually contains only ASCII characters is actually an IDN or is just the ASCII domain name it looks like. If some protocols/formats/applications are to permit only ASCII email addreses, and others both ASCII and internationalized email addresses, it would also be beneficial to be able to determine which is which, rather than wondering whether an internationalized email address that actually contains only ASCII characters is actually an internationalized email address or is just the ASCII email address it looks like. I don't see a problem, but if IRIs have a problem, so do IDNs and internationalized email addresses. > I continue to believe that makes a strong case for doing > something that gets us internationalization by moving away from > the URI syntax model, probably to something that explicitly > identifies the data elements that make up a particular URI. If, > for example, one insisted that domain names be identified as > such wherever they appear, the mess about whether something can > or should be given IDNA treatment (even if only to verify > U-label syntax) and the associated RFC 6055 considerations > become much easier to handle than if one can to guess whether > something might be a domain name or something else with periods > in it. This problem has three levels of difficulty. 1) For those schemes that follow the generic syntax (e.g. http, ftp,...), the domain name is easy to find. 2) There are a few schemes that don't use generic syntax, but use domain names. A typical example is mailto:. Here you need scheme-specific processing. 3) Many URI schemes are open-ended. The typical example is the query part of the http scheme, which can contain domain names or even (suitably encoded) whole URIs. This is an example, please not the "www.ietf.org" at the end: http://www.google.com/search?as_q=URI&as_sitesearch=www.ietf.org It is rather trivial to come up with a kind of format/data structure for this. I'll give a concrete example using XML, but of course, JSON or some other popular format would also work. The details are mostly bike-shedding. <IRI> <scheme>http</scheme> <host type='dns'> <label>www</label> <label>google</label> <label>com</label> </host> <path> <segment>search</segment> <path> <query> <parameter> <name>as_q</name> <value>URI</value> </parameter> <parameter> <name>as_sitesearch</name> <value type='dns'> <label>www</label> <label>ietf</label> <label>org</label> </value> </parameter> </query> </IRI> Note that this duly identifies DNS 'stuff'. It's probably not too difficult for anybody to figure out why people/applications/formats/protocols use URIs/IRIs rather than something like the example above. I'm leaving this as an "exercise for the reader". > Stated a little differently, if IRIs are protocol elements that > are intended to support new protocols, then it seems to me that > it is not obvious that the URI syntax is a constraint. > Certainly the WG has not had a serious discussion about what the > advantages of that constraint are and whether they outweigh the > disadvantages. I hesitate to refer to the charter of the IRI WG (http://datatracker.ietf.org/wg/iri/charter/) because some aspects of it (in particular the milestones) are hopelessly out of date. I see no indication whatsoever about removing the URI syntax constraint, and many indications that strongly (although not explicitly) that are contradicting such a proposal. Please note that while IRIs are intended for new protocols (in the sense that new protocols should preferably use IRIs and not just URIs), they are also intended for "gradual" updates where that's appropriate, and they are already used in many protocols/formats. Regards, Martin.
Received on Tuesday, 3 July 2012 06:19:57 UTC