- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Tue, 25 Aug 2009 18:51:57 +0900
- To: Larry Masinter <masinter@adobe.com>
- CC: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>, "Roy T. Fielding" <fielding@apache.org>, John C Klensin <klensin@jck.com>
Hello Larry, On 2009/08/25 11:27, Larry Masinter wrote: > I've continued to mull over the contradictory requirements > for IDNA and IRI parsing. > > 1) IDNA requires (requests, demands, whatever) that there be no > way in which a %xx percent-hex-encoded version of an Internationalized > Domain Name ever be presented to a DNS resolver. I fully agree that it's not a good idea to send a %-encoded version of an Internationalized Domain Name to a DNS resolver. But I fail to understand the "never ever" aspect of the above sentence. It sounds even stronger than a MUST, maybe MUST NEVER EVER? Can you or anybody else give more details on this? In my understanding, except for some really, really rare weird cases, what happens is simply "host not found". So we have "incomplete implementation" + "edge case data" => "null result". Nobody loosing big money, nobody hurt, no hardware destroyed, etc. Similar things happen in many IETF protocols, even in DNS. > 2) Traditionally, though, IRIs have been defined as requiring > a scheme-independent (and syntax-independent) translation from IRI > (or IRI-like-thing) to URI. Yes, that's the principle. But please note that even now, RFC 3987 says: Systems accepting IRIs MAY convert the ireg-name component of an IRI as follows (before step 2 above) for schemes known to use domain names in ireg-name, if the scheme definition does not allow percent-encoding for ireg-name: [see http://tools.ietf.org/html/rfc3987#section-3.1 for the details] I think it would easily be possible to extend this to "scheme definitions which allow percent-encoding for ireg-name" (there are probably not yet that many of these currently, anyway) and "other scheme-specific components that represent domain names" (which would cover cases such as mailto). > 3) URI schemes have host names in many places, not just one: > mailto:person@host, ftp://user@host/path, http://host1/path?location=http://host2/path2 > > I don't think these three things are compatible. If IRIs are defined > by mapping to URIs using (2), then Internationalized Domain Names > in different schemes (3) will translate to percent-hex-encoding > domain names in their corresponding URIs, violating the requirement > for (2). I guess you wanted to say "violating the requirement for (1)". And that needs to assume that a domain name with %-encoding in an URI is handed to a DNS resolver without any additional processing. While this is probably what's usually happening, especially if IRI->URI conversion and URI resolution are completely independent, it's not a given. There is also the possibility of an URI resolver that is compliant with that part of RFC 3986, and there is the possibility that the %-encoding will be resolved as part of general URI resolution (leading to raw UTF-8 being passed to the DNS resolver), and there are other possibilities. > I can't see any way around (1) or (3), so this leaves me with the > uncomfortable choice of abandoning (2), performing major violence > on the IRI spec. > > So here's a swipe at how this might work (please don't shoot me yet): > > NO LONGER define an IRI by a generic IRI -> URI mapping. > INSTEAD, IRI parsing is *scheme specific*. "Internationalized" > (IRI) versions of URI schemes are defined as: How would *scheme specific* work for host2 in http://host1/path?location=http://host2/path2 ? (I agree with Erik that we should think about that as just being data.) > For each URI scheme, there is a corresponding IRI scheme. > The grammar for the IRI scheme *MUST* be exactly the same > as the grammar for the URI scheme of the same name, except > that (a) every syntactic component in the URI scheme that > allows "unreserved" characters from the URI spec should, > in the IRI form, allow "Unreserved" characters from the > IRI repertoire. > > In general, the mapping for handling IRIs and interpreting > them CAN be defined using a generic IRIstring to URIstring > component using percent-encoding, with the exception that > host names are translated to the right IDNa format string. > > URI schemes do not AUTOMATICALLY get equivalent IRI schemes. > So, data:, cid:, mid:, tag:, etc. etc. do *Not* have IRI > equivalents automatically (if needed, someone can support > them.) For many schemes, it is indeed the case currently that they do not have an equivalent "IRI scheme". mailto would be the classical example. The mailto scheme, as of RFC 2368, doesn't allow any %-encoding, and therefore doesn't allow any IRIs except for those that are trivially identical to URIs. This is being worked on with http://tools.ietf.org/html/draft-duerst-mailto-bis-06. On the other hand, I think it would be a huge overkill to require that every scheme be defined twice (once for URIs and once for IRIs). It's not rocket science to find the components that are domain names in a scheme definition, so generic language for this in the IRI spec should do the job. > Instead, we define IRI versions of "http:" and "https:" > and (maybe) "file:" and "ftp:" and "mailto:" using the > new generic IRI definition. > > This is painful, of course, but at least it seems to be more > consistent with what's implemented. Looking at Erik's mail (http://lists.w3.org/Archives/Public/public-iri/2009Aug/0012.html), implementations seem to be everything else but consistent. Why not have them move in the right direction? In summary, (1) is not in the MUST NEVER EVER category, just in the "shit happens, but it's mostly harmless" category. For (2), RFC 3987 already sins a bit with regards to absolute scheme-independency, and we can sin a bit more in iri-bis if that's deemed necessary. Regards, Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Tuesday, 25 August 2009 09:52:57 UTC