Re: IDNA and IRI document way forward from Martin J. Dürst on 2009-08-25 (public-iri@w3.org from August 2009)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Tue, 25 Aug 2009 18:51:57 +0900
To: Larry Masinter <masinter@adobe.com>
CC: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>, "Roy T. Fielding" <fielding@apache.org>, John C Klensin <klensin@jck.com>
Message-ID: <4A93B43D.20600@it.aoyama.ac.jp>
Hello Larry,

On 2009/08/25 11:27, Larry Masinter wrote:
> I've continued to mull over the contradictory requirements
> for IDNA and IRI parsing.
>
> 1) IDNA requires (requests, demands, whatever) that there be no
>    way in which a %xx percent-hex-encoded version of an Internationalized
>    Domain Name ever be presented to a DNS resolver.

I fully agree that it's not a good idea to send a %-encoded version of 
an Internationalized Domain Name to a DNS resolver. But I fail to 
understand the "never ever" aspect of the above sentence. It sounds even 
stronger than a MUST, maybe MUST NEVER EVER? Can you or anybody else 
give more details on this?

In my understanding, except for some really, really rare weird cases, 
what happens is simply "host not found". So we have "incomplete 
implementation" + "edge case data" => "null result". Nobody loosing big 
money, nobody hurt, no hardware destroyed, etc. Similar things happen in 
many IETF protocols, even in DNS.


> 2) Traditionally, though, IRIs have been defined as requiring
>    a scheme-independent (and syntax-independent) translation from IRI
>    (or IRI-like-thing) to URI.

Yes, that's the principle. But please note that even now, RFC 3987 says:

    Systems accepting IRIs MAY convert the ireg-name component of an IRI
    as follows (before step 2 above) for schemes known to use domain
    names in ireg-name, if the scheme definition does not allow
    percent-encoding for ireg-name:

[see http://tools.ietf.org/html/rfc3987#section-3.1 for the details]

I think it would easily be possible to extend this to "scheme 
definitions which allow percent-encoding for ireg-name" (there are 
probably not yet that many of these currently, anyway) and "other 
scheme-specific components that represent domain names" (which would 
cover cases such as mailto).

> 3) URI schemes have host names in many places, not just one:
>     mailto:person@host,   ftp://user@host/path, http://host1/path?location=http://host2/path2
>
> I don't think these three things are compatible. If IRIs are defined
> by mapping to URIs using (2), then Internationalized Domain Names
> in different schemes (3) will translate to percent-hex-encoding
> domain names in their corresponding URIs, violating the requirement
> for (2).

I guess you wanted to say "violating the requirement for (1)". And that 
needs to assume that a domain name with %-encoding in an URI is handed 
to a DNS resolver without any additional processing. While this is 
probably what's usually happening, especially if IRI->URI conversion and 
URI resolution are completely independent, it's not a given. There is 
also the possibility of an URI resolver that is compliant with that part 
of RFC 3986, and there is the possibility that the %-encoding will be 
resolved as part of general URI resolution (leading to raw UTF-8 being 
passed to the DNS resolver), and there are other possibilities.

> I can't see any way around (1) or (3), so this leaves me with the
> uncomfortable choice of abandoning (2), performing major violence
> on the IRI spec.
>
> So here's a swipe at how this might work (please don't shoot me yet):
>
> NO LONGER define an IRI by a generic IRI ->  URI mapping.
> INSTEAD, IRI parsing is *scheme specific*. "Internationalized"
>    (IRI) versions of URI schemes are defined as:

How would *scheme specific* work for host2 in
    http://host1/path?location=http://host2/path2 ?
(I agree with Erik that we should think about that as just being data.)

>     For each URI scheme, there is a corresponding IRI scheme.
>     The grammar for the IRI scheme *MUST* be exactly the same
>      as the grammar for the URI scheme of the same name, except
>     that (a) every syntactic component in the URI scheme that
>     allows "unreserved" characters from the URI spec should,
>     in the IRI form, allow "Unreserved" characters from the
>     IRI repertoire.
>
>     In general, the mapping for handling IRIs and interpreting
>     them CAN be defined using a generic IRIstring to URIstring
>     component using percent-encoding, with the exception that
>     host names are translated to the right IDNa format string.
>
> URI schemes do not AUTOMATICALLY get equivalent IRI schemes.
> So, data:, cid:, mid:, tag:, etc. etc. do *Not* have IRI
> equivalents automatically (if needed, someone can support
> them.)

For many schemes, it is indeed the case currently that they do not have 
an equivalent "IRI scheme". mailto would be the classical example. The 
mailto scheme, as of RFC 2368, doesn't allow any %-encoding, and 
therefore doesn't allow any IRIs except for those that are trivially 
identical to URIs. This is being worked on with
http://tools.ietf.org/html/draft-duerst-mailto-bis-06.

On the other hand, I think it would be a huge overkill to require that 
every scheme be defined twice (once for URIs and once for IRIs). It's 
not rocket science to find the components that are domain names in a 
scheme definition, so generic language for this in the IRI spec should 
do the job.

> Instead, we define IRI versions of "http:" and "https:"
> and (maybe) "file:" and "ftp:" and "mailto:" using the
> new generic IRI definition.
>
> This is painful, of course, but at least it seems to be more
> consistent with what's implemented.

Looking at Erik's mail 
(http://lists.w3.org/Archives/Public/public-iri/2009Aug/0012.html), 
implementations seem to be everything else but consistent. Why not have 
them move in the right direction?

In summary, (1) is not in the MUST NEVER EVER category, just in the 
"shit happens, but it's mostly harmless" category. For (2), RFC 3987 
already sins a bit with regards to absolute scheme-independency, and we 
can sin a bit more in iri-bis if that's deemed necessary.

Regards,    Martin.
-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
Received on Tuesday, 25 August 2009 09:52:57 UTC