Re: IDNA and IRI document way forward from Martin J. Dürst on 2009-07-29 (public-iri@w3.org from July 2009)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Wed, 29 Jul 2009 15:44:33 +0900
To: Larry Masinter <masinter@adobe.com>
CC: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>, URI <uri@w3.org>
Message-ID: <4A6FEFD1.8020206@it.aoyama.ac.jp>
Hello Larry,

Many thanks for this mail.

On 2009/07/29 13:52, Larry Masinter wrote:
> A note about the direction for URI and IRI:
>
> I'd like to make progress on two tasks
> 1)      resolve any issues with regard to IDNA standards and difficulties with internationalized domain names

There are definitely quite some issues that need to be resolved.
One that is for sure is that we have to adapt the BIDI stuff in the IRI 
draft to what IDNA does.

> 2)      resolve any conflicts with the requirements for documenting current browser behavior for compatibility with HTML5

Conflicts between *what* and the current browser behavior?

> For (1) I think the main changes are:
>
> a)      CHANGE (clarify, update, whatever necessary) URI definitions so that no percent-hex-encoded %XX values are allowed in the "HOST NAME" section of any URI.  URIs that wish to refer to internationalized domain names may only use "A-label" domain names.  My understanding is that this is necessary for security reasons, and may require an update to some existing software (http://tools.ietf.org/html/draft-ietf-idnabis-defs-09#section-2.3.2.1). It may also require an update to the HTTP URI scheme and the Mailto: URI scheme.

Do you mean you propose to update RFC 3986 (STD 66)? Can you give more 
details on the "security reasons"? The section you cite does contain a 
lot of definitions, but nothing about security issues or %-encoding 
(which isn't surprising, because %-encoding is URI/IRI-specific, and 
gets resolved before the data goes to the DNS (or IDNA).

BTW, the issue for allowing %-escaping in the reg-name (NOT host name) 
part of an URI can be found at 
http://labs.apache.org/webarch/uri/rev-2002/issues.html#036-host-escaping.

> b)      Update the IRI definition so that non-ASCII domain names, either in the generic syntax or the alternate syntax, are *not* mapped to URIs by a string algorithm, but rather are parsed, and any IRI ->  URI transformation handled by
> 1.       parsing according to the scheme definition
> 2.        mapping the parsed components based on whether they are host names, query strings (for HTML5), or non-hostname components
> 3.       reassembling the URI components after mapping

Treating host parts specially (i.e. converting them to punycode rather 
than to %-encoding) is already allowed in RFC 3987.

> We will need to review existing registered (and perhaps unregistered?) URI schemes to see if there are other places where host names appear than using the generic scheme://host/path  or scheme:local@remote syntax.  IRI->URI transformation will necessarily be scheme specific, and we'll need to basically define that new URI schemes must either
> (a)    not allow host names anywhere in them
> (b)   use the scheme://host/path syntax
> (c)    be a special exception  -- I think this may be limited to mailto:
>
> There are other uses of domain names in URIs currently; for example, cid: (content-ID) strings often contain domain names.  I'm not sure but it may be reasonable to *not* allow IRI forms, e.g., require that all URIs not using scheme://host/path syntax not allow hex-encoded octets above %7F, for example.

There are many other places where domain names appear in URIs and IRIs.
Take http://validator.w3.org/check?uri=http://www.ietf.org/ as a simple 
example (OT: would be nice if the IETF site validated). Or then take a 
site such as http://恵比寿駅.jp/. That would be http://validator.w3.org 
/check?uri=http://恵比寿駅.jp/ or some such for the validator 
(unfortunately, this is currently not supported). It's very clearly 
impossible to rule this out. But even before that, doing scheme-wise 
processing kills the U in URIs.

> I think this is unfortunate and a pretty drastic change to the IRI document, but I don't think we're going to make progress if we don't take the bull by the horns.

Before taking anything by the horns (or the tail, or whatever) I'd like 
to know in great details what exactly the actual (or pretended) bull is.

Regards,    Martin.

> Larry
> --
> http://larry.masinter.net
>
>

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
Received on Wednesday, 29 July 2009 06:45:32 UTC