IDNA and IRI document way forward

A note about the direction for URI and IRI:

I'd like to make progress on two tasks
1)      resolve any issues with regard to IDNA standards and difficulties with internationalized domain names
2)      resolve any conflicts with the requirements for documenting current browser behavior for compatibility with HTML5

For (1) I think the main changes are:

a)      CHANGE (clarify, update, whatever necessary) URI definitions so that no percent-hex-encoded %XX values are allowed in the "HOST NAME" section of any URI.  URIs that wish to refer to internationalized domain names may only use "A-label" domain names.  My understanding is that this is necessary for security reasons, and may require an update to some existing software (http://tools.ietf.org/html/draft-ietf-idnabis-defs-09#section-2.3.2.1). It may also require an update to the HTTP URI scheme and the Mailto: URI scheme.
b)      Update the IRI definition so that non-ASCII domain names, either in the generic syntax or the alternate syntax, are *not* mapped to URIs by a string algorithm, but rather are parsed, and any IRI -> URI transformation handled by
1.       parsing according to the scheme definition
2.        mapping the parsed components based on whether they are host names, query strings (for HTML5), or non-hostname components
3.       reassembling the URI components after mapping

We will need to review existing registered (and perhaps unregistered?) URI schemes to see if there are other places where host names appear than using the generic scheme://host/path  or scheme:local@remote syntax.  IRI->URI transformation will necessarily be scheme specific, and we'll need to basically define that new URI schemes must either
(a)    not allow host names anywhere in them
(b)   use the scheme://host/path syntax
(c)    be a special exception  -- I think this may be limited to mailto:

There are other uses of domain names in URIs currently; for example, cid: (content-ID) strings often contain domain names.  I'm not sure but it may be reasonable to *not* allow IRI forms, e.g., require that all URIs not using scheme://host/path syntax not allow hex-encoded octets above %7F, for example.

I think this is unfortunate and a pretty drastic change to the IRI document, but I don't think we're going to make progress if we don't take the bull by the horns.

Larry
--
http://larry.masinter.net

Received on Wednesday, 29 July 2009 04:55:00 UTC