RE: IDNA and IRI document way forward from Larry Masinter on 2009-08-25 (public-iri@w3.org from August 2009)

From: Larry Masinter <masinter@adobe.com>
Date: Mon, 24 Aug 2009 19:27:18 -0700
To: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <8B62A039C620904E92F1233570534C9B0118DB8F1E49@nambx04.corp.adobe.com>

I've continued to mull over the contradictory requirements
for IDNA and IRI parsing.

1) IDNA requires (requests, demands, whatever) that there be no
way in which a %xx percent-hex-encoded version of an Internationalized
Domain Name ever be presented to a DNS resolver.

2) Traditionally, though, IRIs have been defined as requiring
a scheme-independent (and syntax-independent) translation from IRI
(or IRI-like-thing) to URI.

3) URI schemes have host names in many places, not just one:
mailto:person@host, ftp://user@host/path, http://host1/path?location=http://host2/path2

I don't think these three things are compatible. If IRIs are defined
by mapping to URIs using (2), then Internationalized Domain Names
in different schemes (3) will translate to percent-hex-encoding
domain names in their corresponding URIs, violating the requirement
for (2).

I can't see any way around (1) or (3), so this leaves me with the
uncomfortable choice of abandoning (2), performing major violence
on the IRI spec.

So here's a swipe at how this might work (please don't shoot me yet):

NO LONGER define an IRI by a generic IRI -> URI mapping.
INSTEAD, IRI parsing is *scheme specific*. "Internationalized"
(IRI) versions of URI schemes are defined as:

For each URI scheme, there is a corresponding IRI scheme.
The grammar for the IRI scheme *MUST* be exactly the same
as the grammar for the URI scheme of the same name, except
that (a) every syntactic component in the URI scheme that
allows "unreserved" characters from the URI spec should,
in the IRI form, allow "Unreserved" characters from the
IRI repertoire.

In general, the mapping for handling IRIs and interpreting
them CAN be defined using a generic IRIstring to URIstring
component using percent-encoding, with the exception that
host names are translated to the right IDNa format string.

URI schemes do not AUTOMATICALLY get equivalent IRI schemes.
So, data:, cid:, mid:, tag:, etc. etc. do *Not* have IRI
equivalents automatically (if needed, someone can support
them.)

Instead, we define IRI versions of "http:" and "https:"
and (maybe) "file:" and "ftp:" and "mailto:" using the
new generic IRI definition.

This is painful, of course, but at least it seems to be more
consistent with what's implemented.

Larry

--
http://larry.masinter.net

Received on Tuesday, 25 August 2009 02:27:55 UTC