W3C home > Mailing lists > Public > public-iri@w3.org > August 2009

RE: IDNA and IRI document way forward

From: Larry Masinter <masinter@adobe.com>
Date: Mon, 24 Aug 2009 19:27:18 -0700
To: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <8B62A039C620904E92F1233570534C9B0118DB8F1E49@nambx04.corp.adobe.com>
I've continued to mull over the contradictory requirements
for IDNA and IRI parsing.

1) IDNA requires (requests, demands, whatever) that there be no
  way in which a %xx percent-hex-encoded version of an Internationalized
  Domain Name ever be presented to a DNS resolver.

2) Traditionally, though, IRIs have been defined as requiring
  a scheme-independent (and syntax-independent) translation from IRI
  (or IRI-like-thing) to URI.

3) URI schemes have host names in many places, not just one:
   mailto:person@host,   ftp://user@host/path, http://host1/path?location=http://host2/path2


I don't think these three things are compatible. If IRIs are defined
by mapping to URIs using (2), then Internationalized Domain Names
in different schemes (3) will translate to percent-hex-encoding
domain names in their corresponding URIs, violating the requirement
for (2).

I can't see any way around (1) or (3), so this leaves me with the
uncomfortable choice of abandoning (2), performing major violence
on the IRI spec.

So here's a swipe at how this might work (please don't shoot me yet):

NO LONGER define an IRI by a generic IRI -> URI mapping.
INSTEAD, IRI parsing is *scheme specific*. "Internationalized"
  (IRI) versions of URI schemes are defined as:
   

   For each URI scheme, there is a corresponding IRI scheme.
   The grammar for the IRI scheme *MUST* be exactly the same
    as the grammar for the URI scheme of the same name, except
   that (a) every syntactic component in the URI scheme that
   allows "unreserved" characters from the URI spec should,
   in the IRI form, allow "Unreserved" characters from the
   IRI repertoire.

   In general, the mapping for handling IRIs and interpreting
   them CAN be defined using a generic IRIstring to URIstring
   component using percent-encoding, with the exception that
   host names are translated to the right IDNa format string.

URI schemes do not AUTOMATICALLY get equivalent IRI schemes.
So, data:, cid:, mid:, tag:, etc. etc. do *Not* have IRI
equivalents automatically (if needed, someone can support
them.)

Instead, we define IRI versions of "http:" and "https:"
and (maybe) "file:" and "ftp:" and "mailto:" using the
new generic IRI definition.

This is painful, of course, but at least it seems to be more
consistent with what's implemented.

Larry



--
http://larry.masinter.net

Received on Tuesday, 25 August 2009 02:27:55 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 30 April 2012 19:51:55 GMT