W3C home > Mailing lists > Public > www-archive@w3.org > April 2014

IDNA, IRIs, and ://..../ authority field

From: Larry Masinter <masinter@adobe.com>
Date: Tue, 8 Apr 2014 00:23:22 +0000
To: "Phillips, Addison" <addison@lab126.com>, Martin Dürst <duerst@it.aoyama.ac.jp>, Leif Halvard Silli <lhs@malform.no>
CC: Anne van Kesteren <annevk@annevk.nl>, "Dave Thaler (dthaler@microsoft.com)" <dthaler@microsoft.com>, "Robin Berjon (robin@w3.org)" <robin@w3.org>, Mark Davis ☕ <mark@macchiato.com>, "www-archive@w3.org" <www-archive@w3.org>
Message-ID: <71361a27ca384ce0868a53add623c2fb@BL2PR02MB307.namprd02.prod.outlook.com>
There is a question about handling of IRIs with new schemes, and how IRI processors should convert IRIs to URIs when dealing with a previously unrecognized scheme. For example, "smtp:" or "submit:" which use the generic syntax "smtp://hostname/...."  
While this is relevant to the IETF applications area working group's new URI registration document, it also affects others, so I'm not sure what list this should go on, but I hope I'm reaching the editors of the respective documents (I don't know who's handling IDNA though or what documents are being updated). Since it's cross-organizational coordination required, I don't know which list all of you read.




If 'hostname' is a non-ASCII string (IDN), should a processor trying to convert the IRI to a URI use punicode or %xx-hex-encoding for the authority segment?

Currently schemes aren't required to reserve the 'authority' field for DNS names only, so a URI might look like

newscheme://non-ascii-but-not-dns-name/path

for which punicode translation of the "non-asacii-but-not-dns-name" shouldn't be punicode encoded.


===== IRI status ==

URLs were originally defined as ASCII only
It was quickly determined that it was desirable to allow non-ascii characters, but shoehorning UTF8 into ASCII-only systems was unacceptable, Unicode not so widely deployed, that the tack was taken to leave "URI" alone and define a new protocol element, "IRI", with RFC 3987 published in 2005 (in sync with the RFC 3986 update to the URI definition).

The IRI -> URI transformation was specified (but it had an options, it wasn't a deterministic path) and the URI -> IRI transformation was also heuristic, since there was no guarantee that %xx-encoded bytes in the URI were actually meant to be %xx percent-hex-encoded bytes of a UTF8 encoding of a Unicode string.

To address these issues a new working group was established in IETF in 2009 (The IRI working group) but despite meeting several times, the group didn't get the attention of those active in WHATWG, W3C or Unicode consortium, and the IRI group was closed in 2014, with the idea that the documents that were in the IRI working group could be updated as individual submissions or within the "applications area" working group.  In particular, one of the IRI working group items was to update the "scheme guidelines and registration process", recently submitted http://tools.ietf.org/html/draft-ietf-appsawg-uri-scheme-reg-00 which, of course, applies to IRIs as well.

Independently, the HTML5 specs in WHATWG/W3C defined "Web Address", in an attempt to match what some of the browsers were doing. This definition (mainly a published parsing algorithm) was moved out into a separate WHATWG document called "URL".

The world has also moved on. ICANN has approved non-ascii top level domains, and IDN 2003 and 2008 didn't really address IRI Encoding.
Unicode consortium is working on UTS #46.

The big issue is to make the IRI -> URI transformation non-ambiguous and stable.  And I don't know what to do about non-domain-name non-ascii 'authority' fields.  There is some evidence that some processors are %xx-hex-encoding the UTF8 of domain names in some circumstances.

There are four umbrella organizations (IETF, W3C, WHATWG, Unicode consortium) and multiple documents, and it's unclear whether there's a trajectory to make them consistent:



IETF
  - AppsAWG
     http://tools.ietf.org/html/draft-ietf-appsawg-uri-scheme-reg  (Dave Thaler)

 - Abandoned
    RFC 3987 ( IETF Proposed standard RFC for IRI)
   = (abandoned) draft-ietf-iri-3987bis, draft-ietf-iri-comparison, draft-ietf-iri-bidi-guidelines (Masinter)
       intended originally to obsolete RFC 3987
   - IDNA
     IDNA 2003, 2008 specs

W3C
  HTML5 spec now references the WHATWG spec (Robin? Addison)

WHATWG
    http://url.spec.whatwg.org/  (Anne, Leif)
    has a fixed set of relative schemes: ftp, file, gopher (a mistake?), http, https, ws, wss
   Uses IDNA 2003 not 2008
   I'm not sure, but I think it re

Unicode consortium
     #46 and I think others
    recommends translating toAscii and back ? But isn't specific about which schemes.


Received on Tuesday, 8 April 2014 00:23:53 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 14:44:29 UTC