RE: IDNA, IRIs, and ://..../ authority field

+Richard

Would we be better off moving to a normal communication forum, such as public-iri@ ? 

Note that at least some of the below was developed in response to my working on a status report/summary document. I have taken a good bit of Larry's text below, which he previously suggested to me, and edited it into my document, which lives here:

   https://www.w3.org/International/wiki/IRIStatus


I'm less concerned about how the "tell the history" (I left my rump-and-not-entirely-accurate version in the document for the nonce) than about trying to document the gaps in URL and propose useful resolutions of the same. I think I am correct in saying that the URL spec is basically seen as the vehicle for "fixing" IRI at this point in time.

At present I'm traveling (in Oman) and so my responses will be slow.

Addison Phillips
Globalization Architect (Amazon Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.



> -----Original Message-----
> From: Larry Masinter [mailto:masinter@adobe.com]
> Sent: Monday, April 07, 2014 5:23 PM
> To: Phillips, Addison; Martin Dürst; Leif Halvard Silli
> Cc: Anne van Kesteren; Dave Thaler (dthaler@microsoft.com); Robin Berjon
> (robin@w3.org); Mark Davis ☕; www-archive@w3.org
> Subject: IDNA, IRIs, and ://..../ authority field
> 
> There is a question about handling of IRIs with new schemes, and how IRI
> processors should convert IRIs to URIs when dealing with a previously
> unrecognized scheme. For example, "smtp:" or "submit:" which use the generic
> syntax "smtp://hostname/...."
> While this is relevant to the IETF applications area working group's new URI
> registration document, it also affects others, so I'm not sure what list this
> should go on, but I hope I'm reaching the editors of the respective documents (I
> don't know who's handling IDNA though or what documents are being updated).
> Since it's cross-organizational coordination required, I don't know which list all
> of you read.
> 
> 
> 
> 
> If 'hostname' is a non-ASCII string (IDN), should a processor trying to convert
> the IRI to a URI use punicode or %xx-hex-encoding for the authority segment?
> 
> Currently schemes aren't required to reserve the 'authority' field for DNS
> names only, so a URI might look like
> 
> newscheme://non-ascii-but-not-dns-name/path
> 
> for which punicode translation of the "non-asacii-but-not-dns-name" shouldn't
> be punicode encoded.
> 
> 
> ===== IRI status ==
> 
> URLs were originally defined as ASCII only It was quickly determined that it was
> desirable to allow non-ascii characters, but shoehorning UTF8 into ASCII-only
> systems was unacceptable, Unicode not so widely deployed, that the tack was
> taken to leave "URI" alone and define a new protocol element, "IRI", with RFC
> 3987 published in 2005 (in sync with the RFC 3986 update to the URI definition).
> 
> The IRI -> URI transformation was specified (but it had an options, it wasn't a
> deterministic path) and the URI -> IRI transformation was also heuristic, since
> there was no guarantee that %xx-encoded bytes in the URI were actually meant
> to be %xx percent-hex-encoded bytes of a UTF8 encoding of a Unicode string.
> 
> To address these issues a new working group was established in IETF in 2009
> (The IRI working group) but despite meeting several times, the group didn't get
> the attention of those active in WHATWG, W3C or Unicode consortium, and the
> IRI group was closed in 2014, with the idea that the documents that were in the
> IRI working group could be updated as individual submissions or within the
> "applications area" working group.  In particular, one of the IRI working group
> items was to update the "scheme guidelines and registration process", recently
> submitted http://tools.ietf.org/html/draft-ietf-appsawg-uri-scheme-reg-00

> which, of course, applies to IRIs as well.
> 
> Independently, the HTML5 specs in WHATWG/W3C defined "Web Address", in
> an attempt to match what some of the browsers were doing. This definition
> (mainly a published parsing algorithm) was moved out into a separate
> WHATWG document called "URL".
> 
> The world has also moved on. ICANN has approved non-ascii top level domains,
> and IDN 2003 and 2008 didn't really address IRI Encoding.
> Unicode consortium is working on UTS #46.
> 
> The big issue is to make the IRI -> URI transformation non-ambiguous and
> stable.  And I don't know what to do about non-domain-name non-ascii
> 'authority' fields.  There is some evidence that some processors are %xx-hex-
> encoding the UTF8 of domain names in some circumstances.
> 
> There are four umbrella organizations (IETF, W3C, WHATWG, Unicode
> consortium) and multiple documents, and it's unclear whether there's a
> trajectory to make them consistent:
> 
> 
> 
> IETF
>   - AppsAWG
>      http://tools.ietf.org/html/draft-ietf-appsawg-uri-scheme-reg  (Dave Thaler)
> 
>  - Abandoned
>     RFC 3987 ( IETF Proposed standard RFC for IRI)
>    = (abandoned) draft-ietf-iri-3987bis, draft-ietf-iri-comparison, draft-ietf-iri-
> bidi-guidelines (Masinter)
>        intended originally to obsolete RFC 3987
>    - IDNA
>      IDNA 2003, 2008 specs
> 
> W3C
>   HTML5 spec now references the WHATWG spec (Robin? Addison)
> 
> WHATWG
>     http://url.spec.whatwg.org/  (Anne, Leif)
>     has a fixed set of relative schemes: ftp, file, gopher (a mistake?), http, https,
> ws, wss
>    Uses IDNA 2003 not 2008
>    I'm not sure, but I think it re
> 
> Unicode consortium
>      #46 and I think others
>     recommends translating toAscii and back ? But isn't specific about which
> schemes.
> 

Received on Tuesday, 8 April 2014 03:35:01 UTC