RE: IDNA and IRI document way forward from Larry Masinter on 2009-08-29 (public-iri@w3.org from August 2009)

From: Larry Masinter <masinter@adobe.com>
Date: Fri, 28 Aug 2009 17:39:40 -0700
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>
CC: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>, "Roy T. Fielding" <fielding@apache.org>, John C Klensin <klensin@jck.com>
Message-ID: <8B62A039C620904E92F1233570534C9B0118DB9ABBFA@nambx04.corp.adobe.com>

One way to think about what I'm talking about
RFC 3987 and the current draft is:

  IRI --translating--> URI 

and then (as necessary):

  URI --parsing--> parsed-URI-component(s)

but most deployed browsers actually do

  IRI --parsing--> parsed-IRI-component(s)

and then (as necessary):

  parsed-IRI-component --translating--> parsed-URI-component

where 'translating' might actually be different
for different components (hostname, form query
parameters).

> Yes, that's the principle. But please note that even now, RFC 3987 says:
>    Systems accepting IRIs MAY convert the ireg-name component of an IRI
>    as follows (before step 2 above) for schemes known to use domain
>    names in ireg-name, if the scheme definition does not allow
>    percent-encoding for ireg-name:

I think this should be a MUST rather than a MAY.

> On the other hand, I think it would be a huge overkill to require that 
> every scheme be defined twice (once for URIs and once for IRIs). 

New schemes should be defined as IRIs if that's applicable. 

The old schemes mainly need a general update based on the new
IRI generic syntax. There are a few special cases, but they
should be addressed specially.


> Looking at Erik's mail 
> (http://lists.w3.org/Archives/Public/public-iri/2009Aug/0012.html), 
> implementations seem to be everything else but consistent. Why not have 
> them move in the right direction?

I think "parse then escape" is more common than "escape then parse"
so I think this is the "right direction".


> For (2), RFC 3987 
> already sins a bit with regards to absolute scheme-independency, and we 
> can sin a bit more in iri-bis if that's deemed necessary.

I think it's just going the whole way, or at least, we should look at
what the spec looks like proposing that.

At this point, I'm thinking of updating RFC 4395 also
 http://www.rfc-editor.org/rfc/rfc4395.txt 
"Guidelines and Registration Procedures for New URI Schemes"

to encourage scheme definitions to

*  be explicit about the applicability or processing methods
   for Unicode strings (default: not allowed)
*  be explicit about HTTP-like "operations" like GET and POST
  (default: not defined)
*  

and starting a review of registered schemes
  http://www.iana.org/assignments/uri-schemes.html

to update any that need IRI definitions.

I think for consistency that the IRI document 
should acknowledge that these are often
popularly called "URLs" but that term is used only loosely,
and that formal specifications should distinguish between
URL, URI, IRI, LEIRI, HREF and the various other non-terminals.

The HTML document attempts to be precise in so many places,
using a loose term where a precise one is called for seems
like it's more appropriate, but I hope to push off that
discussion until I have at least rough drafts of the updated
IRI document and a new registry doc.

Larry
--
http://larry.masinter.net

Received on Saturday, 29 August 2009 00:40:34 UTC