Re: IDNA and IRI document way forward from Erik van der Poel on 2009-07-29 (public-iri@w3.org from July 2009)

From: Erik van der Poel <erikv@google.com>
Date: Wed, 29 Jul 2009 12:39:19 -0700
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: Larry Masinter <masinter@adobe.com>, "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>, URI <uri@w3.org>, "John Klensin (klensin@jck.com)" <klensin@jck.com>, Vint Cerf <vint@google.com>
Message-ID: <c07a32650907291239m1f1c7de5k58e78f8fbc08b2af@mail.gmail.com>
Hi all,

First of all, I am glad that this discussion is happening, and
especially happy to see references to "current browser behavior". :-)

I am a bit concerned about a couple of things in the new IRI bis draft:

http://www.ietf.org/id/draft-duerst-iri-bis-06.txt

There are a couple of new terms in this document: Legacy Extended IRIs
(LEIRIs) and Hypertext References. The difference between LEIRIs and
IRIs appears to be whether or not you %-escape certain characters,
such as Space and Unicode Private Use characters. However, Space is an
ASCII character, so one would expect this to be more related to URIs,
which begs the question of why we don't have the term LEURI as well.

Instead of defining new terms like LEIRI and Hypertext References, I
wonder if it might be better to have Standards Track RFCs on URIs and
IRIs and a BCP (Best Current Practice) on the contexts in which these
*RIs occur. The BCP would cover such issues as whether or not to
%-escape Space and Private Use.

The BCP could also cover HTML-specific issues such as the character
encoding of the query part (after the question mark), or the
HTML-specific material could be moved back to W3C.

Erik

On Wed, Jul 29, 2009 at 4:28 AM, "Martin J.
Dürst"<duerst@it.aoyama.ac.jp> wrote:
> Hello Larry,
>
> On 2009/07/29 16:07, Larry Masinter wrote:
>>
>> I confess that I'm just coming back up to speed on the
>> issues, and hope you'll forgive me for missing some of
>> the history,
>
> No problem with that.
>
>> It seems there are at least two communities (IDN/IDNA and
>> IRI/WEB) which should have been working together for
>> the past many years, haven't been, and we're now facing
>> some difficulties in bringing their perspectives together,
>> especially when those perspectives have been built
>> into long-standing and finely argued documents.
>
> I guess it's possible to see the situation this way, but I think it's also
> possible to see the glass as (at least) half full. Browser vendors have been
> implementing both IDNs and IRIs. Michel Suignard in particular was involved
> in both efforts. I also was and am on both lists, and tried to notice any
> relevant issues (which doesn't mean that they all made it into the current
> draft).
>
>> I'm not entirely sure of the use case and difficulties,
>> which I will try to track down in more detail.
>
> Great. Very much looking forward to that.
>
>> Just as personal speculation, however,
>> I could easily imagine some problems if it were
>> possible to register domain names which actually
>> contained percent-hex-hex sequences.
>>
>> www.%77%33.org vs www.w3.org?
>
> That was the direction that my speculation would go too.
> It is certainly true that the DNS as such easily accepts any byte sequence
> in its labels. As far as I understand, that even includes null bytes,
> because the packets that the DNS sends use an initial byte to indicate the
> length of a label (taking two bits for other purposes, that results in the
> label length limit of 63 bytes).
>
> However, "%77%33" never has been legal in URIs before RFC 3986 made it mean
> the same as "w3". Also, "%77%33" is definitely not something any DNS
> registrar would allow for registration, nor is it something that other IETF
> protocols would accept as a host name if they put any restrictions on it at
> all.
>
>> Perhaps that would be a problem not just for IRIs
>> but for other kinds of processing too.  Can this
>> be disallowed at the URI parsing level? Only at
>> the IRI level?
>
> Well, "%77%33" as such would not need to be disallowed.
> One would just have to write "%2577%2533" if one really had a need to
> express it (which I guess would be the case extremely rarely).
>
>> I see the difficulties of creating a provision for
>> scheme-specific parsing and restrictions on host names
>> containing %xx hex-encoded bytes in URIs are even
>> greater than what I imagined.
>
> I very much have to agree that we have to weight the difficulties against
> the benefits.
>
>
>>> That would be
>>> http://validator.w3.org/check?uri=http://恵比寿駅.jp/
>
> Or http://%E6%81%B5%E6%AF%94%E5%AF%BF%E9%A7%85.jp/ when escaped
> (using UTF-8, as prescribed by RFC 3986/7).
>
>> I'm sure there are difficulties even in circumstances that
>> don't use "?", but this is especially difficult since the
>> HTML-URL/HREF/WebAddress handling of non-ASCII query parameters
>> adds some ambiguity to the translation of this into URI space.
>
> Yes. But we should try to have these localized to those contexts,
> if possible.
>
>>> It's very clearly impossible to rule this out.
>>
>> Difficult, but not impossible.
>>
>>> But even before that, doing scheme-wise processing
>>>  kills the U in URIs.
>>
>> And the I in Internationalized and several other things. Let's
>> stick to identifying issues and alternatives.
>
> I agree.
>
> Regards,   Martin.
>
> --
> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
> #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
>
>
Received on Wednesday, 29 July 2009 19:40:00 UTC