Re: IDN handling, please help

Hello Larry,

I'm sorry the RFC 3987 text is causing confusion.

On 2009/08/30 10:08, Larry Masinter wrote:
> I'm reading this text over and over again, and I really don't get it. Can someone explain what the distinction is between "scheme definition does not allow percent-encoding for ireg-name, and scheme definition DOES allow percent-encoding for ireg-name"?  What schemes allow percent-encoding for ireg-name, for example?

What we wanted to distinguish here is schemes that say that you can use 
%-escaping in the reg-name part of the URI. For example, the 'mailto:' 
scheme wouldn't allow %-escaping, because (as of RFC 2368) it doesn't 
allow %-escaping anywhere. The idea was to create a kind of 
scheme-by-scheme "opt-in" for %-encoding in reg-name, to avoid having 
schemes where most implementations were not prepared for %-encoding in 
reg-name.

However, trying to give some more explicit examples, I just discovered 
that while I was under the impression that lots of schemes predating the 
IRI spec would disallow %-escaping by explicit syntax, what most of them 
(e.g. HTTP in RFC 2616, see
http://tools.ietf.org/html/rfc2616#section-3.2.2) do is just to use the 
<host> production. This doesn't allow %-escaping as of RFC 2396 (see 
http://tools.ietf.org/html/rfc2396#section-3.2.2), but when updated to 
RFC 3986 (http://tools.ietf.org/html/rfc3986#section-3.2.2), that gets 
changed. So under the theory that updates (e.g. from RFC 2396 to RFC 
3986) happen automatically, there is no opt-in anymore.

> Not sure what problem this is solving, or why the two algorithms are different, or whether one is just a shortcut in a special case.

In terms of a 'full-stack' implementation, it can be seen as a shortcut. 
However, it is also important as a backwards-compatibility measure.

The corresponding backwards-compatibility text in RFC 3986 says:
                                              URI producers should
    provide these registered names in the IDNA encoding, rather than a
    percent-encoding, if they wish to maximize interoperability with
    legacy URI resolvers.


I hope this is at least a start to clear things up.

Regards,    Martin.

> =================================================
>
>
>
> Systems accepting IRIs MAY convert the ireg-name component of an IRI as follows (before step 2 above) for schemes known to use domain names in ireg-name, if the scheme definition does not allow percent-encoding for ireg-name: Replace the ireg-name part of the IRI by the part converted using the ToASCII operation specified in Section 4.1 of [RFC3490] (Faltstrom, P., Hoffman, P., and A. Costello, "Internationalizing Domain Names in Applications (IDNA)," March 2003.)<http://larry.masinter.net/draft-duerst-iri-bis.html#RFC3490>  on each dot-separated label, and by using U+002E (FULL STOP) as a label separator, with the flag UseSTD3ASCIIRules set to TRUE, and with the flag AllowUnassigned set to FALSE for creating IRIs and set to TRUE otherwise. The ToASCII operation may fail, but this would mean that the IRI cannot be resolved. This conversion SHOULD be used when the goal is to maximize interoperability with legacy URI resolvers. For example, the IRI
> "http://r&#xE9;sum&#xE9;.example.org"
> may be converted to
> "http://xn--rsum-bpad.example.org"
> instead of
> "http://r%C3%A9sum%C3%A9.example.org".
>
> An IRI with a scheme that is known to use domain names in ireg-name, but where the scheme definition does not allow percent-encoding for ireg-name, meets scheme-specific restrictions if either the straightforward conversion or the conversion using the ToASCII operation on ireg-name result in an URI that meets the scheme-specific restrictions.
>
>
> --
> http://larry.masinter.net
>
>

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp

Received on Monday, 31 August 2009 10:20:10 UTC