Re: referencing IDNA2008 (and IDNA2003?)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp> · Date: Sun, 24 Oct 2010 12:14:54 +0900

[cc'ing the IRI WG list; please remove for topics that are not related 
to IRIs]

On 2010/10/23 23:40, Patrik Fältström wrote:
>
> On 22 okt 2010, at 21.35, John C Klensin wrote:
>
>> So, if either
>> the domain-attribute or the request-host contain non-ASCII
>> characters, it needs to convert those strings to A-labels
>> (IDNA2008) or via ToASCII (IDNA2003).
>
> It is a little bit more complicated than this unfortunately.
> If what you might get as "input" (either X or Y) might be an IRI,
> there is a set of IRIs that the way I read the IRI spec might
> contain strings that are not IDNA-2008 compatible.

Well, RFC 3987 (the current IRI spec) was written before IDNA2008,
so that wouldn't be surprising. Also, IRIs are a syntax to "mix and 
match" various naming mechanisms. There can be many components in an 
IRI, and the reg-name part is only one of them. Both from a spec point 
(we don't want to repeat everything that can be found in other IETF 
documents) and from an implementation viewpoint (many operations on IRIs 
are to some extent generic), we can't always do everything that we would 
like to do for every IRI component.

Also, the IRI reg-name part, in the same way as the reg-name part in 
URIs, is not limited to traditional DNS/IDNA, the same way we are not 
sure what directory system (if any at all) is behind the path component. 
So this further limits any hard rules (i.e. MUST).

> I have lately started to believe that the only IRIs I would like
> to see in a context like yours are the ones that a) is in UTF-8 and

That's a question of carrier. If a single encoding is appropriate, then 
UTF-8 is definitely recommended. But the IRI spec doesn't have to say 
that, it's RFC 2277 that says that.

> b) fulfil the requirement that they can be transformed
> to a URI and back with a 1:1 mapping specified in the IRI spec.

Well, one issue that comes in here is %-encoding, i.e. for the domain 
name part, we have three possible encodings:
a) Unicode characters, i.e. a real IRI
b) %-encoding, as defined in RFC 3986
c) punycode

I agree that it's a very good idea if in any of these three forms, we 
RECOMMEND to only have things that can convert to the other forms and 
back. (and if I weren't on a plane just now in the north of the pacific, 
I'd put a ticket to that effect into the tracker)

> Now there is a new IRI draft out, and I have not checked the details in it,

Not really a problem. Many of the details are not really worked out yet. 
But comments are appreciated at any time.

> but I think we all would like to have:
>
> - IDNA2008 where there is a 1:1 mapping between A-label and U-label,
> and no mapping like IDNA-2003 (potential mapping _must_ really happen
> outside of whatever distributed comparison algorithm we are using)

The IRI spec, in parallel to the URI spec, contains a whole detailed 
section on various ways to compare IRIs. Different applications have 
vastly different needs. XML namespace processing and RDF, for example, 
do it character by character, even 'http:' and 'Http:' is different for 
them. Spiders and similar software, on the other hand, are very 
aggressive and don't mind a few false positives, and for them some 
mappings (be that IDNA2003 or TR46 or RFC 5895) may make a lot of sense.

> - IRIs and URIs that only contain domain names that are IDNA2008
> compatible (U-label or A-label in the domain name part)

Yes, ideally. But we can't educate all end users inputting IRIs about 
IDNA, and we can't map on input whenever somebody inputs an IRI (the 
address bar may be easy, but email or plain text editors are not).

Regards,   Martin.

> If we start with that as base rules, then you can hopefully
> in your spec add additional "temporary rules" that might be recommended
> for backward compatibility reasons. But I think you should really call them that.
>
> If you have these rules, then you can -- modulo A-label/U-label transformation
> and URI/IRI transformation that both are 1:1 -- do much simpler comparison
> than what you otherwise can do if you have to start do transformation
> of Unicode strings (regardless of the encoding of the unicode string).
>
> What is important though is that you in the security consideration section
> explicitly note that there are many many many combination of octets
> that not only are invalid when these rules are applied, but if you are unlucky
> you might get buffer overflow issues (at best) when trying to do
> various things with the strings. Like do A-label/U-label transformation.
>
>     Patrik
>
> _______________________________________________
> Idna-update mailing list
> Idna-update@alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/idna-update
>
>

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp