Re: §2.1.3 IRI/URI Canonicalization does not address IRIs with IDNs

Thanks very much for taking the time to look at this, Felix.

I must admit that I made the assumption that there was a one to one 
mapping of Punycode to Unicode. I based this on finding at least one 
online service that converts between the two [1] and a Perl module that 
seems to offer the same.

Hmmm... I see that Verisign stores IDNs as Punycode, so maybe the 
canonicalisation step should be to convert _to_ Punycode? But, as Kevin 
has pointed out, if two different Unicode strings can give the same 
Punycode, we could be in trouble here anyway.

Our basic need is that we must be able to be certain whether a given IRI 
does or does not match a small data set. Typically, something like

<iriset>
   <includehosts>exåmple.org</includehosts>
   <pathcontains>foó</pathcontains>
</iriset>

For a given IRI, we need to be 100% sure whether it does or does not 
match these conditions - i.e. that it has a host with the last two 
components of exåmple.org and a path that contains foó.

Given that an IRI may, or may not, have been re-encoded in one of 
several different ways, how can we canonicalise it before matching?

And Eric has pointed out that å can be written either Ue5 or 'a' + U30a, 
so there's clearly a lot to this. POWDER is a small group with limited 
resources so any help you can offer would be greatly appreciated.

Phil.

[1] http://www.idnforums.com/converter/

Felix Sasaki wrote:
> Hi Phil,
> 
> I was looking into this section in your attachment:
> 
> [
> 2.1.3.4 Internationalized Domain Names
>    * Internationalized Domain Names (IDNs) should be converted from 
> Punycode [RFC3492] into their UTF-8 string representations. So that, for 
> example:
>      http://www.xn--exmple-jua.org/
>      becomes
>      http://www.exåmple.org/.
> ]
> 
> If you have
> http://www.xn--exmpless-jua.org/
> It is not possible to decide whether it should become
> http://www.exåmpless.org/
> or
> http://www.exåmpleß.org/
> since "ss" in the Punycode string could have been originally "ss" or "ß".
> So I think this canonicalization step is not feasible. I'm also not sure 
> if it is necessary: If you get http://www.xn--exmpless-jua.org/ you 
> could process it in Powder just "as is", without trying to go to the 
> representation with non-ASCII characters. The same for 
> http://www.exåmpless.org/ . But maybe I missing something?
> 
> Just let me know what you think. Note that the problem of the 
> unidirectional relation between "ß" and "ss" is a problem of IDNs which 
> will soon be addressed by a proposed IETF Working Group, see 
> http://www.alvestrand.no/pipermail/idna-update/2008-March/001343.html
> 
> Felix

Received on Thursday, 10 April 2008 08:19:30 UTC