- From: Phil Archer <parcher@icra.org>
- Date: Thu, 10 Apr 2008 09:14:25 +0100
- To: Felix Sasaki <fsasaki@w3.org>
- CC: Eric Prud'hommeaux <eric@w3.org>, public-powderwg@w3.org, public-i18n-core@w3.org
Thanks very much for taking the time to look at this, Felix. I must admit that I made the assumption that there was a one to one mapping of Punycode to Unicode. I based this on finding at least one online service that converts between the two [1] and a Perl module that seems to offer the same. Hmmm... I see that Verisign stores IDNs as Punycode, so maybe the canonicalisation step should be to convert _to_ Punycode? But, as Kevin has pointed out, if two different Unicode strings can give the same Punycode, we could be in trouble here anyway. Our basic need is that we must be able to be certain whether a given IRI does or does not match a small data set. Typically, something like <iriset> <includehosts>exåmple.org</includehosts> <pathcontains>foó</pathcontains> </iriset> For a given IRI, we need to be 100% sure whether it does or does not match these conditions - i.e. that it has a host with the last two components of exåmple.org and a path that contains foó. Given that an IRI may, or may not, have been re-encoded in one of several different ways, how can we canonicalise it before matching? And Eric has pointed out that å can be written either Ue5 or 'a' + U30a, so there's clearly a lot to this. POWDER is a small group with limited resources so any help you can offer would be greatly appreciated. Phil. [1] http://www.idnforums.com/converter/ Felix Sasaki wrote: > Hi Phil, > > I was looking into this section in your attachment: > > [ > 2.1.3.4 Internationalized Domain Names > * Internationalized Domain Names (IDNs) should be converted from > Punycode [RFC3492] into their UTF-8 string representations. So that, for > example: > http://www.xn--exmple-jua.org/ > becomes > http://www.exåmple.org/. > ] > > If you have > http://www.xn--exmpless-jua.org/ > It is not possible to decide whether it should become > http://www.exåmpless.org/ > or > http://www.exåmpleß.org/ > since "ss" in the Punycode string could have been originally "ss" or "ß". > So I think this canonicalization step is not feasible. I'm also not sure > if it is necessary: If you get http://www.xn--exmpless-jua.org/ you > could process it in Powder just "as is", without trying to go to the > representation with non-ASCII characters. The same for > http://www.exåmpless.org/ . But maybe I missing something? > > Just let me know what you think. Note that the problem of the > unidirectional relation between "ß" and "ss" is a problem of IDNs which > will soon be addressed by a proposed IETF Working Group, see > http://www.alvestrand.no/pipermail/idna-update/2008-March/001343.html > > Felix
Received on Thursday, 10 April 2008 08:19:30 UTC