Re: §2.1.3 IRI/URI Canonicalization does not address IRIs with IDNs

Eric,

Thanks for the chat on Monday. As promised, I've tried to write up what 
we discussed. This actually follows on from work that WG member Kevin 
Smith did back in January [1] which addressed some of the issues 
initially raised by Thomas Roessler (and these changes should have been 
included in the current public draft but got left out ny my mistake).

OK, I've now added in some more based on our conversation and the 
discussion here with Felix. My guess is that you'll think we've still 
missed a few details, but I'm not so sure.

A POWDER document is an XML instance. Comparison between an IRI and the 
data in that document may be carried out at any layer - browser, network 
or wherever. I am uneasy about saying things like "if you're working at 
the network layer do this, else do the other." Copying and pasting 
doesn't apply. We have an application that has access to the POWDER doc 
and is fed an IRI. It's job will be to decide whether the IRI is or is 
not in scope, extract the correct description and take action 
accordingly. A POWDER processor isn't something you'd ever paste an IRI 
into (except in a test environment).

Also, if you run a script to convert % encoded triples in a character, 
except the reserved ones, and there aren't any such triples, OK, nothing 
happens.

I'm working on a little Perl-based thing to test some of this out and am 
pleased to have found a module that does IDN conversion [2]!

The attached doc is an update of the currently published one and 
includes the rewrites of the canonicalisation and data encoding 
sections. Any comments from you or others would be greatly appreciated.

Phil.

[1] http://lists.w3.org/Archives/Public/public-powderwg/2008Jan/0023.html
[2] 
http://search.cpan.org/~cfaerber/Net-IDN-tools-0.99_20071012.1/lib/Net/IDN/Punycode.pm

Eric Prud'hommeaux wrote:
> * Felix Sasaki <fsasaki@w3.org> [2008-02-12 09:47+0900]
>> Hi Eric (putting i18n core into the loop),
>>
>> Eric Prud'hommeaux wrote:
>>> http://www.w3.org/2007/powder/Group/powder-grouping/20080128.html#canon
>>> does not include IDN example or rules.
>>>   
>> there is no need for an IDN example or rule. IRI vs. URI, and IRI>URI  
>> conversion (percent escaping) are a step, which is independent of  
>> preprocessing necessary for domain name resolution. See also the  
>> processing described at
>>
>> http://www.w3.org/International/articles/idn-and-iri/#idn
> 
> As you mention below, this does depend on the use case. Imagine a
> proxy content blocker working with an IRI like:
>   http://straßenausblick.example/Düsseldorf?straßenname=Wald+Straße&benutzer=eric%40w3%2Eorg
> The HTTP proxy sees that in a Host and path tuple:
> [[
> GET /D%C3%BCsseldorf?stra%C3%9Fenname=Wald+Stra%C3%9Fe&benutzer=eric%40w3%2Eorg HTTP/1.1
> Host: strassenausblick.example
> ]]
> 
> (note that "ß" => "ss" (tx for pointing this out, Felix)). Matching
> that against labels for the original IRI is non-trivial as you need to
> look for labels for both straßenausblick.example and
> strassenausblick.example.
> 
>>> An example (working) IDN IRI:
>>>   http://www.bravå.nu/
>>> The IDN is punycoded when the IRI is expressed as a URI:
>>>   http://www.xn--brav-toa.nu/
>>>
>>> == homonyms ==
>>> å can be written either Ue5 or 'a' + U30a (COMBINING RING ABOVE).
>>> This results in a different punycoded IDN. 
>> the punycode is only "seen" by the domain name server which uses it for  
>> domain name resolution. There is no need to use it for *IRI/URI*  
>> Canonicalization.
> 
> I think it's seen at the proxy, but not at the location bar.
> 
>>> Unicode gives *some*
>>> c14n (or folding) rules, but not all, and they are not cheap to
>>> implement.
>>>
>>> == fixing ==
>>> This should probably be addressed in an update of mnot's URISpace Note
>>>   http://www.w3.org/TR/urispace
>>>
>>> I recommend inserting in 2.1.3.3 Punycode (or maybe IDN) Conversion:
>>>
>>>   • Internationalized Domain Names (IDNs) are converted from their
>>>     punycode form to Unicode code points.
>>>   
>> where does this happen?
> 
> Good question, and I think, the crux of the issue. If the user pastes
> an IRI into the location bar, the matching code should, I believe,
> work with it directly (no '%' unescaping). If it's a proxy or the
> server, then I expect the job is as described above. If the IRI is in
> an HTML href, then it needs to be un-HTML-escaped, but again *not* be
> '%' unescaped or it will corrupt the URI.
> 
> The example URI includes a email address (eric@w3.org) as a user
> identifier. I would expect POWDER descriptions of the resource to
> include the '%'s, e.g. "...benutzer=eric%40w3%2Eorg". Likewise, I
> would expect a link in HTML to simply &-encode the string:,
> <a href="http://straßenausblick.example/Düsseldorf?straßenname=Wald+Straße&amp;benutzer=eric%40w3%2Eorg">mein haus</a>
> but I haven't researched that at all.
> 
> Perhaps the whole of §2.1.3 IRI/URI Canonicalization should focus
> on a few example scenarios and say what decoding is necessary.
> 
>>                         Note that in IDNA version 2003, roundtripping  
>> Unicode > punycode < Unicode is not possible, since during the step  
>> Unicode > punycode, non-reversible mapping (e.g. Eszett > ss) are made.  
>> But as said above, I think this is out of scope for IRI/URI  
>> canonicalization.
> 
> appreciatively noted and used for the example.

Received on Wednesday, 9 April 2008 14:04:50 UTC