Re: Re: §2.1.3 IRI/URI Canonicalization does not address IRIs with IDNs

* Felix Sasaki <fsasaki@w3.org> [2008-02-12 09:47+0900]
> Hi Eric (putting i18n core into the loop),
>
> Eric Prud'hommeaux wrote:
>> http://www.w3.org/2007/powder/Group/powder-grouping/20080128.html#canon
>> does not include IDN example or rules.
>>   
>
> there is no need for an IDN example or rule. IRI vs. URI, and IRI>URI  
> conversion (percent escaping) are a step, which is independent of  
> preprocessing necessary for domain name resolution. See also the  
> processing described at
>
> http://www.w3.org/International/articles/idn-and-iri/#idn

As you mention below, this does depend on the use case. Imagine a
proxy content blocker working with an IRI like:
  http://straßenausblick.example/Düsseldorf?straßenname=Wald+Straße&benutzer=eric%40w3%2Eorg
The HTTP proxy sees that in a Host and path tuple:
[[
GET /D%C3%BCsseldorf?stra%C3%9Fenname=Wald+Stra%C3%9Fe&benutzer=eric%40w3%2Eorg HTTP/1.1
Host: strassenausblick.example
]]

(note that "ß" => "ss" (tx for pointing this out, Felix)). Matching
that against labels for the original IRI is non-trivial as you need to
look for labels for both straßenausblick.example and
strassenausblick.example.

>> An example (working) IDN IRI:
>>   http://www.bravå.nu/
>> The IDN is punycoded when the IRI is expressed as a URI:
>>   http://www.xn--brav-toa.nu/
>>
>> == homonyms ==
>> å can be written either Ue5 or 'a' + U30a (COMBINING RING ABOVE).
>> This results in a different punycoded IDN. 
>
> the punycode is only "seen" by the domain name server which uses it for  
> domain name resolution. There is no need to use it for *IRI/URI*  
> Canonicalization.

I think it's seen at the proxy, but not at the location bar.

>> Unicode gives *some*
>> c14n (or folding) rules, but not all, and they are not cheap to
>> implement.
>>
>> == fixing ==
>> This should probably be addressed in an update of mnot's URISpace Note
>>   http://www.w3.org/TR/urispace
>>
>> I recommend inserting in 2.1.3.3 Punycode (or maybe IDN) Conversion:
>>
>>   • Internationalized Domain Names (IDNs) are converted from their
>>     punycode form to Unicode code points.
>>   
>
> where does this happen?

Good question, and I think, the crux of the issue. If the user pastes
an IRI into the location bar, the matching code should, I believe,
work with it directly (no '%' unescaping). If it's a proxy or the
server, then I expect the job is as described above. If the IRI is in
an HTML href, then it needs to be un-HTML-escaped, but again *not* be
'%' unescaped or it will corrupt the URI.

The example URI includes a email address (eric@w3.org) as a user
identifier. I would expect POWDER descriptions of the resource to
include the '%'s, e.g. "...benutzer=eric%40w3%2Eorg". Likewise, I
would expect a link in HTML to simply &-encode the string:,
<a href="http://straßenausblick.example/Düsseldorf?straßenname=Wald+Straße&amp;benutzer=eric%40w3%2Eorg">mein haus</a>
but I haven't researched that at all.

Perhaps the whole of §2.1.3 IRI/URI Canonicalization should focus
on a few example scenarios and say what decoding is necessary.

>                         Note that in IDNA version 2003, roundtripping  
> Unicode > punycode < Unicode is not possible, since during the step  
> Unicode > punycode, non-reversible mapping (e.g. Eszett > ss) are made.  
> But as said above, I think this is out of scope for IRI/URI  
> canonicalization.

appreciatively noted and used for the example.
-- 
-eric

office: +1.617.258.5741 32-G528, MIT, Cambridge, MA 02144 USA
mobile: +1.617.599.3509

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

Received on Tuesday, 12 February 2008 07:15:52 UTC