- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Tue, 12 Feb 2008 02:15:37 -0500
- To: Felix Sasaki <fsasaki@w3.org>
- Cc: public-powderwg@w3.org, public-i18n-core@w3.org
- Message-ID: <20080212071537.GA10529@w3.org>
* Felix Sasaki <fsasaki@w3.org> [2008-02-12 09:47+0900] > Hi Eric (putting i18n core into the loop), > > Eric Prud'hommeaux wrote: >> http://www.w3.org/2007/powder/Group/powder-grouping/20080128.html#canon >> does not include IDN example or rules. >> > > there is no need for an IDN example or rule. IRI vs. URI, and IRI>URI > conversion (percent escaping) are a step, which is independent of > preprocessing necessary for domain name resolution. See also the > processing described at > > http://www.w3.org/International/articles/idn-and-iri/#idn As you mention below, this does depend on the use case. Imagine a proxy content blocker working with an IRI like: http://straßenausblick.example/Düsseldorf?straßenname=Wald+Straße&benutzer=eric%40w3%2Eorg The HTTP proxy sees that in a Host and path tuple: [[ GET /D%C3%BCsseldorf?stra%C3%9Fenname=Wald+Stra%C3%9Fe&benutzer=eric%40w3%2Eorg HTTP/1.1 Host: strassenausblick.example ]] (note that "ß" => "ss" (tx for pointing this out, Felix)). Matching that against labels for the original IRI is non-trivial as you need to look for labels for both straßenausblick.example and strassenausblick.example. >> An example (working) IDN IRI: >> http://www.bravå.nu/ >> The IDN is punycoded when the IRI is expressed as a URI: >> http://www.xn--brav-toa.nu/ >> >> == homonyms == >> å can be written either Ue5 or 'a' + U30a (COMBINING RING ABOVE). >> This results in a different punycoded IDN. > > the punycode is only "seen" by the domain name server which uses it for > domain name resolution. There is no need to use it for *IRI/URI* > Canonicalization. I think it's seen at the proxy, but not at the location bar. >> Unicode gives *some* >> c14n (or folding) rules, but not all, and they are not cheap to >> implement. >> >> == fixing == >> This should probably be addressed in an update of mnot's URISpace Note >> http://www.w3.org/TR/urispace >> >> I recommend inserting in 2.1.3.3 Punycode (or maybe IDN) Conversion: >> >> • Internationalized Domain Names (IDNs) are converted from their >> punycode form to Unicode code points. >> > > where does this happen? Good question, and I think, the crux of the issue. If the user pastes an IRI into the location bar, the matching code should, I believe, work with it directly (no '%' unescaping). If it's a proxy or the server, then I expect the job is as described above. If the IRI is in an HTML href, then it needs to be un-HTML-escaped, but again *not* be '%' unescaped or it will corrupt the URI. The example URI includes a email address (eric@w3.org) as a user identifier. I would expect POWDER descriptions of the resource to include the '%'s, e.g. "...benutzer=eric%40w3%2Eorg". Likewise, I would expect a link in HTML to simply &-encode the string:, <a href="http://straßenausblick.example/Düsseldorf?straßenname=Wald+Straße&benutzer=eric%40w3%2Eorg">mein haus</a> but I haven't researched that at all. Perhaps the whole of §2.1.3 IRI/URI Canonicalization should focus on a few example scenarios and say what decoding is necessary. > Note that in IDNA version 2003, roundtripping > Unicode > punycode < Unicode is not possible, since during the step > Unicode > punycode, non-reversible mapping (e.g. Eszett > ss) are made. > But as said above, I think this is out of scope for IRI/URI > canonicalization. appreciatively noted and used for the example. -- -eric office: +1.617.258.5741 32-G528, MIT, Cambridge, MA 02144 USA mobile: +1.617.599.3509 (eric@w3.org) Feel free to forward this message to any list for any purpose other than email address distribution.
Received on Tuesday, 12 February 2008 07:15:52 UTC