- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Fri, 18 Jan 2008 17:37:11 +0900
- To: olivier Thereaux <ot@w3.org>
- Cc: Tools dev list <public-qa-dev@w3.org>
Hello Olivier, Just a very quick reply, without having much of a look at sources. First, I haven't done much of Perl lately, I find Ruby much more pleasant and interesting these days. (Perl: An object is simply a referent blessed into a package... Ruby: [no need to explain objects]) Second, URIs by strict definition don't contain non-ASCII characters, so thrashing them is definitely one way to handle them, although not the most user-friendly. If you put an IRI module on top of an URI module, the IRI module would convert non-ASCII characters to percent-escapes. If the URI module works according to the latest spec, it would actually grok percent-escapes in the domain name part. But this is relatively new, so I'm not sure it's implemented. "grok percent-escapes" would then also mean that it has to do a conversion to punycode for resolution. The most difficult part of implementing IRIs is to have some source of information about the encoding of incomming IRIs. In Perl, this can be done roughly by requiring that the encoding is UTF-8, which should work in most cases. Implementing IRIs on top of an URI module means indeed that some of the parsing has to be implemented twice, or that you have to re-convert from percent-encoding to characters to be able to expose IRI components such as host, path,... in the IRI module. At 15:27 08/01/18, olivier Thereaux wrote: >Hi Martin, > >I was doing some tests with IRIs in perl and your name kept cropping >up in documentation, so I was wondering if you could answer some of my >doubts. Do you know what the state of adoption of IRIs (and in >particular IDNs) in perl? > >I have seen some IDN-related modules (e.g [1]) being released, but it >seems that the top obstacle to nicely handling IRIs in perl is that >the URI module [2] is not IRI-friendly. As my little test script >(attached, but not worth much) showed the URI constructor ignores and >trashes all non-ascii characters in the host [3]. > >[1] Net::IDN::Encode >[2] http://search.cpan.org/~gaas/URI-1.35/URI.pm >[3] http://search.cpan.org/~gaas/URI-1.35/URI.pm#CONSTRUCTORS > >I was hoping I'd be able to 1) construct the URI object and THEN 2) >prepname and encode to punycode the hostname with something like: > > $uri->host( domain_to_ascii($uri->host) ); This might work, but it's a hack. You still want to expose the original (non-ASCII host) from the IRI module. In general, you need two versions for each component, a (potentially) non-ASCII one and a percent-escaped one. And for the host, there are three: - (potentially) non-ASCII - percent-escaped - punycode If you don't do it that way, you may loose information and create surprises for some users. Regards, Martin. >but that won't work because by that time all the non-ascii characters >in the hostname have already been trashed by URI::Escape. The other >solution would be to first encode into punycode, then construct the >URI object, but that means reinventing the wheel and parsing the URI >by hand (to get the host part) first. > >So, that's not satisfying. What is surprising me is that apparently >there is nothing in the tracker for this module mentioning IDNs and >punycode. Maybe noone has yet suggested to the module maintainers that >instead of trashing all non-ascii chars, they should be attempting a >punycode conversion? > >Do you recall any such discussion? Have you already experimented in >this area? > >Thanks. >-- >olivier #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Friday, 18 January 2008 08:38:00 UTC