W3C home > Mailing lists > Public > public-qa-dev@w3.org > January 2008

IRIs in perl? URI module doesn't seem to grok IDNs/Punycode

From: olivier Thereaux <ot@w3.org>
Date: Fri, 18 Jan 2008 15:27:22 +0900
Message-Id: <61C86210-BAEE-4110-87A5-90C5995E4FBB@w3.org>
To: Martin Duerst <duerst@it.aoyama.ac.jp>
Cc: Tools dev list <public-qa-dev@w3.org>

Hi Martin,

I was doing some tests with IRIs in perl and your name kept cropping  
up in documentation, so I was wondering if you could answer some of my  
doubts. Do you know what the state of adoption of IRIs (and in  
particular IDNs) in perl?

I have seen some IDN-related modules (e.g [1]) being released, but it  
seems that the top obstacle to nicely handling IRIs in perl is that  
the URI module [2] is not IRI-friendly. As my little test script  
(attached, but not worth much) showed the URI constructor ignores and  
trashes all non-ascii characters in the host [3].

[1] Net::IDN::Encode
[2] http://search.cpan.org/~gaas/URI-1.35/URI.pm
[3] http://search.cpan.org/~gaas/URI-1.35/URI.pm#CONSTRUCTORS

I was hoping I'd be able to 1) construct the URI object and THEN 2)  
prepname and encode to punycode the hostname with something like:

     $uri->host( domain_to_ascii($uri->host) );

but that won't work because by that time all the non-ascii characters  
in the hostname have already been trashed by URI::Escape. The other  
solution would be to first encode into punycode, then construct the  
URI object, but that means reinventing the wheel and parsing the URI  
by hand (to get the host part) first.

So, that's not satisfying. What is surprising me is that apparently  
there is nothing in the tracker for this module mentioning IDNs and  
punycode. Maybe noone has yet suggested to the module maintainers that  
instead of trashing all non-ascii chars, they should be attempting a  
punycode conversion?

Do you recall any such discussion? Have you already experimented in  
this area?

Received on Friday, 18 January 2008 06:27:41 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:54:52 UTC