W3C home > Mailing lists > Public > public-qa-dev@w3.org > January 2008

Re: IRIs in perl? URI module doesn't seem to grok IDNs/Punycode

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Fri, 18 Jan 2008 17:37:11 +0900
Message-Id: <6.0.0.20.2.20080118172200.09a82270@localhost>
To: olivier Thereaux <ot@w3.org>
Cc: Tools dev list <public-qa-dev@w3.org>

Hello Olivier,

Just a very quick reply, without having much of a
look at sources.

First, I haven't done much of Perl lately, I find Ruby much
more pleasant and interesting these days.
(Perl: An object is simply a referent blessed into a package...
 Ruby: [no need to explain objects])

Second, URIs by strict definition don't contain non-ASCII
characters, so thrashing them is definitely one way to handle
them, although not the most user-friendly. If you put an IRI
module on top of an URI module, the IRI module would convert
non-ASCII characters to percent-escapes. If the URI module
works according to the latest spec, it would actually grok
percent-escapes in the domain name part. But this is relatively
new, so I'm not sure it's implemented. "grok percent-escapes"
would then also mean that it has to do a conversion to punycode
for resolution.

The most difficult part of implementing IRIs is to have some
source of information about the encoding of incomming IRIs.
In Perl, this can be done roughly by requiring that the encoding
is UTF-8, which should work in most cases. Implementing IRIs
on top of an URI module means indeed that some of the parsing
has to be implemented twice, or that you have to re-convert
from percent-encoding to characters to be able to expose
IRI components such as host, path,... in the IRI module.

At 15:27 08/01/18, olivier Thereaux wrote:
>Hi Martin,
>
>I was doing some tests with IRIs in perl and your name kept cropping  
>up in documentation, so I was wondering if you could answer some of my  
>doubts. Do you know what the state of adoption of IRIs (and in  
>particular IDNs) in perl?
>
>I have seen some IDN-related modules (e.g [1]) being released, but it  
>seems that the top obstacle to nicely handling IRIs in perl is that  
>the URI module [2] is not IRI-friendly. As my little test script  
>(attached, but not worth much) showed the URI constructor ignores and  
>trashes all non-ascii characters in the host [3].
>
>[1] Net::IDN::Encode
>[2] http://search.cpan.org/~gaas/URI-1.35/URI.pm
>[3] http://search.cpan.org/~gaas/URI-1.35/URI.pm#CONSTRUCTORS
>
>I was hoping I'd be able to 1) construct the URI object and THEN 2)  
>prepname and encode to punycode the hostname with something like:
>
>     $uri->host( domain_to_ascii($uri->host) );

This might work, but it's a hack. You still want to expose the
original (non-ASCII host) from the IRI module. In general, you
need two versions for each component, a (potentially) non-ASCII
one and a percent-escaped one. And for the host, there are three:
- (potentially) non-ASCII
- percent-escaped
- punycode
If you don't do it that way, you may loose information and
create surprises for some users.

Regards,   Martin.

>but that won't work because by that time all the non-ascii characters  
>in the hostname have already been trashed by URI::Escape. The other  
>solution would be to first encode into punycode, then construct the  
>URI object, but that means reinventing the wheel and parsing the URI  
>by hand (to get the host part) first.
>
>So, that's not satisfying. What is surprising me is that apparently  
>there is nothing in the tracker for this module mentioning IDNs and  
>punycode. Maybe noone has yet suggested to the module maintainers that  
>instead of trashing all non-ascii chars, they should be attempting a  
>punycode conversion?
>
>Do you recall any such discussion? Have you already experimented in  
>this area?
>
>Thanks.
>-- 
>olivier 


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp     
Received on Friday, 18 January 2008 08:38:00 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 19 August 2010 18:12:48 GMT