Re: [URN] URI internationalization

[Cross-posted to URI list, from URN-IETF list]

À 09:05 15-11-96 -0700, Ron Daniel a écrit :
>I think I18N for URLs is a more difficult problem than it has been for
>URNs. We have a large number of existing URLs in a variety of character
>sets.

Well, no, it appears we don't really have that.  I made a search for
non-ASCII URLs last spring (both 8-bit octets and %XY with X>=8), and found
very few out on the Web (cf.
<http://www.alis.com:8085/~yergeau/conf/www5/robot.en.html>).  Less than
0.25% in fact, and then some were typos (divide signs instead of tilde, for
instance) that didn't work until corrected by hand.

Furthermore, compatibility is made easier by the fact that UTF-8 data can be
quite reliably recognized as such.  Given a UR*, a server can test it for
UTF-8 validity; if it fails, it's some 'old' UR* in some encoding other than
UTF-8, the server can process as it did before and nothing is broken; if it
passes, just process as UTF-8.  A little experimentation (need more) shows
that false positives are unlikely, provided one takes care of 7-bit
ISO-2022-like encodings that look like ASCII (and thus UTF-8) but are not.
As for complexity, a UTF-8 validator fits in about 20 lines of C.

>While I18N for URLs is a legitimate issue, it is not an issue for the
>URN-WG (IMHO). The URI list is still alive, that might be the proper
>place to begin discussions.

Agreed, I cross-posted there.  Please limit replies to the URI list.

Regards,

-- 
François Yergeau <yergeau@alis.com>
Alis Technologies Inc., Montréal
Tél : +1 (514) 747-2547
Fax : +1 (514) 747-2561

Received on Friday, 15 November 1996 16:17:02 UTC