- From: Francois Yergeau <yergeau@alis.com>
- Date: Fri, 15 Nov 1996 16:11:48 -0500
- To: Ron Daniel <rdaniel@acl.lanl.gov>
- Cc: urn-ietf@bunyip.com, uri@bunyip.com
[Cross-posted to URI list, from URN-IETF list] À 09:05 15-11-96 -0700, Ron Daniel a écrit : >I think I18N for URLs is a more difficult problem than it has been for >URNs. We have a large number of existing URLs in a variety of character >sets. Well, no, it appears we don't really have that. I made a search for non-ASCII URLs last spring (both 8-bit octets and %XY with X>=8), and found very few out on the Web (cf. <http://www.alis.com:8085/~yergeau/conf/www5/robot.en.html>). Less than 0.25% in fact, and then some were typos (divide signs instead of tilde, for instance) that didn't work until corrected by hand. Furthermore, compatibility is made easier by the fact that UTF-8 data can be quite reliably recognized as such. Given a UR*, a server can test it for UTF-8 validity; if it fails, it's some 'old' UR* in some encoding other than UTF-8, the server can process as it did before and nothing is broken; if it passes, just process as UTF-8. A little experimentation (need more) shows that false positives are unlikely, provided one takes care of 7-bit ISO-2022-like encodings that look like ASCII (and thus UTF-8) but are not. As for complexity, a UTF-8 validator fits in about 20 lines of C. >While I18N for URLs is a legitimate issue, it is not an issue for the >URN-WG (IMHO). The URI list is still alive, that might be the proper >place to begin discussions. Agreed, I cross-posted there. Please limit replies to the URI list. Regards, -- François Yergeau <yergeau@alis.com> Alis Technologies Inc., Montréal Tél : +1 (514) 747-2547 Fax : +1 (514) 747-2561
Received on Friday, 15 November 1996 16:17:02 UTC