Re: [URN] URI internationalization

Francois Yergeau (yergeau@alis.com)
Fri, 15 Nov 1996 16:11:48 -0500


Message-Id: <2.2.32.19961115211148.006eb9e0@genstar.alis.ca>
Date: Fri, 15 Nov 1996 16:11:48 -0500
To: Ron Daniel <rdaniel@acl.lanl.gov>
From: Francois Yergeau <yergeau@alis.com>
Subject: Re: [URN] URI internationalization
Cc: urn-ietf@bunyip.com, uri@bunyip.com

[Cross-posted to URI list, from URN-IETF list]

=C0 09:05 15-11-96 -0700, Ron Daniel a =E9crit :
>I think I18N for URLs is a more difficult problem than it has been for
>URNs. We have a large number of existing URLs in a variety of character
>sets.

Well, no, it appears we don't really have that.  I made a search for
non-ASCII URLs last spring (both 8-bit octets and %XY with X>=3D8), and f=
ound
very few out on the Web (cf.
<http://www.alis.com:8085/~yergeau/conf/www5/robot.en.html>).  Less than
0.25% in fact, and then some were typos (divide signs instead of tilde, f=
or
instance) that didn't work until corrected by hand.

Furthermore, compatibility is made easier by the fact that UTF-8 data can=
 be
quite reliably recognized as such.  Given a UR*, a server can test it for
UTF-8 validity; if it fails, it's some 'old' UR* in some encoding other t=
han
UTF-8, the server can process as it did before and nothing is broken; if =
it
passes, just process as UTF-8.  A little experimentation (need more) show=
s
that false positives are unlikely, provided one takes care of 7-bit
ISO-2022-like encodings that look like ASCII (and thus UTF-8) but are not.
As for complexity, a UTF-8 validator fits in about 20 lines of C.

>While I18N for URLs is a legitimate issue, it is not an issue for the
>URN-WG (IMHO). The URI list is still alive, that might be the proper
>place to begin discussions.

Agreed, I cross-posted there.  Please limit replies to the URI list.

Regards,

--=20
Fran=E7ois Yergeau <yergeau@alis.com>
Alis Technologies Inc., Montr=E9al
T=E9l : +1 (514) 747-2547
Fax : +1 (514) 747-2561