Date: Tue, 15 Apr 1997 13:33:35 -0700 (PDT) From: Chris Newman <Chris.Newman@innosoft.com> Subject: Re: revised "generic syntax" internet draft In-Reply-To: <SIMEON.9704151143.E@tp7.Jck.com> To: John C Klensin <email@example.com> Cc: IETF URI list <firstname.lastname@example.org> Message-Id: <Pine.SOL.3.95.970415130735.22015Kemail@example.com> On Tue, 15 Apr 1997, John C Klensin wrote: > It would have been better had URLs been carefully and > thoughtfully internationalized from the very beginning. > For whatever reasons, they weren't. A conversion now is > going to be painful. But, if the pain is worth it, and I > suspect it might be, then let's look to a balanced, > equitable, *international* solution, not using UTF-8 > encoding in the hope that no one who uses ideographic > characters will be bothered about what happens to them. UTF-8 requires 2 octets to encode characters from the 8859-1 set which normally take 1 octet. UTF-8 requires 3 octets to encode ideographic characters from UCS-2 which normally require 2 octets. So western Europeans take a worse storage hit from UTF-8 than ideographic languages do. I'd be willing to consider an alternative proposal to hex-encoded UTF-8 in URLs, but I can't think of one that's viable in practice other than MIME encoded words (which are too disgusting to consider). I will say that it took me about 10 minutes to write a hex-encoded UTF-8 to UCS 2 converter which looked up the character descriptions in the publicly available Unicode tables.