Re: revised "generic syntax" internet draft from John C Klensin on 1997-04-25 (uri@w3.org from April 1997)

From: John C Klensin <klensin@mci.net>
Date: Fri, 25 Apr 1997 04:40:55 -0400 (EDT)
To: Edward Cherlin <cherlin@newbie.net>
Cc: uri@bunyip.com
Message-Id: <SIMEON.9704250455.U@tp7.Jck.com>

On Thu, 24 Apr 1997 23:25:10 -0700 Edward Cherlin 
<cherlin@newbie.net> wrote:

>...
> Those with less popular character sets are out in the cold today. Unicode
> will bring them in from the cold, since it is a general solution that has
> fairly wide implementation (in Windows NT, Macintosh OS, several flavors of
> UNIX, Java, and so on, and in applications such as Microsoft Office 97 and
> Alis Tango Web browser).
>...
> There is no hope of getting every legacy character 
encoding incorporated
> into modern software by any means other than Unicode.

Edward,

This is not true, and these discussions seem to be 
difficult enough without engaging in hyperbole.  In 
particular:

(i) However widely Unicode is implemented, the actual use 
patterns are, especially outside of areas that use 
Latin-based alphabets, much larger for systems based on 
character set (or code page) switching (mostly, but not 
entirely, utilizing ISO 2022 designators) than they are for 
Unicode.

(ii) In many cases (including with some applications that 
end up on the system you have mentioned), Unicode (or 
something like it) is used as an internal representation, 
but what goes over the wire is a character set switching 
(or shifting) system.   There is some small (but not zero) 
risk that Unicode, like X.400, will end up being more of a 
common conversion and representation format than a format 
that end-user applications actually use natively.

(iii) Even if "Unicode" is the right solution, it doesn't 
automatically follow that the best representation is UTF-8. 
A case can be made that, if one is going to have to resort 
to hex encoding anyway, simply hex-encoding the UCS-2 
string will give better behavior more of the time (when 
"more" is considered by weighting the use of different 
ranges of the Unicode coding set by the number of people in 
the world who use those characters).

(iv) It is not hard to demonstrate that, in the medium to 
long term, there are some requirements for character set 
encoding for which Unicode will not suffice and it will be 
necessary to go to multi-plane 10646 (which is one of 
several reasons why IETF recommendation documents have 
fairly consistently pointed to 10646 and not Unicode).  The 
two are not the same.  In particular, while the comment in 
(iii) can easily and correctly be rewritten as a UCS-4 
statement, UTF-8 becomes, IMO, pathological (and its own 
excuse for compression) when one starts dealing with plane 
3 or 4 much less, should we be unlucky enough to get there, 
plane 200 or so.

     john

p.s. I haven't changed my mind -- I still don't like 2022 
as a "character set" or as a data representation, largely 
because I don't like stateful character encodings.  But I 
think we need to make decisions based on reality rather 
than wishful thinking, evangelism, or pure optimism.

Received on Friday, 25 April 1997 04:41:02 UTC