Re: internationalization/ISO10646 question - UTF-16 from Chris Newman on 2002-12-24 (ietf-charsets@w3.org from October to December 2002)

From: Chris Newman <Chris.Newman@Sun.COM>
Date: Mon, 23 Dec 2002 18:48:51 -0800
To: Markus Scherer <markus.scherer@jtcsv.com>, charsets <ietf-charsets@iana.org>
Message-id: <2147483647.1040669331@nifty-jr.west.sun.com>
begin quotation by Markus Scherer on 2002/12/19 14:03 -0800:
> Chris Newman wrote:
>> UTF-16 is a terrible encoding for interoperability.  There are 3
>
> Not true, especially if it's declared properly. It is interoperable, and
> it is at least as compact as, or more compact than, UTF-8 for all
> non-Latin texts.

If the people who created UTF-16 hadn't messed around with the BOM crap and 
instead made it was network byte order in files, interfaces and on the 
network, then it would interoperate well.  But in today's world, UTF-16 
will interoperate just as well as TIFF does since it made the same mistake 
(actually worse that TIFF since the BOM is optional).  I've seen programs 
which offer to save TIFF files in "Mac format" (Big endian) or "PC format" 
(Little endian) -- just to show you how well that game works.  Meanwhile 
JFIF/JPEG, PNG, and GIF interoperate well because they mandated endian.

>> published non-interoperable variants of UTF-16 (big-endian,
>> little-endian, BOM/switch-endian) and only one of the variants can be
>
> Yes, but the variants are minor - endianness and BOM.

But more than sufficient to cause user-visible interoperability problems. 
See past experience with TIFF.

>> auto-detected with any chance of success (and none of them can be
>> auto-detected as well as UTF-8).  It's not a fixed-width encoding, so
>> you don't get the fixed-width benefits that UCS-4 would provide (unless
>
> Well, few encodings are fixed-width, and some popular encodings are a lot
> more complicated. Fixed-width encodings are useful for processing, but
> this is not an issue for transport.

Exactly true.  For transport, interoperability trumps all other 
requirements.  I brought this up because there is a common misconception 
that UTF-16 is fixed-width.  Well it's mostly fixed width -- meaning you 
get none of the advantages of a fixed-width encoding, but because the 
variable width case is uncommon it adds a new set of interoperability 
problems related to those additional characters.  It violates the "avoid 
uncommon cases" design rule.  Because UTF-8 is variable width in the common 
case it's much more likely to interoperate over the entire Unicode 
repertoire than UTF-16.

>> So this raises the question: why would any sensible protocol designer
>> ever what to transport UTF-16 over the wire?  There may be a few rare
>> corner cases where it makes sense, but in general UTF-8 is superior in
>> almost all instances.  I suspect the only reason we see UTF-16 on the
>> wire is because some programmers are too lazy to convert from an
>> internal variant of UTF-16 to interoperable UTF-8 on the wire, and
>> haven't thought through the bad consequences of their laziness.
>
> Way overstated. UTF-16 and several other Unicode charsets are very
> useful, depending on which protocol. Since UTF-8 is not terribly
> efficient, there is not particular reason to favor it over other Unicode
> charsets when one designs new protocols where ASCII compatibility is
> moot. IMHO.

Time and again people have created obscure binary protocols because they 
are  more "efficient".  Most of these protocols have been huge failures 
because they are vastly less efficient when it comes to interoperability 
and diagnosability which are usually far more important qualities than the 
number of bytes on the wire.  The minor space savings of UTF-16 relative to 
UTF-8 does not justify the huge loss in interoperability.  If space is an 
issue apply a general purpose compression algorithm to UTF-8 -- that will 
be vastly more efficient than UTF-16 without the loss of interoperability, 
auto-detection, or the ability to re-use existing IETF protocol support 
code.

The most successful IETF applications protocols have wisely sacrificed 
attempts to conserve bytes in exchange for improved diagnosability, 
interoperability and backwards-compatibility.

> Remember that UTF-8 was designed to shoehorn Unicode/UCS into Unix file
> systems, nothing more. Where ASCII byte-stream compatibility is not an
> issue, there are Unicode charsets that are more efficient than UTF-8,
> different ones for different uses.

That may be the history, but UTF-8 was designed far better than UTF-16 when 
it comes to all aspects of interoperabilty.  Thus it should be the 
preferred encoding for all transport protocols and all interface points 
between systems from different vendors.  When UTF-16 is promoted instead of 
UTF-8, I consider that detrimental to the deployment of Unicode.

All the UTF-16 APIs in Windows and MacOS are a huge barrier to deployment 
of Unicode on those platforms since all the code has to be rewritten (and 
most of it never is).  If they had instead retro-fitted UTF-8 into the 
existing 8-bit APIs we'd have much better Unicode deployment.

                - Chris
Received on Monday, 23 December 2002 21:54:50 UTC