- From: Chris Newman <Chris.Newman@Sun.COM>
- Date: Mon, 23 Dec 2002 18:48:51 -0800
- To: Markus Scherer <markus.scherer@jtcsv.com>, charsets <ietf-charsets@iana.org>
begin quotation by Markus Scherer on 2002/12/19 14:03 -0800: > Chris Newman wrote: >> UTF-16 is a terrible encoding for interoperability. There are 3 > > Not true, especially if it's declared properly. It is interoperable, and > it is at least as compact as, or more compact than, UTF-8 for all > non-Latin texts. If the people who created UTF-16 hadn't messed around with the BOM crap and instead made it was network byte order in files, interfaces and on the network, then it would interoperate well. But in today's world, UTF-16 will interoperate just as well as TIFF does since it made the same mistake (actually worse that TIFF since the BOM is optional). I've seen programs which offer to save TIFF files in "Mac format" (Big endian) or "PC format" (Little endian) -- just to show you how well that game works. Meanwhile JFIF/JPEG, PNG, and GIF interoperate well because they mandated endian. >> published non-interoperable variants of UTF-16 (big-endian, >> little-endian, BOM/switch-endian) and only one of the variants can be > > Yes, but the variants are minor - endianness and BOM. But more than sufficient to cause user-visible interoperability problems. See past experience with TIFF. >> auto-detected with any chance of success (and none of them can be >> auto-detected as well as UTF-8). It's not a fixed-width encoding, so >> you don't get the fixed-width benefits that UCS-4 would provide (unless > > Well, few encodings are fixed-width, and some popular encodings are a lot > more complicated. Fixed-width encodings are useful for processing, but > this is not an issue for transport. Exactly true. For transport, interoperability trumps all other requirements. I brought this up because there is a common misconception that UTF-16 is fixed-width. Well it's mostly fixed width -- meaning you get none of the advantages of a fixed-width encoding, but because the variable width case is uncommon it adds a new set of interoperability problems related to those additional characters. It violates the "avoid uncommon cases" design rule. Because UTF-8 is variable width in the common case it's much more likely to interoperate over the entire Unicode repertoire than UTF-16. >> So this raises the question: why would any sensible protocol designer >> ever what to transport UTF-16 over the wire? There may be a few rare >> corner cases where it makes sense, but in general UTF-8 is superior in >> almost all instances. I suspect the only reason we see UTF-16 on the >> wire is because some programmers are too lazy to convert from an >> internal variant of UTF-16 to interoperable UTF-8 on the wire, and >> haven't thought through the bad consequences of their laziness. > > Way overstated. UTF-16 and several other Unicode charsets are very > useful, depending on which protocol. Since UTF-8 is not terribly > efficient, there is not particular reason to favor it over other Unicode > charsets when one designs new protocols where ASCII compatibility is > moot. IMHO. Time and again people have created obscure binary protocols because they are more "efficient". Most of these protocols have been huge failures because they are vastly less efficient when it comes to interoperability and diagnosability which are usually far more important qualities than the number of bytes on the wire. The minor space savings of UTF-16 relative to UTF-8 does not justify the huge loss in interoperability. If space is an issue apply a general purpose compression algorithm to UTF-8 -- that will be vastly more efficient than UTF-16 without the loss of interoperability, auto-detection, or the ability to re-use existing IETF protocol support code. The most successful IETF applications protocols have wisely sacrificed attempts to conserve bytes in exchange for improved diagnosability, interoperability and backwards-compatibility. > Remember that UTF-8 was designed to shoehorn Unicode/UCS into Unix file > systems, nothing more. Where ASCII byte-stream compatibility is not an > issue, there are Unicode charsets that are more efficient than UTF-8, > different ones for different uses. That may be the history, but UTF-8 was designed far better than UTF-16 when it comes to all aspects of interoperabilty. Thus it should be the preferred encoding for all transport protocols and all interface points between systems from different vendors. When UTF-16 is promoted instead of UTF-8, I consider that detrimental to the deployment of Unicode. All the UTF-16 APIs in Windows and MacOS are a huge barrier to deployment of Unicode on those platforms since all the code has to be rewritten (and most of it never is). If they had instead retro-fitted UTF-8 into the existing 8-bit APIs we'd have much better Unicode deployment. - Chris
Received on Monday, 23 December 2002 21:54:50 UTC