- From: Markus Scherer <markus.scherer@jtcsv.com>
- Date: Thu, 19 Dec 2002 14:03:12 -0800
- To: charsets <ietf-charsets@iana.org>
Chris Newman wrote: > UTF-16 is a terrible encoding for interoperability. There are 3 Not true, especially if it's declared properly. It is interoperable, and it is at least as compact as, or more compact than, UTF-8 for all non-Latin texts. > published non-interoperable variants of UTF-16 (big-endian, > little-endian, BOM/switch-endian) and only one of the variants can be Yes, but the variants are minor - endianness and BOM. > auto-detected with any chance of success (and none of them can be > auto-detected as well as UTF-8). It's not a fixed-width encoding, so > you don't get the fixed-width benefits that UCS-4 would provide (unless Well, few encodings are fixed-width, and some popular encodings are a lot more complicated. Fixed-width encodings are useful for processing, but this is not an issue for transport. Exchanging data over a wire in UTF-32/UCS-4 would be crazy. You would knowingly waste at least 33% and almost always 50% of your bandwidth transmitting 0s, compared with UTF-16. Besides, UTF-32 has the same 3 variants. > you ignore a slew of plane-1 characters) and it doesn't have any of the which occur rarely > useful characteristics of UTF-8 (nearly complete compatibility with code > written to operate on 8-bit character strings). True, but if you use a converter anyway for input/output as you have to do in a MIME world, then you have to do that for any charset. > So this raises the question: why would any sensible protocol designer > ever what to transport UTF-16 over the wire? There may be a few rare > corner cases where it makes sense, but in general UTF-8 is superior in > almost all instances. I suspect the only reason we see UTF-16 on the > wire is because some programmers are too lazy to convert from an > internal variant of UTF-16 to interoperable UTF-8 on the wire, and > haven't thought through the bad consequences of their laziness. Way overstated. UTF-16 and several other Unicode charsets are very useful, depending on which protocol. Since UTF-8 is not terribly efficient, there is not particular reason to favor it over other Unicode charsets when one designs new protocols where ASCII compatibility is moot. IMHO. Remember that UTF-8 was designed to shoehorn Unicode/UCS into Unix file systems, nothing more. Where ASCII byte-stream compatibility is not an issue, there are Unicode charsets that are more efficient than UTF-8, different ones for different uses. Best regards, markus -- Opinions expressed here may not reflect my company's positions unless otherwise noted.
Received on Thursday, 19 December 2002 17:05:04 UTC