- From: Kenneth Whistler <kenw@sybase.com>
- Date: Thu, 16 Dec 1999 10:25:21 -0800 (PST)
- To: Harald@Alvestrand.no
- Cc: ietf-charsets@iana.org, kenw@sybase.com, mark.davis@us.ibm.com
I concur with some of Harald's list of disadvantages for UTF-16 as an interchange format, but find myself puzzled by some of the others: > My list of disadvantages: > > - No compatibility with cstrings due to NULL This is an obvious problem for interworking with API's that use 8-bit character sets. But I agree with François that this issue will disappear over time as people create appropriate interfaces to work with 16-bit strings. The real issue is not the NULL's but the datatype difference. > - Inability to represent characters outside Planes 0-16 WG2 and UTC are converging on a point of view that characters outside of Planes 0-16 should *never* be assigned. This may be formally written into 10646. The rationale here is that nearly all 10646 implementations are following the Unicode Standard, by necessity, to achieve interoperability in areas that are left unspecified by 10646. Formalizing this convergence by constraining the code space range that could ever be assigned standard characters would close down this nagging issue of incompatibility between the Unicode Standard and 10646. In that case, UTF-8, UTF-16, and UTF-32 would *all* have the exact same representational capability, and would all be completely interconvertible forms. > - VERY bad expansion factor for characters outside Plane 0 (100% overhead) This claim I do not understand at all: scalar value UTF-8 UTF-16 UTF-32 0..7F 1 2 4 80..7FF 2 2 4 800..FFFD 3 2 4 10000..10FFFD 4 4 4 The only size advantage for UTF-8 is for ASCII values, and UTF-16 has the clear size advantage for East Asian data. > - No ability to mix ASCII and UTF-16 elements in a simple viewer This is a very important transitional and developmental advantage for UTF-8, absolutely. > - Two incompatible byte orders Also an admitted problem for UTF-16 and UTF-32, but not significantly more complex that defining interchange formats for any datatype that has to be expressed in machine words larger than a byte wide. > > My list of advantages: > > - Does not require conversion between UCS-2 and UTF-16 when only Plane 0 > characters are used in the UTF-16 UCS-2 is a dead issue in any case. All Unicode implementations should at this point formally be UTF-16 implementations, whether they are actually supporting the interpretation of surrogate pairs or not. If they are claiming conformance to Unicode 2.0 or higher, then they are UTF-16. --Ken > > Note that the single advantage may be listed as a disadvantage if there > turns out to be lots of applications that "support" UTF-16 the way they > currently "support" Unicode - by throwing away the high-order bits.... > > Harald A > > -- > Harald Tveit Alvestrand, EDB Maxware, Norway > Harald.Alvestrand@edb.maxware.no >
Received on Thursday, 16 December 1999 13:29:55 UTC