Re: internationalization/ISO10646 question

On Fri, 06 Dec 2002 13:13:41 -0800
Chris Newman <Chris.Newman@sun.com> wrote:

> 
> UTF-16 is a terrible encoding for interoperability.  There are 3 published 
> non-interoperable variants of UTF-16 (big-endian, little-endian, 
> BOM/switch-endian) and only one of the variants can be auto-detected with 
> any chance of success (and none of them can be auto-detected as well as 
> UTF-8). 

Unfortunately, as far as I know, UTF-8 is not free of such problems.
(1) With or without the Unicode signature, (2) possible confusion with other 
ASCII-compatible encodings (especially when a program has a few non-ASCII characters), 
(3) vulnerability caused by redundant octet sequences, and (4) use of 4 or 6 octets 
for non-BMP characters (e.g., writeUTF and readUTF of java.io.DataOutput).  I know 
that Corrigendum #1: UTF-8 Shortest Form addresses (3), but I am not sure if 
implementations are free of this vulnerability.

I would be very happy if some encoding of Unicode becomes free of interoperability 
or security problems.  But I am not happy yet.

-- 
MURATA Makoto <murata@hokkaido.email.ne.jp>

Received on Tuesday, 24 December 2002 21:50:49 UTC