Re: internationalization/ISO10646 question from MURATA Makoto on 2002-12-25 (ietf-charsets@w3.org from October to December 2002)

From: MURATA Makoto <murata@hokkaido.email.ne.jp>
Date: Wed, 25 Dec 2002 11:51:06 +0900
To: Chris Newman <Chris.Newman@sun.com>
Cc: Marcin Hanclik <mhanclik@poczta.onet.pl>, ietf-charsets@iana.org
Message-id: <20021225113735.8C21.MURATA@hokkaido.email.ne.jp>

On Fri, 06 Dec 2002 13:13:41 -0800
Chris Newman <Chris.Newman@sun.com> wrote:

> 
> UTF-16 is a terrible encoding for interoperability.  There are 3 published 
> non-interoperable variants of UTF-16 (big-endian, little-endian, 
> BOM/switch-endian) and only one of the variants can be auto-detected with 
> any chance of success (and none of them can be auto-detected as well as 
> UTF-8). 

Unfortunately, as far as I know, UTF-8 is not free of such problems.
(1) With or without the Unicode signature, (2) possible confusion with other 
ASCII-compatible encodings (especially when a program has a few non-ASCII characters), 
(3) vulnerability caused by redundant octet sequences, and (4) use of 4 or 6 octets 
for non-BMP characters (e.g., writeUTF and readUTF of java.io.DataOutput).  I know 
that Corrigendum #1: UTF-8 Shortest Form addresses (3), but I am not sure if 
implementations are free of this vulnerability.

I would be very happy if some encoding of Unicode becomes free of interoperability 
or security problems.  But I am not happy yet.

-- 
MURATA Makoto <murata@hokkaido.email.ne.jp>

Received on Tuesday, 24 December 2002 21:50:49 UTC