Re: internationalization/ISO10646 question from Chris Newman on 2003-01-02 (ietf-charsets@w3.org from January to March 2003)

From: Chris Newman <Chris.Newman@Sun.COM>
Date: Thu, 02 Jan 2003 15:15:11 -0800
To: MURATA Makoto <murata@hokkaido.email.ne.jp>
Cc: Marcin Hanclik <mhanclik@poczta.onet.pl>, ietf-charsets@iana.org
Message-id: <2147483647.1041520511@nifty-jr.west.sun.com>

(1) When UTF-8 leaks out with a BOM, that is the result of buggy software 
since the BOM simply isn't needed for UTF-8.

(2) This is not an issue in most Internet protocols since the standards 
have required proper charset labelling for many years.  Ironically, most of 
the countries with widely deployed software that violates the standards by 
emitting unlabelled charsets use encodings that are very easy to 
distinguish from UTF-8.

(3) The UTF-8 overlong sequence issue is sufficiently well documented that 
any security problems in practice are the result of buggy code.  It's an 
extremely minor security issue now, particularly compared to the lookalike 
character problem which impacts all encodings of Unicode and many other 
character sets.

(4) If octet count is an issue use a general purpose compression layer 
which will vastly exceed any savings possible with encoding tricks.

Is UTF-8 perfect?  No.  But the costs greatly outweight the benefits when 
compared to any other charset I've seen, and particularly when compared to 
UTF-16.

                - Chris

begin quotation by MURATA Makoto on 2002/12/25 11:51 +0900:

> On Fri, 06 Dec 2002 13:13:41 -0800
> Chris Newman <Chris.Newman@sun.com> wrote:
>
>>
>> UTF-16 is a terrible encoding for interoperability.  There are 3
>> published  non-interoperable variants of UTF-16 (big-endian,
>> little-endian,  BOM/switch-endian) and only one of the variants can be
>> auto-detected with  any chance of success (and none of them can be
>> auto-detected as well as  UTF-8).
>
> Unfortunately, as far as I know, UTF-8 is not free of such problems.
> (1) With or without the Unicode signature, (2) possible confusion with
> other  ASCII-compatible encodings (especially when a program has a few
> non-ASCII characters),  (3) vulnerability caused by redundant octet
> sequences, and (4) use of 4 or 6 octets  for non-BMP characters (e.g.,
> writeUTF and readUTF of java.io.DataOutput).  I know  that Corrigendum
> #1: UTF-8 Shortest Form addresses (3), but I am not sure if
> implementations are free of this vulnerability.
>
> I would be very happy if some encoding of Unicode becomes free of
> interoperability  or security problems.  But I am not happy yet.
>
> --
> MURATA Makoto <murata@hokkaido.email.ne.jp>
>

Received on Thursday, 2 January 2003 18:21:15 UTC