- From: Chris Newman <Chris.Newman@Sun.COM>
- Date: Thu, 02 Jan 2003 15:15:11 -0800
- To: MURATA Makoto <murata@hokkaido.email.ne.jp>
- Cc: Marcin Hanclik <mhanclik@poczta.onet.pl>, ietf-charsets@iana.org
(1) When UTF-8 leaks out with a BOM, that is the result of buggy software since the BOM simply isn't needed for UTF-8. (2) This is not an issue in most Internet protocols since the standards have required proper charset labelling for many years. Ironically, most of the countries with widely deployed software that violates the standards by emitting unlabelled charsets use encodings that are very easy to distinguish from UTF-8. (3) The UTF-8 overlong sequence issue is sufficiently well documented that any security problems in practice are the result of buggy code. It's an extremely minor security issue now, particularly compared to the lookalike character problem which impacts all encodings of Unicode and many other character sets. (4) If octet count is an issue use a general purpose compression layer which will vastly exceed any savings possible with encoding tricks. Is UTF-8 perfect? No. But the costs greatly outweight the benefits when compared to any other charset I've seen, and particularly when compared to UTF-16. - Chris begin quotation by MURATA Makoto on 2002/12/25 11:51 +0900: > On Fri, 06 Dec 2002 13:13:41 -0800 > Chris Newman <Chris.Newman@sun.com> wrote: > >> >> UTF-16 is a terrible encoding for interoperability. There are 3 >> published non-interoperable variants of UTF-16 (big-endian, >> little-endian, BOM/switch-endian) and only one of the variants can be >> auto-detected with any chance of success (and none of them can be >> auto-detected as well as UTF-8). > > Unfortunately, as far as I know, UTF-8 is not free of such problems. > (1) With or without the Unicode signature, (2) possible confusion with > other ASCII-compatible encodings (especially when a program has a few > non-ASCII characters), (3) vulnerability caused by redundant octet > sequences, and (4) use of 4 or 6 octets for non-BMP characters (e.g., > writeUTF and readUTF of java.io.DataOutput). I know that Corrigendum > #1: UTF-8 Shortest Form addresses (3), but I am not sure if > implementations are free of this vulnerability. > > I would be very happy if some encoding of Unicode becomes free of > interoperability or security problems. But I am not happy yet. > > -- > MURATA Makoto <murata@hokkaido.email.ne.jp> >
Received on Thursday, 2 January 2003 18:21:15 UTC