- From: Markus Scherer <markus.scherer@jtcsv.com>
- Date: Mon, 06 Jan 2003 08:49:23 -0800
- To: charsets <ietf-charsets@iana.org>
Chris Newman wrote: > Software which is fully UTF-8 native will likely work just fine. UTF-8 > aware software already has support for variable width characters, As Ken pointed out, Murata-san's concerns with "6-octet UTF-8" almost certainly are about illegal encodings of surrogate pairs with 2*3 bytes, because such converters do or at least did exist. In the CESU-8 discussion "6 bytes" usually meant pairs of 3-byte sequences. > That's actually the most serious flaw in UTF-16. It's a variable width > encoding, but the variable width characters are an uncommon case > (currently). That means all the code to support non-16 bit characters > in UTF-16 is an uncommon case and those codepaths haven't been tested > (if they exist). Thus you can expect deployed UTF-16 based software to > break in various ways as non-BMP characters show up. This is true, but regardless of which UTF is used for processing. UTF-8-based software also used to assume that supplementary code points would never occur, and used to hardcode that assumption with 16-bit wchar_t and 16-bit-indexed character lookups. Despite UTF-8's design, a lot of UTF-8 software wrongly encoded supplementary code points (by encoding surrogate pairs instead), wrongly decoded them (many decoders truncated code points to the lower 16 bits - visible in several popular browsers, until very recently at least), and failed to look up properties for supplementary code points. Code paths for dealing with supplementary characters were tested as little with UTF-8 as they were with UTF-16. > Unfortunately, I'm afraid the majority of software will fall in the > latter two categories. Regardless of UTF. On the other hand, writers of low-level Unicode libraries have spent time over the last several years testing and upgrading their code. For example, ICU fully handles supplementary code points, using UTF-16 - in both C/C++ and Java. Best regards, markus -- Opinions expressed here may not reflect my company's positions unless otherwise noted.
Received on Monday, 6 January 2003 16:46:23 UTC