Re: internationalization/ISO10646 question from Markus Scherer on 2003-01-06 (ietf-charsets@w3.org from January to March 2003)

From: Markus Scherer <markus.scherer@jtcsv.com>
Date: Mon, 06 Jan 2003 08:49:23 -0800
To: charsets <ietf-charsets@iana.org>
Message-id: <3E19B393.7020403@jtcsv.com>

Chris Newman wrote:
> Software which is fully UTF-8 native will likely work just fine.  UTF-8 
> aware software already has support for variable width characters, 

As Ken pointed out, Murata-san's concerns with "6-octet UTF-8" almost certainly are about illegal 
encodings of surrogate pairs with 2*3 bytes, because such converters do or at least did exist. In 
the CESU-8 discussion "6 bytes" usually meant pairs of 3-byte sequences.

> That's actually the most serious flaw in UTF-16.  It's a variable width 
> encoding, but the variable width characters are an uncommon case 
> (currently).  That means all the code to support non-16 bit characters 
> in UTF-16 is an uncommon case and those codepaths haven't been tested 
> (if they exist).  Thus you can expect deployed UTF-16 based software to 
> break in various ways as non-BMP characters show up.

This is true, but regardless of which UTF is used for processing. UTF-8-based software also used to 
assume that supplementary code points would never occur, and used to hardcode that assumption with 
16-bit wchar_t and 16-bit-indexed character lookups. Despite UTF-8's design, a lot of UTF-8 software 
wrongly encoded supplementary code points (by encoding surrogate pairs instead), wrongly decoded 
them (many decoders truncated code points to the lower 16 bits - visible in several popular 
browsers, until very recently at least), and failed to look up properties for supplementary code 
points. Code paths for dealing with supplementary characters were tested as little with UTF-8 as 
they were with UTF-16.

> Unfortunately, I'm afraid the majority of software will fall in the 
> latter two categories.

Regardless of UTF. On the other hand, writers of low-level Unicode libraries have spent time over 
the last several years testing and upgrading their code. For example, ICU fully handles 
supplementary code points, using UTF-16 - in both C/C++ and Java.

Best regards,
markus

-- 
Opinions expressed here may not reflect my company's positions unless otherwise noted.

Received on Monday, 6 January 2003 16:46:23 UTC