Re: [Moderator Action] Re: BOM & Unicode editors from Asmus Freytag on 2000-06-06 (www-international@w3.org from April to June 2000)

From: Asmus Freytag <asmusf@ix.netcom.com>
Date: Mon, 05 Jun 2000 23:49:58 -0700
To: "Addison Phillips [FCOM]" <AddisonP@flashcom.net> (by way of "Martin J. Duerst" <duerst@w3.org>), www-international@w3.org
Message-Id: <4.2.0.58.20000605234331.01f01398@popd.ix.netcom.com>

At 02:09 PM 6/6/00 +0900, Addison Phillips [FCOM] wrote:
>Actually, in Win2000 and later, MS products mean UTF-16LE.

No. The designation UTF-16LE is reserved for the case that you label the 
data stream externally with the byte order. MS products (at least for plain 
text files) tag the data with a BOM character, making the data UTF-16 
(albeit in the 'little-endian' flavor). As I wrote, there is no shortcut 
designation for this.

>Older products
>really mean UCS-2 (as in, they don't understand surrogates and converting
>UTF-8 values beyond 0xFFFF will result in undefined behavior or data loss).
>Of course, support for UTF-8 was spotty or non-existant in those products
>anyway, so I guess it works out to be the same.

Actually, since most of these older products don't interpret surrogate 
values, you can  expect a fair amount of blind pass-thru - although I'm 
sure that you can easily find instances of bugs that can cause (or allow 
the user the chance of) splitting or truncating surrogate pairs. In the 
long run, it matters more how soon programs provide  the full support 
whether via UTf-8 or UTF-16.

A./

Received on Tuesday, 6 June 2000 02:41:39 UTC