Re: BOM & Unicode editors from Asmus Freytag on 2000-06-05 (www-international@w3.org from April to June 2000)

From: Asmus Freytag <asmusf@ix.netcom.com>
Date: Mon, 05 Jun 2000 12:54:36 -0700
To: "Michael \(michka\) Kaplan" <michka@trigeminal.com>, "Martin J. Duerst" <duerst@w3.org>, "Saba Sundaramurthy" <ssundaramurthy@verisign.com>, "Chris Lilley" <chris@w3.org>
Cc: <mozilla-i18n@mozilla.org>, <www-international@w3.org>, <i18n-prog@acoin.com>
Message-Id: <4.2.0.58.20000605124627.01d98e28@popd.ix.netcom.com>

At 07:54 AM 6/5/00 -0700, Michael \(michka\) Kaplan wrote:
>There has long been controversy over the fact that MS products use "Unicode"
>to mean UCS-2

In the new, more precise terminology you would say that "MS products use 
'Unicode' to mean UTF-16". Since plain text files are prefixed with a BOM, 
the encoding is UTF-16, (internally tagged, endianess can be determined 
from BOM) instead of UTF-16LE (little endian, externally tagged and no BOM 
allowed). There is, incidentally, no shorthand to describe "UTF-16 with BOM 
that I know (from other information) to be little endian".

>and consider UTF-8 to be a multibyte encoding.

There is nothing wrong with this. UTF-8 is a very proper multibyte 
encoding. It's smallest interpretable element is a byte, and like all 
multibyte encodings, each character is encoded by a byte sequence which may 
have one of several lenghts, in
this case 1, 2, 3 or 4 bytes.

The two distinguishing faccts about UTF-8 is that it is self-synchronizing, 
which is a nice feature for a multibyte encoding, and that it can express 
all Unicode characters (identical subset).

A./

Received on Monday, 5 June 2000 15:46:22 UTC