Re: BOM & Unicode editors

At 07:54 AM 6/5/00 -0700, Michael \(michka\) Kaplan wrote:
>There has long been controversy over the fact that MS products use "Unicode"
>to mean UCS-2

In the new, more precise terminology you would say that "MS products use 
'Unicode' to mean UTF-16". Since plain text files are prefixed with a BOM, 
the encoding is UTF-16, (internally tagged, endianess can be determined 
from BOM) instead of UTF-16LE (little endian, externally tagged and no BOM 
allowed). There is, incidentally, no shorthand to describe "UTF-16 with BOM 
that I know (from other information) to be little endian".

>and consider UTF-8 to be a multibyte encoding.

There is nothing wrong with this. UTF-8 is a very proper multibyte 
encoding. It's smallest interpretable element is a byte, and like all 
multibyte encodings, each character is encoded by a byte sequence which may 
have one of several lenghts, in
this case 1, 2, 3 or 4 bytes.

The two distinguishing faccts about UTF-8 is that it is self-synchronizing, 
which is a nice feature for a multibyte encoding, and that it can express 
all Unicode characters (identical subset).

A./

Received on Monday, 5 June 2000 15:46:22 UTC