RE: BOM & Unicode editors from Asmus Freytag on 2000-05-11 (www-international@w3.org from April to June 2000)

From: Asmus Freytag <asmusf@ix.netcom.com>
Date: Thu, 11 May 2000 16:34:51 -0700
To: Saba Sundaramurthy <ssundaramurthy@verisign.com>, "'Robert A. Rosenberg'" <rarpsl@flashcom.net>
Cc: mozilla-i18n@mozilla.org, www-international@w3.org, i18n-prog@acoin.com
Message-Id: <4.2.0.58.20000511162757.01e6dac0@popd.ix.netcom.com>

At 10:09 AM 5/11/00 -0700, Saba Sundaramurthy wrote:
>         UTF-8 characters may expand to any number of bytes (up to 6 for
>UCS-4), I don't think byte order is important since the sequence will be
>written out one byte at a time in the correct order.

Consensus is forming to restrict both UCS-4 and UTF-32 to the same code points
that can be reached with UTF-16. That would result in UTF-8 being limited 
to 4 bytes maximum.

>     As confirmed by Michka, the BOM is placed in UTF-8 files only as a
>'magic cookie'.
>

That is correct. While you need to know the byte order when converting from 
UTF-16 or UTF-32 (aka UCS-4) once the data is in UTF-8 there is no 
ambiguity about the arrangement of bytes, and the BOM is a 'signature' as 
we like to call it. (It's also not the bytes 'FE' 'FF' but the UTC-8 
tranformation).

A./

*(This will result in reducing the private use characters in UCS-4 to 
137,472 characters)

Received on Thursday, 11 May 2000 19:46:01 UTC