RE: BOM & Unicode editors

At 10:09 AM 5/11/00 -0700, Saba Sundaramurthy wrote:
>         UTF-8 characters may expand to any number of bytes (up to 6 for
>UCS-4), I don't think byte order is important since the sequence will be
>written out one byte at a time in the correct order.

Consensus is forming to restrict both UCS-4 and UTF-32 to the same code points
that can be reached with UTF-16. That would result in UTF-8 being limited 
to 4 bytes maximum.


>     As confirmed by Michka, the BOM is placed in UTF-8 files only as a
>'magic cookie'.
>


That is correct. While you need to know the byte order when converting from 
UTF-16 or UTF-32 (aka UCS-4) once the data is in UTF-8 there is no 
ambiguity about the arrangement of bytes, and the BOM is a 'signature' as 
we like to call it. (It's also not the bytes 'FE' 'FF' but the UTC-8 
tranformation).

A./

*(This will result in reducing the private use characters in UCS-4 to 
137,472 characters)

Received on Thursday, 11 May 2000 19:46:01 UTC