Re: UTF-8 signature / BOM in CSS from jcowan@reutershealth.com on 2003-12-07 (www-international@w3.org from October to December 2003)

From: <jcowan@reutershealth.com>
Date: Sat, 6 Dec 2003 23:48:51 -0500
To: Etan Wexler <ewexler@stickdog.com>
Cc: Chris Lilley <chris@w3.org>, www-international@w3.org, w3c-css-wg@w3.org, www-style@w3.org
Message-ID: <20031207044851.GB3264@skunk.reutershealth.com>

Etan Wexler scripsit:

> There is no encoding scheme UTF-16. There is a "charset" value in the 
> IANA registry called "UTF-16", but UTF-16 is an encoding form. Any 
> serialized UTF-16 document is either big-endian or little-endian. 
> Nevertheless, the "UTF-16" label is allowed and in use, so we resort to 
> the BOM to disambiguate the label's meaning. The situation with UTF-32 
> is analogous.

The meaning of the encoding scheme UTF-16 is that if the first two bytes are
0xFE 0xFF, then the rest is interpreted as big-endian UTF-16; if the first
two bytes are 0xFF 0xFE, then the rest is interpreted as little-endian UTF-16;
otherwise, the whole is interpreted as big-endian UTF-16.

In the encoding schemes UTF-16BE and UTF-16LE, the interpretation is always
big-endian or little-endian respectively; if the first character is U+FEFF,
then it is a ZWNBSP and part of the data stream.

-- 
John Cowan  www.reutershealth.com  www.ccil.org/~cowan  jcowan@reutershealth.com
Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash,
The day and hour soon are coming / When all the IT folks say "Gosh!"
It isn't from a clever lawsuit / That Windowsland will finally fall,
But thousands writing open source code / Like mice who nibble through a wall.

Received on Saturday, 6 December 2003 23:49:46 UTC