Re: byte order mark article from John Cowan on 2012-11-21 (www-international@w3.org from October to December 2012)

From: John Cowan <cowan@mercury.ccil.org>
Date: Wed, 21 Nov 2012 16:27:27 -0500
To: Anne van Kesteren <annevk@annevk.nl>
Cc: www-international@w3.org
Message-ID: <20121121212726.GA13361@mercury.ccil.org>

Anne van Kesteren scripsit:

> * Per my reading of the HTML specification you can use utf-16le and
> utf-16be without a BOM. It does not even require it for utf-16,
> although I suppose Unicode might (though Unicode is not very correct
> here with respect to what implementations do). 

Per Unicode, in UTF-16LE and UTF-16BE documents, there is no such
thing as a BOM.  If a UTF-16LE document begins FF FE, that means the
first character is U+FEFF, ZERO BASED NON-BREAKING SPACE; likewise if
a UTF-16BE document begins FE FF.

In the UTF-16 encoding, a leading FF FE or FE FF is a BOM rather than a
character, and all following pairs of bytes are interpreted little-endian
or big-endian respectively.  If the first two bytes are neither of these,
a higher-level protocol must decide whether to interpret the pairs of
bytes as big- or little-endian.  If no higher-level protocol exists,
the interpretation is big-endian by default.

-- 
Unless it was by accident that I had            John Cowan
offended someone, I never apologized.           cowan@ccil.org
        --Quentin Crisp                         http://www.ccil.org/~cowan

Received on Wednesday, 21 November 2012 21:27:49 UTC