W3C home > Mailing lists > Public > public-xml-core-wg@w3.org > December 2006

New XML PE: UTF-8 BOM

From: John Cowan <cowan@ccil.org>
Date: Sat, 30 Dec 2006 14:54:27 -0500
To: public-xml-core-wg@w3.org
Message-ID: <20061230195427.GB23104@ccil.org>

I took up the question of the UTF-8 BOM with the Unicode Technical
Committee after carefully reading what the Unicode Standard
versions 4.0 and 5.0 have to say on the subject, thus:

> > Am I correct in thinking that a conformant process that reads <EF
> > BB BF> from the beginning of a byte stream that purports to be in
> > the UTF-8 encoding scheme has the choice of discarding it as a BOM
> > or accepting it as a ZWNBSP?

I did not request a formal interpretative ruling, but Ken Whistler,
one of the leading lights of the UTC, replied as follows:

> I think in isolation, the answer to that would have to be
> formally, yes, because <EF BB BF> at the start of a UTF-8
> byte stream is ambiguous.
> 
> In a more complex context, where you could specify a conversion
> going on between UTF-8 and one or more UTF-16 or UTF-32-based
> encoding schemes, you could specify some instances where either
> operation (discarding and not interpreting, or retaining and
> interpreting as ZWNBSP) could be conformant or non-conformant.
> It would depend on whether the operation willy-nilly changed
> an intended BOM into a ZWNBSP (or vice versa), or retained the
> intended meaning.

(Note that the Unicode term "encoding scheme" corresponds to the
IETF/W3C term "encoding".)

I understand this to mean that if we wish to *require* <EF BB BF>
to be interpreted as a BOM in a UTF-8 document (as I think we clearly
do) we must spell the requirement out in the XML Recommendations and
cannot rely on inheriting it from Unicode.  In the case of a document
entity, there is no ambiguity: U+FEFF cannot appear at the beginning.
For an external entity, however, U+FEFF *can* appear at the beginning.
I therefore propose the following language as a new paragraph in 4.3.3
for both XML 1.0 and XML 1.1:

	If the replacement text of an external entity is to
	begin with the character U+FEFF, and no text declaration
	is present, then a Byte Order Mark MUST be present,
	whether the entity is encoded in UTF-8 or UTF-16.

-- 
Barry gules and argent of seven and six,        John Cowan
on a canton azure fifty molets of the second.   cowan@ccil.org
        --blazoning the U.S. flag               http://www.ccil.org/~cowan
Received on Saturday, 30 December 2006 19:54:48 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 8 January 2008 14:21:35 GMT