- From: John Cowan <cowan@ccil.org>
- Date: Sat, 30 Dec 2006 14:54:27 -0500
- To: public-xml-core-wg@w3.org
I took up the question of the UTF-8 BOM with the Unicode Technical Committee after carefully reading what the Unicode Standard versions 4.0 and 5.0 have to say on the subject, thus: > > Am I correct in thinking that a conformant process that reads <EF > > BB BF> from the beginning of a byte stream that purports to be in > > the UTF-8 encoding scheme has the choice of discarding it as a BOM > > or accepting it as a ZWNBSP? I did not request a formal interpretative ruling, but Ken Whistler, one of the leading lights of the UTC, replied as follows: > I think in isolation, the answer to that would have to be > formally, yes, because <EF BB BF> at the start of a UTF-8 > byte stream is ambiguous. > > In a more complex context, where you could specify a conversion > going on between UTF-8 and one or more UTF-16 or UTF-32-based > encoding schemes, you could specify some instances where either > operation (discarding and not interpreting, or retaining and > interpreting as ZWNBSP) could be conformant or non-conformant. > It would depend on whether the operation willy-nilly changed > an intended BOM into a ZWNBSP (or vice versa), or retained the > intended meaning. (Note that the Unicode term "encoding scheme" corresponds to the IETF/W3C term "encoding".) I understand this to mean that if we wish to *require* <EF BB BF> to be interpreted as a BOM in a UTF-8 document (as I think we clearly do) we must spell the requirement out in the XML Recommendations and cannot rely on inheriting it from Unicode. In the case of a document entity, there is no ambiguity: U+FEFF cannot appear at the beginning. For an external entity, however, U+FEFF *can* appear at the beginning. I therefore propose the following language as a new paragraph in 4.3.3 for both XML 1.0 and XML 1.1: If the replacement text of an external entity is to begin with the character U+FEFF, and no text declaration is present, then a Byte Order Mark MUST be present, whether the entity is encoded in UTF-8 or UTF-16. -- Barry gules and argent of seven and six, John Cowan on a canton azure fifty molets of the second. cowan@ccil.org --blazoning the U.S. flag http://www.ccil.org/~cowan
Received on Saturday, 30 December 2006 19:54:48 UTC