Re: UTF-8 with Unicode line separator and BOM

Terje Bless <link@tss.no> wrote:

> >That's why the validator correctly reports errors (apart from BOM).
> 
> So there isn't any reason that it should be barfing on the BOM?

Actually this is a "crack" between the First Edition (REC-xml-19980210)
and the Second Edition (REC-xml-20001006) of XML 1.0, IMHO.

"F. Autodetection of Character Encodings" of REC-xml-19980210,
though non-normative, provided an autodetection algorithm of character
encoding.  There was no mention of the BOM in UTF-8, so it would
not be unreasonable to report the byte sequences of EF BB BF at the
beginning of an XML entity as an error.  I looked at the source code
of SP 1.3.4 as well as 1.3, and it seems the XMLDecoder class is based
on the appendix F of REC-xml-19980210.

  cf. http://www.w3.org/TR/1998/REC-xml-19980210#sec-guessing

Appendix F of REC-xml-20001006, however, does mention the case when
the BOM is used in UTF-8.  Appendix F was completely rewritten in
REC-xml-20001006, and I think this is the most significant change
between REC-xml-19980210 and REC-xml-20001006.

  cf. http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing

So, according to the Second Edition of XML 1.0, the validator should
not be barfing on the BOM.

Regards,
-- 
Masayasu Ishikawa / mimasa@w3.org
W3C - World Wide Web Consortium

Received on Tuesday, 17 October 2000 13:48:10 UTC