W3C home > Mailing lists > Public > www-validator@w3.org > October 2000

Re: UTF-8 with Unicode line separator and BOM

From: Masayasu Ishikawa <mimasa@w3.org>
Date: Wed, 18 Oct 2000 02:47:31 +0900
To: link@tss.no
Cc: christian.ottosson@kurir.net, plh@w3.org, www-validator@w3.org
Message-Id: <20001018024731P.mimasa@w3.mag.keio.ac.jp>
Terje Bless <link@tss.no> wrote:

> >That's why the validator correctly reports errors (apart from BOM).
> So there isn't any reason that it should be barfing on the BOM?

Actually this is a "crack" between the First Edition (REC-xml-19980210)
and the Second Edition (REC-xml-20001006) of XML 1.0, IMHO.

"F. Autodetection of Character Encodings" of REC-xml-19980210,
though non-normative, provided an autodetection algorithm of character
encoding.  There was no mention of the BOM in UTF-8, so it would
not be unreasonable to report the byte sequences of EF BB BF at the
beginning of an XML entity as an error.  I looked at the source code
of SP 1.3.4 as well as 1.3, and it seems the XMLDecoder class is based
on the appendix F of REC-xml-19980210.

  cf. http://www.w3.org/TR/1998/REC-xml-19980210#sec-guessing

Appendix F of REC-xml-20001006, however, does mention the case when
the BOM is used in UTF-8.  Appendix F was completely rewritten in
REC-xml-20001006, and I think this is the most significant change
between REC-xml-19980210 and REC-xml-20001006.

  cf. http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing

So, according to the Second Edition of XML 1.0, the validator should
not be barfing on the BOM.

Masayasu Ishikawa / mimasa@w3.org
W3C - World Wide Web Consortium
Received on Tuesday, 17 October 2000 13:48:10 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 1 March 2016 14:17:28 UTC