Re: UTF-8 with Unicode line separator and BOM from Masayasu Ishikawa on 2000-10-17 (www-validator@w3.org from October 2000)

From: Masayasu Ishikawa <mimasa@w3.org>
Date: Wed, 18 Oct 2000 02:47:31 +0900
To: link@tss.no
Cc: christian.ottosson@kurir.net, plh@w3.org, www-validator@w3.org
Message-Id: <20001018024731P.mimasa@w3.mag.keio.ac.jp>

Terje Bless <link@tss.no> wrote:

> >That's why the validator correctly reports errors (apart from BOM).
> 
> So there isn't any reason that it should be barfing on the BOM?

Actually this is a "crack" between the First Edition (REC-xml-19980210)
and the Second Edition (REC-xml-20001006) of XML 1.0, IMHO.

"F. Autodetection of Character Encodings" of REC-xml-19980210,
though non-normative, provided an autodetection algorithm of character
encoding.  There was no mention of the BOM in UTF-8, so it would
not be unreasonable to report the byte sequences of EF BB BF at the
beginning of an XML entity as an error.  I looked at the source code
of SP 1.3.4 as well as 1.3, and it seems the XMLDecoder class is based
on the appendix F of REC-xml-19980210.

  cf. http://www.w3.org/TR/1998/REC-xml-19980210#sec-guessing

Appendix F of REC-xml-20001006, however, does mention the case when
the BOM is used in UTF-8.  Appendix F was completely rewritten in
REC-xml-20001006, and I think this is the most significant change
between REC-xml-19980210 and REC-xml-20001006.

  cf. http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing

So, according to the Second Edition of XML 1.0, the validator should
not be barfing on the BOM.

Regards,
-- 
Masayasu Ishikawa / mimasa@w3.org
W3C - World Wide Web Consortium

Received on Tuesday, 17 October 2000 13:48:10 UTC