Re: UTF-8 with Unicode line separator and BOM from Masayasu Ishikawa on 2000-10-23 (www-validator@w3.org from October 2000)

From: Masayasu Ishikawa <mimasa@w3.org>
Date: Mon, 23 Oct 2000 13:41:44 +0900
To: www-validator@w3.org
Message-Id: <20001023134144K.mimasa@w3.mag.keio.ac.jp>

Terje Bless <link@tss.no> wrote:

> IOW: SP has not yet been updated to recognize the BOM as it was only really
> standardized, lesse, two weeks ago.

Well, not really two weeks ago - the XML 1.0 Second Edition is
supposed to be the same as the XML 1.0 First Edition as corrected
by the XML 1.0 Specification Errata.

  cf. http://www.w3.org/XML/xml-19980210-errata

BOM in UTF-8 was first mentioned in E44 (which was superceded by E105), 
dated 2000-01-06.  So it's been there for about 9 months.  But anyway,
yes, SP has not yet been updated to recognize the BOM in UTF-8.

  cf. http://www.w3.org/XML/xml-19980210-errata#E44
      http://www.w3.org/XML/xml-19980210-errata#E105

> And since this is still version 1.0 of
> XML it's impossible to tell if the document is written for "XML 1.0 First
> Edition" or "XML 1.0 Second Edition" so you have to try sniffing for the
> BOM for all XML 1.0 documents and -- until SP is updated (if it's ever
> updated) -- manually supress the error?

We are planning to enhance support for various character encodings,
by converting them to UTF-8 before validation.  Similarly, BOM in
UTF-8 could be removed before validation so that SP won't be barfing
on it.

BTW, back to one of the original questions,

Christian Ottosson <christian.ottosson@kurir.net> wrote:

> Do you 
> recommend the use of the BOM, as a UTF-8 signature, or should it be 
> omitted?

*Personally* I would recommend NOT to use the BOM in UTF-8 whenever
character encoding information can be provided by other means.  And
in XML, detecting that an XML entity is encoded in UTF-8 can be done
without the BOM.

Regards,
-- 
Masayasu Ishikawa / mimasa@w3.org
W3C - World Wide Web Consortium

Received on Monday, 23 October 2000 00:41:42 UTC