W3C home > Mailing lists > Public > www-international@w3.org > October to December 2006

Re: Strange advice re BOM and UTF-8

From: Chris Lilley <chris@w3.org>
Date: Thu, 7 Dec 2006 13:35:36 +0100
Message-ID: <1523046432.20061207133536@w3.org>
To: olivier Thereaux <ot@w3.org>
Cc: www-validator@w3.org, www-international@w3.org

On Wednesday, December 6, 2006, 4:09:31 PM, olivier wrote:

oT> Hi Chris,

oT> On Dec 6, 2006, at 23:35 , Chris Lilley wrote:
>> I was surprised to see, on the W3C DTD validator, the following  
>> advice:
>>   The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to
>>   cause problems for some text editors and older browsers. You may
>>   want to consider avoiding its use until it is better supported.
>> This is odd because the use of a BOM with UTF-8 files is
>> a) standards compliant, to Unicode and to XML and to CSS
>> b) common practice
>> c) allows text editors to auto-detect the encoding of a plain text
>> document.
>> I believe therefore that the advice is incorrect and indeed
>> potentially damaging.

oT> I am not an expert so all my knowledge about UTF-8 with BOM comes  
oT> from hearsay and some documentation I have read, and the picture I  
oT> was having so far was pointing toward the fact that the BOM for utf-8
oT> was not very necessary (it is only a signature, not a mention of byte
oT> order, isn't it?),

It is indeed a signature. Its moved therefore from being theoretically
possible but rarely used, to common.

As an example Windows 2000 and Windows XP notepad uses it to tell the
difference between a UTF-8 text file and a system codepage text file.
So if you edit in Notepad and save as UTF-8 you will get a BOM. To
avoid getting one, you need to save as some other encoding. This is
not desirable.

oT>  and indeed sometimes (although perhaps more and  
oT> more rarely) harmful because of implementations that do not  
oT> understand the mark.

Thats rather old hearesay now, epecially since the Unoicode consortium
XML clarified the use of the BOM for UTF-8 and since XML (around 3rd
edition, IIRC) made a similar clarification.

oT> Docs I know include:
oT> http://www.w3.org/International/questions/qa-utf8-bom
oT> http://unicode.org/unicode/faq/utf_bom.html#BOM
oT> and both seem to point towards a cautious usage of a BOM for utf-8,  
oT> or no usage at all

oT> Do you have other references worth reading on the topic?

F Autodetection of Character Encodings (Non-Normative)

which notes that the presence of EF BB BF means the stream can be
confidently assumed to be UTF-8, while in the absence of a BOM and the
absence of an xml encoding declaration, "UTF-8 without an encoding
declaration, or else the data stream is mislabeled (lacking a required
encoding declaration), corrupt, fragmentary, or enclosed in a wrapper
of some kind"

I18n comments on CSS 2.1

"Mention should be made of the Unicode BOM and its relationship to the
encoding of the file. Is BOM allowed?"

CSS 2.1

When a style sheet resides in a separate file, user agents must
observe the following priorities when determining a style sheet's
character encoding (from highest priority to lowest):

   1. An HTTP "charset" parameter in a "Content-Type" field (or similar
      parameters in other protocols)
   2. BOM and/or @charset (see below)
   3. <link charset=""> or other metadata from the linking mechanism (if any)
   4. charset of referring style sheet or document (if any)
   5. Assume UTF-8


 Chris Lilley                    mailto:chris@w3.org
 Interaction Domain Leader
 Co-Chair, W3C SVG Working Group
 W3C Graphics Activity Lead
 Co-Chair, W3C Hypertext CG
Received on Thursday, 7 December 2006 12:35:56 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 22:04:24 UTC