Re: Strange advice re BOM and UTF-8

On Wednesday, December 6, 2006, 4:09:31 PM, olivier wrote:

oT> Hi Chris,

oT> On Dec 6, 2006, at 23:35 , Chris Lilley wrote:
>> I was surprised to see, on the W3C DTD validator, the following  
>> advice:
>>
>>   The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to
>>   cause problems for some text editors and older browsers. You may
>>   want to consider avoiding its use until it is better supported.
>>
>> This is odd because the use of a BOM with UTF-8 files is
>>
>> a) standards compliant, to Unicode and to XML and to CSS
>> b) common practice
>> c) allows text editors to auto-detect the encoding of a plain text
>> document.
>>
>> I believe therefore that the advice is incorrect and indeed
>> potentially damaging.

oT> I am not an expert so all my knowledge about UTF-8 with BOM comes  
oT> from hearsay and some documentation I have read, and the picture I  
oT> was having so far was pointing toward the fact that the BOM for utf-8
oT> was not very necessary (it is only a signature, not a mention of byte
oT> order, isn't it?),

It is indeed a signature. Its moved therefore from being theoretically
possible but rarely used, to common.

As an example Windows 2000 and Windows XP notepad uses it to tell the
difference between a UTF-8 text file and a system codepage text file.
So if you edit in Notepad and save as UTF-8 you will get a BOM. To
avoid getting one, you need to save as some other encoding. This is
not desirable.

oT>  and indeed sometimes (although perhaps more and  
oT> more rarely) harmful because of implementations that do not  
oT> understand the mark.

Thats rather old hearesay now, epecially since the Unoicode consortium
XML clarified the use of the BOM for UTF-8 and since XML (around 3rd
edition, IIRC) made a similar clarification.



oT> Docs I know include:
oT> http://www.w3.org/International/questions/qa-utf8-bom
oT> http://unicode.org/unicode/faq/utf_bom.html#BOM
oT> and both seem to point towards a cautious usage of a BOM for utf-8,  
oT> or no usage at all

oT> Do you have other references worth reading on the topic?

F Autodetection of Character Encodings (Non-Normative)
http://www.w3.org/TR/xml/#sec-guessing

which notes that the presence of EF BB BF means the stream can be
confidently assumed to be UTF-8, while in the absence of a BOM and the
absence of an xml encoding declaration, "UTF-8 without an encoding
declaration, or else the data stream is mislabeled (lacking a required
encoding declaration), corrupt, fragmentary, or enclosed in a wrapper
of some kind"

I18n comments on CSS 2.1
http://www.w3.org/International/2005/05/css2-1-review.html

"Mention should be made of the Unicode BOM and its relationship to the
encoding of the file. Is BOM allowed?"

CSS 2.1
http://www.w3.org/TR/CSS21/syndata.html#q23

When a style sheet resides in a separate file, user agents must
observe the following priorities when determining a style sheet's
character encoding (from highest priority to lowest):

   1. An HTTP "charset" parameter in a "Content-Type" field (or similar
      parameters in other protocols)
   2. BOM and/or @charset (see below)
   3. <link charset=""> or other metadata from the linking mechanism (if any)
   4. charset of referring style sheet or document (if any)
   5. Assume UTF-8

   

-- 
 Chris Lilley                    mailto:chris@w3.org
 Interaction Domain Leader
 Co-Chair, W3C SVG Working Group
 W3C Graphics Activity Lead
 Co-Chair, W3C Hypertext CG

Received on Thursday, 7 December 2006 12:35:50 UTC