- From: Chris Lilley <chris@w3.org>
- Date: Thu, 7 Dec 2006 13:35:36 +0100
- To: olivier Thereaux <ot@w3.org>
- Cc: www-validator@w3.org, www-international@w3.org
On Wednesday, December 6, 2006, 4:09:31 PM, olivier wrote: oT> Hi Chris, oT> On Dec 6, 2006, at 23:35 , Chris Lilley wrote: >> I was surprised to see, on the W3C DTD validator, the following >> advice: >> >> The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to >> cause problems for some text editors and older browsers. You may >> want to consider avoiding its use until it is better supported. >> >> This is odd because the use of a BOM with UTF-8 files is >> >> a) standards compliant, to Unicode and to XML and to CSS >> b) common practice >> c) allows text editors to auto-detect the encoding of a plain text >> document. >> >> I believe therefore that the advice is incorrect and indeed >> potentially damaging. oT> I am not an expert so all my knowledge about UTF-8 with BOM comes oT> from hearsay and some documentation I have read, and the picture I oT> was having so far was pointing toward the fact that the BOM for utf-8 oT> was not very necessary (it is only a signature, not a mention of byte oT> order, isn't it?), It is indeed a signature. Its moved therefore from being theoretically possible but rarely used, to common. As an example Windows 2000 and Windows XP notepad uses it to tell the difference between a UTF-8 text file and a system codepage text file. So if you edit in Notepad and save as UTF-8 you will get a BOM. To avoid getting one, you need to save as some other encoding. This is not desirable. oT> and indeed sometimes (although perhaps more and oT> more rarely) harmful because of implementations that do not oT> understand the mark. Thats rather old hearesay now, epecially since the Unoicode consortium XML clarified the use of the BOM for UTF-8 and since XML (around 3rd edition, IIRC) made a similar clarification. oT> Docs I know include: oT> http://www.w3.org/International/questions/qa-utf8-bom oT> http://unicode.org/unicode/faq/utf_bom.html#BOM oT> and both seem to point towards a cautious usage of a BOM for utf-8, oT> or no usage at all oT> Do you have other references worth reading on the topic? F Autodetection of Character Encodings (Non-Normative) http://www.w3.org/TR/xml/#sec-guessing which notes that the presence of EF BB BF means the stream can be confidently assumed to be UTF-8, while in the absence of a BOM and the absence of an xml encoding declaration, "UTF-8 without an encoding declaration, or else the data stream is mislabeled (lacking a required encoding declaration), corrupt, fragmentary, or enclosed in a wrapper of some kind" I18n comments on CSS 2.1 http://www.w3.org/International/2005/05/css2-1-review.html "Mention should be made of the Unicode BOM and its relationship to the encoding of the file. Is BOM allowed?" CSS 2.1 http://www.w3.org/TR/CSS21/syndata.html#q23 When a style sheet resides in a separate file, user agents must observe the following priorities when determining a style sheet's character encoding (from highest priority to lowest): 1. An HTTP "charset" parameter in a "Content-Type" field (or similar parameters in other protocols) 2. BOM and/or @charset (see below) 3. <link charset=""> or other metadata from the linking mechanism (if any) 4. charset of referring style sheet or document (if any) 5. Assume UTF-8 -- Chris Lilley mailto:chris@w3.org Interaction Domain Leader Co-Chair, W3C SVG Working Group W3C Graphics Activity Lead Co-Chair, W3C Hypertext CG
Received on Thursday, 7 December 2006 12:35:56 UTC