- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Thu, 07 Dec 2006 12:53:42 +0900
- To: Chris Lilley <chris@w3.org>, www-validator@w3.org
- Cc: www-international@w3.org
Hello Chris, At 23:35 06/12/06, Chris Lilley wrote: > >Hello www-validator, > >I was surprised to see, on the W3C DTD validator, the following advice: > > The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to > cause problems for some text editors and older browsers. You may > want to consider avoiding its use until it is better supported. > >This is odd because the use of a BOM with UTF-8 files is > >a) standards compliant, to Unicode and to XML and to CSS For Unicode, that hasn't been clear initially. For XML, again, the original edition said nothing about (or against) a BOM on UTF-8. Most if not all initial implementations of XML parsers silently assumed that UTF-8 entities would not start with a BOM. Some of these implementation are still around, sometimes maybe even in silicon. In particular, the second edition of XML 1.0 mentions the BOM for UTF-8: http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing-no-ext-info But the first edition doesn't: http://www.w3.org/TR/1998/REC-xml-19980210#sec-guessing >b) common practice To some extent, yes. But that's one reason for the warning. If the practice would be very rare, we wouldn't have cared to add a warning to the validator. >c) allows text editors to auto-detect the encoding of a plain text >document. Yes. But there is no such thing as a BOM for encodings such as iso-8859-1, iso-8859-2, iso-8859-3, and so on. And these are very difficult to distinguish. On the other hand, UTF-8 can extremely easily be auto-detected if needed, except for the edge case mentioned by Addison. So the signature situation is backwards, the encoding that needs it least has one, whereas the other encodings don't. Also, as Ira and Asmus mentioned, the BOM interfers with certain kinds of processing. On Windows and for users working directly with files, the BOM isn't too problematic. On the other hand, using anything in the direction of Unix-like things such as pipes,... makes the BOM a real pain. More basically, UTF-8 without a BOM has several fundamental and important properties that both don't apply to UTF-8 with a BOM: 1) Any US-ASCII data is also UTF-8 2) Any operations on UTF-8 data that can treat non-ASCII data as 'black boxes' can be implemented as operations on US-ASCII with the only additional restriction that octets with the most significant byte set are left untouched. 1) can be fudged if a program instructed to produce UTF-8 checks whether all its output will be US-ASCII, and in that case, doesn't add a BOM. But checking all your output before starting output is often difficult or impossible. The wording for 2) is a bit long, but there is an enormous number of scripts and programs that meet these conditions. They are particularly frequent where people have an idea or an itch and hack something together. Some of these are hacks in the bad sense, with all kinds of problems, but others are great ideas implemented well and quickly. Using the BOM for UTF-8 denies this fertile breeding ground to UTF-8, and makes basic internationalization a special step in many cases where that wouldn't be necessary. So as a conclusion, the BOM can both be very helpful AND very damaging, depending on circumstances. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Thursday, 7 December 2006 03:55:34 UTC