Re: Strange advice re BOM and UTF-8 from Martin Duerst on 2006-12-07 (www-validator@w3.org from December 2006)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Thu, 07 Dec 2006 12:53:42 +0900
To: Chris Lilley <chris@w3.org>, www-validator@w3.org
Cc: www-international@w3.org
Message-Id: <6.0.0.20.2.20061207103308.095a51e0@localhost>

Hello Chris,

At 23:35 06/12/06, Chris Lilley wrote:
>
>Hello www-validator,
>
>I was surprised to see, on the W3C DTD validator, the following advice:
>
>  The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to
>  cause problems for some text editors and older browsers. You may
>  want to consider avoiding its use until it is better supported.
>
>This is odd because the use of a BOM with UTF-8 files is
>
>a) standards compliant, to Unicode and to XML and to CSS

For Unicode, that hasn't been clear initially. For XML,
again, the original edition said nothing about (or against)
a BOM on UTF-8. Most if not all initial implementations of XML
parsers silently assumed that UTF-8 entities would not start
with a BOM. Some of these implementation are still around,
sometimes maybe even in silicon.

In particular, the second edition of XML 1.0 mentions the BOM for UTF-8:
http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing-no-ext-info
But the first edition doesn't:
http://www.w3.org/TR/1998/REC-xml-19980210#sec-guessing

>b) common practice

To some extent, yes. But that's one reason for the warning.
If the practice would be very rare, we wouldn't have cared
to add a warning to the validator.

>c) allows text editors to auto-detect the encoding of a plain text
>document.

Yes. But there is no such thing as a BOM for encodings such
as iso-8859-1, iso-8859-2, iso-8859-3, and so on. And these
are very difficult to distinguish. On the other hand, UTF-8
can extremely easily be auto-detected if needed, except
for the edge case mentioned by Addison. So the signature
situation is backwards, the encoding that needs it least
has one, whereas the other encodings don't.

Also, as Ira and Asmus mentioned, the BOM interfers with
certain kinds of processing. On Windows and for users
working directly with files, the BOM isn't too problematic.
On the other hand, using anything in the direction of
Unix-like things such as pipes,... makes the BOM a real
pain.

More basically, UTF-8 without a BOM has several fundamental
and important properties that both don't apply to UTF-8
with a BOM:
1) Any US-ASCII data is also UTF-8
2) Any operations on UTF-8 data that can treat non-ASCII
   data as 'black boxes' can be implemented as operations
   on US-ASCII with the only additional restriction that
   octets with the most significant byte set are left
   untouched.

1) can be fudged if a program instructed to produce UTF-8
checks whether all its output will be US-ASCII, and in that
case, doesn't add a BOM. But checking all your output before
starting output is often difficult or impossible.

The wording for 2) is a bit long, but there is an enormous
number of scripts and programs that meet these conditions.
They are particularly frequent where people have an idea or
an itch and hack something together. Some of these are hacks
in the bad sense, with all kinds of problems, but others
are great ideas implemented well and quickly. Using the BOM
for UTF-8 denies this fertile breeding ground to UTF-8,
and makes basic internationalization a special step in
many cases where that wouldn't be necessary.

So as a conclusion, the BOM can both be very helpful AND
very damaging, depending on circumstances.

Regards,     Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp

Received on Thursday, 7 December 2006 03:55:34 UTC