W3C home > Mailing lists > Public > www-international@w3.org > April to June 2011

Re: Should the UTF-8 BOM trump overriding via HTTP or by users?

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Tue, 7 Jun 2011 16:36:41 +0200
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: www-international <www-international@w3.org>
Message-ID: <20110607163641857797.8ec9bcf8@xn--mlform-iua.no>
Bjoern Hoehrmann, Tue, 07 Jun 2011 06:39:34 +0200:
> * Leif Halvard Silli wrote:
>> Subject: Should the UTF-8 BOM trump overriding via HTTP or by users?
> Higher-level information overrides lower-level information, explicit
> information overrides fallbacks, and user agents should do what their
> users want them to do. So, HTTP-level Content-Type overrides document-
> internal information, a BOM overrides user-chosen fallbacks, and user-
> chosen overrides trump anything else.

You portray the BOM as  "fallback". It actuallly is an encoding 

For XML files without BOM or encoding declaration, there is a 
fallback/default, which should override user-chosen fallback as well: 
UTF-8. HTML5 or HTML4 do not operate with UTF-8 as fallback in that 
case. Therefore, the BOM is even more interesting to use in HTML than 
in XML.

Anyway, the above is only a general reasoning (which makes lots of 
sense). Byt where is the this "├╝ber spec" which says that that is how 
it should work? The only one I have found is XML 1.0, which says that, 
when there is external encoding information which conflicts with the 
BOM or the XML declaration, then, quote:

In the interests of interoperability, however, the following rule is 
	*	If an XML entity is in a file, the Byte-Order Mark and encoding 
declaration are used (if present) to determine the character encoding.

Note that this means that if the document has no BOM or encoding 
declaration, then the HTTP header will win despite that UTF-8 is the 
default encdoing.

The scenario you describe, by contrast, does not operate with any 
conflict, it only describes a priority order, which literaly frees the 
user agent from making any choice. A conflict only occurs when a double 
set of information clashes so that the user agent has to make a choice.

> If you want X and your "agent" does Y against your wishes, then it's
> not really an agent acting on your behalf,

Well, I am open to discuss whether it is correct of IE/Webkit to not 
permit the user to override the encoding whenever there is a UTF-8 BOM. 
However, the primary issue here is not *my* will but the HTTP will. 

> I don't think that point
> merits any discussion at all. Similarily, for fallbacks, there is no
> other way this could work due to the semantics of "fallback".

BOM is not fallback info.

> That leaves the BOM versus Content-Type. If you let the BOM override
> the Content-Type header, it would be impossible to send content that
> starts with something that looks like a BOM but really isn't over the
> protocol. "Content-Type overrides BOM" is a "If X then Y" situation.

"Looks like a BOM". Looks like or are exactly those bytes? Can you 
describe a use case? When and how can an XML document/entity legally 
start with the BOM if it is not meant to  be interpreted as the BOM?  

XML in fact says: ]] If the replacement text of an external entity is 
to begin with the character U+FEFF, and no text declaration is present, 
then a Byte Order Mark MUST be present, whether the entity is encoded 
in UTF-8 or UTF-16.[[

(A HTML page cannot, legally, begin with that letter.)

> The other way around you get "If X then Y except when also A, then B,
> ohh, and not X but Z then C, and..." as you've made the process de-
> pendant on the internet media type.

I don't see why not the BOM should win even if you serve text/plain. 
Just now I added a test page for this as well:


The results speaks for themselves. If you don't agree with the 
IE/webkit behavior, then bug reports against them would be in place!

> Anyone who wants the BOM to take precedence over the HTTP Content-Type
> header, or the charset parameter within it, is welcome to make an I-D
> to that effect that updates RFC 2616 and RFC 4288 and possibly others.
> Trying to sneak in such changes through backdoors is unacceptable. So,
> if "HTML5" has rules as you suggest, that is most likely an error.

I opened the thread with saying that I have filed a bug report 
*against* HTML5. So, no. But if the bug report is successful, then HTML 
will say that. Thanks for the proposal regarding I-D. Never say never! 
Leif Halvard Silli
Received on Tuesday, 7 June 2011 14:37:12 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:32 UTC