Re: Should the UTF-8 BOM trump overriding via HTTP or by users? from Leif Halvard Silli on 2011-06-07 (www-international@w3.org from April to June 2011)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Tue, 7 Jun 2011 16:36:41 +0200
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: www-international <www-international@w3.org>
Message-ID: <20110607163641857797.8ec9bcf8@xn--mlform-iua.no>
Bjoern Hoehrmann, Tue, 07 Jun 2011 06:39:34 +0200:
> * Leif Halvard Silli wrote:
>> Subject: Should the UTF-8 BOM trump overriding via HTTP or by users?
> 
> Higher-level information overrides lower-level information, explicit
> information overrides fallbacks, and user agents should do what their
> users want them to do. So, HTTP-level Content-Type overrides document-
> internal information, a BOM overrides user-chosen fallbacks, and user-
> chosen overrides trump anything else.

You portray the BOM as  "fallback". It actuallly is an encoding 
signature.

For XML files without BOM or encoding declaration, there is a 
fallback/default, which should override user-chosen fallback as well: 
UTF-8. HTML5 or HTML4 do not operate with UTF-8 as fallback in that 
case. Therefore, the BOM is even more interesting to use in HTML than 
in XML.

Anyway, the above is only a general reasoning (which makes lots of 
sense). Byt where is the this "über spec" which says that that is how 
it should work? The only one I have found is XML 1.0, which says that, 
when there is external encoding information which conflicts with the 
BOM or the XML declaration, then, quote:

]]
In the interests of interoperability, however, the following rule is 
recommended.
 * If an XML entity is in a file, the Byte-Order Mark and encoding 
declaration are used (if present) to determine the character encoding.
[[

Note that this means that if the document has no BOM or encoding 
declaration, then the HTTP header will win despite that UTF-8 is the 
default encdoing.

The scenario you describe, by contrast, does not operate with any 
conflict, it only describes a priority order, which literaly frees the 
user agent from making any choice. A conflict only occurs when a double 
set of information clashes so that the user agent has to make a choice.

> If you want X and your "agent" does Y against your wishes, then it's
> not really an agent acting on your behalf,

Well, I am open to discuss whether it is correct of IE/Webkit to not 
permit the user to override the encoding whenever there is a UTF-8 BOM. 
However, the primary issue here is not *my* will but the HTTP will. 

> I don't think that point
> merits any discussion at all. Similarily, for fallbacks, there is no
> other way this could work due to the semantics of "fallback".

BOM is not fallback info.

> That leaves the BOM versus Content-Type. If you let the BOM override
> the Content-Type header, it would be impossible to send content that
> starts with something that looks like a BOM but really isn't over the
> protocol. "Content-Type overrides BOM" is a "If X then Y" situation.

"Looks like a BOM". Looks like or are exactly those bytes? Can you 
describe a use case? When and how can an XML document/entity legally 
start with the BOM if it is not meant to  be interpreted as the BOM?  

XML in fact says: ]] If the replacement text of an external entity is 
to begin with the character U+FEFF, and no text declaration is present, 
then a Byte Order Mark MUST be present, whether the entity is encoded 
in UTF-8 or UTF-16.[[

(A HTML page cannot, legally, begin with that letter.)

> The other way around you get "If X then Y except when also A, then B,
> ohh, and not X but Z then C, and..." as you've made the process de-
> pendant on the internet media type.

I don't see why not the BOM should win even if you serve text/plain. 
Just now I added a test page for this as well:

http://malform.no/testing/html5/bom/#ccc


The results speaks for themselves. If you don't agree with the 
IE/webkit behavior, then bug reports against them would be in place!

> Anyone who wants the BOM to take precedence over the HTTP Content-Type
> header, or the charset parameter within it, is welcome to make an I-D
> to that effect that updates RFC 2616 and RFC 4288 and possibly others.
> Trying to sneak in such changes through backdoors is unacceptable. So,
> if "HTML5" has rules as you suggest, that is most likely an error.

I opened the thread with saying that I have filed a bug report 
*against* HTML5. So, no. But if the bug report is successful, then HTML 
will say that. Thanks for the proposal regarding I-D. Never say never! 
:-)
-- 
Leif Halvard Silli
Received on Tuesday, 7 June 2011 14:37:12 UTC