- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Tue, 7 Jun 2011 16:36:41 +0200
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- Cc: www-international <www-international@w3.org>
Bjoern Hoehrmann, Tue, 07 Jun 2011 06:39:34 +0200: > * Leif Halvard Silli wrote: >> Subject: Should the UTF-8 BOM trump overriding via HTTP or by users? > > Higher-level information overrides lower-level information, explicit > information overrides fallbacks, and user agents should do what their > users want them to do. So, HTTP-level Content-Type overrides document- > internal information, a BOM overrides user-chosen fallbacks, and user- > chosen overrides trump anything else. You portray the BOM as "fallback". It actuallly is an encoding signature. For XML files without BOM or encoding declaration, there is a fallback/default, which should override user-chosen fallback as well: UTF-8. HTML5 or HTML4 do not operate with UTF-8 as fallback in that case. Therefore, the BOM is even more interesting to use in HTML than in XML. Anyway, the above is only a general reasoning (which makes lots of sense). Byt where is the this "über spec" which says that that is how it should work? The only one I have found is XML 1.0, which says that, when there is external encoding information which conflicts with the BOM or the XML declaration, then, quote: ]] In the interests of interoperability, however, the following rule is recommended. * If an XML entity is in a file, the Byte-Order Mark and encoding declaration are used (if present) to determine the character encoding. [[ Note that this means that if the document has no BOM or encoding declaration, then the HTTP header will win despite that UTF-8 is the default encdoing. The scenario you describe, by contrast, does not operate with any conflict, it only describes a priority order, which literaly frees the user agent from making any choice. A conflict only occurs when a double set of information clashes so that the user agent has to make a choice. > If you want X and your "agent" does Y against your wishes, then it's > not really an agent acting on your behalf, Well, I am open to discuss whether it is correct of IE/Webkit to not permit the user to override the encoding whenever there is a UTF-8 BOM. However, the primary issue here is not *my* will but the HTTP will. > I don't think that point > merits any discussion at all. Similarily, for fallbacks, there is no > other way this could work due to the semantics of "fallback". BOM is not fallback info. > That leaves the BOM versus Content-Type. If you let the BOM override > the Content-Type header, it would be impossible to send content that > starts with something that looks like a BOM but really isn't over the > protocol. "Content-Type overrides BOM" is a "If X then Y" situation. "Looks like a BOM". Looks like or are exactly those bytes? Can you describe a use case? When and how can an XML document/entity legally start with the BOM if it is not meant to be interpreted as the BOM? XML in fact says: ]] If the replacement text of an external entity is to begin with the character U+FEFF, and no text declaration is present, then a Byte Order Mark MUST be present, whether the entity is encoded in UTF-8 or UTF-16.[[ (A HTML page cannot, legally, begin with that letter.) > The other way around you get "If X then Y except when also A, then B, > ohh, and not X but Z then C, and..." as you've made the process de- > pendant on the internet media type. I don't see why not the BOM should win even if you serve text/plain. Just now I added a test page for this as well: http://malform.no/testing/html5/bom/#ccc The results speaks for themselves. If you don't agree with the IE/webkit behavior, then bug reports against them would be in place! > Anyone who wants the BOM to take precedence over the HTTP Content-Type > header, or the charset parameter within it, is welcome to make an I-D > to that effect that updates RFC 2616 and RFC 4288 and possibly others. > Trying to sneak in such changes through backdoors is unacceptable. So, > if "HTML5" has rules as you suggest, that is most likely an error. I opened the thread with saying that I have filed a bug report *against* HTML5. So, no. But if the bug report is successful, then HTML will say that. Thanks for the proposal regarding I-D. Never say never! :-) -- Leif Halvard Silli
Received on Tuesday, 7 June 2011 14:37:12 UTC