Re: Should the UTF-8 BOM trump overriding via HTTP or by users? from Leif Halvard Silli on 2011-06-08 (www-international@w3.org from April to June 2011)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Wed, 8 Jun 2011 04:47:52 +0200
To: John Cowan <cowan@mercury.ccil.org>
Cc: Bjoern Hoehrmann <derhoermi@gmx.net>, www-international <www-international@w3.org>
Message-ID: <20110608044752892719.f58e772d@xn--mlform-iua.no>

John Cowan, Tue, 7 Jun 2011 13:41:56 -0400:
> Leif Halvard Silli scripsit:

>> ]]
>> In the interests of interoperability, however, the following rule is 
>> recommended.
>>  * If an XML entity is in a file, the Byte-Order Mark and encoding 
>> declaration are used (if present) to determine the character encoding.
>> [[

> Did you paste the wrong quotation?  That explicitly refers to XML entities
> in files; i.e. without HTTP metadata.

The quote appears under the heading "F.2 Priorities in the Presence of 
External Encoding Information". Perhaps section '2.11 End-of-Line 
Handling' gives a hint, it says: "XML parsed entities are often stored 
in computer files […]". Because, when a parsed file is stored, it has 
to include encoding info, which this section suggest to reuse.

> In any case, Appendix F is non-normative.  The algorithm described in
> 
http://recycledknowledge.blogspot.com/2005/07/hello-i-am-xml-encoding-sniffer.html 
> ,
> which has no authority except my own, allows an 8-BOM to override any
> XML declaration.  It doesn't handle XML parsed entities.

But is that in line with XML 1.0? XML describes normative "fatal error" 
situations related to encoding:

1. When external encoding info is absent:
   a) A processor fed with an entity whose encoding differs from
      the info in the XML declaration.
   b) If BOM and XML encoding declaration is lacking too: feeding
      a processor with an entity which isn't in UTF-8 encoded.,

2. To not have the XML declaration as the very first part of the 
entity. (Example: An UTF-8 encoded doc with a BOM and a XML 
declaration, but which for some reason is read as ISO-8859-1. Only 
Opera allows the user to, this way, place the parser in 'fatal error' 
mode.)

3. A parser presented with an encoding it is unable to handle

4. Discovering byte sequences that are illegal in the current encoding

5. Unless higher level protocol defines the encoding, and unless the 
document is in UTF-8 or UTF-16 (so "UTF-16LE" is not covered!), then it 
is an error to not have an encoding declaration.

PS: For XML, then it turns out that Firefox is a unwilling to lett he 
user override the UTF-8 encoding as Webkit. It just takes anothe rangle 
on it: If the XML page is served via HTTP, with an incorrect encoding 
label in the Content-Type:, the it leads to yellow screen of death. 
*And it is impossible for the user to fix it by manually selecting e.g. 
UTF-8.* 

If same file is consumed via the file protocol, then Firefox will 
ignore the XML declaration, if there is one. And if there is no XML 
encoding declaration, then it will default to UTF-8. As it will when 
there is a BOM. However, it will not allow the user to change the 
encoding!

Leif Halvard Silli

Received on Wednesday, 8 June 2011 02:48:24 UTC