Re: Should the UTF-8 BOM trump overriding via HTTP or by users? from Leif Halvard Silli on 2011-06-09 (www-international@w3.org from April to June 2011)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Thu, 9 Jun 2011 09:04:01 +0200
To: www-international <www-international@w3.org>
Cc: John Cowan <cowan@mercury.ccil.org>
Message-ID: <20110609090401531862.04ce13e8@xn--mlform-iua.no>

Leif Halvard Silli, Thu, 9 Jun 2011 01:38:11 +0200:
> John Cowan, Wed, 8 Jun 2011 16:28:19 -0400:
>> Leif Halvard Silli scripsit:
>> 
>>> So, really, I don't know if Firefox uses your algorithm for the
>>> file:// protocol. All I know is that its *parser* fails to retun
>>> 'fatal error' when the BOM and the declaration differ. Based on the
>>> XML parsers I have used recently (Webkit, Gecko, Opera, 'oXygen XML
>>> editor', 'XMLmind XML editor'), it is the *exception* (only Webkit
>>> does it)
> 
> Error: Webkit also does it. [ snip ]

>>> rather than the rule, that file protocol parsing returns
>>> "fatal error" whenever encoding declaration differs from the BOM.
>> 
>> That's clearly a bug, then.  If the encoding declaration is *not* UTF-8,
>> then the BOM is not a BOM at all, but characters preceding the XML
>> declaration.  That means the input is not well formed.
> 
> Even the RXP parser [1], [ snip ] have that bug. 

And Xerces. 

Both Xerces and RXP treat the UTF-8 BOM as the UTF-8 BOM (or, possibly 
ignore it completely - that's perhaps more likely) and just check that 
they know the name of the encoding in the XML encoding declaration, but 
do not check that the encoding declaration is compatible with the UTF-8 
BOM. They ignore HTTP Content-Type:'s charset parameter too. 

To be accurate, for the one case where there is a test case in the XML 
test suite (nameley <?xml version="1.0" encoding="UTF-16"?> used in a 
UTF-8 encoded file), *then* Xerces and RXP emit 'fatal error'. 
-- 
Leif H Silli

Received on Thursday, 9 June 2011 07:04:40 UTC