Re: Should the UTF-8 BOM trump overriding via HTTP or by users? from Leif Halvard Silli on 2011-06-08 (www-international@w3.org from April to June 2011)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Wed, 8 Jun 2011 15:38:07 +0200
To: John Cowan <cowan@mercury.ccil.org>
Cc: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, Bjoern Hoehrmann <derhoermi@gmx.net>, www-international <www-international@w3.org>
Message-ID: <20110608153807198657.eae08c00@xn--mlform-iua.no>

John Cowan, Tue, 7 Jun 2011 23:09:42 -0400:
> Leif Halvard Silli scripsit:
> 
>>> In any case, Appendix F is non-normative.  The algorithm [...],
>>> which has no authority except my own, allows an 8-BOM to override
>>> any XML declaration.  It doesn't handle XML parsed entities.
>> 
>> But is that in line with XML 1.0?
> 
> The sniffer just attempts to discover the encoding: it doesn't check the
> document for correctness. If the document is not well-formed, it may
> return the wrong answer. 

So, you mean: it is a step before the doc is fed to the parser?

I don't see how this is different from what Safari and IE do when they 
override the HTTP header whenever they see the UTF-8 BOM. (Firefox 
behave as your algorithm, for the file:// protocol.)

  ...
>> XML describes normative "fatal error" situations related to encoding:
>> 
>> 1. When external encoding info is absent: a) A processor fed with an
>> entity whose encoding differs from the info in the XML declaration.
> 
> This is not actually testable: bad encoding will at best produce an
> error related to 4 below.

There is no necessarily need for test (that is: step 4 is not always 
needed). It can be a matter comparing the encoding labels. Because: 
First the parser determines the encoding. And if it uses the BOM to 
determine the encoding, and thereafter discovers that the XML encoding 
declaration says "KOI8-R" , then we have the "fatal error" situation. 

But does any XML parser obey this rule? At least Webkit, Opera, Firefox 
do not. They instead accept the BOM and ignore the XML encoding 
declaratation. (Exception: if the encoding in the declaration is an 
unknown encoding, then Webkit shows fatal error - but this is actually 
3 - see below.)

>>    b) If BOM and XML encoding declaration is lacking too: feeding a
>>    processor with an entity which isn't in UTF-8 encoded.
> 
> Again, only testable if non-UTF8 bytes are found.

Except for the "Again": Agreed.

>> 2. To not have the XML declaration as the very first part of
>> the entity. (Example: An UTF-8 encoded doc with a BOM and a XML
>> declaration, but which for some reason is read as ISO-8859-1. Only
>> Opera allows the user to, this way, place the parser in 'fatal error'
>> mode.)

I take your silence as agreement. The "some reason" could be a HTTP 
header.

>> 3. A parser presented with an encoding it is unable to handle
> 
> That can only happen if the encoding declaration, HTTP header, or other
> high-level protocol contains something the parser can't identify.

The exact wording: "XML processor encounters an entity with an encoding 
that it is unable to process". I agree that "encounters" problably 
means "is told" (via the means you mention).

But, for a UTF-8 encoded doc with a BOM whose encoding declaration says 
"US-BSCII", does parsers actually show an error? Well, Opera and 
Firefox treats it the same as 1 - see above. Whereas Webkit shows fatal 
error - I wonder why Webkit didn't show fatal error in 1 as well.

>> 4. Discovering byte sequences that are illegal in the current encoding
> 
> See above.
> 
>> 5. Unless higher level protocol defines the encoding, and unless the
>> document is in UTF-8 or UTF-16 (so "UTF-16LE" is not covered!), then
>> it is an error to not have an encoding declaration.
> 
> Correct.

You are blessing what XML says (in my rewording, though). Do say that 
XML is incorrect w.r.t. the other points above? 
-- 
Leif Halvard Silli

Received on Wednesday, 8 June 2011 13:38:37 UTC