Re: Should the UTF-8 BOM trump overriding via HTTP or by users? from Leif Halvard Silli on 2011-06-08 (www-international@w3.org from April to June 2011)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Wed, 8 Jun 2011 18:50:48 +0200
To: John Cowan <cowan@mercury.ccil.org>
Cc: Bjoern Hoehrmann <derhoermi@gmx.net>, www-international <www-international@w3.org>
Message-ID: <20110608185048330991.d9de5d7b@xn--mlform-iua.no>

John Cowan, Wed, 8 Jun 2011 11:54:30 -0400:
> Leif Halvard Silli scripsit:
> 
>> So, you mean: it is a step before the doc is fed to the parser?
> 
> Yes.

So, that algorithm effectively plays the role of an external encoding 
information. Becuase, unless, the XML parser is not permitted to 
interpret the document different from the XML encoding declaration.

>> I don't see how this is different from what Safari and IE do when they 
>> override the HTTP header whenever they see the UTF-8 BOM.
> 
> My algorithm is a "file" algorithm; it doesn't know anything about HTTP
> headers.

Yes, but I think about the principle. This thread is meant to focus on 
what parsers are allowed to do. Like one can inspect the BOM and the 
XML declaration before feeding to parser, on could also inspect BOM, 
declration *and* HTTP. Like your algorithm effectively is external 
encoding information, an algorithm that also takes into account HTTP 
before doing the overriding, would just be some form of external 
encoding information.

>> (Firefox behave as your algorithm, for the file:// protocol.)
> 
> Good.  :-)

I still question its validity, according to XML. It blurs out the 
draconian error handling of XML.

>>>> XML describes normative "fatal error" situations related to encoding:
>>>> 
>>>> 1. When external encoding info is absent: a) A processor fed with an
>>>> entity whose encoding differs from the info in the XML declaration.
>>> 
>>> This is not actually testable: bad encoding will at best produce an
>>> error related to 4 below.
>> 
>> There is no necessarily need for test (that is: step 4 is not always 
>> needed). It can be a matter comparing the encoding labels. Because: 
>> First the parser determines the encoding. And if it uses the BOM to 
>> determine the encoding, and thereafter discovers that the XML encoding 
>> declaration says "KOI8-R" , then we have the "fatal error" situation. 
> 
> True.

Even XML editors, like well known oXygen do not display a fatal error 
for this, though. 

But perhasp it includes a non-parser which checks the code first, using 
your algorithm, and "override"?

>> But does any XML parser obey this rule? At least Webkit, Opera, Firefox 
>> do not. They instead accept the BOM and ignore the XML encoding 
>> declaratation. (Exception: if the encoding in the declaration is an 
>> unknown encoding, then Webkit shows fatal error - but this is actually 
>> 3 - see below.)
> 
> Those are XML parsers inside browsers, which I know little about.
> 
>> I take your silence as agreement. The "some reason" could be a HTTP 
>> header.
> 
> Quite so.
-- 
Leif H Silli

Received on Wednesday, 8 June 2011 16:51:20 UTC