Re: Should the UTF-8 BOM trump overriding via HTTP or by users? from John Cowan on 2011-06-08 (www-international@w3.org from April to June 2011)

From: John Cowan <cowan@mercury.ccil.org>
Date: Tue, 7 Jun 2011 23:09:42 -0400
To: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Cc: Bjoern Hoehrmann <derhoermi@gmx.net>, www-international <www-international@w3.org>
Message-ID: <20110608030942.GB14459@mercury.ccil.org>

Leif Halvard Silli scripsit:

> > In any case, Appendix F is non-normative.  The algorithm [...],
> > which has no authority except my own, allows an 8-BOM to override
> > any XML declaration.  It doesn't handle XML parsed entities.
>
> But is that in line with XML 1.0?

The sniffer just attempts to discover the encoding: it doesn't check the
document for correctness.  If the document is not well-formed, it may
return the wrong answer.  In addition, some (hypothetical) encodings
will not be correctly sniffed.  For example, the imaginary us-bscii
encoding, which is the same as us-ascii except that 0x61 is 'b' and 0x62
is 'a', will be sniffed as us-ascii.

> XML describes normative "fatal error" situations related to encoding:
>
> 1. When external encoding info is absent: a) A processor fed with an
> entity whose encoding differs from the info in the XML declaration.

This is not actually testable: bad encoding will at best produce an
error related to 4 below.

>    b) If BOM and XML encoding declaration is lacking too: feeding a
>    processor with an entity which isn't in UTF-8 encoded.

Again, only testable if non-UTF8 bytes are found.

> 2. To not have the XML declaration as the very first part of
> the entity. (Example: An UTF-8 encoded doc with a BOM and a XML
> declaration, but which for some reason is read as ISO-8859-1. Only
> Opera allows the user to, this way, place the parser in 'fatal error'
> mode.)
>
> 3. A parser presented with an encoding it is unable to handle

That can only happen if the encoding declaration, HTTP header, or other
high-level protocol contains something the parser can't identify.

> 4. Discovering byte sequences that are illegal in the current encoding

See above.

> 5. Unless higher level protocol defines the encoding, and unless the
> document is in UTF-8 or UTF-16 (so "UTF-16LE" is not covered!), then
> it is an error to not have an encoding declaration.

Correct.

-- 
John Cowan  cowan@ccil.org  http://ccil.org/~cowan
If I have seen farther than others, it is because I was standing on
the shoulders of giants.
        --Isaac Newton

Received on Wednesday, 8 June 2011 03:10:05 UTC