Re: Should the UTF-8 BOM trump overriding via HTTP or by users?

Leif Halvard Silli scripsit:

> > In any case, Appendix F is non-normative.  The algorithm [...],
> > which has no authority except my own, allows an 8-BOM to override
> > any XML declaration.  It doesn't handle XML parsed entities.
> But is that in line with XML 1.0?

The sniffer just attempts to discover the encoding: it doesn't check the
document for correctness.  If the document is not well-formed, it may
return the wrong answer.  In addition, some (hypothetical) encodings
will not be correctly sniffed.  For example, the imaginary us-bscii
encoding, which is the same as us-ascii except that 0x61 is 'b' and 0x62
is 'a', will be sniffed as us-ascii.

> XML describes normative "fatal error" situations related to encoding:
> 1. When external encoding info is absent: a) A processor fed with an
> entity whose encoding differs from the info in the XML declaration.

This is not actually testable: bad encoding will at best produce an
error related to 4 below.

>    b) If BOM and XML encoding declaration is lacking too: feeding a
>    processor with an entity which isn't in UTF-8 encoded.

Again, only testable if non-UTF8 bytes are found.

> 2. To not have the XML declaration as the very first part of
> the entity. (Example: An UTF-8 encoded doc with a BOM and a XML
> declaration, but which for some reason is read as ISO-8859-1. Only
> Opera allows the user to, this way, place the parser in 'fatal error'
> mode.)
> 3. A parser presented with an encoding it is unable to handle

That can only happen if the encoding declaration, HTTP header, or other
high-level protocol contains something the parser can't identify.

> 4. Discovering byte sequences that are illegal in the current encoding

See above.

> 5. Unless higher level protocol defines the encoding, and unless the
> document is in UTF-8 or UTF-16 (so "UTF-16LE" is not covered!), then
> it is an error to not have an encoding declaration.


Received on Wednesday, 8 June 2011 03:10:05 UTC