Who's on top -- charset param or BOM or encoding declaration from Henry S. Thompson on 2013-10-04 (public-xml-core-wg@w3.org from October 2013)

From: Henry S. Thompson <ht@inf.ed.ac.uk>
Date: Fri, 04 Oct 2013 21:55:04 +0100
To: public-xml-core-wg@w3.org
Cc: Richard Tobin <richard@inf.ed.ac.uk>
Message-ID: <f5bsiwgkb87.fsf@troutbeck.inf.ed.ac.uk>
The XML spec. distinguishes two cases wrt encoding issues: with
vs. without external (i.e. for my purposes a Content-Type with an XML
media type and a charset parameter in an HTTP Response Header)
encoding information.

In the 'with external information (EI) case' the normative
spec. says very little, and the non-normative Appendix F says "defer
to 3023 or successor".

Existing tool behaviour and some precedents suggest that wrt the
question of which source is authoritative in the EI case the answer
should be as follows:

   1) A BOM, if present, is authoritative;
   2) In the absence of a BOM, the charset parameter is authoritative.

This leaves at least two questions for the XML spec., however: 

  a) Is it an error if there is a BOM and there is a charset
     parameter, and they conflict?

  b) Is it an error if, in the absence of a BOM, there is a charset
     parameter and there is an encoding declaration and they conflict?

Here's what the spec. says today, a tabulation of every normative
requirement and/or error in section 4.3.3.  The first set (A-H) always
apply, the second (W-Z) only in the absence of EI:

 A) All XML processors *must* be able to read entities in both the
    UTF-8 and UTF-16 encodings

 B) XML processors *must* be able to use [a BOM] to differentiate
    between UTF-8 and UTF-16 encoded documents

 C) If the replacement text of an external entity is to begin with the
    character U+FEFF, and no text declaration is present, then a Byte
    Order Mark *must* be present

 D) It is a *fatal error* for a TextDecl to occur other than at the
    beginning of an external entity

 E) [from the BNF] It is a *fatal error* for an XMLDecl to occur other
    than at the beginning of an entity]

 F) It is a *fatal error* when an XML processor encounters an entity
    with an encoding that it is unable to process

 G) It is a fatal error if an XML entity is determined (via default,
     encoding declaration, or higher-level protocol) to be in a
     certain encoding but contains byte sequences that are not legal
     in that encoding

 H) [I]t is a fatal error if an entity encoded in UTF-8 contains any
    ill-formed code unit sequences

------

 W) [E]ntities which are stored in an encoding other than UTF-8
    or UTF-16 *must* [contain] an encoding declaration

 X) [I]t is a *fatal error* for an entity including an encoding
    declaration to be presented to the XML processor in an encoding
    other than that named in the declaration

 XX) [Theorem derived from B, D|E and X] It is a *fatal error* for an
     XML entity beginning with a BOM to declare an encoding other than
     that implied by the BOM

 Y) [It is a *fatal error*] for an entity which begins with neither a
    Byte Order Mark nor an encoding declaration to use an encoding
    other than UTF-8

 Z) [It is a] *fatal error* if an XML entity contains no encoding
    declaration and its content is not legal UTF-8 or UTF-16

(Note that (B) and (C) are both consistent with proposal (1))

But what is relevant for the WG is whether there is anything we should
do wrt W--Z.  That is, should there be new errors parallel to some or
all of W-Z when there _is_ EI?

Here's what they would look like:

 W') [E]ntities without an encoding declaration which are delivered in
     an encoding other than UTF-8 or UTF-16 *must* provide a charset
     parameter

This is a sensible constraint on transcoders, including non-XML-aware
transcoders -- if you transcode out of UTF-..., you *must* tell me to
what.

 X') [I]t is a *fatal error* for an entity including an encoding
     declaration to be presented to the XML processor with a charset
     parameter other than that named in the declaration

The existing 3023bis draft includes something like this.  I don't
think it can be retained, because it conflicts with W' for
non-XML-aware transcoders: doing the responsible thing would produce
documents which conflict with X'.  Note there is _no way_ to always
win if the charset and the encoding decl disagree: there are plausible
scenarios in which either one might be 'right'.

 XX') It is a *fatal error* for an XML entity beginning with a BOM to
      declare an encoding other than that implied by the BOM.

That is, the BOM is authoritative regardless of whether EI is present
or not.

 Y') [leads to a generalization of W'] Non-UTF-8-encoded entities
     without an encoding declaration *must* be delivered with a
     charset parameter and/or (in the case of UTF-16) a BOM

 Z') [Follows from Y']

So, my tentative conclusion is that I will add something like Y' to
3023bis, and also replace the existing language in 3023bis which
amounts to X' with something that notes that it's not an error per the
XML spec. if there's a conflict between the charset param and the
encoding decl, and that the charset param takes precedence in such a
case.

XX' would be a useful amendment to XML itself.

Questions/comments?  Sorry this is so long!

Ah, and I'm going to make it longer.  Here's a complete analysis by
cases:

The 'encoding' column gives the encoding a conformant processor will
process the document in, given the above and the three previous
columns.

'-' means not specified.  'E' means some specified encoding.  'B'
means some particular BOM.  'F' means some specified encoding distinct
from E.  'C' means a specified encoding distinct from 'B'. 'NUE' means
some specified non-Unicode encoding.

'reason' gives the constraint(s) above which is/are violated in the
case of 'status' NWF.

'UTF-8?' in the 'BOM' column means the same result obtains with or
without a UTF-8 BOM]

   charset   BOM    enc decl.  encoding     status   reason
     -        -        -         UTF-8        OK
     -        B        -           B          OK       
     -        -        E           E          OK
     -        B        B           B          OK
     -        B        C           B          err      XX

All of this is distinct from the encoding actually _used_ in the document.

In all the cases marked 'OK' above, similarly for those below, a
document may be NWF for reason G, if it contains a sufficient range of
characters encoded in the encoding actually used so as to provoke at
least some "illegal sequence" errors from a decoder for 'encoding' (if
any such characters exist).

   charset   BOM    enc decl.  encoding     status   reason
     B        B        -           B          OK
*    E        B        -           B          OK      BOM is authoritative
*    E        B        B           B          OK      BOM is authoritative
     E        B        C           B         err      XX'
     E        -        -           E          OK
     E        -        E           E          OK
*    E        -        F           E          OK      charset > encoding

We might want to say something about warnings in the * cases, where
there is conflicting information by no error, on my proposal.

We could choose to make the last line an error, but neither xmllint,
nor any browser I have tested, reports an error in this case.

xmllint, IE, Firefox and Chrome all prefer the charset, and _do_
either complain (FF, IE, xmllint) or silently do the wrong thing
(Chrome) in this case if a non-E sequence occurs in the document (see
http://www.ltg.ed.ac.uk/~ht/ov-test/mtc.xml).  Opera is the
odd-one-out -- it prefers the encoding decl., and so does _not_
complain.

ht
-- 
       Henry S. Thompson, School of Informatics, University of Edinburgh
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                       URL: http://www.ltg.ed.ac.uk/~ht/
 [mail from me _always_ has a .sig like this -- mail without it is forged spam]
Received on Friday, 4 October 2013 20:55:32 UTC