- From: Henry S. Thompson <ht@inf.ed.ac.uk>
- Date: Fri, 04 Oct 2013 21:55:04 +0100
- To: public-xml-core-wg@w3.org
- Cc: Richard Tobin <richard@inf.ed.ac.uk>
The XML spec. distinguishes two cases wrt encoding issues: with
vs. without external (i.e. for my purposes a Content-Type with an XML
media type and a charset parameter in an HTTP Response Header)
encoding information.
In the 'with external information (EI) case' the normative
spec. says very little, and the non-normative Appendix F says "defer
to 3023 or successor".
Existing tool behaviour and some precedents suggest that wrt the
question of which source is authoritative in the EI case the answer
should be as follows:
1) A BOM, if present, is authoritative;
2) In the absence of a BOM, the charset parameter is authoritative.
This leaves at least two questions for the XML spec., however:
a) Is it an error if there is a BOM and there is a charset
parameter, and they conflict?
b) Is it an error if, in the absence of a BOM, there is a charset
parameter and there is an encoding declaration and they conflict?
Here's what the spec. says today, a tabulation of every normative
requirement and/or error in section 4.3.3. The first set (A-H) always
apply, the second (W-Z) only in the absence of EI:
A) All XML processors *must* be able to read entities in both the
UTF-8 and UTF-16 encodings
B) XML processors *must* be able to use [a BOM] to differentiate
between UTF-8 and UTF-16 encoded documents
C) If the replacement text of an external entity is to begin with the
character U+FEFF, and no text declaration is present, then a Byte
Order Mark *must* be present
D) It is a *fatal error* for a TextDecl to occur other than at the
beginning of an external entity
E) [from the BNF] It is a *fatal error* for an XMLDecl to occur other
than at the beginning of an entity]
F) It is a *fatal error* when an XML processor encounters an entity
with an encoding that it is unable to process
G) It is a fatal error if an XML entity is determined (via default,
encoding declaration, or higher-level protocol) to be in a
certain encoding but contains byte sequences that are not legal
in that encoding
H) [I]t is a fatal error if an entity encoded in UTF-8 contains any
ill-formed code unit sequences
------
W) [E]ntities which are stored in an encoding other than UTF-8
or UTF-16 *must* [contain] an encoding declaration
X) [I]t is a *fatal error* for an entity including an encoding
declaration to be presented to the XML processor in an encoding
other than that named in the declaration
XX) [Theorem derived from B, D|E and X] It is a *fatal error* for an
XML entity beginning with a BOM to declare an encoding other than
that implied by the BOM
Y) [It is a *fatal error*] for an entity which begins with neither a
Byte Order Mark nor an encoding declaration to use an encoding
other than UTF-8
Z) [It is a] *fatal error* if an XML entity contains no encoding
declaration and its content is not legal UTF-8 or UTF-16
(Note that (B) and (C) are both consistent with proposal (1))
But what is relevant for the WG is whether there is anything we should
do wrt W--Z. That is, should there be new errors parallel to some or
all of W-Z when there _is_ EI?
Here's what they would look like:
W') [E]ntities without an encoding declaration which are delivered in
an encoding other than UTF-8 or UTF-16 *must* provide a charset
parameter
This is a sensible constraint on transcoders, including non-XML-aware
transcoders -- if you transcode out of UTF-..., you *must* tell me to
what.
X') [I]t is a *fatal error* for an entity including an encoding
declaration to be presented to the XML processor with a charset
parameter other than that named in the declaration
The existing 3023bis draft includes something like this. I don't
think it can be retained, because it conflicts with W' for
non-XML-aware transcoders: doing the responsible thing would produce
documents which conflict with X'. Note there is _no way_ to always
win if the charset and the encoding decl disagree: there are plausible
scenarios in which either one might be 'right'.
XX') It is a *fatal error* for an XML entity beginning with a BOM to
declare an encoding other than that implied by the BOM.
That is, the BOM is authoritative regardless of whether EI is present
or not.
Y') [leads to a generalization of W'] Non-UTF-8-encoded entities
without an encoding declaration *must* be delivered with a
charset parameter and/or (in the case of UTF-16) a BOM
Z') [Follows from Y']
So, my tentative conclusion is that I will add something like Y' to
3023bis, and also replace the existing language in 3023bis which
amounts to X' with something that notes that it's not an error per the
XML spec. if there's a conflict between the charset param and the
encoding decl, and that the charset param takes precedence in such a
case.
XX' would be a useful amendment to XML itself.
Questions/comments? Sorry this is so long!
Ah, and I'm going to make it longer. Here's a complete analysis by
cases:
The 'encoding' column gives the encoding a conformant processor will
process the document in, given the above and the three previous
columns.
'-' means not specified. 'E' means some specified encoding. 'B'
means some particular BOM. 'F' means some specified encoding distinct
from E. 'C' means a specified encoding distinct from 'B'. 'NUE' means
some specified non-Unicode encoding.
'reason' gives the constraint(s) above which is/are violated in the
case of 'status' NWF.
'UTF-8?' in the 'BOM' column means the same result obtains with or
without a UTF-8 BOM]
charset BOM enc decl. encoding status reason
- - - UTF-8 OK
- B - B OK
- - E E OK
- B B B OK
- B C B err XX
All of this is distinct from the encoding actually _used_ in the document.
In all the cases marked 'OK' above, similarly for those below, a
document may be NWF for reason G, if it contains a sufficient range of
characters encoded in the encoding actually used so as to provoke at
least some "illegal sequence" errors from a decoder for 'encoding' (if
any such characters exist).
charset BOM enc decl. encoding status reason
B B - B OK
* E B - B OK BOM is authoritative
* E B B B OK BOM is authoritative
E B C B err XX'
E - - E OK
E - E E OK
* E - F E OK charset > encoding
We might want to say something about warnings in the * cases, where
there is conflicting information by no error, on my proposal.
We could choose to make the last line an error, but neither xmllint,
nor any browser I have tested, reports an error in this case.
xmllint, IE, Firefox and Chrome all prefer the charset, and _do_
either complain (FF, IE, xmllint) or silently do the wrong thing
(Chrome) in this case if a non-E sequence occurs in the document (see
http://www.ltg.ed.ac.uk/~ht/ov-test/mtc.xml). Opera is the
odd-one-out -- it prefers the encoding decl., and so does _not_
complain.
ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]
Received on Friday, 4 October 2013 20:55:32 UTC