- From: Henry S. Thompson <ht@inf.ed.ac.uk>
- Date: Fri, 04 Oct 2013 21:55:04 +0100
- To: public-xml-core-wg@w3.org
- Cc: Richard Tobin <richard@inf.ed.ac.uk>
The XML spec. distinguishes two cases wrt encoding issues: with vs. without external (i.e. for my purposes a Content-Type with an XML media type and a charset parameter in an HTTP Response Header) encoding information. In the 'with external information (EI) case' the normative spec. says very little, and the non-normative Appendix F says "defer to 3023 or successor". Existing tool behaviour and some precedents suggest that wrt the question of which source is authoritative in the EI case the answer should be as follows: 1) A BOM, if present, is authoritative; 2) In the absence of a BOM, the charset parameter is authoritative. This leaves at least two questions for the XML spec., however: a) Is it an error if there is a BOM and there is a charset parameter, and they conflict? b) Is it an error if, in the absence of a BOM, there is a charset parameter and there is an encoding declaration and they conflict? Here's what the spec. says today, a tabulation of every normative requirement and/or error in section 4.3.3. The first set (A-H) always apply, the second (W-Z) only in the absence of EI: A) All XML processors *must* be able to read entities in both the UTF-8 and UTF-16 encodings B) XML processors *must* be able to use [a BOM] to differentiate between UTF-8 and UTF-16 encoded documents C) If the replacement text of an external entity is to begin with the character U+FEFF, and no text declaration is present, then a Byte Order Mark *must* be present D) It is a *fatal error* for a TextDecl to occur other than at the beginning of an external entity E) [from the BNF] It is a *fatal error* for an XMLDecl to occur other than at the beginning of an entity] F) It is a *fatal error* when an XML processor encounters an entity with an encoding that it is unable to process G) It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding H) [I]t is a fatal error if an entity encoded in UTF-8 contains any ill-formed code unit sequences ------ W) [E]ntities which are stored in an encoding other than UTF-8 or UTF-16 *must* [contain] an encoding declaration X) [I]t is a *fatal error* for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration XX) [Theorem derived from B, D|E and X] It is a *fatal error* for an XML entity beginning with a BOM to declare an encoding other than that implied by the BOM Y) [It is a *fatal error*] for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8 Z) [It is a] *fatal error* if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16 (Note that (B) and (C) are both consistent with proposal (1)) But what is relevant for the WG is whether there is anything we should do wrt W--Z. That is, should there be new errors parallel to some or all of W-Z when there _is_ EI? Here's what they would look like: W') [E]ntities without an encoding declaration which are delivered in an encoding other than UTF-8 or UTF-16 *must* provide a charset parameter This is a sensible constraint on transcoders, including non-XML-aware transcoders -- if you transcode out of UTF-..., you *must* tell me to what. X') [I]t is a *fatal error* for an entity including an encoding declaration to be presented to the XML processor with a charset parameter other than that named in the declaration The existing 3023bis draft includes something like this. I don't think it can be retained, because it conflicts with W' for non-XML-aware transcoders: doing the responsible thing would produce documents which conflict with X'. Note there is _no way_ to always win if the charset and the encoding decl disagree: there are plausible scenarios in which either one might be 'right'. XX') It is a *fatal error* for an XML entity beginning with a BOM to declare an encoding other than that implied by the BOM. That is, the BOM is authoritative regardless of whether EI is present or not. Y') [leads to a generalization of W'] Non-UTF-8-encoded entities without an encoding declaration *must* be delivered with a charset parameter and/or (in the case of UTF-16) a BOM Z') [Follows from Y'] So, my tentative conclusion is that I will add something like Y' to 3023bis, and also replace the existing language in 3023bis which amounts to X' with something that notes that it's not an error per the XML spec. if there's a conflict between the charset param and the encoding decl, and that the charset param takes precedence in such a case. XX' would be a useful amendment to XML itself. Questions/comments? Sorry this is so long! Ah, and I'm going to make it longer. Here's a complete analysis by cases: The 'encoding' column gives the encoding a conformant processor will process the document in, given the above and the three previous columns. '-' means not specified. 'E' means some specified encoding. 'B' means some particular BOM. 'F' means some specified encoding distinct from E. 'C' means a specified encoding distinct from 'B'. 'NUE' means some specified non-Unicode encoding. 'reason' gives the constraint(s) above which is/are violated in the case of 'status' NWF. 'UTF-8?' in the 'BOM' column means the same result obtains with or without a UTF-8 BOM] charset BOM enc decl. encoding status reason - - - UTF-8 OK - B - B OK - - E E OK - B B B OK - B C B err XX All of this is distinct from the encoding actually _used_ in the document. In all the cases marked 'OK' above, similarly for those below, a document may be NWF for reason G, if it contains a sufficient range of characters encoded in the encoding actually used so as to provoke at least some "illegal sequence" errors from a decoder for 'encoding' (if any such characters exist). charset BOM enc decl. encoding status reason B B - B OK * E B - B OK BOM is authoritative * E B B B OK BOM is authoritative E B C B err XX' E - - E OK E - E E OK * E - F E OK charset > encoding We might want to say something about warnings in the * cases, where there is conflicting information by no error, on my proposal. We could choose to make the last line an error, but neither xmllint, nor any browser I have tested, reports an error in this case. xmllint, IE, Firefox and Chrome all prefer the charset, and _do_ either complain (FF, IE, xmllint) or silently do the wrong thing (Chrome) in this case if a non-E sequence occurs in the document (see http://www.ltg.ed.ac.uk/~ht/ov-test/mtc.xml). Opera is the odd-one-out -- it prefers the encoding decl., and so does _not_ complain. ht -- Henry S. Thompson, School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ [mail from me _always_ has a .sig like this -- mail without it is forged spam]
Received on Friday, 4 October 2013 20:55:32 UTC