- From: Henry S. Thompson <ht@inf.ed.ac.uk>
- Date: Thu, 14 Jul 2011 17:57:45 +0100
- To: www-tag@w3.org
- Cc: public-xml-processing-model-comments@w3.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 [resending with correct references -- thank you, Ashok] I took an action some time ago to review the discussion in section 9.2 of the HTML5 spec. [1] in regard to how external entity processing during XML DOCTYPE statement parsing is specified. This section contains the following: "This specification provides the following additional information that user agents should use when retrieving an external entity: the public identifiers given in the following list all correspond to _the URL given by this link_. -//W3C//DTD XHTML 1.0 Transitional//EN -//W3C//DTD XHTML 1.1//EN -//W3C//DTD XHTML 1.0 Strict//EN -//W3C//DTD XHTML 1.0 Frameset//EN -//W3C//DTD XHTML Basic 1.0//EN -//W3C//DTD XHTML 1.1 plus MathML 2.0//EN -//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN -//W3C//DTD MathML 2.0//EN -//WAPFORUM//DTD XHTML Mobile 1.0//EN "Furthermore, user agents should attempt to retrieve the above external entity's content when one of the above public identifiers is used, and should not attempt to retrieve any other external entity's content." [emphasis added] The "URL given by this link" is a data: URI which resolves to a string of 2125 entity declarations. This amounts, as far as I can see, to suggesting (note the use of 'should' throughout (recall that the HTML5 spec. does not typographically distinguish RFC2119 language, all uses of 'must', 'should' etc. are normative unless explicitly noted to the contrary)) that the XML parser invoked by a user agent for XHTML documents should a) Use a catalog [2] which maps all the above public identifiers to the given fixed string; b) Not otherwise process the external subset at all. In principle, there's a lot to recommend this approach. It would evidently solve a bunch of interop problems, and drastically reduce the load on W3C servers. In practice, it leaves open a number of questions, which I think need to be addressed: 1) Why 'should' and not 'must'? If ensuring interop is the goal here, surely we want user agents all to just _do_ this. . . 2) Why not a number of other public identifiers? For example, -//W3C//DTD XHTML Basic 1.0//EN -//W3C//DTD SVG 1.0//EN -//W3C//DTD SVG 1.1//EN -//W3C//MathML 1.0//EN 3) What exactly is that list of entities? How would I know if there was a mistake of omission? 4) What about the _internal_ subset? Should it be processed (consistent with the catalog story) or not (consistent with what the XML spec. says processors may do, since the external subset is "a special kind of external entity", and non-validating XML processor may stop 'processing' the internal subset once they choose not to read an external entity)? 5) What if the XML declaration for the document at hand includes "standalone='no'" (or no standalone, which the XML spec. requires to be interpreted as 'no')? (Note that as it stands Polyglot [3] does not allow either an XML declaration or an internal subset). It seems to me the interoperability of existing XHTML toolchains and HTML5 user agents is implicated by one or more of the above -- what should the TAG say, and to whom? Should the TAG and the XML Processing Model WG work together to define a Processor Profile [4] which could be referenced normatively in section 9.2 of the HTML5 spec.? ht [1] http://www.w3.org/TR/2011/WD-html5-20110525/the-xhtml-syntax.html#parsing-xhtml-documents [Last Call WD] [2] http://www.oasis-open.org/committees/entity/spec-2001-08-06.html [3] http://www.w3.org/TR/2011/WD-html-polyglot-20110525/ [4] http://www.w3.org/TR/xml-proc-profiles/ - -- Henry S. Thompson, School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 651-1426, e-mail: ht@inf.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ [mail from me _always_ has a .sig like this -- mail without it is forged spam] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.6 (GNU/Linux) iD8DBQFOHyAJkjnJixAXWBoRApUCAJ9iaxkl0K46nvBfKvHmAmhVmSYJQwCeJQ3e XZmXCGcHVXPrr7qNRBaeiq4= =3Mv0 -----END PGP SIGNATURE-----
Received on Thursday, 14 July 2011 16:58:20 UTC