- From: Henry S. Thompson <ht@inf.ed.ac.uk>
- Date: Thu, 14 Jul 2011 17:57:45 +0100
- To: www-tag@w3.org
- Cc: public-xml-processing-model-comments@w3.org
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
[resending with correct references -- thank you, Ashok]
I took an action some time ago to review the discussion in section 9.2
of the HTML5 spec. [1] in regard to how external entity processing
during XML DOCTYPE statement parsing is specified.
This section contains the following:
"This specification provides the following additional information
that user agents should use when retrieving an external entity: the
public identifiers given in the following list all correspond to _the
URL given by this link_.
-//W3C//DTD XHTML 1.0 Transitional//EN
-//W3C//DTD XHTML 1.1//EN
-//W3C//DTD XHTML 1.0 Strict//EN
-//W3C//DTD XHTML 1.0 Frameset//EN
-//W3C//DTD XHTML Basic 1.0//EN
-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN
-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN
-//W3C//DTD MathML 2.0//EN
-//WAPFORUM//DTD XHTML Mobile 1.0//EN
"Furthermore, user agents should attempt to retrieve the above
external entity's content when one of the above public identifiers
is used, and should not attempt to retrieve any other external
entity's content." [emphasis added]
The "URL given by this link" is a data: URI which resolves to a string
of 2125 entity declarations.
This amounts, as far as I can see, to suggesting (note the use of
'should' throughout (recall that the HTML5 spec. does not
typographically distinguish RFC2119 language, all uses of 'must',
'should' etc. are normative unless explicitly noted to the contrary))
that the XML parser invoked by a user agent for XHTML documents should
a) Use a catalog [2] which maps all the above public identifiers to
the given fixed string;
b) Not otherwise process the external subset at all.
In principle, there's a lot to recommend this approach. It would
evidently solve a bunch of interop problems, and drastically reduce
the load on W3C servers.
In practice, it leaves open a number of questions, which I think need
to be addressed:
1) Why 'should' and not 'must'?
If ensuring interop is the goal here, surely we want user agents
all to just _do_ this. . .
2) Why not a number of other public identifiers?
For example, -//W3C//DTD XHTML Basic 1.0//EN
-//W3C//DTD SVG 1.0//EN
-//W3C//DTD SVG 1.1//EN
-//W3C//MathML 1.0//EN
3) What exactly is that list of entities? How would I know if there
was a mistake of omission?
4) What about the _internal_ subset? Should it be processed
(consistent with the catalog story) or not (consistent with what
the XML spec. says processors may do, since the external subset is
"a special kind of external entity", and non-validating XML
processor may stop 'processing' the internal subset once they
choose not to read an external entity)?
5) What if the XML declaration for the document at hand includes
"standalone='no'" (or no standalone, which the XML spec. requires
to be interpreted as 'no')?
(Note that as it stands Polyglot [3] does not allow either an XML
declaration or an internal subset).
It seems to me the interoperability of existing XHTML toolchains and
HTML5 user agents is implicated by one or more of the above -- what
should the TAG say, and to whom? Should the TAG and the XML
Processing Model WG work together to define a Processor Profile [4]
which could be referenced normatively in section 9.2 of the HTML5
spec.?
ht
[1] http://www.w3.org/TR/2011/WD-html5-20110525/the-xhtml-syntax.html#parsing-xhtml-documents [Last Call WD]
[2] http://www.oasis-open.org/committees/entity/spec-2001-08-06.html
[3] http://www.w3.org/TR/2011/WD-html-polyglot-20110525/
[4] http://www.w3.org/TR/xml-proc-profiles/
- --
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 651-1426, e-mail: ht@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
iD8DBQFOHyAJkjnJixAXWBoRApUCAJ9iaxkl0K46nvBfKvHmAmhVmSYJQwCeJQ3e
XZmXCGcHVXPrr7qNRBaeiq4=
=3Mv0
-----END PGP SIGNATURE-----
Received on Thursday, 14 July 2011 16:58:19 UTC