XML parsing and external entities in HTML5 -- ACTION-440

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

[resending with correct references -- thank you, Ashok]

I took an action some time ago to review the discussion in section 9.2
of the HTML5 spec. [1] in regard to how external entity processing
during XML DOCTYPE statement parsing is specified.

This section contains the following:

 "This specification provides the following additional information
  that user agents should use when retrieving an external entity: the
  public identifiers given in the following list all correspond to _the
  URL given by this link_.

    -//W3C//DTD XHTML 1.0 Transitional//EN
    -//W3C//DTD XHTML 1.1//EN
    -//W3C//DTD XHTML 1.0 Strict//EN
    -//W3C//DTD XHTML 1.0 Frameset//EN
    -//W3C//DTD XHTML Basic 1.0//EN
    -//W3C//DTD XHTML 1.1 plus MathML 2.0//EN
    -//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN
    -//W3C//DTD MathML 2.0//EN
    -//WAPFORUM//DTD XHTML Mobile 1.0//EN

 "Furthermore, user agents should attempt to retrieve the above
  external entity's content when one of the above public identifiers
  is used, and should not attempt to retrieve any other external
  entity's content."  [emphasis added]

The "URL given by this link" is a data: URI which resolves to a string
of 2125 entity declarations.

This amounts, as far as I can see, to suggesting (note the use of
'should' throughout (recall that the HTML5 spec. does not
typographically distinguish RFC2119 language, all uses of 'must',
'should' etc. are normative unless explicitly noted to the contrary))
that the XML parser invoked by a user agent for XHTML documents should

  a) Use a catalog [2] which maps all the above public identifiers to
     the given fixed string;

  b) Not otherwise process the external subset at all.

In principle, there's a lot to recommend this approach.  It would
evidently solve a bunch of interop problems, and drastically reduce
the load on W3C servers.

In practice, it leaves open a number of questions, which I think need
to be addressed:

 1) Why 'should' and not 'must'?

    If ensuring interop is the goal here, surely we want user agents
    all to just _do_ this. . .

 2) Why not a number of other public identifiers?

    For example, -//W3C//DTD XHTML Basic 1.0//EN
                 -//W3C//DTD SVG 1.0//EN
                 -//W3C//DTD SVG 1.1//EN
                 -//W3C//MathML 1.0//EN

 3) What exactly is that list of entities?  How would I know if there
    was a mistake of omission?

 4) What about the _internal_ subset?  Should it be processed
    (consistent with the catalog story) or not (consistent with what
    the XML spec. says processors may do, since the external subset is
    "a special kind of external entity", and non-validating XML
    processor may stop 'processing' the internal subset once they
    choose not to read an external entity)?

 5) What if the XML declaration for the document at hand includes
    "standalone='no'" (or no standalone, which the XML spec. requires
    to be interpreted as 'no')?

    (Note that as it stands Polyglot [3] does not allow either an XML
     declaration or an internal subset).

It seems to me the interoperability of existing XHTML toolchains and
HTML5 user agents is implicated by one or more of the above -- what
should the TAG say, and to whom?  Should the TAG and the XML
Processing Model WG work together to define a Processor Profile [4]
which could be referenced normatively in section 9.2 of the HTML5
spec.?

ht

[1] http://www.w3.org/TR/2011/WD-html5-20110525/the-xhtml-syntax.html#parsing-xhtml-documents [Last Call WD]
[2] http://www.oasis-open.org/committees/entity/spec-2001-08-06.html
[3] http://www.w3.org/TR/2011/WD-html-polyglot-20110525/
[4] http://www.w3.org/TR/xml-proc-profiles/
- -- 
       Henry S. Thompson, School of Informatics, University of Edinburgh
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 651-1426, e-mail: ht@inf.ed.ac.uk
                       URL: http://www.ltg.ed.ac.uk/~ht/
 [mail from me _always_ has a .sig like this -- mail without it is forged spam]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFOHyAJkjnJixAXWBoRApUCAJ9iaxkl0K46nvBfKvHmAmhVmSYJQwCeJQ3e
XZmXCGcHVXPrr7qNRBaeiq4=
=3Mv0
-----END PGP SIGNATURE-----

Received on Thursday, 14 July 2011 16:58:20 UTC