[closed] Re: XML parsing and external entities in HTML5 -- ACTION-440

At the 13 Oct telcon we agreed that this was an informative message, not a comment on the spec.

"Henry S. Thompson" <ht@inf.ed.ac.uk> writes:
> I took an action some time ago to review the discussion in section 9.2
> of the HTML5 spec. [1] in regard to how external entity processing
> during XML DOCTYPE statement parsing is specified.
>
> This section contains the following:
>
>  "This specification provides the following additional information
>   that user agents should use when retrieving an external entity: the
>   public identifiers given in the following list all correspond to _the
>   URL given by this link_.
>
>     -//W3C//DTD XHTML 1.0 Transitional//EN
>     -//W3C//DTD XHTML 1.1//EN
>     -//W3C//DTD XHTML 1.0 Strict//EN
>     -//W3C//DTD XHTML 1.0 Frameset//EN
>     -//W3C//DTD XHTML Basic 1.0//EN
>     -//W3C//DTD XHTML 1.1 plus MathML 2.0//EN
>     -//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN
>     -//W3C//DTD MathML 2.0//EN
>     -//WAPFORUM//DTD XHTML Mobile 1.0//EN
>
>  "Furthermore, user agents should attempt to retrieve the above
>   external entity's content when one of the above public identifiers
>   is used, and should not attempt to retrieve any other external
>   entity's content."  [emphasis added]
>
> The "URL given by this link" is a data: URI which resolves to a string
> of 2125 entity declarations.
>
> This amounts, as far as I can see, to suggesting (note the use of
> 'should' throughout (recall that the HTML5 spec. does not
> typographically distinguish RFC2119 language, all uses of 'must',
> 'should' etc. are normative unless explicitly noted to the contrary))
> that the XML parser invoked by a user agent for XHTML documents should
>
>   a) Use a catalog [2] which maps all the above public identifiers to
>      the given fixed string;
>
>   b) Not otherwise process the external subset at all.
>
> In principle, there's a lot to recommend this approach.  It would
> evidently solve a bunch of interop problems, and drastically reduce
> the load on W3C servers.
>
> In practice, it leaves open a number of questions, which I think need
> to be addressed:
>
>  1) Why 'should' and not 'must'?
>
>     If ensuring interop is the goal here, surely we want user agents
>     all to just _do_ this. . .
>
>  2) Why not a number of other public identifiers?
>
>     For example, -//W3C//DTD XHTML Basic 1.0//EN
>                  -//W3C//DTD SVG 1.0//EN
>                  -//W3C//DTD SVG 1.1//EN
>                  -//W3C//MathML 1.0//EN
>
>  3) What exactly is that list of entities?  How would I know if there
>     was a mistake of omission?
>
>  4) What about the _internal_ subset?  Should it be processed
>     (consistent with the catalog story) or not (consistent with what
>     the XML spec. says processors may do, since the external subset is
>     "a special kind of external entity", and non-validating XML
>     processor may stop 'processing' the internal subset once they
>     choose not to read an external entity)?
>
>  5) What if the XML declaration for the document at hand includes
>     "standalone='no'" (or no standalone, which the XML spec. requires
>     to be interpreted as 'no')?
>
>     (Note that as it stands Polyglot [2] does not allow either an XML
>      declaration or an internal subset).
>
> It seems to me the interoperability of existing XHTML toolchains and
> HTML5 user agents is implicated by one or more of the above -- what
> should the TAG say, and to whom?  Should the TAG and the XML
> Processing Model WG work together to define a Processor Profile [3]
> which could be referenced normatively in section 9.2 of the HTML5
> spec.?
>
> ht
>
> [1] http://www.w3.org/TR/2011/WD-html5-20110525/the-xhtml-syntax.html#parsing-xhtml-documents [Last Call WD]
> [2] http://www.w3.org/TR/2011/WD-html-polyglot-20110525/
> [3] http://www.w3.org/TR/xml-proc-profiles/
> --
>        Henry S. Thompson, School of Informatics, University of Edinburgh
>       10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
>                 Fax: (44) 131 651-1426, e-mail: ht@inf.ed.ac.uk
>                        URL: http://www.ltg.ed.ac.uk/~ht/
>  [mail from me _always_ has a .sig like this -- mail without it is forged spam]

                                        Be seeing you,
                                          norm

-- 
Norman Walsh
Lead Engineer
MarkLogic Corporation
Phone: +1 413 624 6676
www.marklogic.com

Received on Thursday, 13 October 2011 15:12:43 UTC