Re: XML parsing and external entities in HTML5 -- ACTION-440 from Henri Sivonen on 2011-07-15 (www-tag@w3.org from July 2011)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Fri, 15 Jul 2011 12:20:41 +0200
To: www-tag@w3.org
Message-ID: <1310725241.2270.17.camel@shuttle>
> In practice, it leaves open a number of questions, which I think need
> to be addressed:
> 
>  1) Why 'should' and not 'must'?
> 
>     If ensuring interop is the goal here, surely we want user agents
>     all to just _do_ this. . .

My guess is that it's a should rather than a must in order to appear to step a bit less on the toes of XML orthodoxy.

>  2) Why not a number of other public identifiers?
> 
>     For example, -//W3C//DTD XHTML Basic 1.0//EN
>                  -//W3C//DTD SVG 1.0//EN
>                  -//W3C//DTD SVG 1.1//EN
>                  -//W3C//MathML 1.0//EN
> 
>  3) What exactly is that list of entities?  How would I know if there
>     was a mistake of omission?

The list is from
http://mxr.mozilla.org/mozilla-central/source/parser/htmlparser/src/nsExpatDriver.cpp#287

We can compare those list to verify that the spec is correctly descriptive of the behavior of Gecko. Existing content on the Web can't be relying on anything else, because the content wouldn't have worked.

Historically, WebKit's corresponding list of magic public ids has been narrower. Historically, IE hasn't supported XHTML served as XML. Historically, Opera's market share has been low enough that one has to assume that there aren't significant amount of Opera-only content, since it's likely authors would have checked with other browsers, too.

The list in the spec is motivated by the "Support Existing Content" principle. Items should never, ever be added to the list. If you want to publish XML on the Web, use UTF-8 instead of entity references. If you really want to use entity references, use one of the magic public ids and ignore DTD-based validity.

>  4) What about the _internal_ subset?  Should it be processed
>     (consistent with the catalog story) or not (consistent with what
>     the XML spec. says processors may do, since the external subset is
>     "a special kind of external entity", and non-validating XML
>     processor may stop 'processing' the internal subset once they
>     choose not to read an external entity)?

The HTML5 spec (unfortunately!) has no jurisdiction over the processing of the internal subset. The XML spec requires the internal subset to be processed. The part of HTML5 you quoted doesn't override anything in the XML spec. It just prescribes the behavior of the external entity resolver, whose behavior is out of scope of the XML spec and, thus, may be claimed to be in the jurisdiction of another spec that wishes to say something specific about it.

>  5) What if the XML declaration for the document at hand includes
>     "standalone='no'" (or no standalone, which the XML spec. requires
>     to be interpreted as 'no')?

Then what HTML5 says applies. For standalone="yes", it makes sense not to process external entities at all. Gecko initializes expat accordingly. Perhaps HTML5 should suggest that.

> It seems to me the interoperability of existing XHTML toolchains and
> HTML5 user agents is implicated by one or more of the above -- what
> should the TAG say, and to whom?  

To whoever wants to consume XML on the Web in a browser-compatible way, say that they should do what HTML5 says here.

Those who want to use XHTML5 in non-Web offline XML systems can configure their XML parsers and catalogs however they wish.

> Should the TAG and the XML
> Processing Model WG work together to define a Processor Profile [4]
> which could be referenced normatively in section 9.2 of the HTML5
> spec.?

I don't see value in that.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Friday, 15 July 2011 10:21:15 UTC