Re: XML parsing and external entities in HTML5 -- ACTION-440 from David Carlisle on 2011-07-14 (www-tag@w3.org from July 2011)

From: David Carlisle <davidc@nag.co.uk>
Date: Thu, 14 Jul 2011 21:08:34 +0100
To: www-tag@w3.org
Message-ID: <4E1F4CC2.4000205@nag.co.uk>
> In practice, it leaves open a number of questions, which I think need
> to be addressed:
>
>  1) Why 'should' and not 'must'?
>
>     If ensuring interop is the goal here, surely we want user agents
>     all to just _do_ this. . .

I wasn't involved in this bit, but if you make it a must then an off the 
shelf conformant xml parser wouldn't be able to parse xhtml in a 
conformant way, which might be a bit odd.

>
>  2) Why not a number of other public identifiers?
>
>     For example, -//W3C//DTD XHTML Basic 1.0//EN
>                  -//W3C//DTD SVG 1.0//EN
>                  -//W3C//DTD SVG 1.1//EN
>                  -//W3C//MathML 1.0//EN

Personally I think that an xhtml related xml parer ought to use the same 
entity set for _all_ xml _all_ the time. So that you could finally have 
a spec for passing around fragments of xml like <span>&nbsp;</span> 
without it being not well formed.

Failing that, including at least the standard html declaration

<!DOCTYPE html>
<html>...

would be useful.



>
>  3) What exactly is that list of entities?
the list from the w3c entities spec [1],
specifically the htmlmathml list [2]

 >  How would I know if there
>     was a mistake of omission?

(I must compare with [2] that as part of my last call review of html5)


>
>  4) What about the _internal_ subset?  Should it be processed
>     (consistent with the catalog story) or not (consistent with what
>     the XML spec. says processors may do, since the external subset is
>     "a special kind of external entity", and non-validating XML
>     processor may stop 'processing' the internal subset once they
>     choose not to read an external entity)?

I think internal subsets should be parsed and entities within them 
defined, and authors encouraged not to use them, for compatibility with 
html.

>
>  5) What if the XML declaration for the document at hand includes
>     "standalone='no'" (or no standalone, which the XML spec. requires
>     to be interpreted as 'no')?

I've managed to ignore standalone for over a decade, so ignoring forever 
wouldn't trouble me:-)

>
>     (Note that as it stands Polyglot [2] does not allow either an XML
>      declaration or an internal subset).

It also doesn't allow entity references apart from the predefined xml 
ones, this restriction could be dropped if the full entity set were 
implied by <!DOCTYPE html>
>
> It seems to me the interoperability of existing XHTML toolchains and
> HTML5 user agents is implicated by one or more of the above -- what
> should the TAG say, and to whom?  Should the TAG and the XML
> Processing Model WG work together to define a Processor Profile [3]
> which could be referenced normatively in section 9.2 of the HTML5
> spec.?
>

In a follow-up message:

> For sure, I should have included that as well -- either of the fixed
> lists in this section may need updating quite regularly. . .

As Anne commented we have traditionally strongly resisted adding or 
removing any entities because the story (on the XML side) is so drastic 
if your catalog switches in a dtd with a different entity set, a single 
undefined character renders the entire document not well formed.
The xml entities spec covers all the entities that have been published 
by W3C or ISO and as far as I recall there has only been one new name 
added since MathML 1 in 1998, so adding names has not traditionally been 
a regular occurrence. The other "fixed list" that you refer to, the list 
of URIs that trigger entities, I think that should be infinitely 
extended, as I note above.


David


[1] http://www.w3.org/TR/2010/REC-xml-entity-names-20100401/

[2] http://www.w3.org/2003/entities/2007/htmlmathml-f.ent
Received on Thursday, 14 July 2011 20:09:09 UTC