W3C home > Mailing lists > Public > public-xml-er@w3.org > February 2012

Re: David's less simple example (was: Marcos simple sample)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Tue, 28 Feb 2012 17:57:12 +0000
Cc: David Carlisle <davidc@nag.co.uk>
Message-Id: <EFA07706-8E78-4097-A10D-B33DBF48A78D@jenitennison.com>
To: "public-xml-er@w3.org Community Group" <public-xml-er@w3.org>

On 28 Feb 2012, at 15:49, David Carlisle wrote:
> To distinguish things a bit it's worth looking at something a bit less like well formed XML, say
> <math><one<two<three</one><two></tree></math>
> Using <math> as an outer element has the advantage that you can test
> with an html5 parser (the <math> puts html5 in its "foreign content"
> xml-like mode where /> means what it is supposed to mean. One desirable
> property of XML-ER would be that it wasn't totally unlike the behaviour
> of HTML5 on such content.
> Using V.nu's parser you can see the result of parsing the above:
> http://livedom.validator.nu/?%3C!DOCTYPE%20html%3E%0A%3Cmath%3E%3Cone%3Ctwo%3Cthree%3C%2Fone%3E%3Ctwo%3E%3C%2Ftree%3E%3C%2Fmath%3E
> removing the html head and body implied in the html context results in a
> parse tree of
> <math><oneU00003CtwoU00003CthreeU00003C
> one=""><two></two></oneU00003CtwoU00003CthreeU00003C></math>
> which is what it is. I don't think it matters too much what the parse
> tree is. That is, I don't think it's worth trying to argue about any
> meaning implied by the original markup. The important thing is that
> html5 specifies a deterministic algorithm that returns a tree. Unless
> there is some overwhelming objection, I think XML-ER should return the
> same tree. (To be honest I haven't checked what Anne's draft spec would
> make of this yet).

Although I agree that the important thing is a deterministic algorithm that produces a tree, I think it *is* worth arguing about meaning implied by the original markup -- or at least how a person might have got to this XML from some well-formed XML -- specifically to address the editor use case for XML-ER, as George highlights later in this thread.

To take a slightly less degenerate case, if someone started with the well-formed:

  <math><three /></math>

and then started typing a new tag before the <three> empty element:

  <math><two<three /></math>

I think it is much much more reasonable for this to be interpreted in an editor as the tree

  + math
    + two
      + three

(with the <two> element flagged as having an error) then it is to be interpreted as the tree

  + math
    + twoU00003Cthree

(with the <twoU00003Cthree> flagged as having an error).

While I agree that it's useful to be consistent with HTML5 parsing, I don't think we should be overly slavish. Browsers already have an HTML5-specified parsing algorithm that can be applied to XML, but because it's HTML5-aware, it doesn't meet our first requirement which is to be compatible with XML.

Given that we're going to be asking browsers to implement a different algorithm anyway, I don't see that the benefits from being consistent with HTML5 are so massive that they outweigh the benefits of having a single algorithm that is usable in the editing and ingesting environments as well as in browsers.


Jeni Tennison
Received on Tuesday, 28 February 2012 17:57:37 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 19:47:26 UTC