Re: Draft from Jeni Tennison on 2012-02-20 (public-xml-er@w3.org from February 2012)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Mon, 20 Feb 2012 23:42:24 +0000
To: Anne van Kesteren <annevk@opera.com>, public-xml-er@w3.org
Message-Id: <6B63B0EF-C2EB-4E78-90DA-391B5F816083@jenitennison.com>
Hi Anne,

Thank you very much for getting a draft in place. It's really helpful to have something concrete to start to discuss.

Having looked through the draft, I think there are two fairly fundamental issues that we ought to take a position on at an early stage.

First, I don't think we should call a parser that does error recovery an 'XML Parser', because that will just confuse people or rub them up the wrong way (make them think that we're redefining XML). Perhaps call it a 'Recovering XML Parser' instead? (Or does that sound too much like 'Recovering Alcoholic'?) And instead of:

  This specification defines the parsing rules for XML documents, whether 
  they are syntactically correct or not.

say something like:

  This specification defines the rules for building a tree from XML documents 
  and documents that purport to be XML documents but are not well-formed.

The important thing here is to make sure the spec language does not claim to be redefining what XML is, and there is no such thing as non-well-formed XML. If you would like me to go through and pick up other specific instances where I think the language needs to be more careful around that, I'm happy to do so, just let me know.

Second, I think we need to reach some kind of agreement on what it is exactly that a Recovering XML Parser creates, at least in terms of how the error recovery is specified. It seems to me that there are (at least) five options:

  1. a well-formed XML document
  2. a sequence of (SAX) events [1]
  3. an XML infoset [2]
  4. a DOM [3]
  5. an XDM [4]

Now pretty much all of these can map to the others easily enough but there are subtle differences (eg XDM can have multiple document elements while a DOM can't) that will influence what kinds of error recovery are possible. It might also prove easier/harder to describe the parsing in terms of one particular model over an another, and certainly implementations will be oriented towards a particular model (eg browsers I guess would think in DOM terms whereas most XML-based processing such as XProc/XSLT/XQuery happens on an XDM).

The spec currently talks in terms of a DOM. I think that's a good choice given that browsers are a major audience for this spec, but I just wanted to make sure that we're all agreed that's the right kind of tree to aim for, with all its various quirks, because it will restrict the trees that can be built and how particular types of errors are recovered from.

I think it's also important to have language somewhere that states that a Recovering XML Parser can use other kinds of APIs as long as they do so in a way that's consistent with the DOM described in 4.3 Tree construction. A DOM doesn't make sense for every implementation. (This does make me lean somewhat reluctantly towards specifying the tree in terms of an XML infoset, which at least has defined mappings into both DOM and XDM models.)

Cheers,

Jeni

[1] http://www.saxproject.org/
[2] http://www.w3.org/TR/xml-infoset/
[3] http://www.w3.org/TR/DOM-Level-3-Core/ (or perhaps something closer to that defined in HTML5)
[4] http://www.w3.org/TR/xpath-datamodel/

On 20 Feb 2012, at 15:01, Anne van Kesteren wrote:

> On Sat, 18 Feb 2012 19:09:25 +0100, Anne van Kesteren <annevk@opera.com> wrote:
>> My draft (I will get a dvcs.w3.org repository to put it in next week)
> 
> Got one this morning:
> 
>  http://dvcs.w3.org/hg/xml-er/raw-file/tip/Overview.html
> 
> 
> -- 
> Anne van Kesteren
> http://annevankesteren.nl/
> 
> 

-- 
Jeni Tennison
http://www.jenitennison.com
Received on Monday, 20 February 2012 23:42:49 UTC