Another approach from James Clark on 2012-11-26 (public-xml-er@w3.org from November 2012)

From: James Clark <jjc@jclark.com>
Date: Mon, 26 Nov 2012 14:03:06 +0700
To: public-xml-er@w3.org
Message-ID: <CANz3_EZFQicbZqhtWP2aFg1UV7oo48Ydi18000k=bWbJ0WHuQg@mail.gmail.com>
Members of this group might be interested in an alternative approach to
error recovery for (Micro)XML that I have been working on. There is a draft
spec here:

  https://github.com/jclark/microxml-er/blob/master/recovery.md

and a (very new, lightly tested) JavaScript implementation that you can
play with online here:

  http://jclark.github.com/microxml-er/

The implementation displays a JSON representation of the parsed document.
 The format for an element is [name, attributes, content], where name is a
string, attributes is an object and content is a list of strings and
elements.  I should mention that this implementation is not designed for
performance; rather it is designed to be as close as I can make it to a
translation into JavaScript of the informal prose of the specification.

This work started off in the context of MicroXML, but doing good error
recovery for MicroXML requires reasonable handling of XML constructs that
MicroXML disallows, so I've ended up dealing with most of XML.

I see the main differences between my approach and the current approach as
follows.

1. I don't yet handle DTDs. The approach I would suggest for this is for
the error-correcting parser to restrict itself to identifying where the
internal subset starts and ends.  The characters of the internal and
external subsets would then be parsed as described in the XML spec. If
there is an error in these subsets, then the instance parsing would
continue as if there was no DOCTYPE declaration (or perhaps ignore the part
of the DTD after the first error). At this stage in the life of XML, I
don't think it's worth the added complexity in the specification or the
implementation that would be required to try to continue parsing the
internal/external subset after an error is discovered. What's important is
that everything in the instance is always parsed.

2. Instead of switching on characters, my state machine switches on what I
call lexical tokens, which are defined in terms of regular expressions.
 This radically simplifies the state machine, as you can see from my spec.
Overall I have tried hard to make the spec as simple, declarative and
high-level as I can.

3. I haven't given much weight to compatibility with HTML5 error recovery.

4. At the moment, I have defined only a document-type-independent tree
builder phase.  However, I don't think this should be the only option. In
particular, I think there should be an XHTML-specific tree builder phase,
which would allow XHTML browsers to produce an error-corrected tree that is
closer to user expectations. I also think that it would be useful for
error-correcting schema processors for grammar-based schema languages (such
as XSD and RELAX NG) to be able to operate on the output of the
tokenization phase.

5. Although my tokenization phase is streaming, my tree builder phase is
not. I found this necessary in order to provide satisfactory handling of
documents without a single root element (eg a document that has text but no
tags, or that has multiple top-level elements).  An alternative way to deal
with this might be to have the parser return a DocumentFragment.

James
Received on Monday, 26 November 2012 07:03:56 UTC