- From: James Clark <jjc@jclark.com>
- Date: Mon, 26 Nov 2012 14:03:06 +0700
- To: public-xml-er@w3.org
- Message-ID: <CANz3_EZFQicbZqhtWP2aFg1UV7oo48Ydi18000k=bWbJ0WHuQg@mail.gmail.com>
Members of this group might be interested in an alternative approach to error recovery for (Micro)XML that I have been working on. There is a draft spec here: https://github.com/jclark/microxml-er/blob/master/recovery.md and a (very new, lightly tested) JavaScript implementation that you can play with online here: http://jclark.github.com/microxml-er/ The implementation displays a JSON representation of the parsed document. The format for an element is [name, attributes, content], where name is a string, attributes is an object and content is a list of strings and elements. I should mention that this implementation is not designed for performance; rather it is designed to be as close as I can make it to a translation into JavaScript of the informal prose of the specification. This work started off in the context of MicroXML, but doing good error recovery for MicroXML requires reasonable handling of XML constructs that MicroXML disallows, so I've ended up dealing with most of XML. I see the main differences between my approach and the current approach as follows. 1. I don't yet handle DTDs. The approach I would suggest for this is for the error-correcting parser to restrict itself to identifying where the internal subset starts and ends. The characters of the internal and external subsets would then be parsed as described in the XML spec. If there is an error in these subsets, then the instance parsing would continue as if there was no DOCTYPE declaration (or perhaps ignore the part of the DTD after the first error). At this stage in the life of XML, I don't think it's worth the added complexity in the specification or the implementation that would be required to try to continue parsing the internal/external subset after an error is discovered. What's important is that everything in the instance is always parsed. 2. Instead of switching on characters, my state machine switches on what I call lexical tokens, which are defined in terms of regular expressions. This radically simplifies the state machine, as you can see from my spec. Overall I have tried hard to make the spec as simple, declarative and high-level as I can. 3. I haven't given much weight to compatibility with HTML5 error recovery. 4. At the moment, I have defined only a document-type-independent tree builder phase. However, I don't think this should be the only option. In particular, I think there should be an XHTML-specific tree builder phase, which would allow XHTML browsers to produce an error-corrected tree that is closer to user expectations. I also think that it would be useful for error-correcting schema processors for grammar-based schema languages (such as XSD and RELAX NG) to be able to operate on the output of the tokenization phase. 5. Although my tokenization phase is streaming, my tree builder phase is not. I found this necessary in order to provide satisfactory handling of documents without a single root element (eg a document that has text but no tags, or that has multiple top-level elements). An alternative way to deal with this might be to have the parser return a DocumentFragment. James
Received on Monday, 26 November 2012 07:03:56 UTC