Re: Error recovery from James Clark on 2012-11-18 (public-microxml@w3.org from November 2012)

From: James Clark <jjc@jclark.com>
Date: Sun, 18 Nov 2012 08:55:47 +0700
To: John Cowan <cowan@mercury.ccil.org>
Cc: Michael Sokolov <sokolov@falutin.net>, liam@w3.org, Uche Ogbuji <uche@ogbuji.net>, "public-microxml (public-microxml@w3.org)" <public-microxml@w3.org>
Message-ID: <CANz3_Eb5XmZqWfpZz11ppEPg6dpc+V8hE_c8R6bfDmQnDXVu6Q@mail.gmail.com>

On Sun, Nov 18, 2012 at 7:21 AM, John Cowan <cowan@mercury.ccil.org> wrote:

> James Clark scripsit:
>
> > For the purposes of recovery, I plan to use an extended definition of
> > nameStartChar:
> >
> > nameStartChar ::= [A-Za-z_:$] | [#x80-#x10FFFF]
> >
> > So the tree you get would be as if MicroXML allowed colons as a
> > nameStartChar.
>
> I think that's Just Wrong.  An error-correcting parser should produce
> a valid MicroXML data model, and the data model does not allow
> colons in names.

I've found it useful to work with a generalization of the MicroXML data
model that does not restrict the characters that occur in names or in data.
 This is useful because you can use it to represent not only MicroXML but
also XML 1.0, XML 1.0 Fifth Edition, XML 1.1 and HTML, all of which have
slightly different restrictions on what are allowed in names or in data.
 You can also use this data model to represent the result of XML Namespaces
processing (by putting the URI in the name).

In implementation terms, the idea is to have a syntax-independent DOM, with
just elements, attributes and characters. There are then separate parsers
and serializers that convert between this DOM and various concrete
syntaxes.  Different syntaxes will vary in which DOMs they can represent.
It's the job of the serializer for a particular syntax to throw an error if
it's given a DOM that the syntax cannot represent.  The MicroXML parser
guarantees that it will produce such a DOM for any input.  It doesn't
guarantee that the DOM can be serialized as valid MicroXML. Particularly
for forbidden characters in attribute values and data, I think users will
be much happier if the parser passes the characters through as is, instead
of replacing them by some other character (although I do plan for to
provide an option for the parser to perform replacement).

I think there's something fundamental about the
element/attributes/characters trinity as a data model for markup. I think
this data model is of greater significance than the particular syntactic
choices we've made in MicroXML.

I also have a slightly subversive agenda for error recovery: I see it as a
way to work around some of the annoying syntactic restrictions that XML
compatibility has forced on MicroXML. I also hope to do is investigate
alternative syntaxes for this data model that aren't constrained by
compatibility with XML.

I would like to change the Data Model section of the spec to separate out
the aspects of the data model that are purely there as a result of the
syntactic constraints of MicroXML.

James

Received on Sunday, 18 November 2012 01:56:36 UTC