- From: James Clark <jjc@jclark.com>
- Date: Sun, 18 Nov 2012 08:55:47 +0700
- To: John Cowan <cowan@mercury.ccil.org>
- Cc: Michael Sokolov <sokolov@falutin.net>, liam@w3.org, Uche Ogbuji <uche@ogbuji.net>, "public-microxml (public-microxml@w3.org)" <public-microxml@w3.org>
- Message-ID: <CANz3_Eb5XmZqWfpZz11ppEPg6dpc+V8hE_c8R6bfDmQnDXVu6Q@mail.gmail.com>
On Sun, Nov 18, 2012 at 7:21 AM, John Cowan <cowan@mercury.ccil.org> wrote: > James Clark scripsit: > > > For the purposes of recovery, I plan to use an extended definition of > > nameStartChar: > > > > nameStartChar ::= [A-Za-z_:$] | [#x80-#x10FFFF] > > > > So the tree you get would be as if MicroXML allowed colons as a > > nameStartChar. > > I think that's Just Wrong. An error-correcting parser should produce > a valid MicroXML data model, and the data model does not allow > colons in names. I've found it useful to work with a generalization of the MicroXML data model that does not restrict the characters that occur in names or in data. This is useful because you can use it to represent not only MicroXML but also XML 1.0, XML 1.0 Fifth Edition, XML 1.1 and HTML, all of which have slightly different restrictions on what are allowed in names or in data. You can also use this data model to represent the result of XML Namespaces processing (by putting the URI in the name). In implementation terms, the idea is to have a syntax-independent DOM, with just elements, attributes and characters. There are then separate parsers and serializers that convert between this DOM and various concrete syntaxes. Different syntaxes will vary in which DOMs they can represent. It's the job of the serializer for a particular syntax to throw an error if it's given a DOM that the syntax cannot represent. The MicroXML parser guarantees that it will produce such a DOM for any input. It doesn't guarantee that the DOM can be serialized as valid MicroXML. Particularly for forbidden characters in attribute values and data, I think users will be much happier if the parser passes the characters through as is, instead of replacing them by some other character (although I do plan for to provide an option for the parser to perform replacement). I think there's something fundamental about the element/attributes/characters trinity as a data model for markup. I think this data model is of greater significance than the particular syntactic choices we've made in MicroXML. I also have a slightly subversive agenda for error recovery: I see it as a way to work around some of the annoying syntactic restrictions that XML compatibility has forced on MicroXML. I also hope to do is investigate alternative syntaxes for this data model that aren't constrained by compatibility with XML. I would like to change the Data Model section of the spec to separate out the aspects of the data model that are purely there as a result of the syntactic constraints of MicroXML. James
Received on Sunday, 18 November 2012 01:56:36 UTC