- From: Kurt Cagle <kurt.cagle@gmail.com>
- Date: Wed, 22 Dec 2010 22:18:21 -0500
- To: David Carlisle <davidc@nag.co.uk>
- Cc: public-html-xml@w3.org
- Message-ID: <AANLkTimFWgcfj+=1MzOQR7BTMA8_x1xddRkLHk9yo3Pa@mail.gmail.com>
David, Thanks for the clarification - I'd not realized that these were separate projects. Concerning parsers, however, I think that you can reframe the debate away from "how do we improve XML?" to "how do we improve the XML experience?" Consider, for instance, the characteristics of a hypothetical lax XML parser (leaving aside the HTML issues for a moment). Such a parser would take potentially ill-formed XML as an input, and would apply a core set of heuristics to the data. Such heuristics might include the following: 1) If a default namespace is not defined globally but a an explicit namespace is, and the child elements of that namespaces are in the default namespace, then put them into the explicit namespace: <ns1:foo xmlns:ns="myFooNS"> <bar/> <bat/> </ns1:foo> would map to <ns1:foo xmlns:ns="myFooNS"> <ns1:bar/> <ns1:bat/> </ns1:foo> 2) if you have an element that repeats without being terminated between repeats, then that element will be considered a sibling: <foo> <bar>ABC <bar>123 </foo> becomes: <foo> <bar>ABC</bar> <bar>123</bar> </foo> 3) An element with mixed content will be considered to contain that mixed content until another element of the same name is encountered: <foo> <bar>This is an <a>bit of <b>data <bar>This is another <a>bit of data </foo> would render as <foo> <bar>This is an <a>bit of <b>data</b></a></bar> <bar>This is another <a>bit of data</a></bar> </foo> 4) Entities would be matched to the HTML core set and converted into their equivalent numeric entity codes. And so forth. as the parser works through these cases, it assigns a weight that indicates the likelihood that a given heuristic rule determines the correct configuration. After the parsing is done, these are used to calculate a confidence level for the XML document - the likelihood that the document that is reproduced in the parsing corresponds to the intent of the creator of this content. In the case of well-formed XML this confidence is 1. You could even apply the same heuristics to non-XML documents such as JSON, and so long as there was no ambiguity in those heuristics, the result would be an XML mapping of confidence 1. The default heuristics for such a parser could be extended or replaced by a heuristics document, which i would likely see as an augmented schema (either XSD or RNG) + schematron. This could be set up to handle HTML5 parsing as well as other schemas, and would also handle potential identification of stand-alone content such as an SVG, even outside of the context of HTML5 (such as SVG without the appropriate namespace appearing within an XSL-FO document). Such a heuristics configuration file would definitely be a specialist's tool, but in general a user of such a parser would only be utilizing it when they are dealing with known schemas (although which schema within that set may not be known). I can even give a few use cases where this would have a lot of value: 1) RSS2.0 documents are notorious for being "unparseable" within XML. A heuristic parser, however, could parse such an RSS document, storing it internally as XML 1.0, while giving a specific degree of confidence that what was parsed was in fact what was intended. This can be especially useful when processing bulk documents. 2) We recently received a collection of several gigabytes worth of genericode documents, and discovered that while the containing element was in the genericode namespace, everything else was in a default namespace. An default heuristic parser would likely have handled this use case, but you could also pass in the genericode xsds in order to increase the overall confidence in the document. 3) markup text entered into an HTML textarea field tends to be parseable only a fraction of the time. A heuristic parser could provide a much greater likelihood of matching the text to markup than trying to handle special cases via JavaScript external to such a parser. The problem of trying to create a subset of XML (sans namespaces et al) is that those namespaces and other features do have value to someone, and everyone's edge case is different. If on the other hand you concentrated on building lax (aka heuristic) parsers and accepted the notion that documents may have confidence levels, then you can handle moderately ill-formed XML while at the same time keeping the core specifications cleanly within XML. I don't necessarily think that this is all that different from Anne Kesteren's ideas, save that rather than redefining XML, it simply expands the degree of tolerance for working with XML content. Kurt Cagle XML Architect *Lockheed / US National Archives ERA Project* On Wed, Dec 22, 2010 at 4:34 PM, David Carlisle <davidc@nag.co.uk> wrote: > On 22/12/2010 20:59, Kurt Cagle wrote: > >> XML5 is how Henri Sivonen and others on the HTML5 WG are referring to >> XML parsed by that parser. >> > > > Not really. Henri was (I would think) referring to Anne's XML5 parser > > http://code.google.com/p/xml5/ > > which is a lax parser for xml markup, but a private project of Anne's > unrelated to HTML5 as currently specified. > > The HTML5 spec defines two ways of parsing what might loosely be called xml > content. > > XHTML5 which is the xml serialisation of html, which is (as xhtml 1.0) > intended to be parsed by an xml+namespaces parser with draconian error > handling. > > "foreign content" which is the parse mode used by the html5 parser for > text/html for the content of <svg> and <math> which parses in lax html > style, the main difference of foreign content parser mode being that /> > denotes empty tag rather than start tag. > > David >
Received on Thursday, 23 December 2010 03:19:25 UTC