W3C home > Mailing lists > Public > public-html-xml@w3.org > December 2010

Re: What problem is this task force trying to solve and why?

From: Kurt Cagle <kurt.cagle@gmail.com>
Date: Wed, 22 Dec 2010 22:18:21 -0500
Message-ID: <AANLkTimFWgcfj+=1MzOQR7BTMA8_x1xddRkLHk9yo3Pa@mail.gmail.com>
To: David Carlisle <davidc@nag.co.uk>
Cc: public-html-xml@w3.org
David,

Thanks for the clarification - I'd not realized that these were separate
projects.

Concerning parsers, however, I think that you can reframe the debate away
from "how do we improve XML?" to "how do we improve the XML experience?"

Consider, for instance, the characteristics of a hypothetical lax XML parser
(leaving aside the HTML issues for a moment). Such a parser would take
potentially ill-formed XML as an input, and would apply a core set of
heuristics to the data. Such heuristics might include the following:

1) If a default namespace is not defined globally but a an explicit
namespace is, and the child elements of that namespaces are in the default
namespace, then put them into the explicit namespace:

<ns1:foo xmlns:ns="myFooNS">
     <bar/>
     <bat/>
</ns1:foo>

would map to

<ns1:foo xmlns:ns="myFooNS">
     <ns1:bar/>
     <ns1:bat/>
</ns1:foo>

2) if you have an element that repeats without being terminated between
repeats, then that element will be considered a sibling:

<foo>
    <bar>ABC
    <bar>123
</foo>

becomes:

<foo>
    <bar>ABC</bar>
    <bar>123</bar>
</foo>

3) An element with mixed content will be considered to contain that mixed
content until another element of the same name is encountered:

<foo>
    <bar>This is an <a>bit of <b>data
    <bar>This is another <a>bit of data
</foo>

would render as

<foo>
    <bar>This is an <a>bit of <b>data</b></a></bar>
    <bar>This is another <a>bit of data</a></bar>
</foo>

4) Entities would be matched to the HTML core set and converted into their
equivalent numeric entity codes.

And so forth. as the parser works through these cases, it assigns a weight
that indicates the likelihood that a given heuristic rule determines the
correct configuration. After the parsing is done, these are used to
calculate a confidence level for the XML document - the likelihood that the
document that is reproduced in the parsing corresponds to the intent of the
creator of this content. In the case of well-formed XML this confidence is
1. You could even apply the same heuristics to non-XML documents such as
JSON, and so long as there was no ambiguity in those heuristics, the result
would be an XML mapping of confidence 1.

The default heuristics for such a parser could be extended or replaced by a
heuristics document, which i would likely see as an augmented schema (either
XSD or RNG) + schematron. This could be set up to handle HTML5 parsing as
well as other schemas, and would also handle potential identification of
stand-alone content such as an SVG, even outside of the context of HTML5
(such as SVG without the appropriate namespace appearing within an XSL-FO
document).

Such a heuristics configuration file would definitely be a specialist's
tool, but in general a user of such a parser would only be utilizing it when
they are dealing with known schemas (although which schema within that set
may not be known).

I can even give a few use cases where this would have a lot of value:

1) RSS2.0 documents are notorious for being "unparseable" within XML. A
heuristic parser, however, could parse such an RSS document, storing it
internally as XML 1.0, while giving a specific degree of confidence that
what was parsed was in fact what was intended. This can be especially useful
when processing bulk documents.

2) We recently received a collection of several gigabytes worth of
genericode documents, and discovered that while the containing element was
in the genericode namespace, everything else was in a default namespace. An
default heuristic parser would likely have handled this use case, but you
could also pass in the genericode xsds in order to increase the overall
confidence in the document.

3) markup text entered into an HTML textarea field tends to be parseable
only a fraction of the time. A heuristic parser could provide a much greater
likelihood of matching the text to markup than trying to handle special
cases via JavaScript external to such a parser.

The problem of trying to create a subset of XML (sans namespaces et al) is
that those namespaces and other features do have value to someone, and
everyone's edge case is different. If on the other hand you concentrated on
building lax (aka heuristic) parsers and accepted the notion that documents
may have confidence levels, then you can handle moderately ill-formed XML
while at the same time keeping the core specifications cleanly within XML.

I don't necessarily think that this is all that different from Anne
Kesteren's ideas, save that rather than redefining XML, it simply expands
the degree of tolerance for working with XML content.

Kurt Cagle
XML Architect
*Lockheed / US National Archives ERA Project*



On Wed, Dec 22, 2010 at 4:34 PM, David Carlisle <davidc@nag.co.uk> wrote:

> On 22/12/2010 20:59, Kurt Cagle wrote:
>
>> XML5 is how Henri Sivonen and others on the HTML5 WG are referring to
>> XML parsed by that parser.
>>
>
>
> Not really. Henri was (I would think) referring to Anne's XML5 parser
>
> http://code.google.com/p/xml5/
>
> which is a lax parser for xml markup, but a private project of Anne's
> unrelated to HTML5 as currently specified.
>
> The HTML5 spec defines two ways of parsing what might loosely be called xml
> content.
>
> XHTML5 which is the xml serialisation of html, which is (as xhtml 1.0)
> intended to be parsed by an xml+namespaces parser with draconian error
> handling.
>
> "foreign content" which is the parse mode used by the html5 parser for
> text/html for the content of  <svg> and <math> which parses in lax html
> style, the main difference of foreign content parser mode being that />
> denotes empty tag rather than start tag.
>
> David
>
Received on Thursday, 23 December 2010 03:19:25 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 23 December 2010 03:19:26 GMT