Re: What problem is this task force trying to solve and why? from Kurt Cagle on 2010-12-23 (public-html-xml@w3.org from December 2010)

From: Kurt Cagle <kurt.cagle@gmail.com>
Date: Thu, 23 Dec 2010 10:03:40 -0500
To: John Cowan <cowan@mercury.ccil.org>
Cc: David Carlisle <davidc@nag.co.uk>, public-html-xml@w3.org
Message-ID: <AANLkTikyXShi-2aS9-aPXarGhB1FVcD+uQevs+4gX27+@mail.gmail.com>
John,

I would contend that when a web browser attempts to parse ill-formed HTML,
it is doing precisely this kind of "kludge".

In both cases what you are attempting to do is solve the "Grandmother
problem" - how do you take input from non-programmers (my Grandmother, for
instance) or from poorly designed devices and discern from that the intent
of the content. This is what most HTML parsers do, and seems to be the
anticipated behavior that the HTML community has for XML content. There is
some validity in that approach (the use cases of malformed RSS for
instance), but the result is that you have to go to a model of, yes,
guessing the intent of the user based upon the most likely form of error *when
that content is malformed in the first place.*

Read the post again. I am positing a new parser (as opposed to rewriting the
WHOLE of the XML canon, *plus* changing literally hundreds of billions of
XML documents currently in circulation) that would serve to take XML content
and attempt to intelligently discern what the intent of the user was. I've
laid out the mechanisms by which such a parser would work, and tried to make
the point that, yes,  you can in fact change the heuristics based upon a set
of configuration files in those cases where you DID have a general idea of
the provenance of the XML. With work, you could even do it in via streaming,
which would be ideal for the case of parsing such content within web
browsers for rendering.
*
*
I would also argue about your definition of a kludge. One of the key tenets
of the HTML5 working group is that the Grandmother principle is common and
pervasive, and that because of this the parser has to "discern" the input,
based upon a known schema. Frankly, it's not a kludge - it is a deliberately
thought out strategy to deal with the fact that real world data is dirty,
and I think that's a very compelling argument. What I am arguing is simply
that rather than seeing HTML5 as being some kind of blessed language that
has its own inner workings, you look at HTML5 as being XML for a second,
then ask what would need to change in that dirty-data parser to generalize
this to the level of XML.

Most of the problems that people have working with XML is that there are
rules that can seem arcane and arbitrary, and that, without a fairly
sophisticated understanding of the language don't make sense. Consider my
first example. To David Carlisle's point, I fully recognize that this is
valid XML. I would also contend that in most cases, it is counterintuitive
to the vast majority of non-XML coders:

<ns1:foo xmlns:ns="myFooNS">
    <bar/>
    <bat/>
</ns1:foo>
Listing 1. A namespaced element wraps anonymous content.

internally maps to:

<ns1:foo xmlns:ns="myFooNS">
    <ns2:bar xmls:ns2="myUndeclaredDefaultNamespace"/>
    <ns2:bat xmls:ns2="myUndeclaredDefaultNamespace"/>
</ns1:foo>
Listing 2. This maps internally to a new set of namespaces in the default
namespace realm.

(XSLT is an obvious example of this approach).

However, to the vast majority of non-XML people, Listing 1 is INTENDED to
be:

<ns1:foo xmlns:ns="myFooNS">
    <ns1:bar/>
    <ns1:bat/>
</ns1:foo>
Listing 3. Anonymous elements map to the declared namespace.

This is one of those cases where the obvious case is wrong, and it occurs
with surprising regularity.

This is where confidence comes in - if I parse the above, the heuristic (and
it is a heuristic) would say: in the case of default content within a
declared namespace, there is a 65% chance that what was intended was Listing
3, a 35% chance that what was intended as Listing 2. These specific
percentages could obviously be changed via customization. If I parse the
above, what is the highest overall confidence that I can achieve given
uncertain results. Here it would be 65%, though if there were other such
rules in place, the cumulative confidence would be the product of all of
those.

If I'm parsing an XSLT document, I would require that the parser have a
confidence of 100% - and would generate an error if there were any
ambiguities that would arise. In short, such a parser would be a strictly
conforming one. On the other hand, let's say that I have XML representing a
playlist for a music program, and that there are perhaps a dozen or more
different vendors that each produce such playlists, but not all of them are
well formed XML (and this also happens with alarming regularity) and the
ones that do conform have schemas that differ from standard ones in subtle
ways (the ordering of items is a big one). Ordering matters in XML when you
have schema validation and are employing <xs:sequence> - which is pretty
much the norm for most industrial grade schemas.

Given that scenario, you as a playlist developer could take the parser, but
rather than accepting the default configuration, can feed it in a
configuration file that would augment or override the defaults, adding such
rules as saying that certain element patterns would map in certain ways to a
target schema, specific "record" elements that are given outside of a
container would be mapped to a different internal structure and so forth,
and that the presence of these particular elements would have specific
confidences associated with them as well. Would this involve XSLT or XQuery?
Yes, probably, though at some level a parser and an XSLT transformer are not
that different (as Michael Kay would no doubt verify).

The point is that such a parser would still return a confidence level about
the resulting parsed content that can be used to establish thresholds of
confidence - this playlist is likely valid, this playlist may have enough
information that it can be displayed, even if it doesn't have everything,
this playlist is garbage and should be rejected out of hand. Playlists,
OPML, RSS feeds, even HTML, there's a whole universe of WEB-BASED content
that fits into the category of being useful but not strictly conforming to
established XML practices, and if XML is going to have any utility on the
web, then a fuzzy approach to parsing *when applicable* strikes me as the
easiest solution to achieve.

To David Carlisle's points - the approach that I'm suggesting is one that's
well known in XML circles - rather than encoding your business rules (in
this case the schematic parsing rules) in code, you put it into ... um ...
XML files. I DON'T KNOW what the default heuristics would be, and at the
moment frankly don't care - because these rules are dynamic.

Would it take rebuilding parsers? Yes. Do I have some hand-waving on details
here? Yes, definitely - I haven't even begun to define what such a
configuration file would look like here, though I have some ideas. What I'm
arguing for is the principle - that by taking this approach, you solve
several problems at once:

1) processing all of those "XML" documents out there that are strictly
ill-formed and that up to now have been out of reach of XML.
2) differentiating between strictly complying XML - necessary for mission
critical applications - from the more ill-formed XML.
4) parsing JSON or YAML (or HTML) into XML.

Serializers would work the same way, possibly up to and including the
generation of "malformed" content.

My gut feeling is the creating a MicroXML is not the solution - it's another
specification, and like all such specifications will end up generating more
new infrastructure on top of it. Using HTML5 and JSON is also not the
solution - there are too many places where JSON is inadequate as a language,
 and HTML5 is, at least from my perspective, simply XML with quirks mode
enabled. Given that, it would seem that the best place to tackle the
impedance mismatches is at the point of entry and egress - the parsing and
serialization stacks.

My two cents worth, anyway.

Kurt Cagle
Invited Expert
W3C Web Forms Working Group


On Thu, Dec 23, 2010 at 1:02 AM, John Cowan <cowan@mercury.ccil.org> wrote:

> Kurt Cagle scripsit:
>
> > Consider, for instance, the characteristics of a hypothetical lax XML
> parser
>
> Yeeks.  What you are doing here, AFAICS, is trying to design a kludge.
> By comparison, HTML parsing is an *evolved* kludge: it got to be the
> way it is as a result of natural selection (more or less).  The trouble
> with designing a kludge is, why this particular kludge and not one of any
> number of possible closely related kludges?  For the normal application
> of kludges as one-offs, this doesn't matter, but redesigning XML parsing
> is anything but a one-off.
>
> > As the parser works through these cases, it assigns a weight that
> > indicates the likelihood that a given heuristic rule determines the
> > correct configuration.
>
> Based on what?  To do this in a sound way, you'd have to have a lot of
> information about broken XML and what the creator *meant* to express
> by it.  I don't know any source of that information.  Otherwise you are
> not truly doing heuristics, but just guessing a priori about what kinds of
> error-generating processes are more important and what are less important.
>
> --
> In my last lifetime,                            John Cowan
> I believed in reincarnation;                    http://www.ccil.org/~cowan
> in this lifetime,                               cowan@ccil.org
> I don't.  --Thiagi
>
Received on Thursday, 23 December 2010 15:04:45 UTC