- From: Kurt Cagle <kurt.cagle@gmail.com>
- Date: Thu, 23 Dec 2010 10:03:40 -0500
- To: John Cowan <cowan@mercury.ccil.org>
- Cc: David Carlisle <davidc@nag.co.uk>, public-html-xml@w3.org
- Message-ID: <AANLkTikyXShi-2aS9-aPXarGhB1FVcD+uQevs+4gX27+@mail.gmail.com>
John, I would contend that when a web browser attempts to parse ill-formed HTML, it is doing precisely this kind of "kludge". In both cases what you are attempting to do is solve the "Grandmother problem" - how do you take input from non-programmers (my Grandmother, for instance) or from poorly designed devices and discern from that the intent of the content. This is what most HTML parsers do, and seems to be the anticipated behavior that the HTML community has for XML content. There is some validity in that approach (the use cases of malformed RSS for instance), but the result is that you have to go to a model of, yes, guessing the intent of the user based upon the most likely form of error *when that content is malformed in the first place.* Read the post again. I am positing a new parser (as opposed to rewriting the WHOLE of the XML canon, *plus* changing literally hundreds of billions of XML documents currently in circulation) that would serve to take XML content and attempt to intelligently discern what the intent of the user was. I've laid out the mechanisms by which such a parser would work, and tried to make the point that, yes, you can in fact change the heuristics based upon a set of configuration files in those cases where you DID have a general idea of the provenance of the XML. With work, you could even do it in via streaming, which would be ideal for the case of parsing such content within web browsers for rendering. * * I would also argue about your definition of a kludge. One of the key tenets of the HTML5 working group is that the Grandmother principle is common and pervasive, and that because of this the parser has to "discern" the input, based upon a known schema. Frankly, it's not a kludge - it is a deliberately thought out strategy to deal with the fact that real world data is dirty, and I think that's a very compelling argument. What I am arguing is simply that rather than seeing HTML5 as being some kind of blessed language that has its own inner workings, you look at HTML5 as being XML for a second, then ask what would need to change in that dirty-data parser to generalize this to the level of XML. Most of the problems that people have working with XML is that there are rules that can seem arcane and arbitrary, and that, without a fairly sophisticated understanding of the language don't make sense. Consider my first example. To David Carlisle's point, I fully recognize that this is valid XML. I would also contend that in most cases, it is counterintuitive to the vast majority of non-XML coders: <ns1:foo xmlns:ns="myFooNS"> <bar/> <bat/> </ns1:foo> Listing 1. A namespaced element wraps anonymous content. internally maps to: <ns1:foo xmlns:ns="myFooNS"> <ns2:bar xmls:ns2="myUndeclaredDefaultNamespace"/> <ns2:bat xmls:ns2="myUndeclaredDefaultNamespace"/> </ns1:foo> Listing 2. This maps internally to a new set of namespaces in the default namespace realm. (XSLT is an obvious example of this approach). However, to the vast majority of non-XML people, Listing 1 is INTENDED to be: <ns1:foo xmlns:ns="myFooNS"> <ns1:bar/> <ns1:bat/> </ns1:foo> Listing 3. Anonymous elements map to the declared namespace. This is one of those cases where the obvious case is wrong, and it occurs with surprising regularity. This is where confidence comes in - if I parse the above, the heuristic (and it is a heuristic) would say: in the case of default content within a declared namespace, there is a 65% chance that what was intended was Listing 3, a 35% chance that what was intended as Listing 2. These specific percentages could obviously be changed via customization. If I parse the above, what is the highest overall confidence that I can achieve given uncertain results. Here it would be 65%, though if there were other such rules in place, the cumulative confidence would be the product of all of those. If I'm parsing an XSLT document, I would require that the parser have a confidence of 100% - and would generate an error if there were any ambiguities that would arise. In short, such a parser would be a strictly conforming one. On the other hand, let's say that I have XML representing a playlist for a music program, and that there are perhaps a dozen or more different vendors that each produce such playlists, but not all of them are well formed XML (and this also happens with alarming regularity) and the ones that do conform have schemas that differ from standard ones in subtle ways (the ordering of items is a big one). Ordering matters in XML when you have schema validation and are employing <xs:sequence> - which is pretty much the norm for most industrial grade schemas. Given that scenario, you as a playlist developer could take the parser, but rather than accepting the default configuration, can feed it in a configuration file that would augment or override the defaults, adding such rules as saying that certain element patterns would map in certain ways to a target schema, specific "record" elements that are given outside of a container would be mapped to a different internal structure and so forth, and that the presence of these particular elements would have specific confidences associated with them as well. Would this involve XSLT or XQuery? Yes, probably, though at some level a parser and an XSLT transformer are not that different (as Michael Kay would no doubt verify). The point is that such a parser would still return a confidence level about the resulting parsed content that can be used to establish thresholds of confidence - this playlist is likely valid, this playlist may have enough information that it can be displayed, even if it doesn't have everything, this playlist is garbage and should be rejected out of hand. Playlists, OPML, RSS feeds, even HTML, there's a whole universe of WEB-BASED content that fits into the category of being useful but not strictly conforming to established XML practices, and if XML is going to have any utility on the web, then a fuzzy approach to parsing *when applicable* strikes me as the easiest solution to achieve. To David Carlisle's points - the approach that I'm suggesting is one that's well known in XML circles - rather than encoding your business rules (in this case the schematic parsing rules) in code, you put it into ... um ... XML files. I DON'T KNOW what the default heuristics would be, and at the moment frankly don't care - because these rules are dynamic. Would it take rebuilding parsers? Yes. Do I have some hand-waving on details here? Yes, definitely - I haven't even begun to define what such a configuration file would look like here, though I have some ideas. What I'm arguing for is the principle - that by taking this approach, you solve several problems at once: 1) processing all of those "XML" documents out there that are strictly ill-formed and that up to now have been out of reach of XML. 2) differentiating between strictly complying XML - necessary for mission critical applications - from the more ill-formed XML. 4) parsing JSON or YAML (or HTML) into XML. Serializers would work the same way, possibly up to and including the generation of "malformed" content. My gut feeling is the creating a MicroXML is not the solution - it's another specification, and like all such specifications will end up generating more new infrastructure on top of it. Using HTML5 and JSON is also not the solution - there are too many places where JSON is inadequate as a language, and HTML5 is, at least from my perspective, simply XML with quirks mode enabled. Given that, it would seem that the best place to tackle the impedance mismatches is at the point of entry and egress - the parsing and serialization stacks. My two cents worth, anyway. Kurt Cagle Invited Expert W3C Web Forms Working Group On Thu, Dec 23, 2010 at 1:02 AM, John Cowan <cowan@mercury.ccil.org> wrote: > Kurt Cagle scripsit: > > > Consider, for instance, the characteristics of a hypothetical lax XML > parser > > Yeeks. What you are doing here, AFAICS, is trying to design a kludge. > By comparison, HTML parsing is an *evolved* kludge: it got to be the > way it is as a result of natural selection (more or less). The trouble > with designing a kludge is, why this particular kludge and not one of any > number of possible closely related kludges? For the normal application > of kludges as one-offs, this doesn't matter, but redesigning XML parsing > is anything but a one-off. > > > As the parser works through these cases, it assigns a weight that > > indicates the likelihood that a given heuristic rule determines the > > correct configuration. > > Based on what? To do this in a sound way, you'd have to have a lot of > information about broken XML and what the creator *meant* to express > by it. I don't know any source of that information. Otherwise you are > not truly doing heuristics, but just guessing a priori about what kinds of > error-generating processes are more important and what are less important. > > -- > In my last lifetime, John Cowan > I believed in reincarnation; http://www.ccil.org/~cowan > in this lifetime, cowan@ccil.org > I don't. --Thiagi >
Received on Thursday, 23 December 2010 15:04:45 UTC