W3C home > Mailing lists > Public > public-html-xml@w3.org > January 2012

scalability and forgiving parsers

From: Larry Masinter <masinter@adobe.com>
Date: Fri, 20 Jan 2012 08:28:05 -0800
To: Anne van Kesteren <annevk@opera.com>
CC: "public-html-xml@w3.org" <public-html-xml@w3.org>
Message-ID: <C68CB012D9182D408CED7B884F441D4D06A86D2D1C@nambxv01a.corp.adobe.com>
There was a thread 

Somehow I wonder if, by " the apparent perception that an HTML parser is somehow vastly more complex than its XML counterpart" (and a pointer to a Google+ thread) Anne was referring my comment on that thread about scalability. I kind of gave up trying to make my point in a thread in Google+, but I thought I would try here.

It's really the same point as in http://masinter.blogspot.com/2010/01/over-specification-is-anti-competitive.html .

The general principle is that the more you constrain exactly how an agent is to behave, the more you constrain the implementation style and methodology of implementations.  At least on first principles, every MUST in a specification -- if it means anything at all -- has to make some otherwise compliant implementations non-compliant.  If it doesn't, it isn't really normative.... 

At least theoretically, if you specify a simple language based on matching brackets which you can parse  or scan with a regular expression, and then add rules for how other  strings should also parse, you constrain how to write a scanner.   Maybe you don't make scanners necessarily more _complicated_, you just reduce the flexibility, e.g., you can't use some technology you've implemented for other purposes (oh, YACC or something).

I *think* the counter-argument is that this doesn't apply to XML (XHTML) and HTML, that XML (XHTML) is just as complicated as HTML, that there can be as wide a variety of HTML parsers and processors as there are of XML ones.

Are any of these possibilities:
* You agree with the general principle make sense, but you disagree about its application to XML/HTML?
* you disagree with the general principle?
*  you thought I was making a different point?


Larry


-----Original Message-----
...

I do not feel too strongly, and please publish if this is all that is holding the document back, but I do think a comment I made earlier still stands. The comparison between how HTML instructs an agent to "recover  from markup errors" whereas XML is unforgiving is skewed. I think the reality is more that HTML creates a tree out of any given input and XML defines a number of conditions that will not result in a tree. I think this is important because of the apparent perception that an HTML parser is somehow vastly more complex than its XML counterpart. See e.g.  
https://plus.google.com/103429767916333774260/posts/R6dPzhbc94R for an example of that.

...
Received on Friday, 20 January 2012 16:28:45 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Friday, 20 January 2012 16:28:46 GMT