Re: Comments on HTML WG face to face meetings in France Oct 08 from Maciej Stachowiak on 2008-11-17 (www-tag@w3.org from November 2008)

From: Maciej Stachowiak <mjs@apple.com>
Date: Sun, 16 Nov 2008 22:28:19 -0800
To: elharo@metalab.unc.edu
Cc: Jonas Sicking <jonas@sicking.cc>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, noah_mendelsohn@us.ibm.com, Dean Edridge <dean@dean.org.nz>, public-html <public-html@w3.org>, www-tag@w3.org
Message-id: <CCA014AD-FB8A-4130-8631-E59329AAD798@apple.com>

On Nov 16, 2008, at 10:03 PM, Elliotte Harold wrote:

>
> Yes. Error handling is fine. Error correction is much more  
> problematic. It makes the spec far harder to understand and  
> implement. In essence, the path taken by HTML 5 is that there is no  
> such thing as a document which is in error. All byte streams become  
> legal HTML documents. That's not how they phrase it, but that's the  
> effect.

Since the HTML5 spec itself says the opposite (that some byte streams  
are conforming HTML documents and some are not), I think it is a  
stretch to say it makes any byte stream legal.

Let's compare with a well-known semantic language, English. Many  
utterances in English contain syntax errors. Many such utterances will  
still be understood correctly by a native speaker, and indeed the  
listener will not bother to flag the error most of the time. But that  
does not mean that all utterances are correct English.

> It's an interesting idea, and might even work (though I'm skeptical)  
> but it very much raises the bar for implementing parsers, and is  
> contrary to the design of XML at a very deep level. In essence, it  
> is a fundamental rejection of one of the core values of XML. It is  
> the polar opposite of draconian error handling.

I agree that defining detailed error handling sounds like it would  
greatly increase implementation complexity. However, it should be kept  
in mind that most software will consume HTML5 content using an off-the- 
shelf parser, much as they do for XML content. Such parsers have been  
written already in at least Python, Java and Ruby, so it is not an  
infeasible task. I'll go further and say this is not only common for  
software to use off-the-shelf parsers, but a good idea. Most software  
has no reason to write its own XML or HTML parser or serializer, and  
doing so is more likely to lead to mistakes than to be in any way  
helpful.

And finally, in my experience, it is not necessarily even true that  
the error handling of HTML makes it harder to implement parsing than  
for XML. In WebKit, the pieces of code implementing HTML and XML  
parsing are close to the same size, and that is not even including the  
libxml library that does most of the heavy lifting in XML parsing. If  
you include that, then the XML parsing code is several times bigger.  
There are a few reasons for this. First, support for the internal  
subset adds a lot of complexity to XML parsing. Second, an XML parser  
is required to detect and report many error conditions; this is much  
of the same complexity that results from HTML error handling, but in  
fact it is sometimes worse, because an HTML parser can treat a number  
of error conditions exactly the same way as non-error conditions if it  
is not seeking specifically to report the error.

So overall, I would say that in practice it is not a big problem that  
the parsing algorithm handles so many different errors. Indeed, for  
many applications, this is outweighed by the benefit of being able to  
process content more like browsers do.

Regards,
Maciej

Received on Monday, 17 November 2008 06:29:02 UTC