- From: Maciej Stachowiak <mjs@apple.com>
- Date: Sun, 16 Nov 2008 22:28:19 -0800
- To: elharo@metalab.unc.edu
- Cc: Jonas Sicking <jonas@sicking.cc>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, noah_mendelsohn@us.ibm.com, Dean Edridge <dean@dean.org.nz>, public-html <public-html@w3.org>, www-tag@w3.org
On Nov 16, 2008, at 10:03 PM, Elliotte Harold wrote: > > Yes. Error handling is fine. Error correction is much more > problematic. It makes the spec far harder to understand and > implement. In essence, the path taken by HTML 5 is that there is no > such thing as a document which is in error. All byte streams become > legal HTML documents. That's not how they phrase it, but that's the > effect. Since the HTML5 spec itself says the opposite (that some byte streams are conforming HTML documents and some are not), I think it is a stretch to say it makes any byte stream legal. Let's compare with a well-known semantic language, English. Many utterances in English contain syntax errors. Many such utterances will still be understood correctly by a native speaker, and indeed the listener will not bother to flag the error most of the time. But that does not mean that all utterances are correct English. > It's an interesting idea, and might even work (though I'm skeptical) > but it very much raises the bar for implementing parsers, and is > contrary to the design of XML at a very deep level. In essence, it > is a fundamental rejection of one of the core values of XML. It is > the polar opposite of draconian error handling. I agree that defining detailed error handling sounds like it would greatly increase implementation complexity. However, it should be kept in mind that most software will consume HTML5 content using an off-the- shelf parser, much as they do for XML content. Such parsers have been written already in at least Python, Java and Ruby, so it is not an infeasible task. I'll go further and say this is not only common for software to use off-the-shelf parsers, but a good idea. Most software has no reason to write its own XML or HTML parser or serializer, and doing so is more likely to lead to mistakes than to be in any way helpful. And finally, in my experience, it is not necessarily even true that the error handling of HTML makes it harder to implement parsing than for XML. In WebKit, the pieces of code implementing HTML and XML parsing are close to the same size, and that is not even including the libxml library that does most of the heavy lifting in XML parsing. If you include that, then the XML parsing code is several times bigger. There are a few reasons for this. First, support for the internal subset adds a lot of complexity to XML parsing. Second, an XML parser is required to detect and report many error conditions; this is much of the same complexity that results from HTML error handling, but in fact it is sometimes worse, because an HTML parser can treat a number of error conditions exactly the same way as non-error conditions if it is not seeking specifically to report the error. So overall, I would say that in practice it is not a big problem that the parsing algorithm handles so many different errors. Indeed, for many applications, this is outweighed by the benefit of being able to process content more like browsers do. Regards, Maciej
Received on Monday, 17 November 2008 06:29:02 UTC