- From: Boris Zbarsky <bzbarsky@MIT.EDU>
- Date: Mon, 17 Nov 2008 10:06:14 -0500
- To: elharo@metalab.unc.edu
- CC: public-html <public-html@w3.org>, www-tag@w3.org
Elliotte Harold wrote: > Yes. Error handling is fine. Error correction is much more problematic. If we're going to be pedantic, there are really four things to consider. In order of "effort" (quoted because in practice this isn't the case, as Maciej says) they are: error detection, error handling, error recovery, error correction. Error detection: Noticing that there is an error. Error handling: Deciding what to do with the error Error recovery: Skipping over the error and continuing with parsing. Error correction: Doing what the author "really meant". Examples: Error detection would be noticing a mismatched closing tag in XML or an unknown property name in CSS. Error handling would be terminating the parser in XML or performing error recovery in CSS. Error recovery in CSS would be skipping over the declaration. Error correction in both cases would be figuring out that the author made a typo and pretending that it didn't happen. These are the things that all need to be defined. Here's how typical web languages handle these: XML: Error detection is well-defined. UAs MUST NOT perform error correction. UAs MUST NOT perform error recovery (though there is weasel wording about reporting errors even after the first one). Error handling is not well defined, which is a major problem. CSS: Aims to have error detection, handling, and recovery well-defined (though people keep finding ambiguities in the definitions). UAs MUST NOT perform error correction. HTML4: Defines error detection (DTD, prose). Doesn't define anything else, though makes suggestions about error handling, recovery, and correction. HTML5: Aims to define all four. Note that a lot of what you seem to call "error correction" is in fact "error recovery". It's complicated by having to recover in ways that are consistent with what browsers actually do, which makes it look like correction. But there is plenty of correction that doesn't happen. If you use an <h7>, no one will correct it to <h6> for you. > It makes the spec far harder to understand and implement. There is no effect on the authoring aspect (in that, the parsing algorithm doesn't affect the definition of what is valid). It makes the parsing specification harder to understand, I agree. That's a necessary cost of making it implementable in this case. > In essence, the path taken by HTML 5 is that there is no such thing as a document > which is in error. No, the path taken is that error recovery must be compatible with what UAs actually do and what HTML4 recommends. Just because a document is in error doesn't mean that you don't use as much of it as you can. This is similar to what CSS does. > All byte streams become legal HTML documents. That's > not how they phrase it, but that's the effect. Note that CSS aims at something similar (modulo the business about the lexer there). That is, all byte streams are parseable (as opposed to legal!) documents. > It's an interesting idea, and might even work (though I'm skeptical) It's working OK for CSS so far, no? It's working OK for HTML4 (modulo interop problems due to everything being undefined). > but it very much raises the bar for implementing parsers I'd like to confirm what Maciej said here: the XML and HTML parsers in Gecko are of about equal size (and complexity, imo). The big difference is that we're using an off-the-shelf XML parser (with some local changes), which reduces the maintenance burden significantly. That's the model we'd like to move to with HTML as well. -Boris
Received on Monday, 17 November 2008 15:07:01 UTC