Re: Comments on HTML WG face to face meetings in France Oct 08 from Boris Zbarsky on 2008-11-17 (public-html@w3.org from November 2008)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Mon, 17 Nov 2008 10:06:14 -0500
To: elharo@metalab.unc.edu
CC: public-html <public-html@w3.org>, www-tag@w3.org
Message-ID: <49218866.8060304@mit.edu>
Elliotte Harold wrote:
> Yes. Error handling is fine. Error correction is much more problematic. 

If we're going to be pedantic, there are really four things to consider. 
  In order of "effort" (quoted because in practice this isn't the case, 
as Maciej says) they are: error detection, error handling, error 
recovery, error correction.

  Error detection: Noticing that there is an error.
  Error handling: Deciding what to do with the error
  Error recovery: Skipping over the error and continuing with parsing.
  Error correction: Doing what the author "really meant".

Examples: Error detection would be noticing a mismatched closing tag in 
XML or an unknown property name in CSS.  Error handling would be 
terminating the parser in XML or performing error recovery in CSS. 
Error recovery in CSS would be skipping over the declaration.  Error 
correction in both cases would be figuring out that the author made a 
typo and pretending that it didn't happen.

These are the things that all need to be defined.  Here's how typical 
web languages handle these:

   XML: Error detection is well-defined.  UAs MUST NOT perform
        error correction.  UAs MUST NOT perform error recovery
        (though there is weasel wording about reporting errors
        even after the first one).  Error handling is not well
        defined, which is a major problem.

   CSS: Aims to have error detection, handling, and recovery
        well-defined (though people keep finding ambiguities in
        the definitions).  UAs MUST NOT perform error correction.

   HTML4: Defines error detection (DTD, prose).  Doesn't define
          anything else, though makes suggestions about error
          handling, recovery, and correction.

   HTML5: Aims to define all four.  Note that a lot of what you seem to
          call "error correction" is in fact "error recovery".  It's
          complicated by having to recover in ways that are consistent
          with what browsers actually do, which makes it look like
          correction.  But there is plenty of correction that doesn't
          happen.  If you use an <h7>, no one will correct it to <h6>
          for you.

> It makes the spec far harder to understand and implement.

There is no effect on the authoring aspect (in that, the parsing 
algorithm doesn't affect the definition of what is valid).  It makes the 
parsing specification harder to understand, I agree.  That's a necessary 
cost of making it implementable in this case.

> In essence, the path taken by HTML 5 is that there is no such thing as a document 
> which is in error.

No, the path taken is that error recovery must be compatible with what 
UAs actually do and what HTML4 recommends.  Just because a document is 
in error doesn't mean that you don't use as much of it as you can.  This 
is similar to what CSS does.

> All byte streams become legal HTML documents. That's 
> not how they phrase it, but that's the effect.

Note that CSS aims at something similar (modulo the business about the 
lexer there).  That is, all byte streams are parseable (as opposed to 
legal!) documents.

> It's an interesting idea, and might even work (though I'm skeptical)

It's working OK for CSS so far, no?  It's working OK for HTML4 (modulo 
interop problems due to everything being undefined).

> but it very much raises the bar for implementing parsers

I'd like to confirm what Maciej said here: the XML and HTML parsers in 
Gecko are of about equal size (and complexity, imo).  The big difference 
is that we're using an off-the-shelf XML parser (with some local 
changes), which reduces the maintenance burden significantly.  That's 
the model we'd like to move to with HTML as well.

-Boris
Received on Monday, 17 November 2008 15:07:07 UTC