Re: An HTML language specification vs. a browser specification from Boris Zbarsky on 2008-11-14 (public-html@w3.org from November 2008)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Fri, 14 Nov 2008 18:45:40 -0500
To: Robert J Burns <rob@robburns.com>
CC: public-html@w3.org
Message-ID: <491E0DA4.9020103@mit.edu>
Robert J Burns wrote:
> I'm not really clear what your questions are directed at in my previous 
> message.

They certainly are.

> Certainly, that would need to be part of a parsing the spec. The with 
> SGML we had DTDs.

DTDs aren't a parsing spec.  DTDs are a way to specify what markup is 
valid (and some details like inferring opening/closing tags).  SGML + 
DTDs is closer to a parsing spec (mostly on the SGML side).  But it's 
not really sufficiently well-defined to handle arbitrary byte streams.

> With HTML5 we have prose along with specific error 
> handling for ill-formed/invalid markup.

Right.  Though there's plenty of perfectly valid markup that requires 
behavior that looks suspiciously like error handling, as a result of the 
SGML legacy described above.

> What I'm suggesting is that this 
> part of the HTML5 spec suffers from not having some specialized 
> expertise applies to this.

Specialized expertise in what?  Language design?  Parser design? 
Parsing HTML?  I think a good bit of HTML parsing expertise has been 
applied to writing this part of the spec.  Unless by "this" you mean 
something other than "prose along with specific error handling".

> Ideally I think we could have a parsing 
> specification that applied to HTML and SGML equally, but with the 
> possibility of specifying error handling for other DTD specified SGML. 
> Think of it as an SGML parser with a built-in HTML5 DTD.

That's not particularly compatible with the way HTML actually needs to 
be parsed....  And a DTD can't specify the behavior that's needed out of 
an HTML parser, I should note.  If you're using "DTD" as a shorthand for 
"machine-readable format", there's no reason one couldn't create a 
machine-readable definition of the HTML5 state machine.  I'm just not 
sure that's what you're looking for.

> Parsing only depends on the HTML language with respect to the schema 
> handling.

It depends on the language because of the wide variety of tags that have 
to be handled in "weird" ways.

> Valid well-formed markup can be specified by a the language 
> schema and leave error-handling specifications to the parsing algorithm. 

I'm not sure what the first part of that sentence means, to be honest, 
but I agree with the second part.  The parsing algorithm needs to be 
aware of the error handling, and hence of the HTML vocabulary and the 
various properties different HTML tags have in terms of parsing.

> Perhaps it would better to say this is the specification of the HTML 
> vocabulary  (elements, attributes, and content models) and DOM as 
> opposed to the HTML 'language' and DOM.

OK.  So this would basically be a list of elements, corresponding 
attributes, DOM interface, and the behavior of said DOM interfaces, 
without reference to where these elements come from or how they relate 
to each other other than that some elements may contain other elements 
in some cases?

>> Note that in practice parsing might need to depend on attribute 
>> values....
> 
> Could you give an example where parsing depends on attribute values?

Sure.  Compare the DOM produced by browsers for:

   <table>
     <input type="text">
   </table>

To that for:

   <table>
     <input type="hidden">
   </table>

In practice, either you submit your form controls in DOM order and parse 
differently depending on the type attribute of the control or you parse 
the same way no matter what the type, but submit in an order that has 
nothing to do with DOM order.

> Still there's an independence. We can allow scripts to call the parser 
> and we can have parsers produce scripts while still keeping the 
> definition separate.

The definition of which?

> The point of my post (and what I read Roy Fielding saying) is that the 
> current HTML5 specification's strength is in its web browser behavior 
> specification.

"web browser behavior" includes the parsing algorithm.  In fact, that's 
one of the most important parts of the current specification from 
Mozilla's point of view.

> The parsing algorithm and the HTML vocabulary parts of 
> the spec suffer because we don't have spec editors who sufficiently 
> understand those parts.

Uh...  We have spec editors who understand the parsing algorithm far 
better than anyone else I can think of, since they've spent a good bit 
of time studying how browsers actually parse HTML.  So I don't know 
where the "don't have spec editors who sufficiently understand those 
parts" meme comes from.  It more or less looks like a passive-aggressive 
accusation of incompetence to me.

Cheers,
Boris
Received on Friday, 14 November 2008 23:46:27 UTC