W3C home > Mailing lists > Public > public-html@w3.org > November 2008

Re: An HTML language specification vs. a browser specification

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Fri, 14 Nov 2008 21:26:40 -0500
Message-ID: <491E3360.1040309@mit.edu>
To: Robert J Burns <rob@robburns.com>
CC: public-html@w3.org

Robert J Burns wrote:
> They may be, but the seeming points of contention you raise are not 
> contrary to anything I'm saying, so I'm not sure what the thrust of your 
> intervention is here.

You're proposing that the current spec be split up into multiple pieces. 
  I'm trying to understand your proposal so that I can decide what I 
think of it.

> Agree, that's why I'm saying a separate parsing spec could be made 
> independent of the HTML5 vocabulary

Doing it independent of the HTML5 vocabulary is very difficult, as I 
said before...

Unless you're proposing that there be a generic (non-HTML-specific) 
parsing spec that defines all sorts of error-recovery behaviors, and 
then the actual HTML5 vocabulary spec defines parsing of HTML, by 
refererence to those behaviors?  So to understand how to parse HTML one 
would have to read both specs?

> By this I assume you mean the handling of DTD specified optional tag 
> omissions

Indeed, as well as SGML-defined handling of missing '>' at the very 
least.  I seem to recall a few other things, but I'm not an SGML expert 
by any means.

>> Specialized expertise in what?  Language design?  Parser design? 
>> Parsing HTML?  I think a good bit of HTML parsing expertise has been 
>> applied to writing this part of the spec.  Unless by "this" you mean 
>> something other than "prose along with specific error handling".
> 
> Certainly the reverse-engineering for the parsing of HTML is in the 
> current draft. The problem is the draft lacks any forward-looking 
> approach and leaves rendering engines (mozilla among them) in a state 
> that hinders language improvements in the future (like adding new 
> non-void head elements

Isn't the problem that current browsers don't allow "new non-void head 
elements" and that adding them would therefore not work in those 
browsers anyway?  So it seems to me that the real thing hindering this 
particular language improvement is not so much the current parsing draft 
as the general philosophy (expressed in this group's charter, I must 
add) that you don't break existing content and that HTML must degrade 
gracefully in downrev browsers.

> or not even incorporating Gecko's in span insertion mode).

I'm not sure what you're talking about here.

> Again, there's no point of contention here. All I'm saying is that the 
> parsing algorithm draft includes error-recovery that could be equally 
> applied to non-html DTD defined content (and the algorithm applies 
> error-recovery to HTML beyond what a DTD can define).

OK, agreed.

> It is that parsing 
> algorithm which would be better as a separate spec (one that could 
> incorporate improvements that allowed the HTML element and content model 
> vocabulary to change).

But the parsing algorithm's error recovery depends on details of the 
HTML vocabulary.  That's what I'm not understanding: in the absence of 
reference to HTML, what would your separate spec actually define?

> That again is the fancy error- recovery I spoke about. However, as we 
> advance the HTML5 error-recovery, there shouldn't be any need to add new 
> fancy error-recovery (like moving input elements outside of tables).

I'm not sure what you mean here.  Shouldn't be any need to add it in 
HTML5?  Or shouldn't be any need to add it post-HTML5?  In any case, I'm 
confused by your distinction between "fancy" error recovery and the 
other kind. Why is this distinction relevant, exactly?  HTML needs both.

> In only needs to be aware of the legacy HTML vocabulary and the error 
> recovery applied to that. The introduction of new vocabulary to HTML 
> need not be addressed in changes to error-recovery (there's no need for 
> new fancy error recovery moving forward if every browser adheres to the 
> parsing spec).

Ah, I think I see the issue.  You seem to be most interested in what one 
can do with HTML once all browsers are following the HTML5 parsing spec. 
  I, on the other hand, am more intereted in having the HTML5 parsing 
spec be such that current pages do not break in browsers that implement 
it, such that valid HTML5 documents downgrade properly in non-HTML5 
parsers, and such that language improvement could continue, in that 
order.  I agree that future extensibility of the language is important, 
but there's no point in creating a parsing spec that can't be 
implemented given existing content.

That said, I'm not clear on why you think that having multiple documents 
here would aid creating a parsing specification that allows new 
vocabulary to be added.

>> OK.  So this would basically be a list of elements, corresponding 
>> attributes, DOM interface, and the behavior of said DOM interfaces, 
>> without reference to where these elements come from or how they relate 
>> to each other other than that some elements may contain other elements 
>> in some cases?
> 
> No, it can be a complete specification of the vocabulary of HTML.

How does that differ from what I said?

> What is left out is how that vocabulary gets serialized or how such 
> serialized get parsed (except that the result of the parsing should be a 
> valid HTML object).

I'm not sure one can easily define <script>'s behavior without reference 
to how HTML gets parsed... not if you want to keep document.write(). 
Sad, and I wish no one had ever thought of document.write, but we're 
stuck with it now.

> Yes, but this is the fancy legacy error-recovery I was speaking about. 
> This fancy stuff was introduced by browsers because they didn't have a 
> predefined parsing algorithm and authors used invalid content models 
> where they were trying to solve certain problems.

That's all fine, but I'm not sure how that matters to the situation we 
have now.

> There's no reason that the parsing algorithm going forward needs to introduce any new fancy 
> error-recovery.

I agree. It just needs to support all the existing fancy error-recovery.

> Just to clarify this strangeness is due to some browsers 
> introducing fancy error-recovery to repair poorly authored tables, while 
> authors simultaneously took advantage of other browser error-recovery to 
> hide input value inside tables. If browsers adhere to a newly specified 
> parsing algorithm we shouldn't need to introduce any new such 
> error-recovery in the future.

Fully agreed.  I'm just not sure what bearing that has on the discussion 
at hand.

>>> Still there's an independence. We can allow scripts to call the 
>>> parser and we can have parsers produce scripts while still keeping 
>>> the definition separate.
>>
>> The definition of which?
> 
> The definition of both. In other words we can specify HTML (and others 
> specify javascript) in a way that allows authors to call a parser and 
> parse a string or bytes into content without either HTML or javascript 
> specifying the algorithm for that parsing (e.g., scripting could allow 
> the parsing of a vcard into a microformats hcard, but hcard and HTML do 
> not need to define the algorithm for that parsing).

I would love to see this done for document.write as a proof-of-concept.

> On the other hand, 
> the parsing algorithm doesn't have to necessarily know anything about 
> the document schema that it is going to de-serialize content into.

The simple fact that the parser needs to block on <script> tags makes me 
doubt that, unless the parser is defined to have hooks that support the 
document schema altering parser behavior on the fly.

Which can be done; I'm just worried about the complexity.

>> "web browser behavior" includes the parsing algorithm.  In fact, 
>> that's one of the most important parts of the current specification 
>> from Mozilla's point of view.
> 
> However, its an easily compartmentalized piece of what a browser does 
> (not only easily compartmentalized, but better for spec designing and 
> application design when its abstracted in that way).

Sadly, it's not that easily compartmentalized.  Somewhat 
compartmentalized, true.  But not easily.

>> Uh...  We have spec editors who understand the parsing algorithm far 
>> better than anyone else I can think of, since they've spent a good bit 
>> of time studying how browsers actually parse HTML.  So I don't know 
>> where the "don't have spec editors who sufficiently understand those 
>> parts" meme comes from.
> 
>  From many months of discussing these topics.

I've read every single mail on this list in the last 10 months or so. 
Has the discussion been elsewhere?  Do you have particular people you'd 
like to nominate who "understand the parsing algorithm" well enough for 
your liking?

-Boris
Received on Saturday, 15 November 2008 02:27:28 UTC

This archive was generated by hypermail 2.3.1 : Monday, 29 September 2014 09:38:59 UTC