Re: An HTML language specification vs. a browser specification from Robert J Burns on 2008-11-15 (public-html@w3.org from November 2008)

From: Robert J Burns <rob@robburns.com>
Date: Fri, 14 Nov 2008 18:35:29 -0600
To: Boris Zbarsky <bzbarsky@MIT.EDU>
Cc: public-html@w3.org
Message-Id: <B2899169-17C0-4F03-A80D-448A620E8A10@robburns.com>
HI Boris,


On Nov 14, 2008, at 5:45 PM, Boris Zbarsky wrote:

>
> Robert J Burns wrote:
>> I'm not really clear what your questions are directed at in my  
>> previous message.
>
> They certainly are.

They may be, but the seeming points of contention you raise are not  
contrary to anything I'm saying, so I'm not sure what the thrust of  
your intervention is here.

>
>
>> Certainly, that would need to be part of a parsing the spec. The  
>> with SGML we had DTDs.
>
> DTDs aren't a parsing spec.  DTDs are a way to specify what markup  
> is valid (and some details like inferring opening/closing tags).   
> SGML + DTDs is closer to a parsing spec (mostly on the SGML side).   
> But it's not really sufficiently well-defined to handle arbitrary  
> byte streams.

Agree, that's why I'm saying a separate parsing spec could be made  
independent of the HTML5 vocabulary / DOM and Web browser behavior  
specs. What the HTML5 parsing algorithms adds is specifics of error- 
recovery related both to arbitrary general markup (like that which  
could be defined in a DTD), and HTML specific error-recovery (the  
fancy recovery I wrote about earlier) which could only be applied to  
HTML.

>> With HTML5 we have prose along with specific error handling for ill- 
>> formed/invalid markup.
>
> Right.  Though there's plenty of perfectly valid markup that  
> requires behavior that looks suspiciously like error handling, as a  
> result of the SGML legacy described above.

By this I assume you mean the handling of DTD specified optional tag  
omissions. This is the part of a new parsing spec that I think could  
be applied to any schema defined markup. Contrast this handling of pre- 
specified optional tag omission from the 'fancy' error recovery  
applied by browsers to HTML. This too needs to be in the parsing spec,  
but its just a task independent of the handling of pre-specified  
optional tag omission.

>> What I'm suggesting is that this part of the HTML5 spec suffers  
>> from not having some specialized expertise applies to this.
>
> Specialized expertise in what?  Language design?  Parser design?  
> Parsing HTML?  I think a good bit of HTML parsing expertise has been  
> applied to writing this part of the spec.  Unless by "this" you mean  
> something other than "prose along with specific error handling".

Certainly the reverse-engineering for the parsing of HTML is in the  
current draft. The problem is the draft lacks any forward-looking  
approach and leaves rendering engines (mozilla among them) in a state  
that hinders language improvements in the future (like adding new non- 
void head elements or not even incorporating Gecko's in span insertion  
mode).

>> Ideally I think we could have a parsing specification that applied  
>> to HTML and SGML equally, but with the possibility of specifying  
>> error handling for other DTD specified SGML. Think of it as an SGML  
>> parser with a built-in HTML5 DTD.
>
> That's not particularly compatible with the way HTML actually needs  
> to be parsed....  And a DTD can't specify the behavior that's needed  
> out of an HTML parser, I should note.  If you're using "DTD" as a  
> shorthand for "machine-readable format", there's no reason one  
> couldn't create a machine-readable definition of the HTML5 state  
> machine.  I'm just not sure that's what you're looking for.

Again, there's no point of contention here. All I'm saying is that the  
parsing algorithm draft includes error-recovery that could be equally  
applied to non-html DTD defined content (and the algorithm applies  
error-recovery to HTML beyond what a DTD can define). It is that  
parsing algorithm which would be better as a separate spec (one that  
could incorporate improvements that allowed the HTML element and  
content model vocabulary to change).

>> Parsing only depends on the HTML language with respect to the  
>> schema handling.
>
> It depends on the language because of the wide variety of tags that  
> have to be handled in "weird" ways.

That again is the fancy error- recovery I spoke about. However, as we  
advance the HTML5 error-recovery, there shouldn't be any need to add  
new fancy error-recovery (like moving input elements outside of tables).

>> Valid well-formed markup can be specified by a the language schema  
>> and leave error-handling specifications to the parsing algorithm.
>
> I'm not sure what the first part of that sentence means, to be  
> honest, but I agree with the second part.  The parsing algorithm  
> needs to be aware of the error handling, and hence of the HTML  
> vocabulary and the various properties different HTML tags have in  
> terms of parsing.

In only needs to be aware of the legacy HTML vocabulary and the error  
recovery applied to that. The introduction of new vocabulary to HTML  
need not be addressed in changes to error-recovery (there's no need  
for new fancy error recovery moving forward if every browser adheres  
to the parsing spec).

>> Perhaps it would better to say this is the specification of the  
>> HTML vocabulary  (elements, attributes, and content models) and DOM  
>> as opposed to the HTML 'language' and DOM.
>
> OK.  So this would basically be a list of elements, corresponding  
> attributes, DOM interface, and the behavior of said DOM interfaces,  
> without reference to where these elements come from or how they  
> relate to each other other than that some elements may contain other  
> elements in some cases?

No, it can be a complete specification of the vocabulary of HTML. What  
is left out is how that vocabulary gets serialized or how such  
serialized get parsed (except that the result of the parsing should be  
a valid HTML object).

>>> Note that in practice parsing might need to depend on attribute  
>>> values....
>> Could you give an example where parsing depends on attribute values?
>
> Sure.  Compare the DOM produced by browsers for:
>
>  <table>
>    <input type="text">
>  </table>
>
> To that for:
>
>  <table>
>    <input type="hidden">
>  </table>
>
> In practice, either you submit your form controls in DOM order and  
> parse differently depending on the type attribute of the control or  
> you parse the same way no matter what the type, but submit in an  
> order that has nothing to do with DOM order.

Yes, but this is the fancy legacy error-recovery I was speaking about.  
This fancy stuff was introduced by browsers because they didn't have a  
predefined parsing algorithm and authors used invalid content models  
where they were trying to solve certain problems. There's no reason  
that the parsing algorithm going forward needs to introduce any new  
fancy error-recovery. Just to clarify this strangeness is due to some  
browsers introducing fancy error-recovery to repair poorly authored  
tables, while authors simultaneously took advantage of other browser  
error-recovery to hide input value inside tables. If browsers adhere  
to a newly specified parsing algorithm we shouldn't need to introduce  
any new such error-recovery in the future.


>> Still there's an independence. We can allow scripts to call the  
>> parser and we can have parsers produce scripts while still keeping  
>> the definition separate.
>
> The definition of which?

The definition of both. In other words we can specify HTML (and others  
specify javascript) in a way that allows authors to call a parser and  
parse a string or bytes into content without either HTML or javascript  
specifying the algorithm for that parsing (e.g., scripting could allow  
the parsing of a vcard into a microformats hcard, but hcard and HTML  
do not need to define the algorithm for that parsing). On the other  
hand, the parsing algorithm doesn't have to necessarily know anything  
about the document schema that it is going to de-serialize content into.

>> The point of my post (and what I read Roy Fielding saying) is that  
>> the current HTML5 specification's strength is in its web browser  
>> behavior specification.
>
> "web browser behavior" includes the parsing algorithm.  In fact,  
> that's one of the most important parts of the current specification  
> from Mozilla's point of view.

However, its an easily compartmentalized piece of what a browser does  
(not only easily compartmentalized, but better for spec designing and  
application design when its abstracted in that way).

>> The parsing algorithm and the HTML vocabulary parts of the spec  
>> suffer because we don't have spec editors who sufficiently  
>> understand those parts.
>
> Uh...  We have spec editors who understand the parsing algorithm far  
> better than anyone else I can think of, since they've spent a good  
> bit of time studying how browsers actually parse HTML.  So I don't  
> know where the "don't have spec editors who sufficiently understand  
> those parts" meme comes from.

 From many months of discussing these topics.

Take care,
Rob
Received on Saturday, 15 November 2008 00:36:08 UTC