Re: Comments on HTML WG face to face meetings in France Oct 08 from Henri Sivonen on 2008-11-18 (public-html@w3.org from November 2008)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Tue, 18 Nov 2008 23:33:29 +0200
To: elharo@metalab.unc.edu
Cc: Boris Zbarsky <bzbarsky@MIT.EDU>, Ian Hickson <ian@hixie.ch>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, public-html <public-html@w3.org>, www-tag@w3.org
Message-Id: <6550A5AA-C69C-44FF-B81C-91C72694CD9B@iki.fi>

On Nov 18, 2008, at 15:24, Elliotte Harold wrote:

> Henri Sivonen wrote:
>
>> This means that agents that do not support scripting may use a  
>> different object model. For example, it's conforming to implement a  
>> no-scripting agent with XOM as the internal object model. The  
>> Validator.nu HTML Parser even supports XOM out-of-the-box.
>
> As you point out XOM instead of DOM is not a big leap. They're both  
> tree model after all. I'm more concerned about more radical changes  
> like SAX or other streaming APIs or document specific data bound  
> models or even stranger things. Is it plausible to extend the HTML 5  
> parsing model to cover this?

Yes, and I've got proof by implementation. :-)

The Validator.nu HTML Parser supports SAX in two different modes:  
streaming and tree-buffered.

In the streaming mode, the parser emits SAX events as it proceeds in  
the input stream. However, there are some types of authoring errors  
for which the error recovery is not streamable. These errors are  
treated like XML well-formedness errors. I'd like to emphasize that  
this behavior is conforming per spec:
http://www.whatwg.org/specs/web-apps/current-work/#parse-error

In the tree-bufferend mode, the parser builds a tree using a purpose- 
optimized tree model (which is neither DOM nor XOM and outperforms  
Xerces2 DOM and XOM for this use case) and after the input stream has  
been exhausted, fires SAX events corresponding to the tree.

It is unfortunate that there are classes of errors for which spec- 
compliant recovery is non-streamable. The legacy restricts us here. :- 
( Note that implementing streamable ad hoc error recovery for these  
cases is *not* conforming per spec.

> I also strongly question the wisdom of locking in one of the  
> absolute worst APIs we have. If there's one thing that needs  
> replacing in the HTML ecosystem, it's DOM. Sooner or later DOM will  
> be replaced, and if HTML 5 is standing in the way when that day  
> comes, then HTML 5 is going to come up the loser. Were the object  
> model separable from the syntax and semantics, then the sensible  
> parts of HTML 5 would have a better chance of surviving the  
> transition.

It's extremely unlikely that the DOM would go away in browsers. It's  
semi-plausible that a better API will be introduced for the same data  
model (E4X has been failing so far...), but it isn't feasible to  
remove the DOM API, since there's so much existing content depending  
on it.

As for the DOM going away in non-browser agents that don't run  
scripts, the SAX and XOM modes of the Validator.nu HTML Parser and the  
ElementTree (etc.) APIs for html5lib-created trees are proof that it  
works quite well already with the kind of spec we have.

(html5.validator.nu doesn't operate on a DOM or in fact on any kind of  
in-memory tree model, BTW.)

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Tuesday, 18 November 2008 21:34:15 UTC