W3C home > Mailing lists > Public > public-html@w3.org > May 2009

Re: HTML interpreter vs. HTML user agent

From: Henri Sivonen <hsivonen@iki.fi>
Date: Fri, 29 May 2009 14:07:18 +0300
Cc: "Sam Ruby" <rubys@intertwingly.net>, "HTML WG" <public-html@w3.org>
Message-Id: <8CC6B607-A33B-4E3A-BD65-365944C4BEDD@iki.fi>
To: Anne van Kesteren <annevk@opera.com>
On May 29, 2009, at 13:05, Anne van Kesteren wrote:

> On Thu, 28 May 2009 15:42:56 +0200, Henri Sivonen <hsivonen@iki.fi>  
> wrote:
>> On May 28, 2009, at 16:15, Sam Ruby wrote:
>>> Anybody care to identify any more specifics?
>>
>> My understanding is that search engines that process massive  
>> amounts of data may want to do so with a streaming parser that  
>> doesn't abort on errors for which compliant recovery isn't  
>> streamable. It seems possible to perform indexing usefully without  
>> complying with the spec in the non-streamable cases.
>>
>> I don't have first-hand experience of working on a search engine,  
>> I'm not sure how much of a concern full streamability actually is,  
>> and I'm not sure if it's worthwhile to address this case in the spec.
>>
>> (It's inconceivable to expect browsers to switch to streamable  
>> recovery, so that's not an option.)
>
> Yeah, I recall this being discussed on IRC at some point.
>
> I think it was also discussed to actually define what exactly  
> streaming APIs would have to do that do not have some tree-like  
> representation and do not want to abort on errors for which a tree- 
> like representation is required to "recover".

I used to have the beginnings of such a feature in the Validator.nu  
HTML Parser, but I removed the code some time ago when it seemed clear  
that the feature didn't have immediate demand.

Anyway, there's a quick list of what needs to be different to get full  
streamability:

  * Adding attributes to 'html' or 'body' needs to be represented as  
additional empty 'html' or 'body' element (at whatever point in the  
document) in standard SAX (or as a non-standard special-purpose event).

  * The </head> tag should be ignored (and inferred by <body>).

  * Instead of foster parenting, junk in table should just get  
inserted into the table and the app should deal.

  * </body> and </html> tags should be ignored (and inferred by EOF).

  * Removing an element from the stack (when it doesn't degenerate to  
pop) should leave so kind of poppable shadow node (for generating the  
right endElement event) that doesn't participate in further searches  
for a node for a given element name in the stack.

  * Instead of doing the reparenting thing, the AAA should just reopen  
elements based on the list of formatting elements without moving  
previously-inserted nodes. (Details left as exercise to reader. :-)

(There may be some additional <frameset> craziness that I forgot about.)

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Friday, 29 May 2009 11:08:02 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:16:37 GMT