- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Fri, 29 May 2009 14:07:18 +0300
- To: Anne van Kesteren <annevk@opera.com>
- Cc: "Sam Ruby" <rubys@intertwingly.net>, "HTML WG" <public-html@w3.org>
On May 29, 2009, at 13:05, Anne van Kesteren wrote: > On Thu, 28 May 2009 15:42:56 +0200, Henri Sivonen <hsivonen@iki.fi> > wrote: >> On May 28, 2009, at 16:15, Sam Ruby wrote: >>> Anybody care to identify any more specifics? >> >> My understanding is that search engines that process massive >> amounts of data may want to do so with a streaming parser that >> doesn't abort on errors for which compliant recovery isn't >> streamable. It seems possible to perform indexing usefully without >> complying with the spec in the non-streamable cases. >> >> I don't have first-hand experience of working on a search engine, >> I'm not sure how much of a concern full streamability actually is, >> and I'm not sure if it's worthwhile to address this case in the spec. >> >> (It's inconceivable to expect browsers to switch to streamable >> recovery, so that's not an option.) > > Yeah, I recall this being discussed on IRC at some point. > > I think it was also discussed to actually define what exactly > streaming APIs would have to do that do not have some tree-like > representation and do not want to abort on errors for which a tree- > like representation is required to "recover". I used to have the beginnings of such a feature in the Validator.nu HTML Parser, but I removed the code some time ago when it seemed clear that the feature didn't have immediate demand. Anyway, there's a quick list of what needs to be different to get full streamability: * Adding attributes to 'html' or 'body' needs to be represented as additional empty 'html' or 'body' element (at whatever point in the document) in standard SAX (or as a non-standard special-purpose event). * The </head> tag should be ignored (and inferred by <body>). * Instead of foster parenting, junk in table should just get inserted into the table and the app should deal. * </body> and </html> tags should be ignored (and inferred by EOF). * Removing an element from the stack (when it doesn't degenerate to pop) should leave so kind of poppable shadow node (for generating the right endElement event) that doesn't participate in further searches for a node for a given element name in the stack. * Instead of doing the reparenting thing, the AAA should just reopen elements based on the list of formatting elements without moving previously-inserted nodes. (Details left as exercise to reader. :-) (There may be some additional <frameset> craziness that I forgot about.) -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Friday, 29 May 2009 11:08:02 UTC