Re: Understanding HTML5 parsing

On Jan 6, 2011, at 18:26, Sam Ruby wrote:

> Here are two good utilities, the first will use the browser you are currently running:
> 
> http://software.hixie.ch/utilities/js/live-dom-viewer/

Since this tool shows what the built-in parser of the browser used to view the page uses, using a nightly build of Firefox gives the results that are the closest to the spec. Firefox 4 beta 8 is a tiny bit less up to spec (it doesn't know about the track element) and Chrome represents the situation from a few months ago. For example, it doesn't support HTML in annotation-xml.

> I would also normally suggest:
> 
> http://livedom.validator.nu/
> 
> ... but at the moment, it doesn't seem to be working for me.

Thanks for pointing out the brokenness. I had migrated to a fresh Ubuntu VM, and the Apache defaults had an AddType for .gz. (The page relies on AddEncoding for .gz in the virtual host config actually taking effect.)

Now fixed. I also recompiled the parser using GWT, so http://livedom.validator.nu/ now reflects the latest code from hg.

This tool runs as JavaScript the same implementation of the HTML parsing algorithm that is in Firefox 4 nightlies. The JS version runs inside whatever browser you use to view the page (including older Firefox, Opera, Safari, Chrome or IE9; IE8 doesn't work). However, in some cases, the results differ from running Hixie's original tool in Firefox: There are some DOMs that the browser's built-in parser can generate but that can't be built using the DOM APIs exposed to JavaScript, because the DOM APIs are required to throw when given certain invalid inputs. In order to avoid throwing, the http://livedom.validator.nu/ version of the parser has infoset coercion enabled. This affect sinputs that seem to be of particular interest to this TF, e.g. http://livedom.validator.nu/?%3Cfoo%3Abar%20xmlns%3Dbaz%3E Illegal characters in names are replaced with the Uhhhhh notation. Form feeds are replaced with spaces. xmlns attributes are dropped. Comment nodes aren't shown (bug!). Also, the script execution semantics for <script> elements aren't quite right even though the http://livedom.validator.nu/ makes some effort to support document.write in trivial cases involving inline scripts without nested scripts.

Note that in Firefox, both tools suffer from https://bugzilla.mozilla.org/show_bug.cgi?id=618737

Also, Hixie's visualization of the tree predates foreign content support, so neither tool visualizes namespaces clearly. However, Firefox, WebKit and IE9 show names of HTML elements in upper case and the names of foreign elements in the canonical case, which makes it easy to distinguish HTML nodes from foreign nodes.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Wednesday, 12 January 2011 15:40:31 UTC