Re: [nodejs] Re: [ANN] W3C standards C++ XML DOM parser for NodeJS from Edward O'Connor on 2010-10-06 (www-archive@w3.org from October 2010)

From: Edward O'Connor <hober0@gmail.com>
Date: Wed, 6 Oct 2010 11:54:03 -0700
To: Marco Rogers <marco.rogers@gmail.com>
Cc: www-archive@w3.org
Message-ID: <AANLkTikkzOnQwei60H87yuqG_-HKsO49d2d3guKkjDBf@mail.gmail.com>

Hi,

[Taken off-list as this isn't really node-specific anymore.]

> @Edward, the html parser in libxml2 is very good.  In some preliminary
> tests, I've done, it does pretty well even with crappy markup.

Fundamentally, I'm interested in DOM consistency. Given the same
sequence of bytes, does the libxml2 HTML parser generate the same DOM
that the major browsers do?

> When you say "browser-compatible"

When I say "browser-compatible," I mean

http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#parsing

> that doesn't mean much because each browser has their own parser, and
> when you dig in you'll find that there is quite a bit of difference
> between them.

All four browser engines are converging on the same parsing algorithm,
linked above. Which means that, going forward, all five major browsers
will produce the same DOM from the same arbitrary-pile-of-bytes that
passes for HTML on the web.

Which means that there's really no reason for people to implement or use
other tag soup parsing algorithms.


Ted

Received on Wednesday, 6 October 2010 18:54:57 UTC