Re: The non-polyglot elephant in the room

Sam Ruby <>, 2013-01-21 09:56 -0500:

> On 01/21/2013 09:24 AM, Michael[tm] Smith wrote:
> My experience is that HTML parsers vary wildly in quality and performance,
> and that high performance quality HTML5 compliant parsers are far from
> ubiquitous.

I get your point but it's also worth noting that we actually do have four
different independent implementations of high performance quality HTML5
parsers that have been shipping in production browsers for a long time now
-- plus Henri's Java parser, the performance and quality and compliance of
which is just as high as any browser parser (and in fact as you know is the
same source from which the Gecko parser is build).

But yeah that's not exactly ubiquity.

> I've been told that that will be solved over time.  I've been told that over
> a long period of time.  So far, that has not proven to be true.

True, there's not been a lot of progress for a long time as far as anybody
making HTML5 parsers for existing programming languages. There is still
html5lib for Python the compliance of which and (I think) quality of which
is just as high as any browser parser (depending on how you measure
quality). Yeah, its not high-performance but I think the performance of it
is fine for many or maybe even most tasks that anybody would want to do
with a scripting language (especially if you run it under PyPy).

> People talk at length about the "wasted" developer time that is spent on
> polyglot.  From my perspective, if but a fraction of the energy spent on
> trying to stop this effort were instead spent on either improving parsing
> tools like libxml2

I agree that it'd be great to have a conforming HTML parser in libxml2,
or one that could be used in place of libxml2 in programming environments
that have existing means to use libxml2 as a parsing library. I believe
Henri has actually been planning to develop something like that, but I
think he's had other day-job other priorities get in the way.

Like a lot of other things I guess it comes down to somebody being able to
get the time free to work on it. 

> or on determining what a simplified and more robust subset of HTML5 would
> look like, then we could make better progress on this issue.

While I think I get what you mean by more robust I don't think I'd say that
the existing HTML5 parsing requirements are already plenty robust. What's
not robust is the broken parsing behavior of legacy ad-hoc HTML parsers --
like the one in libxml2 -- whose parsing behavior doesn't match the
behavior of parsers in browsers (and which in fact have never matched the
parsing behavior in browsers, even long before HTML5).

I assume what you mean by robust is actually robustness of document
instances -- in the sense that they'll get parsed as expected even in
legacy/broken non-browser parsers.

If so I really don't think that defining a subset of HTML5 markup to work
around the deficiencies in those parsers is actually helpful. I think it's
in fact somewhat harmful because it gives us all even less incentive to
actually put time into fixing parsing tools like libxml2.


Michael[tm] Smith

Received on Tuesday, 22 January 2013 04:42:19 UTC