- From: Michael[tm] Smith <mike@w3.org>
- Date: Tue, 22 Jan 2013 13:42:07 +0900
- To: Sam Ruby <rubys@intertwingly.net>
- Cc: public-html@w3.org
Sam Ruby <rubys@intertwingly.net>, 2013-01-21 09:56 -0500: > On 01/21/2013 09:24 AM, Michael[tm] Smith wrote: > My experience is that HTML parsers vary wildly in quality and performance, > and that high performance quality HTML5 compliant parsers are far from > ubiquitous. I get your point but it's also worth noting that we actually do have four different independent implementations of high performance quality HTML5 parsers that have been shipping in production browsers for a long time now -- plus Henri's Java parser, the performance and quality and compliance of which is just as high as any browser parser (and in fact as you know is the same source from which the Gecko parser is build). But yeah that's not exactly ubiquity. > I've been told that that will be solved over time. I've been told that over > a long period of time. So far, that has not proven to be true. True, there's not been a lot of progress for a long time as far as anybody making HTML5 parsers for existing programming languages. There is still html5lib for Python the compliance of which and (I think) quality of which is just as high as any browser parser (depending on how you measure quality). Yeah, its not high-performance but I think the performance of it is fine for many or maybe even most tasks that anybody would want to do with a scripting language (especially if you run it under PyPy). > People talk at length about the "wasted" developer time that is spent on > polyglot. From my perspective, if but a fraction of the energy spent on > trying to stop this effort were instead spent on either improving parsing > tools like libxml2 I agree that it'd be great to have a conforming HTML parser in libxml2, or one that could be used in place of libxml2 in programming environments that have existing means to use libxml2 as a parsing library. I believe Henri has actually been planning to develop something like that, but I think he's had other day-job other priorities get in the way. Like a lot of other things I guess it comes down to somebody being able to get the time free to work on it. > or on determining what a simplified and more robust subset of HTML5 would > look like, then we could make better progress on this issue. While I think I get what you mean by more robust I don't think I'd say that the existing HTML5 parsing requirements are already plenty robust. What's not robust is the broken parsing behavior of legacy ad-hoc HTML parsers -- like the one in libxml2 -- whose parsing behavior doesn't match the behavior of parsers in browsers (and which in fact have never matched the parsing behavior in browsers, even long before HTML5). I assume what you mean by robust is actually robustness of document instances -- in the sense that they'll get parsed as expected even in legacy/broken non-browser parsers. If so I really don't think that defining a subset of HTML5 markup to work around the deficiencies in those parsers is actually helpful. I think it's in fact somewhat harmful because it gives us all even less incentive to actually put time into fixing parsing tools like libxml2. --Mike -- Michael[tm] Smith http://people.w3.org/mike
Received on Tuesday, 22 January 2013 04:42:19 UTC