Re: The non-polyglot elephant in the room

On 01/22/2013 08:16 AM, Henri Sivonen wrote:
> On Mon, Jan 21, 2013 at 4:56 PM, Sam Ruby <rubys@intertwingly.net> wrote:
>> I've been told that that will be solved over time.  I've been told that over
>> a long period of time.  So far, that has not proven to be true.
> ...
>>  From my perspective, if but a fraction of the energy spent on
>> trying to stop this effort were instead spent on either improving parsing
>> tools like libxml2 or on determining what a simplified and more robust
>> subset of HTML5 would look like, then we could make better progress on this
>> issue.
>
> Mozilla and I don't have pressing own needs for that piece of
> software, so time for writing it has been starved by higher-priority
> items over and over again. (And now an HTML parser in Rust made its
> way into the work queue ahead of the libxml2-compatible thing…)
>
> If you need it solved faster, the best bet for making is to write code
> for a libxml2-compatible HTML-compliant parser yourself instead of
> spending time promoting polyglot. (In the case of the Validator.nu
> HTML Parser code base, the Gecko-specific stuff is already factored
> into CppType.java in the translator, so you could subclass CppTypes
> with something that returns value suitable for a libxml2
> API-compatible translation. Support for UTF-8 as the internal encoding
> will likely emerge as the side effect of the Rust effort.)

FWIW, if you don't want to have a dependency on Henri's Java code (which 
I feel would be a reasonable position to take for something like 
libxml2), writing a spec-compliant, non-scripting, HTML parser from 
scratch just isn't that hard. There is quite a lot of work, perhaps the 
order of a handful of man weeks, but fortunately there is an *excellent* 
specification :) In my experience, when implementing in a browser, the 
*vast* majority of the complexity comes from supporting scripting and 
document.write, which isn't needed for libxml2.

Received on Tuesday, 22 January 2013 08:36:46 UTC