Re: The non-polyglot elephant in the room

On Mon, Jan 21, 2013 at 4:56 PM, Sam Ruby <rubys@intertwingly.net> wrote:
> I've been told that that will be solved over time.  I've been told that over
> a long period of time.  So far, that has not proven to be true.
...
> From my perspective, if but a fraction of the energy spent on
> trying to stop this effort were instead spent on either improving parsing
> tools like libxml2 or on determining what a simplified and more robust
> subset of HTML5 would look like, then we could make better progress on this
> issue.

Mozilla and I don't have pressing own needs for that piece of
software, so time for writing it has been starved by higher-priority
items over and over again. (And now an HTML parser in Rust made its
way into the work queue ahead of the libxml2-compatible thing…)

If you need it solved faster, the best bet for making is to write code
for a libxml2-compatible HTML-compliant parser yourself instead of
spending time promoting polyglot. (In the case of the Validator.nu
HTML Parser code base, the Gecko-specific stuff is already factored
into CppType.java in the translator, so you could subclass CppTypes
with something that returns value suitable for a libxml2
API-compatible translation. Support for UTF-8 as the internal encoding
will likely emerge as the side effect of the Rust effort.)

Failing that, there is the option of piping stuff through the HTML2XML
sample program that comes with the Validator.nu HTML Parser before the
data reaches your non-Java program. If the JVM startup time is a
problem, gcj can probably compile HTML2XML.

Failing even that, one way to help would be implementing in C a set of
MIT-licensed Encoding Standard-compliant decoders that output UTF-8.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Tuesday, 22 January 2013 07:17:13 UTC