- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Tue, 22 Jan 2013 09:16:45 +0200
- To: public-html@w3.org
On Mon, Jan 21, 2013 at 4:56 PM, Sam Ruby <rubys@intertwingly.net> wrote: > I've been told that that will be solved over time. I've been told that over > a long period of time. So far, that has not proven to be true. ... > From my perspective, if but a fraction of the energy spent on > trying to stop this effort were instead spent on either improving parsing > tools like libxml2 or on determining what a simplified and more robust > subset of HTML5 would look like, then we could make better progress on this > issue. Mozilla and I don't have pressing own needs for that piece of software, so time for writing it has been starved by higher-priority items over and over again. (And now an HTML parser in Rust made its way into the work queue ahead of the libxml2-compatible thing…) If you need it solved faster, the best bet for making is to write code for a libxml2-compatible HTML-compliant parser yourself instead of spending time promoting polyglot. (In the case of the Validator.nu HTML Parser code base, the Gecko-specific stuff is already factored into CppType.java in the translator, so you could subclass CppTypes with something that returns value suitable for a libxml2 API-compatible translation. Support for UTF-8 as the internal encoding will likely emerge as the side effect of the Rust effort.) Failing that, there is the option of piping stuff through the HTML2XML sample program that comes with the Validator.nu HTML Parser before the data reaches your non-Java program. If the JVM startup time is a problem, gcj can probably compile HTML2XML. Failing even that, one way to help would be implementing in C a set of MIT-licensed Encoding Standard-compliant decoders that output UTF-8. -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Tuesday, 22 January 2013 07:17:13 UTC