- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Wed, 11 Jul 2007 13:29:29 +0300
There's now a limping development version of an HTML5 parser in Java that interested parties may try out: svn co http://svn.versiondude.net/whattf/htmlparser/trunk/ htmlparser Warning: This isn't at all ready for any kind of production use. The purpose of this email is just to let interested parties know the status of the project. Dependencies: JDK 5 ICU4J (compile time--needed at run time only if normalization checking of source text is enabled) json-tools, which in turn depend on Antlr 2 (needed for compiling and running the tokenizer test harness--not needed for normal use of the parser) The Apache XML serializer aka. serializer.jar, which comes with Xerces and Xalan (needed for testing the associated tree model with XML) Classes with main() methods are in the *.test packages. Goals: Provide an HTML5 parser that works as a drop-in replacement for an XML parser in non-browser Java apps that expect XML APIs. Make the parser strict enough for conformance checking (including encoding errors, etc.). Known bugs: * Foster parenting doesn't work. * The JDK UTF-8 decoder leaves some bad byte sequences unreported. Unknown bugs: * Probably lots especially in the tree builder. Roadmap: * Test harness for html5lib tree construction test cases. * Pass those tests. * Buffered SAX as drop-in replacement for an XML parser. * Streaming SAX (fatal errors on the AAA and foster parenting, etc.) as drop-in replacement for an XML parser. * DOM as drop-in replacement for an XML parser. * XOM as drop-in replacement for an XML parser. * Configurability regarding XML 1.0 infoset violations. Later on roadmap: * JDOM as drop-in replacement for an XML parser. * Performance improvements. (dom4j is not explicitly on the roadmap, because DOM support is expected to work with dom4j). Doable but not on the roadmap: * Buffered StAX. Not doable within the architecture: * True streaming StAX. (Use SAX instead.) License: MIT/expat. Patches welcome under the same license. Acknowledgments: Thanks to the Mozilla Foundation for funding this project. Thanks to the html5lib team and Philip Taylor (of the lazyilluminati fame) for test cases and bug reports. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/
Received on Wednesday, 11 July 2007 03:29:29 UTC