[Ann] A limping development version of an HTML5 parser in Java

There's now a limping development version of an HTML5 parser in Java  
that interested parties may try out:
svn co http://svn.versiondude.net/whattf/htmlparser/trunk/ htmlparser

Warning: This isn't at all ready for any kind of production use. The  
purpose of this email is just to let interested parties know the  
status of the project.

Dependencies:
JDK 5
ICU4J (compile time--needed at run time only if normalization  
checking of source text is enabled)
json-tools, which in turn depend on Antlr 2 (needed for compiling and  
running the tokenizer test harness--not needed for normal use of the  
parser)
The Apache XML serializer aka. serializer.jar, which comes with  
Xerces and Xalan (needed for testing the associated tree model with XML)

Classes with main() methods are in the *.test packages.

Goals:
Provide an HTML5 parser that works as a drop-in replacement for an  
XML parser in non-browser Java apps that expect XML APIs. Make the  
parser strict enough for conformance checking (including encoding  
errors, etc.).

Known bugs:
  * Foster parenting doesn't work.
  * The JDK UTF-8 decoder leaves some bad byte sequences unreported.

Unknown bugs:
  * Probably lots especially in the tree builder.

Roadmap:
  * Test harness for html5lib tree construction test cases.
  * Pass those tests.
  * Buffered SAX as drop-in replacement for an XML parser.
  * Streaming SAX (fatal errors on the AAA and foster parenting,  
etc.) as drop-in replacement for an XML parser.
  * DOM as drop-in replacement for an XML parser.
  * XOM as drop-in replacement for an XML parser.
  * Configurability regarding XML 1.0 infoset violations.

Later on roadmap:
  * JDOM as drop-in replacement for an XML parser.
  * Performance improvements.

(dom4j is not explicitly on the roadmap, because DOM support is  
expected to work with dom4j).

Doable but not on the roadmap:
  * Buffered StAX.

Not doable within the architecture:
  * True streaming StAX. (Use SAX instead.)

License:
MIT/expat. Patches welcome under the same license.

Acknowledgments:
Thanks to the Mozilla Foundation for funding this project. Thanks to  
the html5lib team and Philip Taylor (of the lazyilluminati fame) for  
test cases and bug reports.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Wednesday, 11 July 2007 10:29:46 UTC