- From: Sam Ruby <rubys@intertwingly.net>
- Date: Thu, 18 Jun 2009 06:36:53 -0400
- To: Henri Sivonen <hsivonen@iki.fi>
- CC: Jonathan Rees <jar@creativecommons.org>, Anne van Kesteren <annevk@opera.com>, Dan Connolly <connolly@w3.org>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, www-archive@w3.org
Henri Sivonen wrote: > > The Validator.nu HTML Parser comes with a sample application called > HTML2XML. When the input is a conforming HTML5 document, the output is > the semantically equivalent XHTML5 document. HTML2XML doesn't repair > non-conforming documents. > > You can obtain the Java version from http://about.validator.nu/htmlparser/ > > Sam Ruby is working on a version that doesn't require the JVM invocation > overhead > http://intertwingly.net/blog/2009/06/15/Invoking-HtmlParser-from-C > > If your pipeline is in Java, you don't need HTML2XML but you should just > use the Validator.nu HTML Parser directly, which optimizes away the > steps of serializing as XML and reparsing it. Update: I'm working on that too: http://intertwingly.net/blog/2009/06/17/Calling-JAXP-from-Ruby Jonathan: I will echo what Henri says. Except for edge cases, HTML5 parsers and serializers can simply be considered a 'drop in' replacement for XML parsers and serializers. Every effort has been made to ensure that the edge cases are as small as possible. And the cases where the differences are unavoidable are clearly documented. Apparently Henri's favorite example is form feed characters. Mine is consecutive dashes in comments. - Sam Ruby
Received on Thursday, 18 June 2009 10:37:33 UTC