W3C home > Mailing lists > Public > www-archive@w3.org > June 2009

Re: question about XML and HTML5

From: Sam Ruby <rubys@intertwingly.net>
Date: Thu, 18 Jun 2009 06:36:53 -0400
Message-ID: <4A3A18C5.2090209@intertwingly.net>
To: Henri Sivonen <hsivonen@iki.fi>
CC: Jonathan Rees <jar@creativecommons.org>, Anne van Kesteren <annevk@opera.com>, Dan Connolly <connolly@w3.org>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, www-archive@w3.org
Henri Sivonen wrote:
> 
> The Validator.nu HTML Parser comes with a sample application called 
> HTML2XML. When the input is a conforming HTML5 document, the output is 
> the semantically equivalent XHTML5 document. HTML2XML doesn't repair 
> non-conforming documents.
> 
> You can obtain the Java version from http://about.validator.nu/htmlparser/
> 
> Sam Ruby is working on a version that doesn't require the JVM invocation 
> overhead
> http://intertwingly.net/blog/2009/06/15/Invoking-HtmlParser-from-C
> 
> If your pipeline is in Java, you don't need HTML2XML but you should just 
> use the Validator.nu HTML Parser directly, which optimizes away the 
> steps of serializing as XML and reparsing it.

Update: I'm working on that too: 
http://intertwingly.net/blog/2009/06/17/Calling-JAXP-from-Ruby

Jonathan: I will echo what Henri says.  Except for edge cases, HTML5 
parsers and serializers can simply be considered a 'drop in' replacement 
for XML parsers and serializers.  Every effort has been made to ensure 
that the edge cases are as small as possible.  And the cases where the 
differences are unavoidable are clearly documented.  Apparently Henri's 
favorite example is form feed characters.  Mine is consecutive dashes in 
comments.

- Sam Ruby
Received on Thursday, 18 June 2009 10:37:33 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 7 November 2012 14:18:25 GMT