Re: question about XML and HTML5 from Sam Ruby on 2009-06-18 (www-archive@w3.org from June 2009)

From: Sam Ruby <rubys@intertwingly.net>
Date: Thu, 18 Jun 2009 06:36:53 -0400
To: Henri Sivonen <hsivonen@iki.fi>
CC: Jonathan Rees <jar@creativecommons.org>, Anne van Kesteren <annevk@opera.com>, Dan Connolly <connolly@w3.org>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, www-archive@w3.org
Message-ID: <4A3A18C5.2090209@intertwingly.net>

Henri Sivonen wrote:
> 
> The Validator.nu HTML Parser comes with a sample application called 
> HTML2XML. When the input is a conforming HTML5 document, the output is 
> the semantically equivalent XHTML5 document. HTML2XML doesn't repair 
> non-conforming documents.
> 
> You can obtain the Java version from http://about.validator.nu/htmlparser/
> 
> Sam Ruby is working on a version that doesn't require the JVM invocation 
> overhead
> http://intertwingly.net/blog/2009/06/15/Invoking-HtmlParser-from-C
> 
> If your pipeline is in Java, you don't need HTML2XML but you should just 
> use the Validator.nu HTML Parser directly, which optimizes away the 
> steps of serializing as XML and reparsing it.

Update: I'm working on that too: 
http://intertwingly.net/blog/2009/06/17/Calling-JAXP-from-Ruby

Jonathan: I will echo what Henri says.  Except for edge cases, HTML5 
parsers and serializers can simply be considered a 'drop in' replacement 
for XML parsers and serializers.  Every effort has been made to ensure 
that the edge cases are as small as possible.  And the cases where the 
differences are unavoidable are clearly documented.  Apparently Henri's 
favorite example is form feed characters.  Mine is consecutive dashes in 
comments.

- Sam Ruby

Received on Thursday, 18 June 2009 10:37:33 UTC