W3C home > Mailing lists > Public > xproc-dev@w3.org > October 2011

Re: XML Calabash 0.9.36 released

From: Norman Walsh <ndw@nwalsh.com>
Date: Wed, 05 Oct 2011 11:18:25 -0400
To: XProc Dev <xproc-dev@w3.org>
Message-ID: <m2ehyro4am.fsf@nwalsh.com>
Zearin <zearin@gonk.net> writes:
> What I want more than anything is something to convert HTML5 to
> XHTML5. (+100 points if there’s an option to convert it to polyglot
> XHTML5!)

Well, Henri's HTML parser turns random characters into HTML5, I think.
And the output is definitely XML, but it's an object model not a
serialized representation so some of the HTML5/XHTML5 differences
aren't detectable. For example, I don't think that polyglot really
comes into play.

FWIW, I setup the parser to be maximally forgiving to markup errors by
choosing "AlterInfoset" as the XML violation policy. I think that
means the parser will "fixup" stuff even if it has to go back into the
tree to correct the error.

> Back in the day, htmltidy could convert uncivilized HTML into XHTML.
> Sure—it wasn’t always perfect, but it got you 90% of the way there.
> And once it was in XHTML form, cleaning up any remaining cruft was
> (usually) trivial. Best of all, after I was done using the power of
> XML tools to work on the document, it was simple to transform it back
> into plain HTML again (for example, if I was working on something for
> somebody else who wanted vanilla HTML).
> Is there any hope for this?

I think Henri's parser gets you most of the way there. Uncivilized
HTML into HTML5 for sure. And then you can use any XProc steps you'd
like to massage it. The only tricky part will be getting exactly the
right serialization. For example, I don't off the top of my head know
how to get

  <!DOCTYPE html>

in the serialized form. 

                                        Be seeing you,

Norman Walsh
Lead Engineer
MarkLogic Corporation
Phone: +1 413 624 6676

Received on Wednesday, 5 October 2011 15:19:27 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:03:09 UTC