- From: Norman Walsh <ndw@nwalsh.com>
- Date: Wed, 05 Oct 2011 11:18:25 -0400
- To: XProc Dev <xproc-dev@w3.org>
- Message-ID: <m2ehyro4am.fsf@nwalsh.com>
Zearin <zearin@gonk.net> writes: > What I want more than anything is something to convert HTML5 to > XHTML5. (+100 points if there’s an option to convert it to polyglot > XHTML5!) Well, Henri's HTML parser turns random characters into HTML5, I think. And the output is definitely XML, but it's an object model not a serialized representation so some of the HTML5/XHTML5 differences aren't detectable. For example, I don't think that polyglot really comes into play. FWIW, I setup the parser to be maximally forgiving to markup errors by choosing "AlterInfoset" as the XML violation policy. I think that means the parser will "fixup" stuff even if it has to go back into the tree to correct the error. > Back in the day, htmltidy could convert uncivilized HTML into XHTML. > Sure—it wasn’t always perfect, but it got you 90% of the way there. > And once it was in XHTML form, cleaning up any remaining cruft was > (usually) trivial. Best of all, after I was done using the power of > XML tools to work on the document, it was simple to transform it back > into plain HTML again (for example, if I was working on something for > somebody else who wanted vanilla HTML). > > Is there any hope for this? I think Henri's parser gets you most of the way there. Uncivilized HTML into HTML5 for sure. And then you can use any XProc steps you'd like to massage it. The only tricky part will be getting exactly the right serialization. For example, I don't off the top of my head know how to get <!DOCTYPE html> in the serialized form. Be seeing you, norm -- Norman Walsh Lead Engineer MarkLogic Corporation Phone: +1 413 624 6676 www.marklogic.com
Received on Wednesday, 5 October 2011 15:19:27 UTC