- From: Richard A. O'Keefe <ok@atlas.otago.ac.nz>
- Date: Thu, 30 Aug 2001 13:11:24 +1200 (NZST)
- To: html-tidy@w3.org, jelks@jelks.nu, mattg@vguild.com
"Matt G" <mattg@vguild.com> wrote: I need to turn really bad HTML into parse-able XML at any cost; that the result may be complete gibberish with respect to the XHTML DTD's is of no concern. I am having trouble imagining an application that extracts useful information from botched-HTML and cares that the tags are balanced without caring which or in what order those tags might be. I suggest that the results will be of less utility than might appear. There is the obvious method, is there not? <gibberish><![CDATA[ original document ]]></gibberish> More precisely, Add "<?xml version='1.0'?><html><head>" "<title>Gibberish</title></head><body><![CDATA[" at the front. Replace each occurrence of "]]>" by "]]]><!CDATA[]>". Add "]]></body></html>" at the end. Presto chango, really bad HTML turned into parsable XML. If you want to use XSLT (but why? It is an extremely clumsy language, and all the XSLT implementations I've tried, which is most of the free ones, are *stunningly* slow) then garbled HTML input is pretty much guaranteed to lead to incorrect output.
Received on Wednesday, 29 August 2001 21:11:28 UTC