Re: to XML, not XHTML from Richard A. O'Keefe on 2001-08-30 (html-tidy@w3.org from July to September 2001)

From: Richard A. O'Keefe <ok@atlas.otago.ac.nz>
Date: Thu, 30 Aug 2001 13:11:24 +1200 (NZST)
To: html-tidy@w3.org, jelks@jelks.nu, mattg@vguild.com
Message-Id: <200108300111.NAA187797@atlas.otago.ac.nz>

"Matt G" <mattg@vguild.com> wrote:
	I need to turn really bad HTML into parse-able XML at any cost;
	that the result may be complete gibberish with respect to the
	XHTML DTD's is of no concern.
	
I am having trouble imagining an application that extracts useful information
from botched-HTML and cares that the tags are balanced without caring which
or in what order those tags might be.  I suggest that the results will be of
less utility than might appear.

There is the obvious method, is there not?

<gibberish><![CDATA[  original document ]]></gibberish>

More precisely,
    Add "<?xml version='1.0'?><html><head>"
        "<title>Gibberish</title></head><body><![CDATA["
    at the front.
    Replace each occurrence of "]]>" by "]]]><!CDATA[]>".
    Add "]]></body></html>" at the end.

Presto chango, really bad HTML turned into parsable XML.
				 
If you want to use XSLT (but why?  It is an extremely clumsy language,
and all the XSLT implementations I've tried, which is most of the free
ones, are *stunningly* slow) then garbled HTML input is pretty much
guaranteed to lead to incorrect output.

Received on Wednesday, 29 August 2001 21:11:28 UTC