Re: to XML, not XHTML from Richard A. O'Keefe on 2001-08-30 (html-tidy@w3.org from July to September 2001)

From: Richard A. O'Keefe <ok@atlas.otago.ac.nz>
Date: Fri, 31 Aug 2001 11:39:32 +1200 (NZST)
To: KlausRusch@atmedia.net, html-tidy@w3.org
Message-Id: <200108302339.LAA184163@atlas.otago.ac.nz>

	> "Matt G" <mattg@vguild.com> wrote:
	> I am having trouble imagining an application that extracts useful information
	> from botched-HTML and cares that the tags are balanced without caring which
	> or in what order those tags might be.  I suggest that the results will be of
	> less utility than might appear.

	Why would anyone want to turn bad HTML into XML?

Nononononono.  That wasn't the question.

Why would anyone want to convert stuff that is so hopelessly botched
that Tidy can't make sense of it into XML?

THAT'S the question.

	Even the most unstructured HTML, with invalid or misplaced tags,
	may contain some information that can be reused, and XSL is not
	necessarily a bad tool for that.

But yes, if the input is as garbled as all that, the structure that XSL
relies on *can't* be relied on.  That's my point, really.  There comes
a point where the input is so garbled that "converting it to XML ... at
any cost" isn't particularly useful; all the information you can reliably
extract can be best extracted by treating the file as plain text.  We've
seen examples in this mailing list where it required human-level
intelligence to figure out which characters were supposed to be part of
the tags and which characters were supposed to be running text.

I have done enough with XSLT to be certain that it is a bad tool for
just about anything.  It is a declarative language, which is _good_, but
it is syntactically the second worst language I've seen (Intercal is the
only thing I can think of that beats it), and the available implementations
are stunningly slow (except libxslt, which is plain slow, not stunningly
slow).  There are plenty of declarative languages which let you express XML
transformations much faster in fewer characters, even some with strict
polymorphic typing.

Received on Thursday, 30 August 2001 19:39:54 UTC