- From: Richard A. O'Keefe <ok@atlas.otago.ac.nz>
- Date: Fri, 31 Aug 2001 11:39:32 +1200 (NZST)
- To: KlausRusch@atmedia.net, html-tidy@w3.org
> "Matt G" <mattg@vguild.com> wrote: > I am having trouble imagining an application that extracts useful information > from botched-HTML and cares that the tags are balanced without caring which > or in what order those tags might be. I suggest that the results will be of > less utility than might appear. Why would anyone want to turn bad HTML into XML? Nononononono. That wasn't the question. Why would anyone want to convert stuff that is so hopelessly botched that Tidy can't make sense of it into XML? THAT'S the question. Even the most unstructured HTML, with invalid or misplaced tags, may contain some information that can be reused, and XSL is not necessarily a bad tool for that. But yes, if the input is as garbled as all that, the structure that XSL relies on *can't* be relied on. That's my point, really. There comes a point where the input is so garbled that "converting it to XML ... at any cost" isn't particularly useful; all the information you can reliably extract can be best extracted by treating the file as plain text. We've seen examples in this mailing list where it required human-level intelligence to figure out which characters were supposed to be part of the tags and which characters were supposed to be running text. I have done enough with XSLT to be certain that it is a bad tool for just about anything. It is a declarative language, which is _good_, but it is syntactically the second worst language I've seen (Intercal is the only thing I can think of that beats it), and the available implementations are stunningly slow (except libxslt, which is plain slow, not stunningly slow). There are plenty of declarative languages which let you express XML transformations much faster in fewer characters, even some with strict polymorphic typing.
Received on Thursday, 30 August 2001 19:39:54 UTC