Re: to XML, not XHTML

If you want to extract certain data published in HTML format, how would you
do it?

a) Parse the data out with text parsing functions and/or regular expressions
b) Parse the HTML into a tree and crawl the tree to find the data
c) Convert the HTML to XML and use XSLT templates to extract the data
d) Other

The extracted data is going to a database, so why should I care what happens
to the bad presentation markup? All I care about is getting the data. And if
the HTML format changes, I can just modify the XSLT templates rather than
rewriting parsing functions.

Is this obtuse?

    Matt

----- Original Message -----
From: "Klaus Johannes Rusch" <KlausRusch@atmedia.net>
To: <html-tidy@w3.org>
Sent: Thursday, August 30, 2001 5:57 AM
Subject: Re: to XML, not XHTML


In <200108300111.NAA187797@atlas.otago.ac.nz>, "Richard A. O'Keefe"
<ok@atlas.otago.ac.nz> writes:
> "Matt G" <mattg@vguild.com> wrote:
> I am having trouble imagining an application that extracts useful
information
> from botched-HTML and cares that the tags are balanced without caring
which
> or in what order those tags might be.  I suggest that the results will be
of
> less utility than might appear.

Why would anyone want to turn bad HTML into XML? Even the most unstructured
HTML, with invalid or misplaced tags, may contain some information that can
be
reused, and XSL is not necessarily a bad tool for that.

--
Klaus Johannes Rusch
KlausRusch@atmedia.net
http://www.atmedia.net/KlausRusch/

Received on Thursday, 30 August 2001 17:20:48 UTC