- From: Matt G <mattg@vguild.com>
- Date: Thu, 30 Aug 2001 15:19:50 -0600
- To: "Klaus Johannes Rusch" <KlausRusch@atmedia.net>, <html-tidy@w3.org>
If you want to extract certain data published in HTML format, how would you do it? a) Parse the data out with text parsing functions and/or regular expressions b) Parse the HTML into a tree and crawl the tree to find the data c) Convert the HTML to XML and use XSLT templates to extract the data d) Other The extracted data is going to a database, so why should I care what happens to the bad presentation markup? All I care about is getting the data. And if the HTML format changes, I can just modify the XSLT templates rather than rewriting parsing functions. Is this obtuse? Matt ----- Original Message ----- From: "Klaus Johannes Rusch" <KlausRusch@atmedia.net> To: <html-tidy@w3.org> Sent: Thursday, August 30, 2001 5:57 AM Subject: Re: to XML, not XHTML In <200108300111.NAA187797@atlas.otago.ac.nz>, "Richard A. O'Keefe" <ok@atlas.otago.ac.nz> writes: > "Matt G" <mattg@vguild.com> wrote: > I am having trouble imagining an application that extracts useful information > from botched-HTML and cares that the tags are balanced without caring which > or in what order those tags might be. I suggest that the results will be of > less utility than might appear. Why would anyone want to turn bad HTML into XML? Even the most unstructured HTML, with invalid or misplaced tags, may contain some information that can be reused, and XSL is not necessarily a bad tool for that. -- Klaus Johannes Rusch KlausRusch@atmedia.net http://www.atmedia.net/KlausRusch/
Received on Thursday, 30 August 2001 17:20:48 UTC