- From: Richard A. O'Keefe <ok@atlas.otago.ac.nz>
- Date: Fri, 31 Aug 2001 12:03:57 +1200 (NZST)
- To: KlausRusch@atmedia.net, html-tidy@w3.org, mattg@vguild.com
"Matt G" <mattg@vguild.com> wrote: If you want to extract certain data published in HTML format, how would you do it? b) Parse the HTML into a tree and crawl the tree to find the data c) Convert the HTML to XML and use XSLT templates to extract the data Most of the XSLT processors do not require an XML document. They will work off - a DOM object that you build any way you want, or - a sequence of SAX events that you fire off, doing it any way you like, or - an actual XML object. If starting with JTidy, perhaps the most efficient way of getting the information into an XSLT processor would be by writing a little glue code to walk over JTidy's "home-brew DOM" and fire off SAX events. Me, I would definitely go for alternative (b). I'd use Scheme, or Prolog, or Mercury, or Haskell, or Clean, or OCAML, or ... do to the tree-walking. The extracted data is going to a database, so why should I care what happens to the bad presentation markup? The question is, HOW DO YOU KNOW which parts of the input file go into what fields? If the markup isn't giving you any help at all, then why try to tidy it? Why not just strip out tags completely? All I care about is getting the data. That's fine, *if* the input is clean enough for you to know which part is the date and which part is the price. (Or whatever.) And if the HTML format changes, What do you mean "if the HTML format changes"? Are you talking about things like HTML 2.0 -> HTML 3.2 -> HTML 4.01? Those were upwards-compatible extensions. Are you talking about some stylised use of CLASS attributes to tag information semantically? WHO is controlling the format of these documents? WHAT kinds of "format" must they satisfy over and above being HTML? WHY aren't they cleaned by their creators? WHAT kinds of mess do you have to cope with? WHAT kinds of structural properties guide your information extraction procession? HOW do you know that the mess isn't bad enough to destroy the structure you expect to rely on? (I've seen documents with two heads, documents with a head inside a body, you name the monstrosity, and some commercial HTML editor will happily generate it.) I can just modify the XSLT templates rather than rewriting parsing functions. "Parsing" in the sense of turning XML into trees is pretty trivial. The *real* job of "parsing" is precisely what you write XSLT code to do.
Received on Thursday, 30 August 2001 20:04:02 UTC