Re: to XML, not XHTML from Richard A. O'Keefe on 2001-08-31 (html-tidy@w3.org from July to September 2001)

From: Richard A. O'Keefe <ok@atlas.otago.ac.nz>
Date: Fri, 31 Aug 2001 12:03:57 +1200 (NZST)
To: KlausRusch@atmedia.net, html-tidy@w3.org, mattg@vguild.com
Message-Id: <200108310003.MAA194644@atlas.otago.ac.nz>

"Matt G" <mattg@vguild.com> wrote:
	If you want to extract certain data published in HTML format,
	how would you do it?
	
	b) Parse the HTML into a tree and crawl the tree to find the data
	c) Convert the HTML to XML and use XSLT templates to extract the data

Most of the XSLT processors do not require an XML document.
They will work off
 - a DOM object that you build any way you want, or
 - a sequence of SAX events that you fire off, doing it any way you like, or
 - an actual XML object.
If starting with JTidy, perhaps the most efficient way of getting the
information into an XSLT processor would be by writing a little glue code
to walk over JTidy's "home-brew DOM" and fire off SAX events.

Me, I would definitely go for alternative (b).  I'd use Scheme, or Prolog,
or Mercury, or Haskell, or Clean, or OCAML, or ... do to the tree-walking.

	The extracted data is going to a database, so why should I care
	what happens to the bad presentation markup?

The question is, HOW DO YOU KNOW which parts of the input file go into
what fields?  If the markup isn't giving you any help at all, then why
try to tidy it?  Why not just strip out tags completely?

	All I care about is getting the data.

That's fine, *if* the input is clean enough for you to know which part
is the date and which part is the price.  (Or whatever.)

	And if the HTML format changes,

What do you mean "if the HTML format changes"?
Are you talking about things like HTML 2.0 -> HTML 3.2 -> HTML 4.01?
Those were upwards-compatible extensions. 
Are you talking about some stylised use of CLASS attributes to tag
information semantically?

WHO is controlling the format of these documents?
WHAT kinds of "format" must they satisfy over and above being HTML?
WHY aren't they cleaned by their creators?
WHAT kinds of mess do you have to cope with?
WHAT kinds of structural properties guide your information extraction
     procession?
HOW do you know that the mess isn't bad enough to destroy the structure
    you expect to rely on?  (I've seen documents with two heads, documents
    with a head inside a body, you name the monstrosity, and some
    commercial HTML editor will happily generate it.)

	I can just modify the XSLT templates rather than rewriting
	parsing functions.
	
"Parsing" in the sense of turning XML into trees is pretty trivial.
The *real* job of "parsing" is precisely what you write XSLT code
to do.

Received on Thursday, 30 August 2001 20:04:02 UTC