Connecting to Prolog and SGML/HTML trouble from Jan Wielemaker on 2000-03-03 (www-lib@w3.org from January to March 2000)

From: Jan Wielemaker <jan@swi.psy.uva.nl>
Date: Fri, 3 Mar 2000 10:26:09 +0100
To: www-lib@w3.org
Message-Id: <00030310562700.17016@gollem>
Hi,

I'm playing around with libwww for about a week, trying to figure out
how I can bind it to SWI-Prolog and how useful it is.

I'm starting to make some progress, though i'm still wondering on the
best level to interract.  I don't want to provide a Prolog interface 
for all interface functions, but provide an easy to program extensible
interface layer that can be used to incorporate quickly what is needed
for some project.

Getting data from an URI and feeding it into a Prolog I/O stream works
fine now.  I've got the impression that the protocol layer is generally
done neatly.

Then I started connecting the HTML stuff.  A first connection
through the HText stuff was easily done, collecting the data from a URI,
transforming it into a Prolog structured term and on completing the
document make a call-back to Prolog passing the structure.

Only, when I saw the result I was totally disappointed.  We have been
playing here with the SP parser, which deals pretty well with not too
badly formed HTML, but the libwww parser doesn't handle omitted tags!?

I browsed the mailinglist, seeing various queries on this topic,
including some remarks of people working on a proper parser.  Sofar,
I concluded:

	* The libwww SGML parser as it stands only recognises tags and
	  entities based on some C-defined structure.

          It cannot deal with omitted tags, does not know about content
	  model nor attribute types.

	  I presume it cannot deal with anything but ISO-latin-1.

	* I took a brief look at the Amaya parser, but this parser like
	  SP appeans to be `sucking' the input data.  I used the `x
	  closes a b c' definitions in my Prolog binding, which makes
	  things a lot better, but due to the lack of a content model
	  in the SGML parser I still get <HR> with content inside it.

	  Some closer look also suggests the Amaya parser itself most
	  likely doesn't return anything that looks like SP.  Probably
	  their engine doesn't need that, but I do.

Is this a correct reflection of the status?  I assume things are
much better at the XML side, right?

What now?  I like the libwww architecture and I'm not too keen
connecting SP in some odd way to it.  Is handling omitted tags the
only real problem?  Is there somebody who wrote a DTD parser (at
least the content model, attribute declarations and parameter
entities)?  Using that and the existing libwww SGML parser, it might
not be too hard to do a reasonable job building a second-stage
structure stream that translates the output of the first stage
inserting the required tags.

I'm considering doing it myself, but first I'd like to know whether
there is someone with a decent starting point.  A DTD parser would
be just great.

	Regards --- Jan
Received on Friday, 3 March 2000 04:56:28 UTC