- From: Jan Wielemaker <jan@swi.psy.uva.nl>
- Date: Fri, 3 Mar 2000 10:26:09 +0100
- To: www-lib@w3.org
Hi, I'm playing around with libwww for about a week, trying to figure out how I can bind it to SWI-Prolog and how useful it is. I'm starting to make some progress, though i'm still wondering on the best level to interract. I don't want to provide a Prolog interface for all interface functions, but provide an easy to program extensible interface layer that can be used to incorporate quickly what is needed for some project. Getting data from an URI and feeding it into a Prolog I/O stream works fine now. I've got the impression that the protocol layer is generally done neatly. Then I started connecting the HTML stuff. A first connection through the HText stuff was easily done, collecting the data from a URI, transforming it into a Prolog structured term and on completing the document make a call-back to Prolog passing the structure. Only, when I saw the result I was totally disappointed. We have been playing here with the SP parser, which deals pretty well with not too badly formed HTML, but the libwww parser doesn't handle omitted tags!? I browsed the mailinglist, seeing various queries on this topic, including some remarks of people working on a proper parser. Sofar, I concluded: * The libwww SGML parser as it stands only recognises tags and entities based on some C-defined structure. It cannot deal with omitted tags, does not know about content model nor attribute types. I presume it cannot deal with anything but ISO-latin-1. * I took a brief look at the Amaya parser, but this parser like SP appeans to be `sucking' the input data. I used the `x closes a b c' definitions in my Prolog binding, which makes things a lot better, but due to the lack of a content model in the SGML parser I still get <HR> with content inside it. Some closer look also suggests the Amaya parser itself most likely doesn't return anything that looks like SP. Probably their engine doesn't need that, but I do. Is this a correct reflection of the status? I assume things are much better at the XML side, right? What now? I like the libwww architecture and I'm not too keen connecting SP in some odd way to it. Is handling omitted tags the only real problem? Is there somebody who wrote a DTD parser (at least the content model, attribute declarations and parameter entities)? Using that and the existing libwww SGML parser, it might not be too hard to do a reasonable job building a second-stage structure stream that translates the output of the first stage inserting the required tags. I'm considering doing it myself, but first I'd like to know whether there is someone with a decent starting point. A DTD parser would be just great. Regards --- Jan
Received on Friday, 3 March 2000 04:56:28 UTC