- From: Corne Beerse <beerse@ats.nld.alcatel.nl>
- Date: Mon, 16 Nov 1998 09:24:16 +0100
- To: Howard Rubin <hrubin@nyx.net>
- CC: hrubin@disc.com, www-amaya@w3.org
Howard Rubin wrote: > I need to extract text from HTML documents and do this from > a platform portable C program. I've been all over the web -- > dejanews, yahoo etc., and the closest thing I've found is libwww. > However, I notice in libwww (http://www.w3.org/Library/User/Start.html) > that libwww isn't recommended for use as an HTML parser. It > recommends Amaya as a full HTML parser. You might try to write a Perl (or sed/awk) script for the purpose. > > Is there some part of the Amaya source code that would be suitable > for extracting the text from HTML documents from a C program? > Any tips, hints, etc would be greatly appreciated. I should have a look at the print code. There is a special executable in the bin directory. You should strip all the postscript code it generates and there you have your text. CB -- Is reading in the bathroom considered Multi-Tasking? Corne' Beerse | Alcatel Telecom Nederland mailto:beerse@ats.nld.alcatel.nl | Postbus 3292 talkto:+31(70)3079108 faxto:+31(70)3079191 | NL-2280 GG Rijswijk
Received on Monday, 16 November 1998 03:25:01 UTC