Re: HTMLToPlain and libwww 5.2 from Henrik Frystyk Nielsen on 1998-11-21 (www-lib@w3.org from October to December 1998)

From: Henrik Frystyk Nielsen <frystyk@w3.org>
Date: Sat, 21 Nov 1998 13:32:30 -0500
To: kent@iastate.edu, www-lib@w3.org
Message-Id: <3.0.5.32.19981121133230.02f7bb90@localhost>

At 00:07 11/21/98 CST, Kent Vander Velden wrote:
>
>  I have been trying to convert the returned html to plain text.  So
>far I have not been able to do this.  Using the w3c program I can
>retrieve the remote file in "text/latex" format but no in 
>"text/x-c" or "text/plain".  I have added the extra converters
>with a call to HTMLInit() and can see that when maximum debug
>is enabled that the converter is found.  It is also clear from 
>the parser output that the converter is running; there just is
>no output.
>
>  In short, 
>    this works:
>      ./w3c -to "text/latex" http://www.w3.org/ -o w3home.txt
>    this does not:
>      ./w3c -to "text/plain" http://www.w3.org/ -o w3home.txt

I don't think that any of these work - the command line tool [1] doesn't
have an HTML parser integrated - I only added the HTML parser to the webbot
[2] (which needs it for finding links) and the line mode browser (because
it's a browser!) [3].

The following should work as intended:

	./www -to "text/latex" http://www.w3.org/ -o w3home.tex

	./www -to "text/plain" http://www.w3.org/ -o w3home.txt

(it may not generate fully compliant tex, though). You can remove the [n]
link references by using the "-na" command line option.

Henrik

[1] http://www.w3.org/ComLine/
[2] http://www.w3.org/Robot/
[3] http://www.w3.org/LineMode/

--
Henrik Frystyk Nielsen,
World Wide Web Consortium
http://www.w3.org/People/Frystyk

Received on Saturday, 21 November 1998 13:32:32 UTC