- From: Sami Lempinen <lempinen@iki.fi>
- Date: Tue, 28 Nov 2000 08:09:14 +0200
- To: #VIKRAM BALKRISHNAN NATARAJAN# <U903506@ntu.edu.sg>
- Cc: html-tidy@w3.org
Greetings,
[cc'd to the list]
On Tue, Nov 28, 2000 at 11:49:13AM +0800, #VIKRAM BALKRISHNAN NATARAJAN# wrote:
> Thanks a lot for your prompt reply.
> I wanted to ask you a few fundamental questions before I can start using
> JTidy to know that I am on the right track.
>
> 1: Can JTidy be easily used with my java program to parse and structure HTML
> pages.
Yes (although it depends on your definition of "easily" -- you have to
know a bit of DOM to do this. See my article at
<http://lempinen.net:8180/Forum/975361475/>
for an introduction. The project documentation at
<http://sourceforge.net/docman/?group_id=13153>
contains more code fragments.
> 2: Other than parsing can JTidy be used to retrieve only the text from say a
> web page of www.cnn.com i.e. retrieve the text of a news site article.
I would do this *after* parsing: first open a stream from an URL, pass
the stream to JTidy and extract the DOM tree. Then, use the DOM to
extract the textual contents.
Yours,
-Sami
--
lempinen@iki.fi http://www.iki.fi/lempinen/
ICQ:19002710 ************* apt-get a life
Received on Tuesday, 28 November 2000 01:09:24 UTC