W3C home > Mailing lists > Public > html-tidy@w3.org > October to December 2000

Re: your mail

From: Sami Lempinen <lempinen@iki.fi>
Date: Tue, 28 Nov 2000 08:09:14 +0200
Cc: html-tidy@w3.org
Message-ID: <20001128080914.A2716@koti1-user114.adsl.tpo.fi>

[cc'd to the list]

On Tue, Nov 28, 2000 at 11:49:13AM +0800, #VIKRAM BALKRISHNAN NATARAJAN# wrote:

> Thanks a lot for your prompt reply.
> I wanted to ask you a few fundamental questions before I can start using
> JTidy to know that I am on the right track.
> 1: Can JTidy be easily used with my java program to parse and structure HTML
> pages. 

Yes (although it depends on your definition of "easily" -- you have to
know a bit of DOM to do this. See my article at 


for an introduction. The project documentation at


contains more code fragments.

> 2: Other than parsing can JTidy be used to retrieve only the text from say a
> web page of www.cnn.com i.e. retrieve the text of a news site article.

I would do this *after* parsing: first open a stream from an URL, pass
the stream to JTidy and extract the DOM tree. Then, use the DOM to
extract the textual contents.



lempinen@iki.fi http://www.iki.fi/lempinen/
ICQ:19002710  *************  apt-get a life
Received on Tuesday, 28 November 2000 01:09:24 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:49 UTC