W3C home > Mailing lists > Public > html-tidy@w3.org > October to December 2000

Re: your mail

From: Sami Lempinen <lempinen@iki.fi>
Date: Tue, 28 Nov 2000 08:09:14 +0200
To: #VIKRAM BALKRISHNAN NATARAJAN# <U903506@ntu.edu.sg>
Cc: html-tidy@w3.org
Message-ID: <20001128080914.A2716@koti1-user114.adsl.tpo.fi>
Greetings,

[cc'd to the list]

On Tue, Nov 28, 2000 at 11:49:13AM +0800, #VIKRAM BALKRISHNAN NATARAJAN# wrote:

> Thanks a lot for your prompt reply.
> I wanted to ask you a few fundamental questions before I can start using
> JTidy to know that I am on the right track.
> 
> 1: Can JTidy be easily used with my java program to parse and structure HTML
> pages. 

Yes (although it depends on your definition of "easily" -- you have to
know a bit of DOM to do this. See my article at 

     <http://lempinen.net:8180/Forum/975361475/>

for an introduction. The project documentation at

    <http://sourceforge.net/docman/?group_id=13153>

contains more code fragments.

> 2: Other than parsing can JTidy be used to retrieve only the text from say a
> web page of www.cnn.com i.e. retrieve the text of a news site article.

I would do this *after* parsing: first open a stream from an URL, pass
the stream to JTidy and extract the DOM tree. Then, use the DOM to
extract the textual contents.

Yours,

-Sami

-- 
lempinen@iki.fi http://www.iki.fi/lempinen/
ICQ:19002710  *************  apt-get a life
Received on Tuesday, 28 November 2000 01:09:24 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:44 GMT