- From: Sami Lempinen <lempinen@iki.fi>
- Date: Tue, 28 Nov 2000 08:09:14 +0200
- To: #VIKRAM BALKRISHNAN NATARAJAN# <U903506@ntu.edu.sg>
- Cc: html-tidy@w3.org
Greetings, [cc'd to the list] On Tue, Nov 28, 2000 at 11:49:13AM +0800, #VIKRAM BALKRISHNAN NATARAJAN# wrote: > Thanks a lot for your prompt reply. > I wanted to ask you a few fundamental questions before I can start using > JTidy to know that I am on the right track. > > 1: Can JTidy be easily used with my java program to parse and structure HTML > pages. Yes (although it depends on your definition of "easily" -- you have to know a bit of DOM to do this. See my article at <http://lempinen.net:8180/Forum/975361475/> for an introduction. The project documentation at <http://sourceforge.net/docman/?group_id=13153> contains more code fragments. > 2: Other than parsing can JTidy be used to retrieve only the text from say a > web page of www.cnn.com i.e. retrieve the text of a news site article. I would do this *after* parsing: first open a stream from an URL, pass the stream to JTidy and extract the DOM tree. Then, use the DOM to extract the textual contents. Yours, -Sami -- lempinen@iki.fi http://www.iki.fi/lempinen/ ICQ:19002710 ************* apt-get a life
Received on Tuesday, 28 November 2000 01:09:24 UTC