W3C home > Mailing lists > Public > html-tidy@w3.org > October to December 2000

Re: your mail

From: Russell Gold <russgold@acm.org>
Date: Tue, 28 Nov 2000 07:07:04 -0500
Message-Id: <v0311070ab6494fa87e43@[207.106.60.109]>
To: Sami Lempinen <lempinen@iki.fi>, #VIKRAM BALKRISHNAN NATARAJAN# <U903506@ntu.edu.sg>
Cc: html-tidy@w3.org
At 1:09 AM -0500 11/28/00, Sami Lempinen wrote:
>[cc'd to the list]
>
>On Tue, Nov 28, 2000 at 11:49:13AM +0800, #VIKRAM BALKRISHNAN NATARAJAN# wrote:
>
>> 1: Can JTidy be easily used with my java program to parse and structure HTML
>> pages. 
>
>Yes (although it depends on your definition of "easily" -- you have to
>know a bit of DOM to do this. See my article at 
>
>     <http://lempinen.net:8180/Forum/975361475/>
>
>for an introduction. The project documentation at
>
>    <http://sourceforge.net/docman/?group_id=13153>
>
>contains more code fragments.

I would add that HttpUnit does exactly this, using JTidy. It is open source so that you can either use it as is, or study it for examples of how to do this with JTidy:

    <http://httpunit.sourceforge.net>
>
>> 2: Other than parsing can JTidy be used to retrieve only the text from say a
>> web page of www.cnn.com i.e. retrieve the text of a news site article.
>
>I would do this *after* parsing: first open a stream from an URL, pass
>the stream to JTidy and extract the DOM tree. Then, use the DOM to
>extract the textual contents.

It's a bit tricky, since you have to accumulate the contents of many DOM nodes, but once again, HttpUnit provides methods to do this.

------------------------------------------------------------------------
Russell Gold                     | "... society is tradition and order
russgold@acm.org    (preferred)  | and reverence, not a series of cheap
russgold@netaxs.com              | bargains between selfish interests."
rgold@thesycamoregroup.com       |   - Poul Anderson, "Iron"
Received on Tuesday, 28 November 2000 09:00:35 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:44 GMT