W3C home > Mailing lists > Public > html-tidy@w3.org > October to December 2000


From: Philippe QUERREC <pquerrec@cosmosbay.com>
Date: Tue, 28 Nov 2000 16:29:06 +0100
Message-ID: <704251321BC8D311AC37525400E861421504C9@graham.cosmosbay.com>
To: "'html-tidy@w3.org'" <html-tidy@w3.org>
When we extract a DOM tree of a stream from an URL, we can't use it with
Xerces, because we have org.w3c.tidy.DOMElementImpl element instead of Node
element. Is there any cast available in any code fragments ?
Actually, I write the DOM on disk, and re-open it as a XML file, it works
but use too many i/o instructions.

Thanks for your help
-----Message d'origine-----
De : html-tidy-request@w3.org [mailto:html-tidy-request@w3.org]De la
part de Russell Gold
Envoyé : mardi 28 novembre 2000 13:07
Cc : html-tidy@w3.org
Objet : Re: your mail

At 1:09 AM -0500 11/28/00, Sami Lempinen wrote:
>[cc'd to the list]
>On Tue, Nov 28, 2000 at 11:49:13AM +0800, #VIKRAM BALKRISHNAN NATARAJAN#
>> 1: Can JTidy be easily used with my java program to parse and structure
>> pages. 
>Yes (although it depends on your definition of "easily" -- you have to
>know a bit of DOM to do this. See my article at 
>     <http://lempinen.net:8180/Forum/975361475/>
>for an introduction. The project documentation at
>    <http://sourceforge.net/docman/?group_id=13153>
>contains more code fragments.

I would add that HttpUnit does exactly this, using JTidy. It is open source
so that you can either use it as is, or study it for examples of how to do
this with JTidy:
>> 2: Other than parsing can JTidy be used to retrieve only the text from
say a
>> web page of www.cnn.com i.e. retrieve the text of a news site article.
>I would do this *after* parsing: first open a stream from an URL, pass
>the stream to JTidy and extract the DOM tree. Then, use the DOM to
>extract the textual contents.
 It's a bit tricky, since you have to accumulate the contents of many DOM
 nodes, but once again, HttpUnit provides methods to do this.
 Russell Gold                     | "... society is tradition and order
 russgold@acm.org    (preferred)  | and reverence, not a series of cheap
 russgold@netaxs.com              | bargains between selfish interests."
 rgold@thesycamoregroup.com       |   - Poul Anderson, "Iron"
Received on Tuesday, 28 November 2000 10:30:16 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:49 UTC