- From: Philippe QUERREC <pquerrec@cosmosbay.com>
- Date: Tue, 28 Nov 2000 16:29:06 +0100
- To: "'html-tidy@w3.org'" <html-tidy@w3.org>
When we extract a DOM tree of a stream from an URL, we can't use it with Xerces, because we have org.w3c.tidy.DOMElementImpl element instead of Node element. Is there any cast available in any code fragments ? Actually, I write the DOM on disk, and re-open it as a XML file, it works but use too many i/o instructions. Thanks for your help Pils -----Message d'origine----- De : html-tidy-request@w3.org [mailto:html-tidy-request@w3.org]De la part de Russell Gold Envoyé : mardi 28 novembre 2000 13:07 À : Sami Lempinen; #VIKRAM BALKRISHNAN NATARAJAN# Cc : html-tidy@w3.org Objet : Re: your mail At 1:09 AM -0500 11/28/00, Sami Lempinen wrote: >[cc'd to the list] > >On Tue, Nov 28, 2000 at 11:49:13AM +0800, #VIKRAM BALKRISHNAN NATARAJAN# wrote: > >> 1: Can JTidy be easily used with my java program to parse and structure HTML >> pages. > >Yes (although it depends on your definition of "easily" -- you have to >know a bit of DOM to do this. See my article at > > <http://lempinen.net:8180/Forum/975361475/> > >for an introduction. The project documentation at > > <http://sourceforge.net/docman/?group_id=13153> > >contains more code fragments. I would add that HttpUnit does exactly this, using JTidy. It is open source so that you can either use it as is, or study it for examples of how to do this with JTidy: <http://httpunit.sourceforge.net> > >> 2: Other than parsing can JTidy be used to retrieve only the text from say a >> web page of www.cnn.com i.e. retrieve the text of a news site article. > >I would do this *after* parsing: first open a stream from an URL, pass >the stream to JTidy and extract the DOM tree. Then, use the DOM to >extract the textual contents. It's a bit tricky, since you have to accumulate the contents of many DOM nodes, but once again, HttpUnit provides methods to do this. ------------------------------------------------------------------------ Russell Gold | "... society is tradition and order russgold@acm.org (preferred) | and reverence, not a series of cheap russgold@netaxs.com | bargains between selfish interests." rgold@thesycamoregroup.com | - Poul Anderson, "Iron"
Received on Tuesday, 28 November 2000 10:30:16 UTC