W3C home > Mailing lists > Public > www-html@w3.org > July 2000

Re: HELP!! (parsing HTML with Java)

From: Dan Connolly <connolly@w3.org>
Date: Mon, 24 Jul 2000 16:48:34 -0500
Message-ID: <397CB9B2.D1764F1E@w3.org>
To: "Simon St.Laurent" <simonstl@simonstl.com>
CC: Christian Stone <chris.stone@virgin.net>, www-html@w3.org
"Simon St.Laurent" wrote:
> At 09:53 PM 7/24/00 +0100, Christian Stone wrote:
> >Does anybody out there in the ether have any suggestions about where I
> >can get some information on how to use the HTML parser in JAVA.
> >
> >I am trying to parse an HTML page and then be able to iterate over the
> >parse tree to extract all the <a tags to create a table of links.
> I don't know how much documentation is included, but David Brownell has a
> tool that lets you use the Java Swing HTML parser to generate
> XML-parser-like SAX events, which would at least get you into a
> well-documented parsing environment.

A similar approach is to use the Tidy Java Bean.
It seems to have reasonable documentation and
it seems to be actively maintained:

Java HTML Tidy
Updated 22 Jul 2000

<- Andy Quick
<- HTML Tidy
<- HTML Home page

> See:
> http://home.pacbell.net/david-b/xml/
> It's in the SAX2 Utilities package.
> Information on the SAX2 API is at:
> http://www.megginson.com/SAX/
> You could collect all the a elements and their attributes in the
> StartElement method of your ContentHandler.
> I hope that helps...
> Simon St.Laurent
> XML Elements of Style / XML: A Primer, 2nd Ed.
> http://www.simonstl.com - XML essays and books

Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Monday, 24 July 2000 17:49:35 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:05:54 UTC