Re: XML Parse from Dave Raggett on 2000-07-24 (html-tidy@w3.org from July to September 2000)

From: Dave Raggett <dsr@w3.org>
Date: Mon, 24 Jul 2000 11:59:32 +0100 (GMT Daylight Time)
To: "Dickey, Will" <wdickey@gettuit.com>
cc: "'html-tidy@w3.org'" <html-tidy@w3.org>, gerald@w3.org
Message-ID: <Pine.WNT.4.10.10007241156030.-775019@hazel.hpl.hp.com>

On Fri, 21 Jul 2000, Dickey, Will wrote:

> Hello.  I would like to parse the results of a tidy operation
> into a DOM. I'm not sure if this is possible, and it apparently
> is not with MSXML, as it raises numerous errors on any HTML
> document I tidy and then try to parse.
> 
> Is my premise wrong - parsing HTML into an XML DOM can't be
> done, or am I using the wrong parser?  Any help would be greatly
> appreciated.

The simplest thing is to use Tidy to clean up the markup and
convert it into well formed XML, and follow this up with an
off-the-shelf XML tool, e.g. the IBM java tool kit for XML.

You could alternatively add code into Tidy to do what you want.
Tidy provides a simple interface for walking markup trees, although
it doesn't conform the the DOM, but this is hardly surprising given
that work on Tidy started before the DOM.

Regards,

-- Dave Raggett <dsr@w3.org> http://www.w3.org/People/Raggett
tel/fax: +44 122 578 3011 (or 2521) +44 778 532 0444 (mobile)
World Wide Web Consortium (on assignment from HP Labs)

Received on Monday, 24 July 2000 06:59:39 UTC