W3C home > Mailing lists > Public > html-tidy@w3.org > July to September 2001

Re: to XML, not XHTML

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Thu, 30 Aug 2001 03:27:06 +0200
To: "Matt G" <mattg@vguild.com>
Cc: <html-tidy@w3.org>
Message-ID: <hl4rot0lcmqcfsotrnq0l78sld10urum5s@4ax.com>
* Matt G wrote:
>Is their a way to force Tidy to ignore "HTML good/bad-ness" and only convert
>badly formed HTML into well-formed XML (which should be much more
>efficient).

No and there won't be such an option.

>Or is there another utility (COM interface preferred,
>command-line okay, no GUI allowed) that will do this?

The Gnome XML library is able to do so, see http://xmlsoft.org/

>I don't care about producing good HTML/XHTML, all I need is to produce
>something I can shove into an XML parser and use XPath/XSLT to extract data.

The mentioned library comes with all you need to do so.

>It will be used by automation scripts and robots.

There is also HTML::TreeBuilder for Perl, but it does care about some
HTML flaws.
-- 
Björn Höhrmann { mailto:bjoern@hoehrmann.de } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
Received on Wednesday, 29 August 2001 21:27:46 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:46 GMT