W3C home > Mailing lists > Public > html-tidy@w3.org > July to September 2001

Re: to XML, not XHTML

From: Klaus Johannes Rusch <KlausRusch@atmedia.net>
Date: Wed, 29 Aug 2001 19:39:01 CET
Message-Id: <200108291838.OAA03919@tux.w3.org>
To: <html-tidy@w3.org>
In <003501c1305a$da510fe0$6703a8c0@nb100>, "Matt G" <mattg@vguild.com> writes:
> Yes, but XML isn't XHTML. Understand?
> The following is not valid XHTML. It *is* valid XML.
> <input><form /><foobar /><tr /></input>
> I need to turn really bad HTML into parse-able XML at any cost; that the
> result may be complete gibberish with respect to the XHTML DTD's is of no
> concern.

Try the HTML::TreeBuilder Perl module, this will read an HTML page, build a tree representation
and output HTML (as_HTML method) or XML (as_XML, experimental according to the documentation).

use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new;
do {
    local $/ = undef;
print $tree->as_XML, "\n";

The output should be parseable XML.

Klaus Johannes Rusch
Received on Wednesday, 29 August 2001 14:38:11 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:50 UTC