W3C home > Mailing lists > Public > html-tidy@w3.org > July to September 2001

Re: to XML, not XHTML

From: Klaus Johannes Rusch <KlausRusch@atmedia.net>
Date: Wed, 29 Aug 2001 19:39:01 CET
Message-Id: <200108291838.OAA03919@tux.w3.org>
To: <html-tidy@w3.org>
In <003501c1305a$da510fe0$6703a8c0@nb100>, "Matt G" <mattg@vguild.com> writes:
> Yes, but XML isn't XHTML. Understand?
> 
> The following is not valid XHTML. It *is* valid XML.
> 
> <input><form /><foobar /><tr /></input>
> 
> I need to turn really bad HTML into parse-able XML at any cost; that the
> result may be complete gibberish with respect to the XHTML DTD's is of no
> concern.

Try the HTML::TreeBuilder Perl module, this will read an HTML page, build a tree representation
and output HTML (as_HTML method) or XML (as_XML, experimental according to the documentation).


use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new;
do {
    local $/ = undef;
    $tree->parse(<>);
};
$tree->eof;
print $tree->as_XML, "\n";

The output should be parseable XML.

-- 
Klaus Johannes Rusch
KlausRusch@atmedia.net
http://www.atmedia.net/KlausRusch/
Received on Wednesday, 29 August 2001 14:38:11 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:46 GMT