Re: to XML, not XHTML from Matt G on 2001-08-29 (html-tidy@w3.org from July to September 2001)

From: Matt G <mattg@vguild.com>
Date: Wed, 29 Aug 2001 01:18:49 -0600
To: "Jelks Cabaniss" <jelks@jelks.nu>, <html-tidy@w3.org>
Message-ID: <003501c1305a$da510fe0$6703a8c0@nb100>

Yes, but XML isn't XHTML. Understand?

The following is not valid XHTML. It *is* valid XML.

<input><form /><foobar /><tr /></input>

I need to turn really bad HTML into parse-able XML at any cost; that the
result may be complete gibberish with respect to the XHTML DTD's is of no
concern.

I am using TidyCOM.

I have sucessfully accomplished HTML=>XML using the Trident (IE) web browser
control (shdocvw.dll) and iterating the HTML DOM tree. The problem with this
method is that it is extremely slow and processor intensive, and completely
unsuitable for server-side automated robots.

    Matt

----- Original Message -----
From: "Jelks Cabaniss" <jelks@jelks.nu>
To: <html-tidy@w3.org>
Sent: Wednesday, August 29, 2001 12:26 AM
Subject: RE: to XML, not XHTML

Matt G wrote:

> Is their a way to force Tidy to ignore "HTML good/bad-ness"
> and only convert badly formed HTML into well-formed XML
> (which should be much more efficient). Or is there another
> utility (COM interface preferred, command-line okay, no GUI
> allowed) that will do this?
>
> I don't care about producing good HTML/XHTML, all I need is
> to produce something I can shove into an XML parser and use
> XPath/XSLT to extract data. It will be used by automation
> scripts and robots.

XHTML *is* well-formed XML.

As to a Tidy COM interface, see

http://perso.wanadoo.fr/ablavier/TidyCOM/

/Jelks

Received on Wednesday, 29 August 2001 03:19:07 UTC