- From: Matt G <mattg@vguild.com>
- Date: Wed, 29 Aug 2001 01:18:49 -0600
- To: "Jelks Cabaniss" <jelks@jelks.nu>, <html-tidy@w3.org>
Yes, but XML isn't XHTML. Understand? The following is not valid XHTML. It *is* valid XML. <input><form /><foobar /><tr /></input> I need to turn really bad HTML into parse-able XML at any cost; that the result may be complete gibberish with respect to the XHTML DTD's is of no concern. I am using TidyCOM. I have sucessfully accomplished HTML=>XML using the Trident (IE) web browser control (shdocvw.dll) and iterating the HTML DOM tree. The problem with this method is that it is extremely slow and processor intensive, and completely unsuitable for server-side automated robots. Matt ----- Original Message ----- From: "Jelks Cabaniss" <jelks@jelks.nu> To: <html-tidy@w3.org> Sent: Wednesday, August 29, 2001 12:26 AM Subject: RE: to XML, not XHTML Matt G wrote: > Is their a way to force Tidy to ignore "HTML good/bad-ness" > and only convert badly formed HTML into well-formed XML > (which should be much more efficient). Or is there another > utility (COM interface preferred, command-line okay, no GUI > allowed) that will do this? > > I don't care about producing good HTML/XHTML, all I need is > to produce something I can shove into an XML parser and use > XPath/XSLT to extract data. It will be used by automation > scripts and robots. XHTML *is* well-formed XML. As to a Tidy COM interface, see http://perso.wanadoo.fr/ablavier/TidyCOM/ /Jelks
Received on Wednesday, 29 August 2001 03:19:07 UTC