- From: <fe.sola@infomed.sld.cu>
- Date: Sat, 12 Jul 2003 00:23:14 -0400
- To: html-tidy@w3.org
Hello list, I'm currently in a project where I need to convert HTML files into well formed XML to extract information from it. I have been using TidyCOM and the C# wrapper made by Matthew Stanfield. Unfortunately, after playing around several times with the configuration file, the tidy output file is not a well formed XML. I have also searched this list archives looking for similar requests and solutions. I recall an email sent by Bjoern Hoerhrmann // Matt G wrote: //>Is their a way to force Tidy to ignore "HTML good/bad-ness" and only convert //>badly formed HTML into well-formed XML (which should be much more //>efficient). //No and there won't be such an option. //>Or is there another utility (COM interface preferred, //>command-line okay, no GUI allowed) that will do this? //The Gnome XML library is able to do so, see http://xmlsoft.org/ //>I don't care about producing good HTML/XHTML, all I need is to produce //>something I can shove into an XML parser and use XPath/XSLT to extract data. //The mentioned library comes with all you need to do so. In my case I have well formed HTML, my option file looks like this: add-xml-decl=yes bare=yes clean=yes drop-font-tags=yes drop-propietary-attributes=yes indent=auto indent-spaces=2 wrap=72 markup=yes output-xml=yes input-xml=no show-warnings=yes numeric-entities=yes quote-marks=yes quote-nbsp=yes quote-ampersand=no break-before-br=no uppercase-tags=no uppercase-attributes=no smart-indent=no output-xhtml=yes char-encoding=latin1 join-styles=yes word-2000=yes but I still get several unclosed tags in the output files, like the img and link tags. Do I have to use the xmlsoft libraries to accomplish this or Tidy can do the work? If so, can anyone tell me what is wrong with my configuration file? Thanks a lot! Lizet. ------------------------------------------------- Este mensaje fue enviado usando el servicio de correo en web de Infomed http://webmail.sld.cu
Received on Saturday, 12 July 2003 00:29:09 UTC