W3C home > Mailing lists > Public > html-tidy@w3.org > July to September 2003

HTML to XML

From: <fe.sola@infomed.sld.cu>
Date: Sat, 12 Jul 2003 00:23:14 -0400
Message-ID: <1057983794.3f0f8d3294f02@webmail.sld.cu>
To: html-tidy@w3.org

Hello list,
I'm  currently in a project where I need to convert HTML files into well formed XML to 
extract information from it.
I have been using TidyCOM and the C# wrapper made by Matthew Stanfield. 
Unfortunately, after playing around several times with the configuration file, the tidy 
output file is not a well formed XML.
I have also searched this list archives looking for similar requests and solutions. I 
recall an email sent by Bjoern Hoerhrmann

// Matt G wrote:
//>Is their a way to force Tidy to ignore "HTML good/bad-ness" and only convert
//>badly formed HTML into well-formed XML (which should be much more
//>efficient).

//No and there won't be such an option.

//>Or is there another utility (COM interface preferred,
//>command-line okay, no GUI allowed) that will do this?

//The Gnome XML library is able to do so, see http://xmlsoft.org/

//>I don't care about producing good HTML/XHTML, all I need is to produce
//>something I can shove into an XML parser and use XPath/XSLT to extract data.

//The mentioned library comes with all you need to do so.


In my case I have well formed HTML, my option file looks like this:

add-xml-decl=yes
bare=yes
clean=yes
drop-font-tags=yes
drop-propietary-attributes=yes
indent=auto
indent-spaces=2
wrap=72
markup=yes
output-xml=yes
input-xml=no
show-warnings=yes
numeric-entities=yes
quote-marks=yes
quote-nbsp=yes
quote-ampersand=no
break-before-br=no
uppercase-tags=no
uppercase-attributes=no
smart-indent=no
output-xhtml=yes
char-encoding=latin1
join-styles=yes
word-2000=yes

but I still get several unclosed tags in the output files, like the img  and link tags.
Do I have to use the xmlsoft libraries to accomplish this or Tidy can do the work? If 
so, can anyone tell me what is wrong with my configuration file?

Thanks a lot!

Lizet.


-------------------------------------------------
Este mensaje fue enviado usando el servicio de correo en web de Infomed
http://webmail.sld.cu
Received on Saturday, 12 July 2003 00:29:09 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:54 UTC