- From: Vaclav Barta <vbar@comp.cz>
- Date: Mon, 23 Jun 2008 10:22:12 +0200
- To: html-tidy@w3.org
Hi, I'd like to convert some not-entirely-HTML to XML (so that I can scrape a tree, without worrying about unpaired tags and other details), and HTML Tidy mostly does that, but... Say I have the following tag soup: <html> <body> provede registraci online <span style="FONT-SIZE: 12pt; FONT-FAMILY: "Times New Roman"; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: CS; mso-fareast-language: CS; mso-bidi-language: AR-SA"><a href="http://www.ibm.com/services/servicepac"><strong>na adrese</strong></a></span> </body> </html> (which is simplified from http://www.alza.cz/lenovo-thinkplus-service-pack-d94476.htm ). I run tidy -asxml on it (where tidy is compiled from today's CVS) and get <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="generator" content= "HTML Tidy for Linux/x86 (vers 18 June 2008), see www.w3.org" /> <title></title> </head> <body> provede registraci online <span style= "FONT-SIZE: 12pt; FONT-FAMILY:" times="" mso-fareast-font-family:= "Times" new="" mso-ansi-language:="" mso-fareast-language:="" mso-bidi-language:=""><a href= "http://www.ibm.com/services/servicepac"><strong>na adrese</strong></a></span> </body> </html> which obviously not only isn't valid XHTML (and tidy knows that, warns about proprietary attributes yet insists on the doctype and namespace declarations), but isn't even XML - some synthetised attributes end with a colon. I admit the input isn't valid either, but I still think it should be manageable - what do people who know something about HTML Tidy think? Is the above a bug, or a feature request? :-) Bye Vasek -- http://www.mangrove.cz/ Open Source integration
Received on Monday, 23 June 2008 15:20:01 UTC