asxml produces invalid XML from Vaclav Barta on 2008-06-23 (html-tidy@w3.org from April to June 2008)

From: Vaclav Barta <vbar@comp.cz>
Date: Mon, 23 Jun 2008 10:22:12 +0200
To: html-tidy@w3.org
Message-Id: <200806231022.13919.vbar@comp.cz>

Hi,

I'd like to convert some not-entirely-HTML to XML (so that I can scrape a 
tree, without worrying about unpaired tags and other details), and HTML Tidy 
mostly does that, but... Say I have the following tag soup:

<html>
<body>
provede registraci online <span style="FONT-SIZE: 12pt; FONT-FAMILY: "Times 
New Roman"; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: 
CS; mso-fareast-language: CS; mso-bidi-language: AR-SA"><a 
href="http://www.ibm.com/services/servicepac"><strong>na 
adrese</strong></a></span>
</body>
</html>

(which is simplified from 
http://www.alza.cz/lenovo-thinkplus-service-pack-d94476.htm ). I run

tidy -asxml

on it (where tidy is compiled from today's CVS) and get

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Linux/x86 (vers 18 June 2008), see www.w3.org" />
<title></title>
</head>
<body>
provede registraci online <span style=
"FONT-SIZE: 12pt; FONT-FAMILY:" times="" mso-fareast-font-family:=
"Times" new="" mso-ansi-language:="" mso-fareast-language:=""
mso-bidi-language:=""><a href=
"http://www.ibm.com/services/servicepac"><strong>na
adrese</strong></a></span>
</body>
</html>

which obviously not only isn't valid XHTML (and tidy knows that, warns about 
proprietary attributes yet insists on the doctype and namespace 
declarations), but isn't even XML - some synthetised attributes end with a 
colon. I admit the input isn't valid either, but I still think it should be 
manageable - what do people who know something about HTML Tidy think? Is the 
above a bug, or a feature request? :-)

	Bye
		Vasek
--
http://www.mangrove.cz/
Open Source integration

Received on Monday, 23 June 2008 15:20:01 UTC