- From: Vaclav Barta <vbar@comp.cz>
- Date: Mon, 23 Jun 2008 10:22:12 +0200
- To: html-tidy@w3.org
Hi,
I'd like to convert some not-entirely-HTML to XML (so that I can scrape a
tree, without worrying about unpaired tags and other details), and HTML Tidy
mostly does that, but... Say I have the following tag soup:
<html>
<body>
provede registraci online <span style="FONT-SIZE: 12pt; FONT-FAMILY: "Times
New Roman"; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language:
CS; mso-fareast-language: CS; mso-bidi-language: AR-SA"><a
href="http://www.ibm.com/services/servicepac"><strong>na
adrese</strong></a></span>
</body>
</html>
(which is simplified from
http://www.alza.cz/lenovo-thinkplus-service-pack-d94476.htm ). I run
tidy -asxml
on it (where tidy is compiled from today's CVS) and get
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Linux/x86 (vers 18 June 2008), see www.w3.org" />
<title></title>
</head>
<body>
provede registraci online <span style=
"FONT-SIZE: 12pt; FONT-FAMILY:" times="" mso-fareast-font-family:=
"Times" new="" mso-ansi-language:="" mso-fareast-language:=""
mso-bidi-language:=""><a href=
"http://www.ibm.com/services/servicepac"><strong>na
adrese</strong></a></span>
</body>
</html>
which obviously not only isn't valid XHTML (and tidy knows that, warns about
proprietary attributes yet insists on the doctype and namespace
declarations), but isn't even XML - some synthetised attributes end with a
colon. I admit the input isn't valid either, but I still think it should be
manageable - what do people who know something about HTML Tidy think? Is the
above a bug, or a feature request? :-)
Bye
Vasek
--
http://www.mangrove.cz/
Open Source integration
Received on Monday, 23 June 2008 15:20:01 UTC