W3C home > Mailing lists > Public > html-tidy@w3.org > April to June 2008

asxml produces invalid XML

From: Vaclav Barta <vbar@comp.cz>
Date: Mon, 23 Jun 2008 10:22:12 +0200
To: html-tidy@w3.org
Message-Id: <200806231022.13919.vbar@comp.cz>


I'd like to convert some not-entirely-HTML to XML (so that I can scrape a 
tree, without worrying about unpaired tags and other details), and HTML Tidy 
mostly does that, but... Say I have the following tag soup:

provede registraci online <span style="FONT-SIZE: 12pt; FONT-FAMILY: "Times 
New Roman"; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: 
CS; mso-fareast-language: CS; mso-bidi-language: AR-SA"><a 

(which is simplified from 
http://www.alza.cz/lenovo-thinkplus-service-pack-d94476.htm ). I run

tidy -asxml

on it (where tidy is compiled from today's CVS) and get

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
<html xmlns="http://www.w3.org/1999/xhtml">
<meta name="generator" content=
"HTML Tidy for Linux/x86 (vers 18 June 2008), see www.w3.org" />
provede registraci online <span style=
"FONT-SIZE: 12pt; FONT-FAMILY:" times="" mso-fareast-font-family:=
"Times" new="" mso-ansi-language:="" mso-fareast-language:=""
mso-bidi-language:=""><a href=

which obviously not only isn't valid XHTML (and tidy knows that, warns about 
proprietary attributes yet insists on the doctype and namespace 
declarations), but isn't even XML - some synthetised attributes end with a 
colon. I admit the input isn't valid either, but I still think it should be 
manageable - what do people who know something about HTML Tidy think? Is the 
above a bug, or a feature request? :-)

Open Source integration
Received on Monday, 23 June 2008 15:20:01 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:57 UTC