W3C home > Mailing lists > Public > xmlschema-dev@w3.org > November 2011

Re: Study of XML quality on the Web

From: Michael Kay <mike@saxonica.com>
Date: Sat, 19 Nov 2011 22:31:08 +0000
Message-ID: <4EC82E2C.1040003@saxonica.com>
To: xmlschema-dev@w3.org, Jeni Tennison <jeni@jenitennison.com>
On 19/11/2011 17:27, Noah Mendelsohn wrote:
> I'd think readers of this list would be interested in a very nice 
> study of the quality of XML documents on the Web. [1] The study was 
> done by Steven Grijzenhout and Maarten Marx.
> Noah
> [1] 
> http://ilps.science.uva.nl/PoliticalMashup/uploads/2011/08/cikm2011-6pages.pdf

They don't say how they decided that a file is purporting to be XML: is 
it the file extension, the MIME type, the presence of an XML 
declaration, or what? The results depend heavily on how you distinguish 
a "non XML" document from a "bad XML" document, and this isn't really 
explained. One might ask the question, if an XML document isn't 
well-formed, then it's not much use to anyone, so why put it on the web? 
My guess would be that many of these documents are (quasi-)XHTML. It 
would be nice to see a breakdown of these documents by vocabulary or 

I'd be interested in a response from Jeni on why .gov.uk scores badly. I 
suspect there's a systematic and probably quite simple explanation. 
Could be as simple as one server serving HTML with an incorrect MIME type.

On XSD more specifically, I would observe that it's not an error to have 
an XML document with an xsi:schemaLocation that isn't resolvable. In 
fact, it's probably a quite common accident of the publishing process 
for documents that have been validated in the course of the publication 
workflow to end up in this state.

One more observation, like many others the authors talk of "XML on the 
web" to mean "visible XML on the public web". Most of the XML on the web 
sits in databases behind an application server that delivers the content 
as HTML. Hopefully that XML has rather better quality than is observed here.

Michael Kay
Received on Saturday, 19 November 2011 22:31:34 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 14:56:19 UTC