- From: Michael Kay <mike@saxonica.com>
- Date: Sat, 19 Nov 2011 22:31:08 +0000
- To: xmlschema-dev@w3.org, Jeni Tennison <jeni@jenitennison.com>
On 19/11/2011 17:27, Noah Mendelsohn wrote: > I'd think readers of this list would be interested in a very nice > study of the quality of XML documents on the Web. [1] The study was > done by Steven Grijzenhout and Maarten Marx. > > Noah > > [1] > http://ilps.science.uva.nl/PoliticalMashup/uploads/2011/08/cikm2011-6pages.pdf > > Interesting. They don't say how they decided that a file is purporting to be XML: is it the file extension, the MIME type, the presence of an XML declaration, or what? The results depend heavily on how you distinguish a "non XML" document from a "bad XML" document, and this isn't really explained. One might ask the question, if an XML document isn't well-formed, then it's not much use to anyone, so why put it on the web? My guess would be that many of these documents are (quasi-)XHTML. It would be nice to see a breakdown of these documents by vocabulary or namespace. I'd be interested in a response from Jeni on why .gov.uk scores badly. I suspect there's a systematic and probably quite simple explanation. Could be as simple as one server serving HTML with an incorrect MIME type. On XSD more specifically, I would observe that it's not an error to have an XML document with an xsi:schemaLocation that isn't resolvable. In fact, it's probably a quite common accident of the publishing process for documents that have been validated in the course of the publication workflow to end up in this state. One more observation, like many others the authors talk of "XML on the web" to mean "visible XML on the public web". Most of the XML on the web sits in databases behind an application server that delivers the content as HTML. Hopefully that XML has rather better quality than is observed here. Michael Kay Saxonica
Received on Saturday, 19 November 2011 22:31:34 UTC