W3C home > Mailing lists > Public > public-xml-core-wg@w3.org > September 2012

Re: XML on the web

From: Liam R E Quin <liam@w3.org>
Date: Fri, 14 Sep 2012 13:20:06 -0400
To: "Henry S. Thompson" <ht@inf.ed.ac.uk>
Cc: public-xml-core-wg <public-xml-core-wg@w3.org>
Message-ID: <1347643206.560.33.camel@localhost.localdomain>
On Fri, 2012-09-14 at 10:15 +0100, Henry S. Thompson wrote:
> Does anyone have, or have a pointer to, a database of URIs of XML
> documents visible on the Web?  I don't have the time right now to
> find, install and configure a crawler, but if someone else has already
> done so. . .

The University of Amsterdam XML on th Web Corpus is available -- I
mentioned it in my Balisage talk at
http://www.w3.org/2012/Talks/08-quin-xml-web-corpus/

Be warned that you need to check the http header mysql table that's
included, because some (I thnk 25,000) of the files were not served with
an XML content type header and are mostly also not trying to be XML.

About 13% of the remaining files were not well-formed.

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org freenode/#xml
Received on Friday, 14 September 2012 17:20:39 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:16:44 UTC