Re: XML on the web

On Fri, 2012-09-14 at 10:15 +0100, Henry S. Thompson wrote:
> Does anyone have, or have a pointer to, a database of URIs of XML
> documents visible on the Web?  I don't have the time right now to
> find, install and configure a crawler, but if someone else has already
> done so. . .

The University of Amsterdam XML on th Web Corpus is available -- I
mentioned it in my Balisage talk at
http://www.w3.org/2012/Talks/08-quin-xml-web-corpus/

Be warned that you need to check the http header mysql table that's
included, because some (I thnk 25,000) of the files were not served with
an XML content type header and are mostly also not trying to be XML.

About 13% of the remaining files were not well-formed.

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org freenode/#xml

Received on Friday, 14 September 2012 17:20:39 UTC