Re: Billion Triples Challenge Crawl 2014 from Aidan Hogan on 2014-02-22 (semantic-web@w3.org from February 2014)

From: Aidan Hogan <aidan.hogan@deri.org>
Date: Sat, 22 Feb 2014 18:52:19 -0300
To: semantic-web@w3.org
Message-ID: <53091C13.9040708@deri.org>
A little late to the party, but as a past collaborator of Andreas and 
Tobias, I'd just like to publicly extend my thanks to them for again 
crawling the BTC data. The data have been used in a wide-variety of 
papers [1] and have lead to important insights as to how research 
techniques perform against real-world, diverse data (moving away from 
all-too-comfortable-world of synthetic benchmarks or single-dataset 
corpora). Collecting the BTC data is not an easy job and it is done with 
little personal benefit to those who organise it.

Unfortunately, for many "LOD datasets" there is a huge deficit between 
what is reported as available, and what is available through crawls such 
as the BTC. Though the results are a bit old (referring to BTC 11), in 
[2], we wrote briefly about the reasons for under-representation of some 
of the largest LOD datasets in the BTC. These datasets were hosted on 
rpi.edu, linkedgeodata.org, wright.edu, concordia.ca, rdfabout.com, 
unime.it, uriburner.com, openlibrary.org, sudoc.fr, viaf.org, 
europeana.eu, moreways.net, rkbexplorer.com, opencorporates.com, 
uberblic.org and geonames.org.  (For example, GeoNames bans most access 
through robots.txt.) Though it wasn't the focus of the paper, some more 
details are given in Section 4.3 of [1], where Tables 1 & 2 are most 
relevant.

In summary, there is a huge discrepancy between what the LOD cloud 
promises and what is available in reality through crawls like the BTC. 
Some of these reasons were down to weaknesses of the crawler (for 
example, in past editions BTC would only crawl RDF/XML, but other 
formats have been supported since 2012). However, most of the reasons 
for the discrepancies are now down to how data are hosted (robots, 
conneg, 500's, deadlinks, not linked, cyclical redirects, broken syntax 
...).

If you have a LOD dataset (particularly hosted on one of the mentioned 
servers), it might be a good time to think about why your data might be 
difficult to access for potential consumers. If you resolve such issues 
now, your data can be included in BTC'14 and can, e.g., influence the 
evaluation of future research papers. As a bonus, other folks will be 
able to access your Linked Data in future.

And thanks to Andreas and Tobias again!

Best,
Aidan

P.S., on a side note, for a few years, a variety of us tried to directly 
contact publishers about problems we found with such crawls through the 
"Pedantic Web" group [3,4]. We had mixed responses. :/

[1] http://scholar.google.com/scholar?hl=en&q=btc+semantic+web
[2] http://aidanhogan.com/docs/dyldo_ldow12.pdf
[3] https://groups.google.com/forum/#!forum/pedantic-web
[4] http://pedantic-web.org/

On 11/02/2014 09:36, Andreas Harth wrote:
> Hello,
>
> we are about to start crawling for the Billion Triples Challenge
> 2014 dataset in the next couple of weeks.
>
> A few things you can do to prepare:
>
> 1. If you do not want our crawler to visit your site, provide
> a robots.txt [1].  For example:
>
> User-agent: ldspider
> Disallow: /
>
> 2. If you want our crawler to find your site, make sure you
> link your data from well-connected files (e.g. FOAF files).
> You can also send me an entry URI to your dataset which I will
> include in my FOAF file.
>
> 3. Even with an external link (more are better!), your dataset
> might not be well interlinked internally.  Make sure you have
> enough internal links so that the crawler can traverse your dataset.
>
> 4. We plan to wait two seconds between lookups per pay-level
> domain.  If your server cannot sustain that pace, please see #1.
>
> We are aware that LDSpider [2], our crawler, has a number of
> limitations:
>
> * The crawler supports only basic robots.txt features
> (i.e., exclusion)
> * The crawler does not understand sitemaps
> * The crawler does not handle zipped or gziped data dump files
>
> The Billion Triple Challenge datasets are a valuable resource and
> have been used in benchmarking, data analysis and as basis for
> applications.  Your data is appreciated.
>
> Many thanks in advance for your support!
>
> Cheers,
> Andreas Harth and Tobias Kaefer.
>
> [1] http://www.robotstxt.org/robotstxt.html
> [2] http://code.google.com/p/ldspider/
>
Received on Saturday, 22 February 2014 21:52:52 UTC