- From: Aidan Hogan <aidan.hogan@deri.org>
- Date: Sat, 22 Feb 2014 18:52:19 -0300
- To: semantic-web@w3.org
A little late to the party, but as a past collaborator of Andreas and Tobias, I'd just like to publicly extend my thanks to them for again crawling the BTC data. The data have been used in a wide-variety of papers [1] and have lead to important insights as to how research techniques perform against real-world, diverse data (moving away from all-too-comfortable-world of synthetic benchmarks or single-dataset corpora). Collecting the BTC data is not an easy job and it is done with little personal benefit to those who organise it. Unfortunately, for many "LOD datasets" there is a huge deficit between what is reported as available, and what is available through crawls such as the BTC. Though the results are a bit old (referring to BTC 11), in [2], we wrote briefly about the reasons for under-representation of some of the largest LOD datasets in the BTC. These datasets were hosted on rpi.edu, linkedgeodata.org, wright.edu, concordia.ca, rdfabout.com, unime.it, uriburner.com, openlibrary.org, sudoc.fr, viaf.org, europeana.eu, moreways.net, rkbexplorer.com, opencorporates.com, uberblic.org and geonames.org. (For example, GeoNames bans most access through robots.txt.) Though it wasn't the focus of the paper, some more details are given in Section 4.3 of [1], where Tables 1 & 2 are most relevant. In summary, there is a huge discrepancy between what the LOD cloud promises and what is available in reality through crawls like the BTC. Some of these reasons were down to weaknesses of the crawler (for example, in past editions BTC would only crawl RDF/XML, but other formats have been supported since 2012). However, most of the reasons for the discrepancies are now down to how data are hosted (robots, conneg, 500's, deadlinks, not linked, cyclical redirects, broken syntax ...). If you have a LOD dataset (particularly hosted on one of the mentioned servers), it might be a good time to think about why your data might be difficult to access for potential consumers. If you resolve such issues now, your data can be included in BTC'14 and can, e.g., influence the evaluation of future research papers. As a bonus, other folks will be able to access your Linked Data in future. And thanks to Andreas and Tobias again! Best, Aidan P.S., on a side note, for a few years, a variety of us tried to directly contact publishers about problems we found with such crawls through the "Pedantic Web" group [3,4]. We had mixed responses. :/ [1] http://scholar.google.com/scholar?hl=en&q=btc+semantic+web [2] http://aidanhogan.com/docs/dyldo_ldow12.pdf [3] https://groups.google.com/forum/#!forum/pedantic-web [4] http://pedantic-web.org/ On 11/02/2014 09:36, Andreas Harth wrote: > Hello, > > we are about to start crawling for the Billion Triples Challenge > 2014 dataset in the next couple of weeks. > > A few things you can do to prepare: > > 1. If you do not want our crawler to visit your site, provide > a robots.txt [1]. For example: > > User-agent: ldspider > Disallow: / > > 2. If you want our crawler to find your site, make sure you > link your data from well-connected files (e.g. FOAF files). > You can also send me an entry URI to your dataset which I will > include in my FOAF file. > > 3. Even with an external link (more are better!), your dataset > might not be well interlinked internally. Make sure you have > enough internal links so that the crawler can traverse your dataset. > > 4. We plan to wait two seconds between lookups per pay-level > domain. If your server cannot sustain that pace, please see #1. > > We are aware that LDSpider [2], our crawler, has a number of > limitations: > > * The crawler supports only basic robots.txt features > (i.e., exclusion) > * The crawler does not understand sitemaps > * The crawler does not handle zipped or gziped data dump files > > The Billion Triple Challenge datasets are a valuable resource and > have been used in benchmarking, data analysis and as basis for > applications. Your data is appreciated. > > Many thanks in advance for your support! > > Cheers, > Andreas Harth and Tobias Kaefer. > > [1] http://www.robotstxt.org/robotstxt.html > [2] http://code.google.com/p/ldspider/ >
Received on Saturday, 22 February 2014 21:52:52 UTC