Re: Billion Triples Challenge Crawl 2014 from Hugh Glaser on 2014-02-15 (semantic-web@w3.org from February 2014)

From: Hugh Glaser <hugh@glasers.org>
Date: Sat, 15 Feb 2014 16:00:05 +0000
To: Andreas Harth <andreas@harth.org>
Cc: Semantic Web <semantic-web@w3.org>
Message-Id: <9DD8E80C-3676-40DB-84C5-0C2A566907E1@glasers.org>

Hi Andreas and Tobias.
Good luck!
Actually, I think essentially ignoring dumps and doing a “real” crawl, is a feature, rather than a bug.

Having the Crawl contain only stuff that has been got as the result of URI fetch (Linked Data principle 3) is excellent.
If you used dumps, you would in any case need to resolve the URIs you got to check that they resolved (if you want it to be “real” Linked Data).

Sitemaps that give URI lists are possibly reasonable, and might give you better coverage.
It means that users will need to think carefully about the validity of any implications they draw - for example I trust that people won’t use the Crawl data to try to deduce figures about the linkage in datasets, since it will be skewed against stuff with little linkage.

Best
Hugh

On 15 Feb 2014, at 15:31, Andreas Harth <andreas@harth.org> wrote:

> Michel,
> 
> On 02/14/2014 08:46 AM, Michel Dumontier wrote:
>>  I'd like to help by getting bio2rdf data into the crawl, really. but
>> we gzip all of our files, and they are in n-quads format.
>> 
>> http://download.bio2rdf.org/release/3/
>> 
>> think you can add gzip/bzip2 support ?
> 
> we have to start with the crawl soon, and adding support for data dumps
> will take some time.  We can put that feature on our list and see if
> we can make the feature for the 2015 crawl.
> 
> As only 3.81% of Linking Open Data cloud datasets provide a data dump
> ([1], page 38), we have to decide whether supporting data dumps in the
> crawler is a high-priority feature.
> 
> Please also note that data dumps do not include 303 redirect
> information, something we record when we do lookups during crawling.
> 
> There are means to link to data dumps in VoID files, which would
> enable a crawler to automatically discover the data dump URIs.
> However, I suspect that will be true for less than 3.81 % of the
> datasets.
> 
> My goal is to carry out the crawling without manual intervention (i.e.
> I point the crawling agent to a set of seed URIs and off it goes).  The
> 4 Linked Data principles are nice as they provide an elegant minimal
> set of conventions that an automated agent needs to support.
> 
> Cheers,
> Andreas.
> 
> [1] http://wiki.planet-data.eu/uploads/7/79/D2.4.pdf
> 

-- 
Hugh Glaser
   20 Portchester Rise
   Eastleigh
   SO50 4QS
Mobile: +44 75 9533 4155, Home: +44 23 8061 5652

Received on Saturday, 15 February 2014 16:00:30 UTC