Re: Size of the Semantic Web was: Semantic Web Ontology Map from Bijan Parsia on 2007-08-01 (semantic-web@w3.org from August 2007)

From: Bijan Parsia <bparsia@cs.man.ac.uk>
Date: Wed, 1 Aug 2007 23:32:39 +0100
To: Joshua Tauberer <jt@occams.info>
Cc: Semantic Web <semantic-web@w3.org>
Message-Id: <5BFF5C12-031D-440C-A3C5-BF4F49C6F071@cs.man.ac.uk>

On Jul 28, 2007, at 5:44 PM, Joshua Tauberer wrote:

> Chris Bizer wrote:
>> The datasources in the Linking Open Data project are all  
>> interlinked with RDF links. So it is possible to crawl all 30  
>> million documents by following these links.
>
> ::shudder::
>
> When I finally am able to serve up the 500M-to-1B triples of U.S.  
> census data myself, I can't wait until some dozen crawlers start  
> looking for more information on each resource one by one... (I say  
> each resource because if resources are dereferencable to documents  
> about them, then it seems inevitable that crawlers will start  
> dereferencing them.)

This can be annoying for both parties (server and consumer). I  
personally find it annoying to have to whip out a crawler for data I  
*know* is dumpable. (My most recent example was clinicaltrials.gov,  
though apparently they have a search parameter to retrieve all  
records. Had to email them to figure that out though :))

It's generally cheaper and easier to supply a (gzipped) dump of the  
entire dataset. I'm quite surprised that, afaik, no one does this for  
HTML sites. But for RDF serving sites I see no reason not to provide  
(and to use) the big dump link to acquire all the data. It's easier  
for everyone. Perhaps we could extend e.g., robots.txt with a "here's  
the big dump of data if you want it all" bit.

Cheers,
Bijan.

Received on Wednesday, 1 August 2007 22:32:45 UTC