W3C home > Mailing lists > Public > semantic-web@w3.org > August 2007

Re: Size of the Semantic Web was: Semantic Web Ontology Map

From: Bijan Parsia <bparsia@cs.man.ac.uk>
Date: Wed, 1 Aug 2007 23:32:39 +0100
Message-Id: <5BFF5C12-031D-440C-A3C5-BF4F49C6F071@cs.man.ac.uk>
Cc: Semantic Web <semantic-web@w3.org>
To: Joshua Tauberer <jt@occams.info>

On Jul 28, 2007, at 5:44 PM, Joshua Tauberer wrote:

> Chris Bizer wrote:
>> The datasources in the Linking Open Data project are all  
>> interlinked with RDF links. So it is possible to crawl all 30  
>> million documents by following these links.
>
> ::shudder::
>
> When I finally am able to serve up the 500M-to-1B triples of U.S.  
> census data myself, I can't wait until some dozen crawlers start  
> looking for more information on each resource one by one... (I say  
> each resource because if resources are dereferencable to documents  
> about them, then it seems inevitable that crawlers will start  
> dereferencing them.)

This can be annoying for both parties (server and consumer). I  
personally find it annoying to have to whip out a crawler for data I  
*know* is dumpable. (My most recent example was clinicaltrials.gov,  
though apparently they have a search parameter to retrieve all  
records. Had to email them to figure that out though :))

It's generally cheaper and easier to supply a (gzipped) dump of the  
entire dataset. I'm quite surprised that, afaik, no one does this for  
HTML sites. But for RDF serving sites I see no reason not to provide  
(and to use) the big dump link to acquire all the data. It's easier  
for everyone. Perhaps we could extend e.g., robots.txt with a "here's  
the big dump of data if you want it all" bit.

Cheers,
Bijan.
Received on Wednesday, 1 August 2007 22:32:45 GMT

This archive was generated by hypermail 2.3.1 : Tuesday, 26 March 2013 21:45:17 GMT