Re: Size of the Semantic Web was: Semantic Web Ontology Map from Dan Brickley on 2007-08-01 (semantic-web@w3.org from August 2007)

From: Dan Brickley <danbri@danbri.org>
Date: Thu, 02 Aug 2007 00:11:09 +0100
To: Bijan Parsia <bparsia@cs.man.ac.uk>
Cc: Joshua Tauberer <jt@occams.info>, Semantic Web <semantic-web@w3.org>
Message-ID: <46B1130D.9080905@danbri.org>

Bijan Parsia wrote:
> 
> On Jul 28, 2007, at 5:44 PM, Joshua Tauberer wrote:
> 
>> Chris Bizer wrote:
>>> The datasources in the Linking Open Data project are all interlinked 
>>> with RDF links. So it is possible to crawl all 30 million documents 
>>> by following these links.
>>
>> ::shudder::
>>
>> When I finally am able to serve up the 500M-to-1B triples of U.S. 
>> census data myself, I can't wait until some dozen crawlers start 
>> looking for more information on each resource one by one... (I say 
>> each resource because if resources are dereferencable to documents 
>> about them, then it seems inevitable that crawlers will start 
>> dereferencing them.)
> 
> This can be annoying for both parties (server and consumer). I 
> personally find it annoying to have to whip out a crawler for data I 
> *know* is dumpable. (My most recent example was clinicaltrials.gov, 
> though apparently they have a search parameter to retrieve all records. 
> Had to email them to figure that out though :))
> 
> It's generally cheaper and easier to supply a (gzipped) dump of the 
> entire dataset. I'm quite surprised that, afaik, no one does this for 
> HTML sites. But for RDF serving sites I see no reason not to provide 
> (and to use) the big dump link to acquire all the data. It's easier for 
> everyone. Perhaps we could extend e.g., robots.txt with a "here's the 
> big dump of data if you want it all" bit.

Yes, especially with huge sites, and in cases (eg. social networking 
sites) where each small file often partially describes things that are 
fully described in their own file. Lots of redundancy.

One issue might be that many sites are essentially HTML views into 
(typically SQL) databases, and since they're already making per-record 
HTML pages, per-record RDF is pretty easy, while a total database dump 
takes a bit more thought and coding.

How's that for a seamless transition into plugging the upcoming "W3C 
Workshop on RDF Access to Relational Databases", 25-26 October, 2007 — 
Boston, MA, USA? Details at http://www.w3.org/2007/03/RdfRDB/cfp

cheers,

Dan

Received on Wednesday, 1 August 2007 23:11:30 UTC