Re: Size of the Semantic Web was: Semantic Web Ontology Map from Richard Cyganiak on 2007-08-01 (semantic-web@w3.org from August 2007)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Thu, 2 Aug 2007 01:37:59 +0200
To: Bijan Parsia <bparsia@cs.man.ac.uk>
Cc: Joshua Tauberer <jt@occams.info>, Semantic Web <semantic-web@w3.org>
Message-Id: <2935A517-EBF9-4DFC-A2C6-DCF7DE408D62@cyganiak.de>

On 2 Aug 2007, at 00:32, Bijan Parsia wrote:
> I personally find it annoying to have to whip out a crawler for  
> data I *know* is dumpable. (My most recent example was  
> clinicaltrials.gov, though apparently they have a search parameter  
> to retrieve all records. Had to email them to figure that out  
> though :))
>
> It's generally cheaper and easier to supply a (gzipped) dump of the  
> entire dataset. I'm quite surprised that, afaik, no one does this  
> for HTML sites.

So why don't HTML sites provide gzipped dumps of all pages? The  
answers could be illuminating for RDF publishing.

I offer a few thoughts:

1. With dynamic web sites, the publisher must of course serve the  
individual pages over HTTP, no way around that. Providing a dump as a  
second option is extra work. Search engines sprang up using what was  
available (the individual pages), and since it worked for them, and  
the publishers in general didn't want extra work, the option of  
providing dumps never really went anywhere.

2. With sites where individual pages change often, creating a fairly  
up-to-date dump can be technically challenging and consume quite some  
computing resources.

3. Web sites grow and evolve over time, and the implementation can  
accumulate quite a bit of complexity and cruft. The easiest way for a  
webmaster to create a dump of a complex site might be to crawl the  
site himself. And at this point, he might just as well say, “Why  
bother; let Googlebot and the other crawlers do the job.”

> But for RDF serving sites I see no reason not to provide (and to  
> use) the big dump link to acquire all the data. It's easier for  
> everyone.

It's not necessarily easier for everyone. The RDF Book Mashup [1] is  
a counter-example. It's a wrapper around the Amazon and Google APIs,  
and creates a URI and associated RDF description for every book and  
author on Amazon. As far as we know, only a couple hundreds of these  
documents have been accessed since the service went online, and only  
a few are linked to from somewhere else. So, providing a dump of  
*all* documents would not be easier for us.

YMMV.

Richard

[1] http://sites.wiwiss.fu-berlin.de/suhl/bizer/bookmashup/

> Perhaps we could extend e.g., robots.txt with a "here's the big  
> dump of data if you want it all" bit.
>
> Cheers,
> Bijan.
>
>

Received on Wednesday, 1 August 2007 23:38:54 UTC