- From: Bijan Parsia <bparsia@cs.man.ac.uk>
- Date: Wed, 1 Aug 2007 23:32:39 +0100
- To: Joshua Tauberer <jt@occams.info>
- Cc: Semantic Web <semantic-web@w3.org>
On Jul 28, 2007, at 5:44 PM, Joshua Tauberer wrote: > Chris Bizer wrote: >> The datasources in the Linking Open Data project are all >> interlinked with RDF links. So it is possible to crawl all 30 >> million documents by following these links. > > ::shudder:: > > When I finally am able to serve up the 500M-to-1B triples of U.S. > census data myself, I can't wait until some dozen crawlers start > looking for more information on each resource one by one... (I say > each resource because if resources are dereferencable to documents > about them, then it seems inevitable that crawlers will start > dereferencing them.) This can be annoying for both parties (server and consumer). I personally find it annoying to have to whip out a crawler for data I *know* is dumpable. (My most recent example was clinicaltrials.gov, though apparently they have a search parameter to retrieve all records. Had to email them to figure that out though :)) It's generally cheaper and easier to supply a (gzipped) dump of the entire dataset. I'm quite surprised that, afaik, no one does this for HTML sites. But for RDF serving sites I see no reason not to provide (and to use) the big dump link to acquire all the data. It's easier for everyone. Perhaps we could extend e.g., robots.txt with a "here's the big dump of data if you want it all" bit. Cheers, Bijan.
Received on Wednesday, 1 August 2007 22:32:45 UTC