On Jul 28, 2007, at 5:44 PM, Joshua Tauberer wrote: > Chris Bizer wrote: >> The datasources in the Linking Open Data project are all >> interlinked with RDF links. So it is possible to crawl all 30 >> million documents by following these links. > > ::shudder:: > > When I finally am able to serve up the 500M-to-1B triples of U.S. > census data myself, I can't wait until some dozen crawlers start > looking for more information on each resource one by one... (I say > each resource because if resources are dereferencable to documents > about them, then it seems inevitable that crawlers will start > dereferencing them.) This can be annoying for both parties (server and consumer). I personally find it annoying to have to whip out a crawler for data I *know* is dumpable. (My most recent example was clinicaltrials.gov, though apparently they have a search parameter to retrieve all records. Had to email them to figure that out though :)) It's generally cheaper and easier to supply a (gzipped) dump of the entire dataset. I'm quite surprised that, afaik, no one does this for HTML sites. But for RDF serving sites I see no reason not to provide (and to use) the big dump link to acquire all the data. It's easier for everyone. Perhaps we could extend e.g., robots.txt with a "here's the big dump of data if you want it all" bit. Cheers, Bijan.Received on Wednesday, 1 August 2007 22:32:45 GMT
This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 7 December 2009 10:45:08 GMT