- From: Richard Cyganiak <richard@cyganiak.de>
- Date: Thu, 2 Aug 2007 01:37:59 +0200
- To: Bijan Parsia <bparsia@cs.man.ac.uk>
- Cc: Joshua Tauberer <jt@occams.info>, Semantic Web <semantic-web@w3.org>
On 2 Aug 2007, at 00:32, Bijan Parsia wrote: > I personally find it annoying to have to whip out a crawler for > data I *know* is dumpable. (My most recent example was > clinicaltrials.gov, though apparently they have a search parameter > to retrieve all records. Had to email them to figure that out > though :)) > > It's generally cheaper and easier to supply a (gzipped) dump of the > entire dataset. I'm quite surprised that, afaik, no one does this > for HTML sites. So why don't HTML sites provide gzipped dumps of all pages? The answers could be illuminating for RDF publishing. I offer a few thoughts: 1. With dynamic web sites, the publisher must of course serve the individual pages over HTTP, no way around that. Providing a dump as a second option is extra work. Search engines sprang up using what was available (the individual pages), and since it worked for them, and the publishers in general didn't want extra work, the option of providing dumps never really went anywhere. 2. With sites where individual pages change often, creating a fairly up-to-date dump can be technically challenging and consume quite some computing resources. 3. Web sites grow and evolve over time, and the implementation can accumulate quite a bit of complexity and cruft. The easiest way for a webmaster to create a dump of a complex site might be to crawl the site himself. And at this point, he might just as well say, “Why bother; let Googlebot and the other crawlers do the job.” > But for RDF serving sites I see no reason not to provide (and to > use) the big dump link to acquire all the data. It's easier for > everyone. It's not necessarily easier for everyone. The RDF Book Mashup [1] is a counter-example. It's a wrapper around the Amazon and Google APIs, and creates a URI and associated RDF description for every book and author on Amazon. As far as we know, only a couple hundreds of these documents have been accessed since the service went online, and only a few are linked to from somewhere else. So, providing a dump of *all* documents would not be easier for us. YMMV. Richard [1] http://sites.wiwiss.fu-berlin.de/suhl/bizer/bookmashup/ > Perhaps we could extend e.g., robots.txt with a "here's the big > dump of data if you want it all" bit. > > Cheers, > Bijan. > >
Received on Wednesday, 1 August 2007 23:38:54 UTC