- From: Bijan Parsia <bparsia@cs.man.ac.uk>
- Date: Thu, 2 Aug 2007 01:47:10 +0100
- To: Richard Cyganiak <richard@cyganiak.de>
- Cc: Joshua Tauberer <jt@occams.info>, Semantic Web <semantic-web@w3.org>
On Aug 2, 2007, at 12:37 AM, Richard Cyganiak wrote: > On 2 Aug 2007, at 00:32, Bijan Parsia wrote: >> I personally find it annoying to have to whip out a crawler for >> data I *know* is dumpable. (My most recent example was >> clinicaltrials.gov, though apparently they have a search parameter >> to retrieve all records. Had to email them to figure that out >> though :)) >> >> It's generally cheaper and easier to supply a (gzipped) dump of >> the entire dataset. I'm quite surprised that, afaik, no one does >> this for HTML sites. > > So why don't HTML sites provide gzipped dumps of all pages? My hypothesis is that there is little user demand for this. The primary mode of interaction is human, page at a time. Crawlers generally are run by people with a lot of expertise in crawling, and they have, by and large, tuned things so it doesn't hurt too much. So the marginal gain in efficiency probably isn't worth it (though it might be worth it for big sites; but then again, google *owns* a lot of the big sites :)) > The answers could be illuminating for RDF publishing. > > I offer a few thoughts: > > 1. With dynamic web sites, the publisher must of course serve the > individual pages over HTTP, no way around that. Providing a dump as > a second option is extra work. And less obviously useful to non-expert crawlers. With RDF data, however, I might *want* all the data. Actually, I often feel that way about blogs. One of my old ones used to provide a full article rss feed of the entire site. Quite nice for some things. [snip] >> But for RDF serving sites I see no reason not to provide (and to >> use) the big dump link to acquire all the data. It's easier for >> everyone. > > It's not necessarily easier for everyone. The RDF Book Mashup [1] > is a counter-example. It's a wrapper around the Amazon and Google > APIs, and creates a URI and associated RDF description for every > book and author on Amazon. As far as we know, only a couple > hundreds of these documents have been accessed since the service > went online, and only a few are linked to from somewhere else. So, > providing a dump of *all* documents would not be easier for us. [snip] Good point. I really just meant that when you in fact have created a big data dump in the first place, it can be very helpful to serve it up that way. The RDF DBLP stuff is a good example. (Several reasonably big data sites *do* let you download their data, e.g., citeseer, but in a pretty icky xml form :)) But then this is clear...I suspect you don't want people crawling your mashup! (How would Amazon or Google react if you had someone try to crawl *their* database via your site?) Anyway, a big dump can be a solution. If it's easy (e.g., you have all your data in a single store), I think it's a good idea to provide it. If you don't want your site crawled, then robots.txt is the way to go. Cheers, Bijan.
Received on Thursday, 2 August 2007 00:47:11 UTC