- From: Dan Brickley <danbri@danbri.org>
- Date: Thu, 02 Aug 2007 01:23:29 +0100
- To: Richard Cyganiak <richard@cyganiak.de>
- Cc: Bijan Parsia <bparsia@cs.man.ac.uk>, Joshua Tauberer <jt@occams.info>, Semantic Web <semantic-web@w3.org>
Richard Cyganiak wrote: > > > On 2 Aug 2007, at 00:32, Bijan Parsia wrote: >> I personally find it annoying to have to whip out a crawler for data I >> *know* is dumpable. (My most recent example was clinicaltrials.gov, >> though apparently they have a search parameter to retrieve all >> records. Had to email them to figure that out though :)) >> >> It's generally cheaper and easier to supply a (gzipped) dump of the >> entire dataset. I'm quite surprised that, afaik, no one does this for >> HTML sites. > > So why don't HTML sites provide gzipped dumps of all pages? The answers > could be illuminating for RDF publishing. > > I offer a few thoughts: > > 1. With dynamic web sites, the publisher must of course serve the > individual pages over HTTP, no way around that. Providing a dump as a > second option is extra work. Search engines sprang up using what was > available (the individual pages), and since it worked for them, and the > publishers in general didn't want extra work, the option of providing > dumps never really went anywhere. > > 2. With sites where individual pages change often, creating a fairly > up-to-date dump can be technically challenging and consume quite some > computing resources. > > 3. Web sites grow and evolve over time, and the implementation can > accumulate quite a bit of complexity and cruft. The easiest way for a > webmaster to create a dump of a complex site might be to crawl the site > himself. And at this point, he might just as well say, “Why bother; let > Googlebot and the other crawlers do the job.” Yep, ten years or so back, there was a general expectation in metadata circles that such crawlers would be part of a decentralising network of aggregators, crawling and indexing and sharing parts of the Web. The "this thing is too big for anyone to crawl/index alone" line seemed persuasive then, ... but somehow it hasn't happened that way! The Harvest system (which was pretty ahead of its time) was the best thing going in that direction. You could run a crawler over a set of target sites, with custom per-format summarisers. Then the aggregated set of summaries could be exposed for others to collect, index and make sharable using a simple (SIOF) metadata record syntax. eg. see http://web.archive.org/web/20010606092559/http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/schwartz.harvest/schwartz.harvest.html http://web.archive.org/web/19971221220012/http://harvest.transarc.com/ http://www.w3.org/Search/9605-Indexing-Workshop/index.html http://www.w3.org/Search/9605-Indexing-Workshop/Papers/Allen@Bunyip.html http://www.w3.org/TandS/QL/QL98/pp/distributed.html etc. For example, at ILRT in Bristol we had a library-like catalogue of social science Web sites (SOSIG), ... and made some experiments with a subject-themed search engine focussed on a thematic area of the Web (full text of selected high quality sites). Similarly the central Uni Web team ran a crawler against *.bris.ac.uk sites. Various others made similar experiments, eg. Dave Beckett and friends had an experimental *.ac.uk search engine using Harvest gatherers and brokers, see http://www.ariadne.ac.uk/issue3/acdc/ (heh, "The index for the gathered data is getting rather large, around 200 Mbytes" :) Over time, most of this kind of medium-size / specialist search engine activity has seemed to die out (I'd be happy to be proved wrong here!). People just used Google instead. Running a crawler is a pain, running a search engine is a pain, and even if relatively small scale, it is hard to compete with the usability of the commercial services. The only area that decentralised search has really flourished is in the P2P filesharing scene, where centralised commercial services risk legal exposure. But maybe it is worth revisiting this space; things have moved on a bit since 1996 after all. You can do a lot more with a cheap PC, and things like Lucene seem more mature than the old text indexers... Dan
Received on Thursday, 2 August 2007 00:23:54 UTC