Re: Size of the Semantic Web was: Semantic Web Ontology Map

Richard Cyganiak wrote:
> 
> 
> On 2 Aug 2007, at 00:32, Bijan Parsia wrote:
>> I personally find it annoying to have to whip out a crawler for data I 
>> *know* is dumpable. (My most recent example was clinicaltrials.gov, 
>> though apparently they have a search parameter to retrieve all 
>> records. Had to email them to figure that out though :))
>>
>> It's generally cheaper and easier to supply a (gzipped) dump of the 
>> entire dataset. I'm quite surprised that, afaik, no one does this for 
>> HTML sites.
> 
> So why don't HTML sites provide gzipped dumps of all pages? The answers 
> could be illuminating for RDF publishing.
> 
> I offer a few thoughts:
> 
> 1. With dynamic web sites, the publisher must of course serve the 
> individual pages over HTTP, no way around that. Providing a dump as a 
> second option is extra work. Search engines sprang up using what was 
> available (the individual pages), and since it worked for them, and the 
> publishers in general didn't want extra work, the option of providing 
> dumps never really went anywhere.
> 
> 2. With sites where individual pages change often, creating a fairly 
> up-to-date dump can be technically challenging and consume quite some 
> computing resources.
> 
> 3. Web sites grow and evolve over time, and the implementation can 
> accumulate quite a bit of complexity and cruft. The easiest way for a 
> webmaster to create a dump of a complex site might be to crawl the site 
> himself. And at this point, he might just as well say, “Why bother; let 
> Googlebot and the other crawlers do the job.”

Yep, ten years or so back, there was a general expectation in metadata 
circles that such crawlers would be part of a decentralising network of 
aggregators, crawling and indexing and sharing parts of the Web. The 
"this thing is too big for anyone to crawl/index alone" line seemed 
persuasive then, ... but somehow it hasn't happened that way!

The Harvest system (which was pretty ahead of its time) was the best 
thing going in that direction. You could run a crawler over a set of 
target sites, with custom per-format summarisers. Then the aggregated 
set of summaries could be exposed for others to collect, index and make 
sharable using a simple (SIOF) metadata record syntax.

eg. see
http://web.archive.org/web/20010606092559/http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/schwartz.harvest/schwartz.harvest.html
http://web.archive.org/web/19971221220012/http://harvest.transarc.com/
http://www.w3.org/Search/9605-Indexing-Workshop/index.html
http://www.w3.org/Search/9605-Indexing-Workshop/Papers/Allen@Bunyip.html
http://www.w3.org/TandS/QL/QL98/pp/distributed.html
etc.

For example, at ILRT in Bristol we had a library-like catalogue of 
social science Web sites (SOSIG), ... and made some experiments with a 
subject-themed search engine focussed on a thematic area of the Web 
(full text of selected high quality sites). Similarly the central Uni 
Web team ran a crawler against *.bris.ac.uk sites. Various others made 
similar experiments, eg. Dave Beckett and friends had an experimental 
*.ac.uk search engine using Harvest gatherers and brokers, see 
http://www.ariadne.ac.uk/issue3/acdc/ (heh, "The index for the gathered 
data is getting rather large, around 200 Mbytes" :)

Over time, most of this kind of medium-size / specialist search engine 
activity has seemed to die out (I'd be happy to be proved wrong here!). 
People just used Google instead. Running a crawler is a pain, running a 
search engine is a pain, and even if relatively small scale, it is hard 
to compete with the usability of the commercial services. The only area 
that decentralised search has really flourished is in the P2P 
filesharing scene, where centralised commercial services risk legal 
exposure.

But maybe it is worth revisiting this space; things have moved on a bit 
since 1996 after all. You can do a lot more with a cheap PC, and things 
like Lucene seem more mature than the old text indexers...

Dan

Received on Thursday, 2 August 2007 00:23:54 UTC