- From: Hugh Glaser <hg@ecs.soton.ac.uk>
- Date: Fri, 03 Aug 2007 10:26:10 +0100
- To: Richard Cyganiak <richard@cyganiak.de>
- CC: Semantic Web <semantic-web@w3.org>
Isn't this an analogy in the Semantic Web of the the Deep Web (was Invisible Web) problem? I hesitate to point at Wikipedia(!) http://en.wikipedia.org/wiki/Deep_web but that points at http://qprober.cs.columbia.edu/publications/sigmod2001.pdf Re-running their pubmed cancer example today, although I can't get google to just do pubmed, Google: 559,000 from www.ncbi.nlm.nih.gov for cancer nih: 1957409 for pubmed alone. For the full search of cancer at www.ncbi.nlm.nih.gov on the whole site, more than 9000000 results, coming from nucleotide, geotide, etc.. It may be that they report duplicate entries differently, but the whole problem is an active area, as some people have said, pointing at extending current proposals. So all we need to do is solve the deep web problem, and our specific problem will be solved? A little :-) here. So what I think we should be asking, is whether there is something structurally different about our problem of publishing KBs to search engines, compared with publishing DBs? Cheers Hugh (and Afraz) On 2/8/07 01:47, "Bijan Parsia" <bparsia@cs.man.ac.uk> wrote: > > On Aug 2, 2007, at 12:37 AM, Richard Cyganiak wrote: > >> On 2 Aug 2007, at 00:32, Bijan Parsia wrote: >>> I personally find it annoying to have to whip out a crawler for >>> data I *know* is dumpable. (My most recent example was >>> clinicaltrials.gov, though apparently they have a search parameter >>> to retrieve all records. Had to email them to figure that out >>> though :)) >>> >>> It's generally cheaper and easier to supply a (gzipped) dump of >>> the entire dataset. I'm quite surprised that, afaik, no one does >>> this for HTML sites. >> >> So why don't HTML sites provide gzipped dumps of all pages? > > My hypothesis is that there is little user demand for this. The > primary mode of interaction is human, page at a time. Crawlers > generally are run by people with a lot of expertise in crawling, and > they have, by and large, tuned things so it doesn't hurt too much. So > the marginal gain in efficiency probably isn't worth it (though it > might be worth it for big sites; but then again, google *owns* a lot > of the big sites :)) > >> The answers could be illuminating for RDF publishing. >> >> I offer a few thoughts: >> >> 1. With dynamic web sites, the publisher must of course serve the >> individual pages over HTTP, no way around that. Providing a dump as >> a second option is extra work. > > And less obviously useful to non-expert crawlers. With RDF data, > however, I might *want* all the data. > > Actually, I often feel that way about blogs. One of my old ones used > to provide a full article rss feed of the entire site. Quite nice for > some things. > > [snip] >>> But for RDF serving sites I see no reason not to provide (and to >>> use) the big dump link to acquire all the data. It's easier for >>> everyone. >> >> It's not necessarily easier for everyone. The RDF Book Mashup [1] >> is a counter-example. It's a wrapper around the Amazon and Google >> APIs, and creates a URI and associated RDF description for every >> book and author on Amazon. As far as we know, only a couple >> hundreds of these documents have been accessed since the service >> went online, and only a few are linked to from somewhere else. So, >> providing a dump of *all* documents would not be easier for us. > [snip] > > Good point. I really just meant that when you in fact have created a > big data dump in the first place, it can be very helpful to serve it > up that way. > > The RDF DBLP stuff is a good example. > > (Several reasonably big data sites *do* let you download their data, > e.g., citeseer, but in a pretty icky xml form :)) > > But then this is clear...I suspect you don't want people crawling > your mashup! (How would Amazon or Google react if you had someone try > to crawl *their* database via your site?) > > Anyway, a big dump can be a solution. If it's easy (e.g., you have > all your data in a single store), I think it's a good idea to provide > it. If you don't want your site crawled, then robots.txt is the way > to go. > > Cheers, > Bijan. > > >
Received on Friday, 3 August 2007 09:27:14 UTC