Re: Size of the Semantic Web was: Semantic Web Ontology Map from Bijan Parsia on 2007-08-02 (semantic-web@w3.org from August 2007)

From: Bijan Parsia <bparsia@cs.man.ac.uk>
Date: Thu, 2 Aug 2007 01:47:10 +0100
To: Richard Cyganiak <richard@cyganiak.de>
Cc: Joshua Tauberer <jt@occams.info>, Semantic Web <semantic-web@w3.org>
Message-Id: <E1858D9B-3196-450E-9C2E-95A80C0091BC@cs.man.ac.uk>

On Aug 2, 2007, at 12:37 AM, Richard Cyganiak wrote:

> On 2 Aug 2007, at 00:32, Bijan Parsia wrote:
>> I personally find it annoying to have to whip out a crawler for  
>> data I *know* is dumpable. (My most recent example was  
>> clinicaltrials.gov, though apparently they have a search parameter  
>> to retrieve all records. Had to email them to figure that out  
>> though :))
>>
>> It's generally cheaper and easier to supply a (gzipped) dump of  
>> the entire dataset. I'm quite surprised that, afaik, no one does  
>> this for HTML sites.
>
> So why don't HTML sites provide gzipped dumps of all pages?

My hypothesis is that there is little user demand for this. The  
primary mode of interaction is human, page at a time. Crawlers  
generally are run by people with a lot of expertise in crawling, and  
they have, by and large, tuned things so it doesn't hurt too much. So  
the marginal gain in efficiency probably isn't worth it (though it  
might be worth it for big sites; but then again, google *owns* a lot  
of the big sites :))

> The answers could be illuminating for RDF publishing.
>
> I offer a few thoughts:
>
> 1. With dynamic web sites, the publisher must of course serve the  
> individual pages over HTTP, no way around that. Providing a dump as  
> a second option is extra work.

And less obviously useful to non-expert crawlers. With RDF data,  
however, I might *want* all the data.

Actually, I often feel that way about blogs. One of my old ones used  
to provide a full article rss feed of the entire site. Quite nice for  
some things.

[snip]
>> But for RDF serving sites I see no reason not to provide (and to  
>> use) the big dump link to acquire all the data. It's easier for  
>> everyone.
>
> It's not necessarily easier for everyone. The RDF Book Mashup [1]  
> is a counter-example. It's a wrapper around the Amazon and Google  
> APIs, and creates a URI and associated RDF description for every  
> book and author on Amazon. As far as we know, only a couple  
> hundreds of these documents have been accessed since the service  
> went online, and only a few are linked to from somewhere else. So,  
> providing a dump of *all* documents would not be easier for us.
[snip]

Good point. I really just meant that when you in fact have created a  
big data dump in the first place, it can be very helpful to serve it  
up that way.

The RDF DBLP stuff is a good example.

(Several reasonably big data sites *do* let you download their data,  
e.g., citeseer, but in a pretty icky xml form :))

But then this is clear...I suspect you don't want people crawling  
your mashup! (How would Amazon or Google react if you had someone try  
to crawl *their* database via your site?)

Anyway, a big dump can be a solution. If it's easy (e.g., you have  
all your data in a single store), I think it's a good idea to provide  
it. If you don't want your site crawled, then robots.txt is the way  
to go.

Cheers,
Bijan.

Received on Thursday, 2 August 2007 00:47:11 UTC