DBpedia hosting burden from Dan Brickley on 2010-04-14 (public-lod@w3.org from April 2010)

From: Dan Brickley <danbri@danbri.org>
Date: Wed, 14 Apr 2010 19:58:37 +0200
To: Kingsley Idehen <kidehen@openlinksw.com>
Cc: public-lod <public-lod@w3.org>, dbpedia-discussion <dbpedia-discussion@lists.sourceforge.net>
Message-ID: <h2qeb19f3361004141058w2d405433u3df0b29c37592c4@mail.gmail.com>

(trimming cc: list to LOD and DBPedia)

On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen <kidehen@openlinksw.com> wrote:

> My comment wasn't a "what is DBpedia?" lecture. It was about clarifying
> the crux of the matter i.e., bandwidth consumption and its effects on
> other DBpedia users (as well as our own non-DBpedia related Web properties).
(Leigh)
>> I was just curious about usage volumes. We all talk about how central
>> dbpedia is in the LOD cloud picture, and wondered if there was any
>> publicly accessible metrics to help add some detail that.
>>
> Well here is the critical detail: people typically crawl DBpedia. They
> crawl it more than any other Data Space in the LOD cloud. They do so
> because DBpedia is still quite central to to the burgeoning Web of
> Linked Data.

Have you considered blocking DBpedia crawlers more aggressively, and
nudging them to alternative ways of accessing the data? While it is a
shame to say 'no' to people trying to use linked data, this would be
more saying 'yes, but not like that...'.

> When people aren't crawling, they are executing CONSTRUCTsor DESCRIBEs
> via SPARQL, which is still ultimately Export from DBpedia and Import to
> my data space mindset.

That's useful to know, thanks. Do you have the impression that these
folk are typically trying to copy the entire thing, or to make some
filtered subset (by geographical view, topic, property etc). Can
studying these logs help provide different downloadable dumps that
would discourage crawlers?

> That's as simple and precise as this matter is.
>
>  From a SPARQL perspective, DBpedia is quite microscopic, its when you
> factor in Crawler mentality and network bandwith that issues arise, and
> we deliberately have protection in place for Crawlers.

Looking at http://wiki.dbpedia.org/OnlineAccess#h28-14 I don't see
anything discouraging crawlers. Where is the 'best practice' or
'acceptable use' advice we should all be following, to avoid putting
needless burden on your servers and bandwidth?

As you mention, DBpedia is an important and central resource, thanks
both to the work of the Wikipedia community, and those in the DBpedia
project who enrich and make available all that information. It's
therefore important that the SemWeb / Linked Data community takes care
to remember that these things don't come for free, that bills need
paying and that de-referencing is a privilege not a right. If there
are things we can do as a technology community to lower the cost of
hosting / distributing such data, or to nudge consumers of it in the
direction of more sustainable habits, we should do so. If there's not
so much the rest of us can do but say 'thanks!', ... then, ...er,
'thanks!'. Much appreciated!

Are there any scenarios around eg. BitTorrent that could be explored?
What if each of the static files in http://dbpedia.org/sitemap.xml
were available as torrents (or magnet: URIs)? I realise that would
only address part of the problem/cost, but it's a widely used
technology for distributing large files; can we bend it to our needs?

cheers,

Dan

Received on Wednesday, 14 April 2010 17:59:11 UTC