Re: DBpedia hosting burden from Nathan on 2010-04-14 (public-lod@w3.org from April 2010)

From: Nathan <nathan@webr3.org>
Date: Wed, 14 Apr 2010 19:08:15 +0100
To: Dan Brickley <danbri@danbri.org>
CC: Kingsley Idehen <kidehen@openlinksw.com>, public-lod <public-lod@w3.org>, dbpedia-discussion <dbpedia-discussion@lists.sourceforge.net>
Message-ID: <4BC6048F.2000205@webr3.org>

Dan Brickley wrote:
> (trimming cc: list to LOD and DBPedia)
> 
> On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen <kidehen@openlinksw.com> wrote:
> 
>> My comment wasn't a "what is DBpedia?" lecture. It was about clarifying
>> the crux of the matter i.e., bandwidth consumption and its effects on
>> other DBpedia users (as well as our own non-DBpedia related Web properties).
> (Leigh)
>>> I was just curious about usage volumes. We all talk about how central
>>> dbpedia is in the LOD cloud picture, and wondered if there was any
>>> publicly accessible metrics to help add some detail that.
>>>
>> Well here is the critical detail: people typically crawl DBpedia. They
>> crawl it more than any other Data Space in the LOD cloud. They do so
>> because DBpedia is still quite central to to the burgeoning Web of
>> Linked Data.
> 
> Have you considered blocking DBpedia crawlers more aggressively, and
> nudging them to alternative ways of accessing the data? While it is a
> shame to say 'no' to people trying to use linked data, this would be
> more saying 'yes, but not like that...'.
> 
>> When people aren't crawling, they are executing CONSTRUCTsor DESCRIBEs
>> via SPARQL, which is still ultimately Export from DBpedia and Import to
>> my data space mindset.
> 
> That's useful to know, thanks. Do you have the impression that these
> folk are typically trying to copy the entire thing, or to make some
> filtered subset (by geographical view, topic, property etc). Can
> studying these logs help provide different downloadable dumps that
> would discourage crawlers?
> 
>> That's as simple and precise as this matter is.
>>
>>  From a SPARQL perspective, DBpedia is quite microscopic, its when you
>> factor in Crawler mentality and network bandwith that issues arise, and
>> we deliberately have protection in place for Crawlers.
> 
> Looking at http://wiki.dbpedia.org/OnlineAccess#h28-14 I don't see
> anything discouraging crawlers. Where is the 'best practice' or
> 'acceptable use' advice we should all be following, to avoid putting
> needless burden on your servers and bandwidth?
> 
> As you mention, DBpedia is an important and central resource, thanks
> both to the work of the Wikipedia community, and those in the DBpedia
> project who enrich and make available all that information. It's
> therefore important that the SemWeb / Linked Data community takes care
> to remember that these things don't come for free, that bills need
> paying and that de-referencing is a privilege not a right. If there
> are things we can do as a technology community to lower the cost of
> hosting / distributing such data, or to nudge consumers of it in the
> direction of more sustainable habits, we should do so. If there's not
> so much the rest of us can do but say 'thanks!', ... then, ...er,
> 'thanks!'. Much appreciated!
> 
> Are there any scenarios around eg. BitTorrent that could be explored?
> What if each of the static files in http://dbpedia.org/sitemap.xml
> were available as torrents (or magnet: URIs)? I realise that would
> only address part of the problem/cost, but it's a widely used
> technology for distributing large files; can we bend it to our needs?
> 

I'd like to add; could the /data/* and /page/* resources all be made
static files? (if they are not already) + make use of http caching etc.

perhaps even the non-sparql dependant parts could be hosted on another
machine purely for static content? perhaps an interim proxy which
cache's said resources permanently (then cache rebuild on request when a
new dataset is upgraded)

regards!

Received on Wednesday, 14 April 2010 18:09:01 UTC