W3C home > Mailing lists > Public > public-lod@w3.org > April 2010

Re: DBpedia hosting burden

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Wed, 14 Apr 2010 14:19:57 -0400
Message-ID: <4BC6074D.8030006@openlinksw.com>
To: nathan@webr3.org
CC: public-lod <public-lod@w3.org>, dbpedia-discussion <dbpedia-discussion@lists.sourceforge.net>
Nathan wrote:
> Dan Brickley wrote:
>> (trimming cc: list to LOD and DBPedia)
>> On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen <kidehen@openlinksw.com> wrote:
>>> My comment wasn't a "what is DBpedia?" lecture. It was about clarifying
>>> the crux of the matter i.e., bandwidth consumption and its effects on
>>> other DBpedia users (as well as our own non-DBpedia related Web properties).
>> (Leigh)
>>>> I was just curious about usage volumes. We all talk about how central
>>>> dbpedia is in the LOD cloud picture, and wondered if there was any
>>>> publicly accessible metrics to help add some detail that.
>>> Well here is the critical detail: people typically crawl DBpedia. They
>>> crawl it more than any other Data Space in the LOD cloud. They do so
>>> because DBpedia is still quite central to to the burgeoning Web of
>>> Linked Data.
>> Have you considered blocking DBpedia crawlers more aggressively, and
>> nudging them to alternative ways of accessing the data? While it is a
>> shame to say 'no' to people trying to use linked data, this would be
>> more saying 'yes, but not like that...'.
>>> When people aren't crawling, they are executing CONSTRUCTsor DESCRIBEs
>>> via SPARQL, which is still ultimately Export from DBpedia and Import to
>>> my data space mindset.
>> That's useful to know, thanks. Do you have the impression that these
>> folk are typically trying to copy the entire thing, or to make some
>> filtered subset (by geographical view, topic, property etc). Can
>> studying these logs help provide different downloadable dumps that
>> would discourage crawlers?
>>> That's as simple and precise as this matter is.
>>>  From a SPARQL perspective, DBpedia is quite microscopic, its when you
>>> factor in Crawler mentality and network bandwith that issues arise, and
>>> we deliberately have protection in place for Crawlers.
>> Looking at http://wiki.dbpedia.org/OnlineAccess#h28-14 I don't see
>> anything discouraging crawlers. Where is the 'best practice' or
>> 'acceptable use' advice we should all be following, to avoid putting
>> needless burden on your servers and bandwidth?
>> As you mention, DBpedia is an important and central resource, thanks
>> both to the work of the Wikipedia community, and those in the DBpedia
>> project who enrich and make available all that information. It's
>> therefore important that the SemWeb / Linked Data community takes care
>> to remember that these things don't come for free, that bills need
>> paying and that de-referencing is a privilege not a right. If there
>> are things we can do as a technology community to lower the cost of
>> hosting / distributing such data, or to nudge consumers of it in the
>> direction of more sustainable habits, we should do so. If there's not
>> so much the rest of us can do but say 'thanks!', ... then, ...er,
>> 'thanks!'. Much appreciated!
>> Are there any scenarios around eg. BitTorrent that could be explored?
>> What if each of the static files in http://dbpedia.org/sitemap.xml
>> were available as torrents (or magnet: URIs)? I realise that would
>> only address part of the problem/cost, but it's a widely used
>> technology for distributing large files; can we bend it to our needs?
> I'd like to add; could the /data/* and /page/* resources all be made
> static files? (if they are not already) + make use of http caching etc.

> perhaps even the non-sparql dependant parts could be hosted on another
> machine purely for static content? perhaps an interim proxy which
> cache's said resources permanently (then cache rebuild on request when a
> new dataset is upgraded)

> regards!



Kingsley Idehen	      
President & CEO 
OpenLink Software     
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 
Received on Wednesday, 14 April 2010 18:20:25 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:16:05 UTC