Re: DBpedia hosting burden from Kingsley Idehen on 2010-04-14 (public-lod@w3.org from April 2010)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Wed, 14 Apr 2010 14:11:54 -0400
To: Dan Brickley <danbri@danbri.org>
CC: public-lod <public-lod@w3.org>, dbpedia-discussion <dbpedia-discussion@lists.sourceforge.net>
Message-ID: <4BC6056A.1060000@openlinksw.com>
Dan Brickley wrote:
> (trimming cc: list to LOD and DBPedia)
>
> On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen <kidehen@openlinksw.com> wrote:
>
>   
>> My comment wasn't a "what is DBpedia?" lecture. It was about clarifying
>> the crux of the matter i.e., bandwidth consumption and its effects on
>> other DBpedia users (as well as our own non-DBpedia related Web properties).
>>     
> (Leigh)
>   
>>> I was just curious about usage volumes. We all talk about how central
>>> dbpedia is in the LOD cloud picture, and wondered if there was any
>>> publicly accessible metrics to help add some detail that.
>>>
>>>       
>> Well here is the critical detail: people typically crawl DBpedia. They
>> crawl it more than any other Data Space in the LOD cloud. They do so
>> because DBpedia is still quite central to to the burgeoning Web of
>> Linked Data.
>>     
>
> Have you considered blocking DBpedia crawlers more aggressively, and
> nudging them to alternative ways of accessing the data? 

Yes.

Some have cleaned up their act for sure.

Problem is, there are others doing the same thing, who then complain 
about the instance in very generic fashion.

> While it is a
> shame to say 'no' to people trying to use linked data, this would be
> more saying 'yes, but not like that...'.
>   

I think we have an outstanding blog post / technical note about the 
DBpedia instance that hasn't been published (possibly due to the 3.5 and 
DBpedia-Live work we are doing), said note will cover how to work with 
the instance etc..
>   
>> When people aren't crawling, they are executing CONSTRUCTsor DESCRIBEs
>> via SPARQL, which is still ultimately Export from DBpedia and Import to
>> my data space mindset.
>>     
>
> That's useful to know, thanks. Do you have the impression that these
> folk are typically trying to copy the entire thing, or to make some
> filtered subset (by geographical view, topic, property etc).
Many (and to some degree quite natural) attempt to export the whole 
thing. Even when they're nudged to use OFFSET and LIMIT, end result is 
multiple hits en route to complete export.
>  Can
> studying these logs help provide different downloadable dumps that
> would discourage crawlers?
>   

We do have a solution in mind, basically, we are going to have a 
different place for the descriptor resources and redirect crawlers 
there  via 303's etc..
>   
>> That's as simple and precise as this matter is.
>>
>>  From a SPARQL perspective, DBpedia is quite microscopic, its when you
>> factor in Crawler mentality and network bandwith that issues arise, and
>> we deliberately have protection in place for Crawlers.
>>     
>
> Looking at http://wiki.dbpedia.org/OnlineAccess#h28-14 I don't see
> anything discouraging crawlers. Where is the 'best practice' or
> 'acceptable use' advice we should all be following, to avoid putting
> needless burden on your servers and bandwidth?
>   

We'll get the guide out.
> As you mention, DBpedia is an important and central resource, thanks
> both to the work of the Wikipedia community, and those in the DBpedia
> project who enrich and make available all that information. It's
> therefore important that the SemWeb / Linked Data community takes care
> to remember that these things don't come for free, that bills need
> paying and that de-referencing is a privilege not a right.

"Bills" the major operative word in a world where the "Bill Payer" and 
"Database Maintainer" is a footnote (at best) re. perception of what 
constitutes the DBpedia Project.

Our own ISPs even had to get in contact with us (last quarter of 2009) 
re. the amount of bandwidth being consumed by DBpedia etc..

>  If there
> are things we can do as a technology community to lower the cost of
> hosting / distributing such data, or to nudge consumers of it in the
> direction of more sustainable habits, we should do so. If there's not
> so much the rest of us can do but say 'thanks!', ... then, ...er,
> 'thanks!'. Much appreciated!
>   

For us, the most important thing is perspective. DBpedia is another 
space on a public network, thus it can't magically rewrite the 
underlying physics of wide area networking where access is open to the 
world.  Thus, we can make a note about proper behavior and explain how 
we protect the instance such that everyone has a chance of using it 
(rather than a select few resource guzzlers).
> Are there any scenarios around eg. BitTorrent that could be explored?
> What if each of the static files in http://dbpedia.org/sitemap.xml
> were available as torrents (or magnet: URIs)?
When we set up the Descriptor Resource host, these would certainly be 
considered.
>  I realise that would
> only address part of the problem/cost, but it's a widely used
> technology for distributing large files; can we bend it to our needs?
>   
Also, we encourage use of gzip over HTTP  :-)

Kingsley
> cheers,
>
> Dan
>
>   


-- 

Regards,

Kingsley Idehen	      
President & CEO 
OpenLink Software     
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen
Received on Wednesday, 14 April 2010 18:12:23 UTC