- From: Dan Brickley <danbri@danbri.org>
- Date: Wed, 14 Apr 2010 19:58:37 +0200
- To: Kingsley Idehen <kidehen@openlinksw.com>
- Cc: public-lod <public-lod@w3.org>, dbpedia-discussion <dbpedia-discussion@lists.sourceforge.net>
(trimming cc: list to LOD and DBPedia) On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen <kidehen@openlinksw.com> wrote: > My comment wasn't a "what is DBpedia?" lecture. It was about clarifying > the crux of the matter i.e., bandwidth consumption and its effects on > other DBpedia users (as well as our own non-DBpedia related Web properties). (Leigh) >> I was just curious about usage volumes. We all talk about how central >> dbpedia is in the LOD cloud picture, and wondered if there was any >> publicly accessible metrics to help add some detail that. >> > Well here is the critical detail: people typically crawl DBpedia. They > crawl it more than any other Data Space in the LOD cloud. They do so > because DBpedia is still quite central to to the burgeoning Web of > Linked Data. Have you considered blocking DBpedia crawlers more aggressively, and nudging them to alternative ways of accessing the data? While it is a shame to say 'no' to people trying to use linked data, this would be more saying 'yes, but not like that...'. > When people aren't crawling, they are executing CONSTRUCTsor DESCRIBEs > via SPARQL, which is still ultimately Export from DBpedia and Import to > my data space mindset. That's useful to know, thanks. Do you have the impression that these folk are typically trying to copy the entire thing, or to make some filtered subset (by geographical view, topic, property etc). Can studying these logs help provide different downloadable dumps that would discourage crawlers? > That's as simple and precise as this matter is. > > From a SPARQL perspective, DBpedia is quite microscopic, its when you > factor in Crawler mentality and network bandwith that issues arise, and > we deliberately have protection in place for Crawlers. Looking at http://wiki.dbpedia.org/OnlineAccess#h28-14 I don't see anything discouraging crawlers. Where is the 'best practice' or 'acceptable use' advice we should all be following, to avoid putting needless burden on your servers and bandwidth? As you mention, DBpedia is an important and central resource, thanks both to the work of the Wikipedia community, and those in the DBpedia project who enrich and make available all that information. It's therefore important that the SemWeb / Linked Data community takes care to remember that these things don't come for free, that bills need paying and that de-referencing is a privilege not a right. If there are things we can do as a technology community to lower the cost of hosting / distributing such data, or to nudge consumers of it in the direction of more sustainable habits, we should do so. If there's not so much the rest of us can do but say 'thanks!', ... then, ...er, 'thanks!'. Much appreciated! Are there any scenarios around eg. BitTorrent that could be explored? What if each of the static files in http://dbpedia.org/sitemap.xml were available as torrents (or magnet: URIs)? I realise that would only address part of the problem/cost, but it's a widely used technology for distributing large files; can we bend it to our needs? cheers, Dan
Received on Wednesday, 14 April 2010 17:59:11 UTC