- From: Nathan <nathan@webr3.org>
- Date: Wed, 14 Apr 2010 19:08:15 +0100
- To: Dan Brickley <danbri@danbri.org>
- CC: Kingsley Idehen <kidehen@openlinksw.com>, public-lod <public-lod@w3.org>, dbpedia-discussion <dbpedia-discussion@lists.sourceforge.net>
Dan Brickley wrote: > (trimming cc: list to LOD and DBPedia) > > On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen <kidehen@openlinksw.com> wrote: > >> My comment wasn't a "what is DBpedia?" lecture. It was about clarifying >> the crux of the matter i.e., bandwidth consumption and its effects on >> other DBpedia users (as well as our own non-DBpedia related Web properties). > (Leigh) >>> I was just curious about usage volumes. We all talk about how central >>> dbpedia is in the LOD cloud picture, and wondered if there was any >>> publicly accessible metrics to help add some detail that. >>> >> Well here is the critical detail: people typically crawl DBpedia. They >> crawl it more than any other Data Space in the LOD cloud. They do so >> because DBpedia is still quite central to to the burgeoning Web of >> Linked Data. > > Have you considered blocking DBpedia crawlers more aggressively, and > nudging them to alternative ways of accessing the data? While it is a > shame to say 'no' to people trying to use linked data, this would be > more saying 'yes, but not like that...'. > >> When people aren't crawling, they are executing CONSTRUCTsor DESCRIBEs >> via SPARQL, which is still ultimately Export from DBpedia and Import to >> my data space mindset. > > That's useful to know, thanks. Do you have the impression that these > folk are typically trying to copy the entire thing, or to make some > filtered subset (by geographical view, topic, property etc). Can > studying these logs help provide different downloadable dumps that > would discourage crawlers? > >> That's as simple and precise as this matter is. >> >> From a SPARQL perspective, DBpedia is quite microscopic, its when you >> factor in Crawler mentality and network bandwith that issues arise, and >> we deliberately have protection in place for Crawlers. > > Looking at http://wiki.dbpedia.org/OnlineAccess#h28-14 I don't see > anything discouraging crawlers. Where is the 'best practice' or > 'acceptable use' advice we should all be following, to avoid putting > needless burden on your servers and bandwidth? > > As you mention, DBpedia is an important and central resource, thanks > both to the work of the Wikipedia community, and those in the DBpedia > project who enrich and make available all that information. It's > therefore important that the SemWeb / Linked Data community takes care > to remember that these things don't come for free, that bills need > paying and that de-referencing is a privilege not a right. If there > are things we can do as a technology community to lower the cost of > hosting / distributing such data, or to nudge consumers of it in the > direction of more sustainable habits, we should do so. If there's not > so much the rest of us can do but say 'thanks!', ... then, ...er, > 'thanks!'. Much appreciated! > > Are there any scenarios around eg. BitTorrent that could be explored? > What if each of the static files in http://dbpedia.org/sitemap.xml > were available as torrents (or magnet: URIs)? I realise that would > only address part of the problem/cost, but it's a widely used > technology for distributing large files; can we bend it to our needs? > I'd like to add; could the /data/* and /page/* resources all be made static files? (if they are not already) + make use of http caching etc. perhaps even the non-sparql dependant parts could be hosted on another machine purely for static content? perhaps an interim proxy which cache's said resources permanently (then cache rebuild on request when a new dataset is upgraded) regards!
Received on Wednesday, 14 April 2010 18:09:01 UTC