- From: Kingsley Idehen <kidehen@openlinksw.com>
- Date: Wed, 14 Apr 2010 14:19:57 -0400
- To: nathan@webr3.org
- CC: public-lod <public-lod@w3.org>, dbpedia-discussion <dbpedia-discussion@lists.sourceforge.net>
Nathan wrote: > Dan Brickley wrote: > >> (trimming cc: list to LOD and DBPedia) >> >> On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen <kidehen@openlinksw.com> wrote: >> >> >>> My comment wasn't a "what is DBpedia?" lecture. It was about clarifying >>> the crux of the matter i.e., bandwidth consumption and its effects on >>> other DBpedia users (as well as our own non-DBpedia related Web properties). >>> >> (Leigh) >> >>>> I was just curious about usage volumes. We all talk about how central >>>> dbpedia is in the LOD cloud picture, and wondered if there was any >>>> publicly accessible metrics to help add some detail that. >>>> >>>> >>> Well here is the critical detail: people typically crawl DBpedia. They >>> crawl it more than any other Data Space in the LOD cloud. They do so >>> because DBpedia is still quite central to to the burgeoning Web of >>> Linked Data. >>> >> Have you considered blocking DBpedia crawlers more aggressively, and >> nudging them to alternative ways of accessing the data? While it is a >> shame to say 'no' to people trying to use linked data, this would be >> more saying 'yes, but not like that...'. >> >> >>> When people aren't crawling, they are executing CONSTRUCTsor DESCRIBEs >>> via SPARQL, which is still ultimately Export from DBpedia and Import to >>> my data space mindset. >>> >> That's useful to know, thanks. Do you have the impression that these >> folk are typically trying to copy the entire thing, or to make some >> filtered subset (by geographical view, topic, property etc). Can >> studying these logs help provide different downloadable dumps that >> would discourage crawlers? >> >> >>> That's as simple and precise as this matter is. >>> >>> From a SPARQL perspective, DBpedia is quite microscopic, its when you >>> factor in Crawler mentality and network bandwith that issues arise, and >>> we deliberately have protection in place for Crawlers. >>> >> Looking at http://wiki.dbpedia.org/OnlineAccess#h28-14 I don't see >> anything discouraging crawlers. Where is the 'best practice' or >> 'acceptable use' advice we should all be following, to avoid putting >> needless burden on your servers and bandwidth? >> >> As you mention, DBpedia is an important and central resource, thanks >> both to the work of the Wikipedia community, and those in the DBpedia >> project who enrich and make available all that information. It's >> therefore important that the SemWeb / Linked Data community takes care >> to remember that these things don't come for free, that bills need >> paying and that de-referencing is a privilege not a right. If there >> are things we can do as a technology community to lower the cost of >> hosting / distributing such data, or to nudge consumers of it in the >> direction of more sustainable habits, we should do so. If there's not >> so much the rest of us can do but say 'thanks!', ... then, ...er, >> 'thanks!'. Much appreciated! >> >> Are there any scenarios around eg. BitTorrent that could be explored? >> What if each of the static files in http://dbpedia.org/sitemap.xml >> were available as torrents (or magnet: URIs)? I realise that would >> only address part of the problem/cost, but it's a widely used >> technology for distributing large files; can we bend it to our needs? >> >> > > I'd like to add; could the /data/* and /page/* resources all be made > static files? (if they are not already) + make use of http caching etc. > Yes. > perhaps even the non-sparql dependant parts could be hosted on another > machine purely for static content? perhaps an interim proxy which > cache's said resources permanently (then cache rebuild on request when a > new dataset is upgraded) > Yes. Kingsley > regards! > > -- Regards, Kingsley Idehen President & CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Received on Wednesday, 14 April 2010 18:20:25 UTC