Re: [semanticweb] ANN: DBpedia 3.5 released from Kingsley Idehen on 2010-04-14 (semantic-web@w3.org from April 2010)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Wed, 14 Apr 2010 13:09:11 -0400
To: Leigh Dodds <leigh.dodds@talis.com>
CC: Ivan Mikhailov <imikhailov@openlinksw.com>, baran <baran@goldmail.de>, semanticweb <semanticweb@yahoogroups.com>, public-lod <public-lod@w3.org>, SW-forum <semantic-web@w3.org>, dbpedia-discussion <dbpedia-discussion@lists.sourceforge.net>, dbpedia-announcements <dbpedia-announcements@lists.sourceforge.net>, Chris Bizer <chris@bizer.de>
Message-ID: <4BC5F6B7.7000507@openlinksw.com>

Leigh Dodds wrote:
> Hi,
>
> 2010/4/14 Kingsley Idehen <kidehen@openlinksw.com>:
>   
>> When we refer to an "option" we are talking about a mirror rather than
>> an alternative place where DBpedia data sets have been loaded.
>>     
>
> I deliberately didn't use the word "mirror" as that sets expectations
> around offering same features, using same technology, etc. So I meant
> what I said: there are other SPARQL endpoints that provide live,
> public access to the dbpedia data.
>
>   
Fine, but Ivan specifically commented about "Mirror".

Do understand that the issues aren't about SPARQL per se. it's about 
what's happening around the instance at http://dbpedia.org.  Crawling 
the Descriptor Resources is chewing up "across the wire" bandwidth.
>> As for usage levels, the issues have very little to do we sane SPARQL
>> query and everything to do with crawlers that actually attempt to
>> perform wholesale imports of the entire data set (many attempt this as
>> we can seen from the HTTP logs and the payload sizes). In addition,
>> remember, we are severing up actual RDF based descriptor resources, and
>> these too are crawled wholesale with the intent of populating other data
>> spaces (these are also crawled aggressively via LOD and non LOD crawlers).
>>
>> We are not just providing a SPARQL endpoint, we are also serving RDF
>> descriptor resources in a variety of representation formats. And as I've
>> stated above, the dominant use pattern is crawling the RDF descriptor
>> resources, which (without protection) simply obliterates "across the
>> wire bandwidth" as is the case with any document server on a public
>> network such as the World Wide Web.
>>     
>
> Yes I'm aware of what dbpedia is, and also the challenges of running a
> live operational service :)
>   
My comment wasn't a "what is DBpedia?" lecture. It was about clarifying 
the crux of the matter i.e., bandwidth consumption and its effects on 
other DBpedia users (as well as our own non-DBpedia related Web properties).
> I was just curious about usage volumes. We all talk about how central
> dbpedia is in the LOD cloud picture, and wondered if there was any
> publicly accessible metrics to help add some detail that.
>   
Well here is the critical detail: people typically crawl DBpedia. They 
crawl it more than any other Data Space in the LOD cloud. They do so 
because DBpedia is still quite central to to the burgeoning Web of 
Linked Data.

When people aren't crawling, they are executing CONSTRUCTsor DESCRIBEs 
via SPARQL, which is still ultimately Export from DBpedia and Import to 
my data space mindset.


That's as simple and precise as this matter is.

 From a SPARQL perspective, DBpedia is quite microscopic, its when you 
factor in Crawler mentality and network bandwith that issues arise, and 
we deliberately have protection in place for Crawlers.

Kingsley


> Cheers,
>
> L.
>
>   


-- 

Regards,

Kingsley Idehen	      
President & CEO 
OpenLink Software     
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Received on Wednesday, 14 April 2010 17:09:49 UTC