Linked Data and IRI dereferencing (scale limits?) from Jörn Hees on 2010-08-05 (public-lod@w3.org from August 2010)

From: Jörn Hees <j_hees@cs.uni-kl.de>
Date: Thu, 5 Aug 2010 02:37:44 +0200
To: public-lod@w3.org
Message-Id: <201008050237.45452.j_hees@cs.uni-kl.de>
Hi all,

I have a question related to IRI dereferencing best practices. For a project I 
need to ask a couple of cross domain SPARQL-Query like this one:

SELECT ?s ?p ?o
WHERE {
  <http://dbpedia.org/resource/Barack_Obama> ?p ?o.
  OPTIONAL { ?p rdfs:label ?lp. }
  OPTIONAL { ?o rdfs:label ?lo. }
}

(I need a human readable version of all triples starting at 
dbpedia:Barack_Obama.)

For now I need this not only for Barack_Obama but for the 1000 dbpedia 
resources corresponding to the 1000 most visited pages of this month in the 
English Wikipedia.

Actually quite simple I thought:
I don't need the whole DBpedia, but only a reasonably small subset. For every 
distinct ?p and ?o I should be able to acquire the information I need simply 
by looking up the URI and retrieving the data.


But there are some problems:

1. DBpedia still uses skos:subject quite often, even though it's deprecated. 
If you look the URI http://www.w3.org/2004/02/skos/core#subject I'm silently 
redirected to the current skos definition http://www.w3.org/TR/skos-
reference/skos.html#subject, but there is no #subject in it anymore. This 
means: no rdfs:label for a property which is ubiquitous in DBpedia.
Am I missing out some Header option for the content negotiation or is this a 
problem of the w3.org end?


2. When dereferencing DBpedia URIs I repeatedly found a suspiciously equal 
number of triples per fetched IRI in the local cache: 2001 triples, sometimes 
2002. I remembered: ah, yes... you don't have to return all triples, but just 
"usful ones". I think what currently happens is that on the DBpedia side a cut 
is made after 2000 triples, probably to reduce all the traffic overhead.
Still, please go to http://dbpedia.org/data/United_States.rdf (you get there 
by content neg. from http://dbpedia.org/resource/United_States -- the HTML-
page shows different content). Notice that you nearly only get inverse triples 
(?s ?p <http://dbpedia.org/resource/United_States>).

There is no rdfs:label, no rdf:type, etc. in it, while all these useful things 
are in the HTML version.
I'm not pointing this out to say that there is a problem in DBpedia. I think 
this is a serious problem of scale. How do you decide what is useful for 
someone dereferencing your URIs? How do you keep unnecessary traffic low at the 
same time?
I think maybe a few standard triples should be included in any case (e.g., 
rdfs:label, rdf:type, ...), luckily I'm not the one to decide about these 1st 
class properties ;) . But is there perhaps a property or Header field called 
"responsibleSPARQLendpoint"? With such a property one could at some point say 
"this is what you get, if you need anything else, please contact this SPARQL-
endpoint".


Would love to get some feedback on these problems or on whether my whole 
approach is on the wrong track.

Cheers,
Jörn
Received on Thursday, 5 August 2010 00:38:29 UTC