Oracle Uniprot RDF data set and benchmarks

Hi Susie,

Thanks for the discussion regarding Oracle's RDF support.

One of the topics of conversation at the recent F2F was the 
desire to use real world data sets, as opposed to the LUBM 
graphs, when performing benchmarks. I know others mentioned 
this also, but I specifically recall my conversations with 
Sean Martin.

In Oracle's recent VLDB paper [1] the authors mention that a 
subset of Uniprot, consisting of 80 million triples, was used 
for benchmarking purposes. I was not able to find a pointer to 
this data though. Is this graph available for download? Since 
the paper also proposes several queries for the Uniprot graph, 
I think it would make sense to expand on this work for future 
benchmarks.

Depending on how the Uniprot subgraph is derived, a lot of 
variability can be introduced into results (e.g. minimizing 
literals, etc.). The Uniprot RDF is also updated as the 
Uniprot database changes, so it is a moving target. We will 
thus want to maintain a local copy of this extract (on the 
wiki?) so changes in the graph don't change the benchmarking 
results.

I think the entire Uniprot graph is probably not practical for 
most - that is, thus far, I have been unsuccessful in loading 
the entire graph of just the main file (~300 mil triples).

I am currently using my own extract of the Uniprot data, ~125 
million triples, to benchmark several triplestores in main 
memory - but I would rather share one common extract for 
benchmarking purposes in our community. Since your group has 
already published on the capabilities of the 10g product, this 
seemed a logical starting point.

Curious what others think.

Thanks,
Ian

[1]http://www.oracle.com/technology/tech/semantic_technologies/pdf/vldb_2005.pdf

Received on Wednesday, 8 February 2006 10:24:51 UTC