RE: Uniprot RDF in RDF Gateway from Geoff Chappell on 2005-07-05 (public-semweb-lifesci@w3.org from July 2005)

From: Geoff Chappell <gchappell@intellidimension.com>
Date: Mon, 4 Jul 2005 20:26:08 -0400
To: "'Eric Jain'" <Eric.Jain@isb-sib.ch>
Cc: <public-semweb-lifesci@w3.org>
Message-ID: <004201c580f8$2510a620$6501a8c0@gsclaptop>

> -----Original Message-----
> From: Eric Jain [mailto:Eric.Jain@isb-sib.ch]
> Sent: Monday, July 04, 2005 12:33 PM
> To: Geoff Chappell
> Cc: public-semweb-lifesci@w3.org
> Subject: Re: Uniprot RDF in RDF Gateway
> 
> Geoff Chappell wrote:
> > I've added an experimental sparql interface - details at:
> >
> > 	http://labs.intellidimension.com/uniprot/query2.rsp
> 
> Great, quite impressive!
> 
> Is it possible (and efficient) to use this system to retrieve large data
> sets (thousands to millions of triples)?

To some degree... query results and intermediate products are currently
in-memory only (something we're addressing in a fall release) - so you're
somewhat limited by the characteristics of your machine, the complexity of
your query and rules, and the amount of data you have. 

That said, the scripting language give you some flexibility in retrieving
massive datasets. For example, you could do something like this to obtain
concise bounded descriptions of all human proteins (reasonably efficiently):


use UNIPROT;

import "/std/ns.rql";
import "/std/cbd.rql";

session.namespaces["uni"] = "urn:lsid:uniprot.org:ontology:";

rulebase trans{
	infer {[rdfs:subClassOf] ?a ?c} from {[rdfs:subClassOf] ?a ?b} 
		and {[rdfs:subClassOf] ?b ?c};
}

var dsUni = datasource("uniprot");
var rsSub = (select ?c using uniprot rulebase trans 
	where {[rdfs:subClassOf] [urn:lsid:uniprot.org:taxonomy:9606] ?c} 
		or ?c=[urn:lsid:uniprot.org:taxonomy:9606]);
for (;!rsSub.EOF; rsSub.moveNext())
{
	//for each superclass of human get a simple cursor 
	//(just walks an index - no memory usage)
	var rs = dsUni.getCursor(resource("uni:organism"), null, rsSub[0]);
	for (;!rs.EOF;rs.moveNext())
	{
		//get a concise bounded description for the resource 
		//(includes reifications about resource)
		var ds = datasource((select ?p ?s ?o using uniprot rulebase
cbd
 			where description(?p ?s ?o #(rs[2]))));
		var s = ds.format("application/ntriples");
		
		//append it to a file, write it out, etc.
		//...		
	}
}

> Many people are interested in obtaining subsets of our data (e.g. only
> human proteins), so that's another interesting use case.

See above example.

Of course, it'd probably make sense to set a query governor - e.g:

	Session.maxQueryComplexity = 5000000;

and let the original query rip as one - good chance for this example it
would be ok on a reasonable box.

Best,

Geoff

Received on Tuesday, 5 July 2005 01:05:11 UTC