Re: An application of the Semantic Web for finding alternative drug applications from Peter Ansell on 2008-09-11 (public-semweb-lifesci@w3.org from September 2008)

From: Peter Ansell <ansell.peter@gmail.com>
Date: Fri, 12 Sep 2008 08:49:14 +1000 (EST)
To: Amit Sheth <amitpsheth@gmail.com>
Cc: w3c semweb hcls <public-semweb-lifesci@w3.org>
Message-ID: <7963383.211221173349422.JavaMail.peter@Macintosh-2.local>
----- "Amit Sheth" <amitpsheth@gmail.com> wrote:

> Finding "potentially interesting" paths, subgraphs, and pattering in
> semantic web data (eg those
> created from complex entity and relationship extraction from
> biomedical literature [1],
> semantic annotation and provenence of experimental data, and of course
> structured datatabases) is
> very useful in biomedical research
> and requires SPARQL extensions. One of several examples along this
> line is the
> support for path queries as in SPARQ2L [2]. Other interesting examples
> are
> supporting spatio-temporal thematic queries and corresponding
> extensions such as SPARQ-ST
> [3] albeit we have not applied these extensions to sensor data so far
> and not (yet) to biomedical domain.
> 
> Amit
> 
> [1]
> http://knoesis.wright.edu/research/semweb/projects/textMining/ekaw2008/
> [2] http://knoesis.wright.edu/library/resource.php?id=00060
> [3] http://knoesis.org/research/semweb/projects/stt/

The queries that can be done with the path finding are interesting. 

One of the queries was actually quite interesting to me:

SELECT ??p 
WHERE  {  ?x   ??p   ?y .  
?x   bio:name   �MTB Surface Molecule� .   
?y   rdf:type   bio:Cellular_Response_Event .  
?z   rdf:type   bio:PI3K_Enzyme .  
PathFilter(containsAny(??p, ?z) && cost(??p) == 3 ) } 

I have been doing things like that already using queries similar to the following, which are complex, but they seem to work. Admittedly, these might be just as hard for a sparql engine to process as your pathfinder algorithm with in memory RDF graphs.

CONSTRUCT
{
 ?s1 ?p1 ?o1 .
 ?o1 ?p2 ?o2 .
 ?o2 ?p3 ?o3 .
}
WHERE {
?s1 ?p1 ?o1 .
?s1 bio:name �MTB Surface Molecule� .
?o1 ?p2 ?o2 .
?o2 ?p3 ?o3 .
?o3 rdf:type bio:Cellular_Response_Event .

FILTER( ?s1 != ?o1 && ?s1 != ?o2 && ?s2 != ?o3 && ?o1 != ?o2 && ?o1 != ?o3 && ?o2 != ?o3)

OPTIONAL
{ 
   ?containsAnySubject1 rdf:type   bio:PI3K_Enzyme .
   FILTER(?containsAnySubject1 == ?o1)
}

OPTIONAL
{ 
   ?containsAnySubject2 rdf:type   bio:PI3K_Enzyme .
   FILTER(?containsAnySubject2 == ?o2)
}

FILTER(bound(?containsAnySubject1) || bound(?containsAnySubject2))
}


Of course, these will each have fixed cardinalities and I haven't done any metrics yet to determine their performance in different scenarios.

I have doubts about the possible performance capabilities of an unconstrained generalised path finding algorithm on very large datasets (ie, hundreds of millions of triples -> billions of triples), although it could still be reasonably effective if people put in the cost function to limit the path length and possibly some intermediates that need to be traversed. However, having said that, it is a nice abstraction to be able to refer to the path without knowing before hand how long it is going to be. 

For the moment my own queries are cross-dataset, where each of the datasets do not have a large number of interlinks inside, which are easier to optimise than the inter-dataset queries that are in the samples. By that I mean that I know that the starting triples will come from uniprot for example, followed by NCBI geneid, and then NCBI pubmed, followed by mesh terms, and I can optimise the SPARQL query because I know the order. Discovering a different path between uniprot and mesh might be interesting for initial analysis, but you wouldn't have the performance guarantees that you have with a predicate/subject constrained graph matching pattern like the example above, if only because the whole web of links must be traversed until you know conclusively that you have hit a dead end, or you expired your quota for link path. Some example queries (which can be added to by anyone) for Bio2Rdf can be found at [4], although they are elementary path finding queries, and not restricted in some cases. I haven't put up any of my queries incidentally although I have a feeling they won't complete within the allowed amount of time on the public bio2rdf sparql endpoints but I will in future for reference by other people who have their own mirrors.

Have you implemented a direct SPARQ2L->SPARQL converter which generates sets of SPARQL queries that fit a simple type of SPARQ2L query, ie, path-length constrained to n with a single containsAny statement? 

Is there a theoretical reason why FILTER's wouldn't be appropriate for the translation to ensure that nodes are not traversed more than once in a given path?

Cheers,

Peter

[4] http://bio2rdf.wiki.sourceforge.net/Demo+queries
Received on Thursday, 11 September 2008 22:49:57 UTC