- From: M. Scott Marshall <mscottmarshall@gmail.com>
- Date: Mon, 8 Nov 2010 16:41:36 +0100
- To: HCLS <public-semweb-lifesci@w3.org>
- Cc: Christoph Grabmueller <grabmuel@ebi.ac.uk>
[Forwarding Christoph's comments on the BioRDF Provenance article draft for discussion. Cheers, Scott] Hello Scott, I finally had a look at the provenance draft, a few points: 1) Why is provenance so important? Is the lab where the data came from so significant? Or should I understand provenance more generally as metadata? 2) Many 'things' are represented as strings (e.g. genes), which makes it often impossible to run a federated query against another endpoint. Gene names might somewhat consistent for HUGO, but what about other species? Also, just the simple variance between 'STEAP2' and 'Steap2' makes a (direct) federated query impossible. Speaking of federation, it is not possible to query over EFO and DO (using their OWL representations), because EFO uses funny strings like 'DOID:14691' instead of the namespace of DO. 3) I like the Excel to RDF converter, but it relies on the user entering correct namespaces, names and database ids from various places in a syntactically correct way. This requires knowledge of the correct databases to choose and the 'correct' uri (many variants to chose from). If people just enter strings we are not all that far away from MAGE-TAB. 4) It's not possible to 'turn around' some of queries. In query 4, if I wanted to input a disease instead of the gene into the Diseasome endpoint, where would I get the correct string 'Gastric cancer, 137215' from to query? The federation examples conveniently only use gene labels as input to other endpoints *clears throat* :) 5) A more technical comment to query 4: Why is 'FILTER (?brainRegion = neurolex:Entorhinal_cortex )' used. '?sampleList biordf:derives_from_region neurolex:Entorhinal_cortex' would be much faster, since the number of possible result triples is reduced greatly. If ?brainRegion is used in the middle of a query and is not bound, all possible values and combinations are used throughout the query. Only at the end a huge amount of tuples is removed by FILTER. I know that most of the points are not specific to the project, but to the semantic web in general. There is almost no consistency across datasets, databases or cross references. Generally I like the data model, seems intuitive to me. Btw in the graph at http://biordfmicroarray.googlecode.com/hg/sparql_endpoint.html the edge 'experimentSet dct:isPartOf microarray_experiment' is missing. What I have so far is a simplified subset of this model, I don't see any conflicts. The main difference is that I'm using the EFO directly, and then link to DO (only possible after 'fixing' the EFO myself); and I'm not using gene name strings but official uniprot URIs (http://purl.uniprot.org/uniprot/P30089). Regards, Christoph
Received on Monday, 8 November 2010 15:42:04 UTC