- From: mdmiller <mdmiller53@comcast.net>
- Date: Mon, 8 Nov 2010 08:02:28 -0800
- To: "M. Scott Marshall" <mscottmarshall@gmail.com>, "Christoph Grabmueller" <grabmuel@ebi.ac.uk>
- Cc: "HCLS" <public-semweb-lifesci@w3.org>
hi all, my comments in line, cheers, michael ----- Original Message ----- From: "M. Scott Marshall" <mscottmarshall@gmail.com> To: "HCLS" <public-semweb-lifesci@w3.org> Cc: "Christoph Grabmueller" <grabmuel@ebi.ac.uk> Sent: Monday, November 08, 2010 7:41 AM Subject: [BioRDF] Comments from Christoph Grabmuller on BioRDF microarray provenance [Forwarding Christoph's comments on the BioRDF Provenance article draft for discussion. Cheers, Scott] Hello Scott, I finally had a look at the provenance draft, a few points: 1) Why is provenance so important? Is the lab where the data came from so significant? Or should I understand provenance more generally as metadata? * as one of the levels of provenance, the lab is extremely important. keith baggerly at the MD Anderson Cancer Center and his group have shown that some published microarray experiments data cannot only not be reproduced but is actually wrong. for any lab that followup shows this for one experiment, then any other experiment from that lab would be suspect. --mm 2) Many 'things' are represented as strings (e.g. genes), which makes it often impossible to run a federated query against another endpoint. Gene names might somewhat consistent for HUGO, but what about other species? Also, just the simple variance between 'STEAP2' and 'Steap2' makes a (direct) federated query impossible. * actually, HGNC Gene Symbols and entrez accessions are very stable. for ArrayExpress, the ADF file will usually map to one or both of these identifiers. in practice, i've not seen this to be a problem but for the paper we didn't go far enough. --mm Speaking of federation, it is not possible to query over EFO and DO (using their OWL representations), because EFO uses funny strings like 'DOID:14691' instead of the namespace of DO. * yes, this type of thing is annoying but if the transformation is straight-forward, it works well as a rule. --mm 3) I like the Excel to RDF converter, but it relies on the user entering correct namespaces, names and database ids from various places in a syntactically correct way. This requires knowledge of the correct databases to choose and the 'correct' uri (many variants to chose from). If people just enter strings we are not all that far away from MAGE-TAB. * i'm involved in an open source project, Annotare, that seeks to put a nice UI on top of creating MAGE-TAB documents for a bench scientist. part of that is use of the NCBO tools to make it easy for the creator of the document to go fetch the appropriate term from the appropriate onotlogy/vocabulary. version one has support for EFO built-in, one of the main goals for version 2 is to make this much easier and much broader. --mm 4) It's not possible to 'turn around' some of queries. In query 4, if I wanted to input a disease instead of the gene into the Diseasome endpoint, where would I get the correct string 'Gastric cancer, 137215' from to query? The federation examples conveniently only use gene labels as input to other endpoints *clears throat* :) 5) A more technical comment to query 4: Why is 'FILTER (?brainRegion = neurolex:Entorhinal_cortex )' used. '?sampleList biordf:derives_from_region neurolex:Entorhinal_cortex' would be much faster, since the number of possible result triples is reduced greatly. If ?brainRegion is used in the middle of a query and is not bound, all possible values and combinations are used throughout the query. Only at the end a huge amount of tuples is removed by FILTER. I know that most of the points are not specific to the project, but to the semantic web in general. There is almost no consistency across datasets, databases or cross references. Generally I like the data model, seems intuitive to me. Btw in the graph at http://biordfmicroarray.googlecode.com/hg/sparql_endpoint.html the edge 'experimentSet dct:isPartOf microarray_experiment' is missing. What I have so far is a simplified subset of this model, I don't see any conflicts. The main difference is that I'm using the EFO directly, and then link to DO (only possible after 'fixing' the EFO myself); and I'm not using gene name strings but official uniprot URIs (http://purl.uniprot.org/uniprot/P30089). Regards, Christoph
Received on Monday, 8 November 2010 16:03:10 UTC