Re: [BioRDF] Comments from Christoph Grabmuller on BioRDF microarray provenance from mdmiller on 2010-11-08 (public-semweb-lifesci@w3.org from November 2010)

From: mdmiller <mdmiller53@comcast.net>
Date: Mon, 8 Nov 2010 08:02:28 -0800
To: "M. Scott Marshall" <mscottmarshall@gmail.com>, "Christoph Grabmueller" <grabmuel@ebi.ac.uk>
Cc: "HCLS" <public-semweb-lifesci@w3.org>
Message-ID: <D549649665B54BC89D619AB82D9D61DC@mmPC>
hi all,

my comments in line,

cheers,
michael

----- Original Message ----- 
From: "M. Scott Marshall" <mscottmarshall@gmail.com>
To: "HCLS" <public-semweb-lifesci@w3.org>
Cc: "Christoph Grabmueller" <grabmuel@ebi.ac.uk>
Sent: Monday, November 08, 2010 7:41 AM
Subject: [BioRDF] Comments from Christoph Grabmuller on BioRDF microarray 
provenance


[Forwarding Christoph's comments on the BioRDF Provenance article
draft for discussion.  Cheers, Scott]

Hello Scott,

I finally had a look at the provenance draft, a few points:
1) Why is provenance so important? Is the lab where the data came from
so significant? Or should I understand provenance more generally as
metadata?

* as one of the levels of provenance, the lab is extremely important. keith
baggerly at the MD Anderson Cancer Center and his group have shown that some
published microarray experiments data cannot only not be reproduced but is
actually wrong.  for any lab that followup shows this for one experiment,
then any other experiment from that lab would be suspect.
--mm


2) Many 'things' are represented as strings (e.g. genes), which makes
it often impossible to run a federated query against another endpoint.
Gene names might somewhat consistent for HUGO, but what about other
species? Also, just the simple variance between 'STEAP2' and 'Steap2'
makes a (direct) federated query impossible.

* actually, HGNC Gene Symbols and entrez accessions are very stable.  for
ArrayExpress, the ADF file will usually map to one or both of these
identifiers.  in practice, i've not seen this to be a problem but for the
paper we didn't go far enough.
--mm

Speaking of federation, it is not possible to query over EFO and DO
(using their OWL representations), because EFO uses funny strings like
'DOID:14691' instead of the namespace of DO.

* yes, this type of thing is annoying but if the transformation is
straight-forward, it works well as a rule.
--mm

3) I like the Excel to RDF converter, but it relies on the user
entering correct namespaces, names and database ids from various
places in a syntactically correct way. This requires knowledge of the
correct databases to choose and the 'correct' uri (many variants to
chose from).
If people just enter strings we are not all that far away from MAGE-TAB.

* i'm involved in an open source project, Annotare, that seeks to put a nice
UI on top of creating MAGE-TAB documents for a bench scientist.  part of
that is use of the NCBO tools to make it easy for the creator of the
document to go fetch the appropriate term from the appropriate
onotlogy/vocabulary.  version one has support for EFO built-in, one of the
main goals for version 2 is to make this much easier and much broader.
--mm

4) It's not possible to 'turn around' some of queries. In query 4, if
I wanted to input a disease instead of the gene into the Diseasome
endpoint, where would I get the correct string 'Gastric cancer,
137215' from to query? The federation examples conveniently only use
gene labels as input to other endpoints *clears throat* :)

5) A more technical comment to query 4:
Why is 'FILTER (?brainRegion = neurolex:Entorhinal_cortex )' used.
'?sampleList biordf:derives_from_region neurolex:Entorhinal_cortex'
would be much faster, since the number of possible result triples is
reduced greatly. If ?brainRegion is used in the middle of a query and
is not bound, all possible values and combinations are used throughout
the query. Only at the end a huge amount of tuples is removed by
FILTER.


I know that most of the points are not specific to the project, but to
the semantic web in general. There is almost no consistency across
datasets, databases or cross references.

Generally I like the data model, seems intuitive to me. Btw in the
graph at http://biordfmicroarray.googlecode.com/hg/sparql_endpoint.html
the edge 'experimentSet dct:isPartOf microarray_experiment' is
missing.
What I have so far is a simplified subset of this model, I don't see
any conflicts. The main difference is that I'm using the EFO directly,
and then link to DO (only possible after 'fixing' the EFO myself); and
I'm not using gene name strings but official uniprot URIs
(http://purl.uniprot.org/uniprot/P30089).

Regards,
Christoph
Received on Monday, 8 November 2010 16:03:10 UTC