[BioRDF] Comments from Christoph Grabmuller on BioRDF microarray provenance from M. Scott Marshall on 2010-11-08 (public-semweb-lifesci@w3.org from November 2010)

From: M. Scott Marshall <mscottmarshall@gmail.com>
Date: Mon, 8 Nov 2010 16:41:36 +0100
To: HCLS <public-semweb-lifesci@w3.org>
Cc: Christoph Grabmueller <grabmuel@ebi.ac.uk>
Message-ID: <AANLkTikG6FfnUCcctPVtDz+OLGVG6Wb3NQCnyvckyHRq@mail.gmail.com>

[Forwarding Christoph's comments on the BioRDF Provenance article
draft for discussion.  Cheers, Scott]

Hello Scott,

I finally had a look at the provenance draft, a few points:
1) Why is provenance so important? Is the lab where the data came from
so significant? Or should I understand provenance more generally as
metadata?

2) Many 'things' are represented as strings (e.g. genes), which makes
it often impossible to run a federated query against another endpoint.
Gene names might somewhat consistent for HUGO, but what about other
species? Also, just the simple variance between 'STEAP2' and 'Steap2'
makes a (direct) federated query impossible.

Speaking of federation, it is not possible to query over EFO and DO
(using their OWL representations), because EFO uses funny strings like
'DOID:14691' instead of the namespace of DO.

3) I like the Excel to RDF converter, but it relies on the user
entering correct namespaces, names and database ids from various
places in a syntactically correct way. This requires knowledge of the
correct databases to choose and the 'correct' uri (many variants to
chose from).
If people just enter strings we are not all that far away from MAGE-TAB.

4) It's not possible to 'turn around' some of queries. In query 4, if
I wanted to input a disease instead of the gene into the Diseasome
endpoint, where would I get the correct string 'Gastric cancer,
137215' from to query? The federation examples conveniently only use
gene labels as input to other endpoints *clears throat* :)

5) A more technical comment to query 4:
Why is 'FILTER (?brainRegion = neurolex:Entorhinal_cortex )' used.
'?sampleList    biordf:derives_from_region      neurolex:Entorhinal_cortex'
would be much faster, since the number of possible result triples is
reduced greatly. If ?brainRegion is used in the middle of a query and
is not bound, all possible values and combinations are used throughout
the query. Only at the end a huge amount of tuples is removed by
FILTER.


I know that most of the points are not specific to the project, but to
the semantic web in general. There is almost no consistency across
datasets, databases or cross references.

Generally I like the data model, seems intuitive to me. Btw in the
graph at http://biordfmicroarray.googlecode.com/hg/sparql_endpoint.html
the edge 'experimentSet dct:isPartOf    microarray_experiment' is
missing.
What I have so far is a simplified subset of this model, I don't see
any conflicts. The main difference is that I'm using the EFO directly,
and then link to DO (only possible after 'fixing' the EFO myself); and
I'm not using gene name strings but official uniprot URIs
(http://purl.uniprot.org/uniprot/P30089).

Regards,
Christoph

Received on Monday, 8 November 2010 15:42:04 UTC