Re: RDF Provenance for microarray data from M. Scott Marshall on 2010-08-20 (public-semweb-lifesci@w3.org from August 2010)

From: M. Scott Marshall <mscottmarshall@gmail.com>
Date: Thu, 19 Aug 2010 17:30:40 -0700
To: Tom Morris <tfmorris@gmail.com>
Cc: HCLS <public-semweb-lifesci@w3.org>, ostolop@ebi.ac.uk, James Malone <malone@ebi.ac.uk>, Helen Parkinson <parkinson@ebi.ac.uk>
Message-ID: <AANLkTimFMjBfJnjZkSxbqrmVnMEL-4tVhLAoPbdNwqL0@mail.gmail.com>

I agree Tom with the need to record the provenance that you can.
That's why it's particularly useful that James Malone (CC'd) told us
about the Software Ontology http://www.ebi.ac.uk/efo/swo that is being
developed in the context of the Experimental Factor Ontology work at
EBI, quoting from the web page:

<QUOTE>
The software ontology (SWO) was a project initiated by Dr Helen
Parkinson and Dr James Malone at EBI and implemented by Nandini
Badarinarayan to describe software used in bioinformatics. The SWO
describes components of software such as the software type, the
manufacturer, the data inputs and outputs and the objectives of the
software.

The SWO uses a slim version of the Basic Formal Ontology (BFO) upper
ontology and subclasses and relations from the Experimental Factor
Ontology (EFO), the Ontology of Biomedical Investigations (OBI) and
the Information Artifact Ontology (IAO).
</QUOTE>

For example, the identifier for (the BioConductor implementation of)
LIMMA in SWO is http://www.ebi.ac.uk/efo/swo/SWO_0000593 .

If you find the above identifier along with a gene list that is
associated with an microarray study article (imagine for a moment that
a gene list is provided in the associated MAGE-TAB of the data), it is
far better than having to guess at the gene list yourself or having to
read the article to decide if you want to use the gene list. Suppose
that you 1) have access to gene lists and 2) prefer to only make use
of genelists produced by LIMMA. Then, you can encode your inclusion
criteria into a SPARQL query.

BTW, Gene Atlas http://www.ebi.ac.uk/gxa/ provides gene lists that
have been uniformly selected using LIMMA from a subset of
ArrayExpress. Currently, it is possible to access some of the service
output as, for example, a list of strings in JSON format. Misha
Kapushesky (CC'd) and colleagues are interested in eventually
providing RDF renderings of the data as well.

-Scott

On Thu, Jul 22, 2010 at 1:49 PM, Tom Morris <tfmorris@gmail.com> wrote:
> This discussion about provenance:
>
>  "Lena: But software packages change so any reference to the software
> will be stale over the years.
>
>  "Scott: Many types of provenance will go stale but essential
> information about the origins of the information (provenance), such as
> the method used to produce the p-values, is important to anyone
> reusing the data. They want to know whether it's from LIMMA or MANOVA,
> just as they want to know Affy vs. other types of arrays."
>
> seems to assume that provenance information will unavoidably get stale.
>
> I don't think that needs to be the case.  With a little forethought, I
> think one can collect enough information that you have a good chance
> of unambiguously identifying something like a software package.  If
> rather than "LIMMA" you record something like "LIMMA v3.4.4 Windows
> 64-bit" (or even better, a structured version of that), you should be
> able to trace even things which are version specific or
> platform/compiler specific.  If the package has multiple methods that
> might have been used for a task, include a reference to the
> method/process also.
>
> Tom

-- 
M. Scott Marshall, W3C HCLS IG co-chair
Leiden University Medical Center / University of Amsterdam
http://staff.science.uva.nl/~marshall

Received on Friday, 20 August 2010 00:31:08 UTC