Re: Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges from Paolo Romano on 2008-08-21 (public-semweb-lifesci@w3.org from August 2008)

From: Paolo Romano <paolo.romano@istge.it>
Date: Thu, 21 Aug 2008 11:53:55 +0200
To: Phillip Lord <phillip.lord@newcastle.ac.uk>
Cc: Kei Cheung <kei.cheung@yale.edu>, Peter Ansell <ansell.peter@gmail.com>, Marco Roos <M.Roos1@uva.nl>, public-semweb-lifesci <public-semweb-lifesci@w3.org>, mygrid@listserv.manchester.ac.uk, myexperiment-discuss@nongnu.org, Matthias Samwald <samwald@gmx.at>
Message-Id: <200808211008.m7LA8Yac022996@ibm43p.biotech.ist.unige.it>

At 11:31 21/08/2008, Phillip Lord wrote:

> >>>>> "KC" == Kei Cheung <kei.cheung@yale.edu> writes:
>
>   KC> If some journals are requiring raw data (e.g., microarray data) to be
>   KC> submitted to a public data repository, I wonder if workflows that are
>   KC> used to analyze the data should also be submitted to a public workflow
>   KC> repository.
>
>It's a nice idea but doesn't quite allow the same level of repeatability. Most
>taverna workflows need updating periodically, as the services go offline or
>change their interfaces. Even if they don't, they return different results as
>the implementation changes.
>
>Ultimately, you need to store more than the workflow to allow any degree of
>repeatability. Still, it would be a good step forward which is no bad thing.

You are right, and I think this really is a serious problem not only 
with the workflow approach to data analysis,
but to all bioinformatics procedures.

We should find a way to fully describe a bioinformatics data 
analysis, by specifying, e.g., not only
the tools used (software programme, databases involved, parameters 
used, I/O), but also a lot of
meta information on them, like software version and implementation, 
residing operating system,
database version, server software and related version and 
implementation, accessed site, date of accession, etc...
All this information would support at least a better specification of 
the procedure, while repeatability of
the analysis would still be difficult, due to the frequent update of 
databases and the difficulty in keeping
previous releases on-line.
At the same time, it would be nice to see how results of analysis can 
change after some time, when
new data is available in databases.

Paolo

Paolo Romano (paolo.romano@istge.it)
Bioinformatics
National Cancer Research Institute (IST)
Largo Rosanna Benzi, 10, I-16132, Genova, Italy
Tel: +39-010-5737-288  Fax: +39-010-5737-295

Received on Thursday, 21 August 2008 09:55:39 UTC