Re: Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges from Kei Cheung on 2008-08-21 (public-semweb-lifesci@w3.org from August 2008)

From: Kei Cheung <kei.cheung@yale.edu>
Date: Thu, 21 Aug 2008 18:07:27 -0400
To: Paolo Romano <paolo.romano@istge.it>
CC: Phillip Lord <phillip.lord@newcastle.ac.uk>, Peter Ansell <ansell.peter@gmail.com>, Marco Roos <M.Roos1@uva.nl>, public-semweb-lifesci <public-semweb-lifesci@w3.org>, mygrid@listserv.manchester.ac.uk, myexperiment-discuss@nongnu.org, Matthias Samwald <samwald@gmx.at>
Message-ID: <48ADE71F.3060004@yale.edu>

Paolo Romano wrote:

> At 11:31 21/08/2008, Phillip Lord wrote:
>
>> >>>>> "KC" == Kei Cheung <kei.cheung@yale.edu> writes:
>>
>>   KC> If some journals are requiring raw data (e.g., microarray data) 
>> to be
>>   KC> submitted to a public data repository, I wonder if workflows 
>> that are
>>   KC> used to analyze the data should also be submitted to a public 
>> workflow
>>   KC> repository.
>>
>> It's a nice idea but doesn't quite allow the same level of 
>> repeatability. Most
>> taverna workflows need updating periodically, as the services go 
>> offline or
>> change their interfaces. Even if they don't, they return different 
>> results as
>> the implementation changes.
>>
>> Ultimately, you need to store more than the workflow to allow any 
>> degree of
>> repeatability. Still, it would be a good step forward which is no bad 
>> thing.
>
>
> You are right, and I think this really is a serious problem not only 
> with the workflow approach to data analysis,
> but to all bioinformatics procedures.
>
> We should find a way to fully describe a bioinformatics data analysis, 
> by specifying, e.g., not only
> the tools used (software programme, databases involved, parameters 
> used, I/O), but also a lot of
> meta information on them, like software version and implementation, 
> residing operating system,
> database version, server software and related version and 
> implementation, accessed site, date of accession, etc...
> All this information would support at least a better specification of 
> the procedure, while repeatability of
> the analysis would still be difficult, due to the frequent update of 
> databases and the difficulty in keeping
> previous releases on-line.
> At the same time, it would be nice to see how results of analysis can 
> change after some time, when
> new data is available in databases.
>
> Paolo
>
> Paolo Romano (paolo.romano@istge.it)
> Bioinformatics
> National Cancer Research Institute (IST)
> Largo Rosanna Benzi, 10, I-16132, Genova, Italy
> Tel: +39-010-5737-288  Fax: +39-010-5737-295
>
Since the data (e.g., genome annotation) used in an analysis pipeline 
(workflow) may evolve over time,  part of the provenance of the workflow 
may need to include the version of the data (besides raw data) involved 
in the analysis.

-Kei

Received on Thursday, 21 August 2008 22:08:49 UTC