- From: Michel Dumontier <michel.dumontier@gmail.com>
- Date: Wed, 20 Aug 2014 09:58:47 -0700
- To: Michael Miller <Michael.Miller@systemsbiology.org>
- Cc: Joachim Baran <joachim.baran@gmail.com>, w3c semweb hcls <public-semweb-lifesci@w3.org>, Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
excellent. thank you Michael. m. On Wed, Aug 20, 2014 at 9:30 AM, Michael Miller <Michael.Miller@systemsbiology.org> wrote: > hi all, > > > > finally finished my initial attempt at adding a section on datasets in > workflows. had some brief discussions with melissa and she and carole may > have some additions. i tried to keep it short and sweet, to really get into > it could be an entire note on its own. > > > > kim, you should see my pull request. hope to make the call next week. > > > > cheers, > > michael > > > > Michael Miller > > Software Engineer > > Institute for Systems Biology > > > > > > From: Joachim Baran [mailto:joachim.baran@gmail.com] > Sent: Monday, August 11, 2014 8:00 AM > To: Michael Miller > > > Cc: Stian Soiland-Reyes; w3c semweb hcls > Subject: Re: hcls dataset description comments--Dataset Descriptions vs. > PROV > > > > Great! Before you send the pull request, please make sure that W3C's HTML > validation passes: http://validator.w3.org/#validate_by_input > > > > Kim > > > > On 9 August 2014 10:15, Michael Miller <Michael.Miller@systemsbiology.org> > wrote: > > hi kim, > > i've made decent progress and expect to have something mid-week, if all goes > well (as a pull request, tho no guarantee on the formatting!) > > > cheers, > michael > > Michael Miller > Software Engineer > Institute for Systems Biology > >> -----Original Message----- > >> From: Joachim Baran [mailto:joachim.baran@gmail.com] > >> Sent: Friday, August 08, 2014 6:46 PM >> To: Michael Miller > >> Cc: Stian Soiland-Reyes; w3c semweb hcls >> Subject: Re: hcls dataset description comments--Dataset Descriptions vs. >> PROV >> >> Hello, >> >> Has there been an update to this? Preferably a pull request? >> >> Thanks, >> >> Kim >> >> >> >> > On Aug 5, 2014, at 8:13 AM, Michael Miller >> <Michael.Miller@systemsbiology.org> wrote: >> > >> > hi stian, >> > >> > thanks much, very useful! >> > >> > cheers, >> > michael >> > >> > Michael Miller >> > Software Engineer >> > Institute for Systems Biology >> > >> > >> >> -----Original Message----- >> >> From: stian@mygrid.org.uk [mailto:stian@mygrid.org.uk] On Behalf Of >> Stian >> >> Soiland-Reyes >> >> Sent: Tuesday, August 05, 2014 5:38 AM >> >> To: Michael Miller >> >> Cc: Joachim Baran; w3c semweb hcls >> >> Subject: Re: hcls dataset description comments--Dataset Descriptions >> >> vs. >> >> PROV >> >> >> >> Just some inputs: >> >> >> >> >> >> PROV defines prov:wasDerivedFrom which in broad sense describes such >> a >> >> relationset between datasets. However you do not know anything more >> >> about what kind of derivation we are talking about. >> >> >> >> >> >> In PAV we found the need to specialize three types of derivation: >> >> >> >> pav:retrievedFrom - >> >> http://purl.org/pav/html#http://purl.org/pav/retrievedFrom >> >> .. a byte-for-byte download >> >> >> >> pav:importedFrom - >> >> http://purl.org/pav/html#http://purl.org/pav/importedFrom >> >> .. a somewhat equivalent form of the source, but after some kind of >> >> transformation or selection (e.g. CSV -> XML) >> >> >> >> pav:derivedFrom - >> >> http://purl.org/pav/html#http://purl.org/pav/derivedFrom >> >> .. when the new resource has been further refined or modified >> >> (somewhat adding additional knowledge) >> >> >> >> >> >> If you are simply concatenating several dataset, then multiple >> >> pav:importedFrom statements would make sense. If further knowledge is >> >> added, say by reasoning or calculation, then pav:derivedFrom would >> >> make sense. >> >> >> >> >> >> Now if you want to detail exactly how those datasets have been >> >> combined, I think you are right that would make sense to break down >> >> the derivation using PROV statements, e.g. a series of activities, >> >> generation and usage. How to describe these activities (e.g. >> >> subclasses and properties) will be specific to each case. >> >> >> >> >> >> >> >> If the process you generated the dataset with somewhat resembles a >> >> dataflow, you might be interested in the wfprov and wfdesc ontologies >> >> that specialize PROV to define a WorkflowRun of steps of ProcessRuns, >> >> which can be related to a common workflow description (e.g. a >> >> prov:Plan): >> >> >> >> http://purl.org/wf4ever/model#wfprov >> >> >> >> OPMW is a similar approach: >> >> http://www.opmw.org/model/OPMW/ >> >> >> >> >> >> >> >> On 4 August 2014 17:44, Michael Miller >> >> <Michael.Miller@systemsbiology.org> wrote: >> >>> hi all, >> >>> >> >>> >> >>> >> >>> as you are all undoubtedly aware, a major, if not the major TCGA >> >>> dataset >> >> use >> >>> cases revolve around taking the 3rd level data from the TCGA dcc >> >> repository >> >>> and doing analysis, producing 4th level data such as clusters, pca, >> >>> etc. >> >>> one of the things we do here at ISB is produce an intermediate data >> >>> step >> >>> that combines the different platforms (mRNA, miRNA, RPPA, METH, >> etc.) >> >> into >> >>> one feature matrix so that the analysis can use all the platforms >> >>> together. >> >>> the Broad firehose pipeline also has this as one of its outputs. >> >>> >> >>> >> >>> >> >>> as some of my comments allude to, it doesn't seem that Dataset >> >> Descriptions >> >>> deal with the use case of describing a dataset that is specifically >> >>> derived >> >>> from other datasets, which is what we are looking at ways we might >> >> describe >> >>> our data when we publish it. i took a look at PROV and, i've got a >> >>> bit >> >>> more >> >>> mapping to do, but it seems like PROV provides the terms we need. >> >>> >> >>> >> >>> >> >>> but this has lead me to ask the question of what is the relation of >> >>> Dataset >> >>> Descriptions and PROV and how should they/should they be used >> >> together? i >> >>> think the above use case is quite common for datasets being published >> so >> >>> might deserve a discussion in the Dataset Descriptions note >> >>> >> >>> >> >>> >> >>> cheers, >> >>> >> >>> michael >> >>> >> >>> >> >>> >> >>> Michael Miller >> >>> >> >>> Software Engineer >> >>> >> >>> Institute for Systems Biology >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> From: Joachim Baran [mailto:joachim.baran@gmail.com] >> >>> Sent: Thursday, July 31, 2014 3:43 PM >> >>> To: Michael Miller >> >>> Cc: w3c semweb hcls >> >>> Subject: Re: hcls dataset description comments >> >>> >> >>> >> >>> >> >>> Hi! >> >>> >> >>> >> >>> >> >>> I will ponder about your edit suggestion of your first bullet point. >> >>> I >> >>> am >> >>> not sure at the moment if it would have wider implications. >> >>> >> >>> >> >>> >> >>> You are right that the use cases were written by the groups >> >>> themselves. I >> >>> do not know how to improve the use cases without rewriting them, >> which >> >> might >> >>> not be agreeable to all parties involved. C'est la vie. >> >>> >> >>> >> >>> >> >>> The role of Data Catalogs should then be discussed during out next >> >>> conf >> >>> call. Thanks for highlighting that this might be unclear to readers. >> >>> >> >>> >> >>> >> >>> Kim >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> On 30 July 2014 10:41, Michael Miller >> >>> <Michael.Miller@systemsbiology.org> >> >>> wrote: >> >>> >> >>> hi kim, >> >>> >> >>> >> >>> >> >>> 'For other edits, please fork the repository and create a pull request >> >>> with >> >>> your changes' >> >>> >> >>> >> >>> >> >>> of the four general comments, the first is really the only 'edit', i >> >>> didn't >> >>> put it in the minor edits because it had some implications that the >> >>> group >> >>> might not agree with. if the change makes sense, it might be easier >> >>> for >> >>> you >> >>> to make the edit. >> >>> >> >>> >> >>> >> >>> the other three are general comments and i'm not sure what the >> solution >> >>> might be, they were mainly points, as a reader, that weren't clear or >> >>> were a >> >>> bit confusing. these were all from the use case section so were >> >>> probably >> >>> written by the groups themselves? if i have permission, i can >> >>> certainly >> >>> add >> >>> them as issues. >> >>> >> >>> >> >>> >> >>> cheers, >> >>> >> >>> michael >> >>> >> >>> >> >>> >> >>> Michael Miller >> >>> >> >>> Software Engineer >> >>> >> >>> Institute for Systems Biology >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> From: Joachim Baran [mailto:joachim.baran@gmail.com] >> >>> Sent: Tuesday, July 29, 2014 11:56 AM >> >>> >> >>> >> >>> To: Michael Miller >> >>> Cc: w3c semweb hcls >> >>> Subject: Re: hcls dataset description comments >> >>> >> >>> >> >>> >> >>> Hi! >> >>> >> >>> >> >>> >> >>> Thanks for the suggestions. I have incorporated your minor edits. >> >>> Unbelievable how those slipped through after so many re-readings >> >>> still. >> >>> >> >>> >> >>> >> >>> For other edits, please fork the repository and create a pull request >> >>> with >> >>> your changes. >> >>> >> >>> >> >>> >> >>> Best wishes, >> >>> >> >>> >> >>> >> >>> Kim >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> On 23 July 2014 08:53, Michael Miller >> >>> <Michael.Miller@systemsbiology.org> >> >>> wrote: >> >>> >> >>> hi kim, >> >>> >> >>> >> >>> >> >>> thanks for the pointer, i've updated my comments based on this newer >> >> draft >> >>> below. many fewer and i especially like the complete example in 10.1! >> >>> >> >>> >> >>> >> >>> cheers, >> >>> >> >>> michael >> >>> >> >>> >> >>> >> >>> Michael Miller >> >>> >> >>> Software Engineer >> >>> >> >>> Institute for Systems Biology >> >>> >> >>> >> >>> >> >>> general comments: >> >>> >> >>> · s4.4 'Dataset Linking': might mention also that datasets are >> >>> derived from other datasets? >> >>> 'A dataset may incorporate, or link to, data in other datasets, e.g. >> >>> in >> >>> the >> >>> creation of a data mashup ' --> 'A dataset may incorporate, be derived >> >> from, >> >>> or link to, data in other datasets, e.g. in the analysis of original >> >>> datasets or in the creation of a data mashup ' >> >>> >> >>> · s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are >> >>> individual organizations but three (8.4, 8.8, 8.9) have subsections >> >>> for >> >>> different organizations. maybe organize so all top level sections >> >>> define a >> >>> type of organization with subsections beneath or make all top-level? >> >>> >> >>> · s8: some of the use cases could be more focused on how this >> >>> note >> >>> will help them (8.5-8.7) >> >>> >> >>> · s8.9: how do Data Catalogs fit into this note? wasn't clear >> >>> to me >> >>> how this note is relevant to them >> >>> >> >>> our use case questions: >> >>> >> >>> · how to reference 3rd party datasets that aren't described by >> >>> this >> >>> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom' >> >>> with >> >>> the IRI being the URL into the repository? >> >>> >> >>> · we have a lot of intermediary files that we won't publish, >> >>> the >> >>> software specified in creating our published datasets from its sources >> >>> form >> >>> a (branching) workflow with the input being from the previous step(s) >> >>> in >> >> the >> >>> workflow. how best to represent this? this note doesn't seem to >> >>> cover >> >> how >> >>> the dataset is created so any recommendations? >> >>> >> >>> minor edits: >> >>> >> >>> · there are two s6.2.3 sections >> >>> >> >>> · s8.8.1: '... what period it is updated. To know when to...' >> >>> should >> >>> be '...what period it is updated to know when to...'? >> >>> >> >>> >> >>> >> >>> From: Joachim Baran [mailto:joachim.baran@gmail.com] >> >>> Sent: Tuesday, July 22, 2014 3:43 PM >> >>> To: Michael Miller >> >>> Cc: w3c semweb hcls >> >>> Subject: Re: hcls dataset description comments >> >>> >> >>> >> >>> >> >>> Hello, >> >>> >> >>> >> >>> >> >>> I believe you were looking at an old document. There is currently >> >>> only >> >>> one >> >>> Figure in the note. >> >>> >> >>> >> >>> >> >>> Please check the actual draft at: >> >> >> >> http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDescriptions/blob/master/Overview.html >> >>> >> >>> >> >>> >> >>> Best wishes, >> >>> >> >>> >> >>> >> >>> Kim >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> On 22 July 2014 15:36, Michael Miller >> >>> <Michael.Miller@systemsbiology.org> >> >>> wrote: >> >>> >> >>> hi all, >> >>> >> >>> >> >>> >> >>> tremendous work, very clear and well-written. my group at ISB, the >> >>> Shmulevich lab is looking to provide provenance for the analysis >> >>> datasets >> >> we >> >>> are producing for TCGA. we're not sure if we'll be able to 'go all >> >>> the >> >>> way' >> >>> but we want to make sure we have at hand all the information that we >> >> could, >> >>> at least in theory, be compliant. as long as i was reading the >> >>> document, >> >>> below are some notes. >> >>> >> >>> >> >>> >> >>> general comments: >> >>> >> >>> · s4.4 'Dataset Linking': might mention also that datasets are >> >>> derived from other datasets? >> >>> 'A dataset may incorporate, or link to, data in other datasets, e.g. >> >>> in >> >>> the >> >>> creation of a data mashup ' --> 'A dataset may incorporate, be derived >> >> from, >> >>> or link to, data in other datasets, e.g. in the analysis of original >> >>> datasets or in the creation of a data mashup ' >> >>> >> >>> · the chembl example in s5 is not compliant to the property >> >>> table >> >>> below, it probably is only supposed to show the relationship of the >> >>> three >> >>> terms but that could be clarified >> >>> >> >>> · s6.2.12 could use the example filled in >> >>> >> >>> · 6.3.2: not sure what an 'X level description' is >> >>> >> >>> · s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are >> >>> individual organizations but three (8.4, 8.8, 8.9) have subsections >> >>> for >> >>> different organizations. maybe organize so all top level sections >> >>> define a >> >>> type of organization with subsections beneath or make all top-level? >> >>> >> >>> · s8: many of the use cases could be more focused on how this >> >>> note >> >>> will help them >> >>> >> >>> · s8.9: how do Data Catalogs fit into this note? wasn't clear >> >>> to me >> >>> how this note is relevant to them >> >>> >> >>> · would be nice to have a 'complete' example p[put together, >> >>> maybe >> >>> based on chembl? >> >>> >> >>> >> >>> >> >>> our use case questions: >> >>> >> >>> · how to reference 3rd party datasets that aren't described by >> >>> this >> >>> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom' >> >>> with >> >>> the IRI being the URL into the repository? >> >>> >> >>> · we have a lot of intermediary files that we won't publish, >> >>> the >> >>> software specified in creating our published datasets from its sources >> >>> form >> >>> a (branching) workflow with the input being from the previous step(s) >> >>> in >> >> the >> >>> workflow. how best to represent this? this note doesn't seem to >> >>> cover >> >> how >> >>> the dataset is created so any recommendations? >> >>> >> >>> >> >>> >> >>> text issues: >> >>> >> >>> · Figure 1: 'Overview of dataset description level metadata >> >>> profiles >> >>> and their relationships': reference not resolved, image doesn't show >> >>> >> >>> · Figure 2: 'Improve diagram. Multiple appearance of >> >>> concepts/description levels unclear.': reference not resolved, image >> >> doesn't >> >>> show. add actual label >> >>> >> >>> >> >>> >> >>> minor edits: >> >>> >> >>> · bottom of s.3: 'placeholde' should be 'placeholder' >> >>> >> >>> · use straight quotes rather than slant quotes in s6.2.2 >> >>> example >> >>> (and elsewhere)? >> >>> >> >>> · the text runs out of the box in s6.2.3 'Description' >> >>> >> >>> · s6.2.3: 'Dates of Creation and Issuance': 'state the date >> >>> the >> >>> dataset was generated using dct:created and/or the date the dataset >> was >> >> made >> >>> public using dct:created' should be 'state the date the dataset was >> >>> generated using dct:created and/or the date the dataset was made >> public >> >>> using dct:issued'? >> >>> >> >>> · there are two s6.2.3 sections >> >>> >> >>> · s6.2.4: 'Creation: ... The date of authorship' should be >> >>> '...The >> >>> date of creation' and 'Curation:... The date of authorship' should be >> >>> '...The date of curation'? >> >>> >> >>> · s8.5: the author list has end parenthesis without beginning >> >>> parenthesis >> >>> >> >>> · s8.8.1: '... what period it is updated. To know when to...' >> >>> should >> >>> be '...what period it is updated to know when to...' >> >>> >> >>> >> >>> >> >>> cheers, >> >>> >> >>> michael >> >>> >> >>> >> >>> >> >>> Michael Miller >> >>> >> >>> Software Engineer >> >>> >> >>> Institute for Systems Biology >> >> >> >> >> >> >> >> -- >> >> Stian Soiland-Reyes, myGrid team >> >> School of Computer Science >> >> The University of Manchester >> >> http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842- >> 9718 > >
Received on Wednesday, 20 August 2014 16:59:35 UTC