- From: Joachim Baran <joachim.baran@gmail.com>
- Date: Fri, 8 Aug 2014 18:46:27 -0700
- To: Michael Miller <Michael.Miller@systemsbiology.org>
- Cc: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>, w3c semweb hcls <public-semweb-lifesci@w3.org>
Hello, Has there been an update to this? Preferably a pull request? Thanks, Kim > On Aug 5, 2014, at 8:13 AM, Michael Miller <Michael.Miller@systemsbiology.org> wrote: > > hi stian, > > thanks much, very useful! > > cheers, > michael > > Michael Miller > Software Engineer > Institute for Systems Biology > > >> -----Original Message----- >> From: stian@mygrid.org.uk [mailto:stian@mygrid.org.uk] On Behalf Of Stian >> Soiland-Reyes >> Sent: Tuesday, August 05, 2014 5:38 AM >> To: Michael Miller >> Cc: Joachim Baran; w3c semweb hcls >> Subject: Re: hcls dataset description comments--Dataset Descriptions vs. >> PROV >> >> Just some inputs: >> >> >> PROV defines prov:wasDerivedFrom which in broad sense describes such a >> relationset between datasets. However you do not know anything more >> about what kind of derivation we are talking about. >> >> >> In PAV we found the need to specialize three types of derivation: >> >> pav:retrievedFrom - >> http://purl.org/pav/html#http://purl.org/pav/retrievedFrom >> .. a byte-for-byte download >> >> pav:importedFrom - >> http://purl.org/pav/html#http://purl.org/pav/importedFrom >> .. a somewhat equivalent form of the source, but after some kind of >> transformation or selection (e.g. CSV -> XML) >> >> pav:derivedFrom - >> http://purl.org/pav/html#http://purl.org/pav/derivedFrom >> .. when the new resource has been further refined or modified >> (somewhat adding additional knowledge) >> >> >> If you are simply concatenating several dataset, then multiple >> pav:importedFrom statements would make sense. If further knowledge is >> added, say by reasoning or calculation, then pav:derivedFrom would >> make sense. >> >> >> Now if you want to detail exactly how those datasets have been >> combined, I think you are right that would make sense to break down >> the derivation using PROV statements, e.g. a series of activities, >> generation and usage. How to describe these activities (e.g. >> subclasses and properties) will be specific to each case. >> >> >> >> If the process you generated the dataset with somewhat resembles a >> dataflow, you might be interested in the wfprov and wfdesc ontologies >> that specialize PROV to define a WorkflowRun of steps of ProcessRuns, >> which can be related to a common workflow description (e.g. a >> prov:Plan): >> >> http://purl.org/wf4ever/model#wfprov >> >> OPMW is a similar approach: >> http://www.opmw.org/model/OPMW/ >> >> >> >> On 4 August 2014 17:44, Michael Miller >> <Michael.Miller@systemsbiology.org> wrote: >>> hi all, >>> >>> >>> >>> as you are all undoubtedly aware, a major, if not the major TCGA dataset >> use >>> cases revolve around taking the 3rd level data from the TCGA dcc >> repository >>> and doing analysis, producing 4th level data such as clusters, pca, etc. >>> one of the things we do here at ISB is produce an intermediate data step >>> that combines the different platforms (mRNA, miRNA, RPPA, METH, etc.) >> into >>> one feature matrix so that the analysis can use all the platforms >>> together. >>> the Broad firehose pipeline also has this as one of its outputs. >>> >>> >>> >>> as some of my comments allude to, it doesn't seem that Dataset >> Descriptions >>> deal with the use case of describing a dataset that is specifically >>> derived >>> from other datasets, which is what we are looking at ways we might >> describe >>> our data when we publish it. i took a look at PROV and, i've got a bit >>> more >>> mapping to do, but it seems like PROV provides the terms we need. >>> >>> >>> >>> but this has lead me to ask the question of what is the relation of >>> Dataset >>> Descriptions and PROV and how should they/should they be used >> together? i >>> think the above use case is quite common for datasets being published so >>> might deserve a discussion in the Dataset Descriptions note >>> >>> >>> >>> cheers, >>> >>> michael >>> >>> >>> >>> Michael Miller >>> >>> Software Engineer >>> >>> Institute for Systems Biology >>> >>> >>> >>> >>> >>> From: Joachim Baran [mailto:joachim.baran@gmail.com] >>> Sent: Thursday, July 31, 2014 3:43 PM >>> To: Michael Miller >>> Cc: w3c semweb hcls >>> Subject: Re: hcls dataset description comments >>> >>> >>> >>> Hi! >>> >>> >>> >>> I will ponder about your edit suggestion of your first bullet point. I >>> am >>> not sure at the moment if it would have wider implications. >>> >>> >>> >>> You are right that the use cases were written by the groups >>> themselves. I >>> do not know how to improve the use cases without rewriting them, which >> might >>> not be agreeable to all parties involved. C'est la vie. >>> >>> >>> >>> The role of Data Catalogs should then be discussed during out next >>> conf >>> call. Thanks for highlighting that this might be unclear to readers. >>> >>> >>> >>> Kim >>> >>> >>> >>> >>> >>> >>> >>> On 30 July 2014 10:41, Michael Miller >>> <Michael.Miller@systemsbiology.org> >>> wrote: >>> >>> hi kim, >>> >>> >>> >>> 'For other edits, please fork the repository and create a pull request >>> with >>> your changes' >>> >>> >>> >>> of the four general comments, the first is really the only 'edit', i >>> didn't >>> put it in the minor edits because it had some implications that the >>> group >>> might not agree with. if the change makes sense, it might be easier for >>> you >>> to make the edit. >>> >>> >>> >>> the other three are general comments and i'm not sure what the solution >>> might be, they were mainly points, as a reader, that weren't clear or >>> were a >>> bit confusing. these were all from the use case section so were >>> probably >>> written by the groups themselves? if i have permission, i can certainly >>> add >>> them as issues. >>> >>> >>> >>> cheers, >>> >>> michael >>> >>> >>> >>> Michael Miller >>> >>> Software Engineer >>> >>> Institute for Systems Biology >>> >>> >>> >>> >>> >>> From: Joachim Baran [mailto:joachim.baran@gmail.com] >>> Sent: Tuesday, July 29, 2014 11:56 AM >>> >>> >>> To: Michael Miller >>> Cc: w3c semweb hcls >>> Subject: Re: hcls dataset description comments >>> >>> >>> >>> Hi! >>> >>> >>> >>> Thanks for the suggestions. I have incorporated your minor edits. >>> Unbelievable how those slipped through after so many re-readings still. >>> >>> >>> >>> For other edits, please fork the repository and create a pull request >>> with >>> your changes. >>> >>> >>> >>> Best wishes, >>> >>> >>> >>> Kim >>> >>> >>> >>> >>> >>> On 23 July 2014 08:53, Michael Miller >>> <Michael.Miller@systemsbiology.org> >>> wrote: >>> >>> hi kim, >>> >>> >>> >>> thanks for the pointer, i've updated my comments based on this newer >> draft >>> below. many fewer and i especially like the complete example in 10.1! >>> >>> >>> >>> cheers, >>> >>> michael >>> >>> >>> >>> Michael Miller >>> >>> Software Engineer >>> >>> Institute for Systems Biology >>> >>> >>> >>> general comments: >>> >>> · s4.4 'Dataset Linking': might mention also that datasets are >>> derived from other datasets? >>> 'A dataset may incorporate, or link to, data in other datasets, e.g. in >>> the >>> creation of a data mashup ' --> 'A dataset may incorporate, be derived >> from, >>> or link to, data in other datasets, e.g. in the analysis of original >>> datasets or in the creation of a data mashup ' >>> >>> · s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are >>> individual organizations but three (8.4, 8.8, 8.9) have subsections for >>> different organizations. maybe organize so all top level sections >>> define a >>> type of organization with subsections beneath or make all top-level? >>> >>> · s8: some of the use cases could be more focused on how this >>> note >>> will help them (8.5-8.7) >>> >>> · s8.9: how do Data Catalogs fit into this note? wasn't clear >>> to me >>> how this note is relevant to them >>> >>> our use case questions: >>> >>> · how to reference 3rd party datasets that aren't described by >>> this >>> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom' >>> with >>> the IRI being the URL into the repository? >>> >>> · we have a lot of intermediary files that we won't publish, the >>> software specified in creating our published datasets from its sources >>> form >>> a (branching) workflow with the input being from the previous step(s) in >> the >>> workflow. how best to represent this? this note doesn't seem to cover >> how >>> the dataset is created so any recommendations? >>> >>> minor edits: >>> >>> · there are two s6.2.3 sections >>> >>> · s8.8.1: '... what period it is updated. To know when to...' >>> should >>> be '...what period it is updated to know when to...'? >>> >>> >>> >>> From: Joachim Baran [mailto:joachim.baran@gmail.com] >>> Sent: Tuesday, July 22, 2014 3:43 PM >>> To: Michael Miller >>> Cc: w3c semweb hcls >>> Subject: Re: hcls dataset description comments >>> >>> >>> >>> Hello, >>> >>> >>> >>> I believe you were looking at an old document. There is currently only >>> one >>> Figure in the note. >>> >>> >>> >>> Please check the actual draft at: >> http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDe >> scriptions/blob/master/Overview.html >>> >>> >>> >>> Best wishes, >>> >>> >>> >>> Kim >>> >>> >>> >>> >>> >>> On 22 July 2014 15:36, Michael Miller >>> <Michael.Miller@systemsbiology.org> >>> wrote: >>> >>> hi all, >>> >>> >>> >>> tremendous work, very clear and well-written. my group at ISB, the >>> Shmulevich lab is looking to provide provenance for the analysis >>> datasets >> we >>> are producing for TCGA. we're not sure if we'll be able to 'go all the >>> way' >>> but we want to make sure we have at hand all the information that we >> could, >>> at least in theory, be compliant. as long as i was reading the >>> document, >>> below are some notes. >>> >>> >>> >>> general comments: >>> >>> · s4.4 'Dataset Linking': might mention also that datasets are >>> derived from other datasets? >>> 'A dataset may incorporate, or link to, data in other datasets, e.g. in >>> the >>> creation of a data mashup ' --> 'A dataset may incorporate, be derived >> from, >>> or link to, data in other datasets, e.g. in the analysis of original >>> datasets or in the creation of a data mashup ' >>> >>> · the chembl example in s5 is not compliant to the property >>> table >>> below, it probably is only supposed to show the relationship of the >>> three >>> terms but that could be clarified >>> >>> · s6.2.12 could use the example filled in >>> >>> · 6.3.2: not sure what an 'X level description' is >>> >>> · s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are >>> individual organizations but three (8.4, 8.8, 8.9) have subsections for >>> different organizations. maybe organize so all top level sections >>> define a >>> type of organization with subsections beneath or make all top-level? >>> >>> · s8: many of the use cases could be more focused on how this >>> note >>> will help them >>> >>> · s8.9: how do Data Catalogs fit into this note? wasn't clear >>> to me >>> how this note is relevant to them >>> >>> · would be nice to have a 'complete' example p[put together, >>> maybe >>> based on chembl? >>> >>> >>> >>> our use case questions: >>> >>> · how to reference 3rd party datasets that aren't described by >>> this >>> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom' >>> with >>> the IRI being the URL into the repository? >>> >>> · we have a lot of intermediary files that we won't publish, the >>> software specified in creating our published datasets from its sources >>> form >>> a (branching) workflow with the input being from the previous step(s) in >> the >>> workflow. how best to represent this? this note doesn't seem to cover >> how >>> the dataset is created so any recommendations? >>> >>> >>> >>> text issues: >>> >>> · Figure 1: 'Overview of dataset description level metadata >>> profiles >>> and their relationships': reference not resolved, image doesn't show >>> >>> · Figure 2: 'Improve diagram. Multiple appearance of >>> concepts/description levels unclear.': reference not resolved, image >> doesn't >>> show. add actual label >>> >>> >>> >>> minor edits: >>> >>> · bottom of s.3: 'placeholde' should be 'placeholder' >>> >>> · use straight quotes rather than slant quotes in s6.2.2 example >>> (and elsewhere)? >>> >>> · the text runs out of the box in s6.2.3 'Description' >>> >>> · s6.2.3: 'Dates of Creation and Issuance': 'state the date the >>> dataset was generated using dct:created and/or the date the dataset was >> made >>> public using dct:created' should be 'state the date the dataset was >>> generated using dct:created and/or the date the dataset was made public >>> using dct:issued'? >>> >>> · there are two s6.2.3 sections >>> >>> · s6.2.4: 'Creation: ... The date of authorship' should be >>> '...The >>> date of creation' and 'Curation:... The date of authorship' should be >>> '...The date of curation'? >>> >>> · s8.5: the author list has end parenthesis without beginning >>> parenthesis >>> >>> · s8.8.1: '... what period it is updated. To know when to...' >>> should >>> be '...what period it is updated to know when to...' >>> >>> >>> >>> cheers, >>> >>> michael >>> >>> >>> >>> Michael Miller >>> >>> Software Engineer >>> >>> Institute for Systems Biology >> >> >> >> -- >> Stian Soiland-Reyes, myGrid team >> School of Computer Science >> The University of Manchester >> http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718
Received on Saturday, 9 August 2014 01:46:57 UTC