- From: Michael Miller <Michael.Miller@systemsbiology.org>
- Date: Tue, 5 Aug 2014 08:13:55 -0700
- To: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
- Cc: Joachim Baran <joachim.baran@gmail.com>, w3c semweb hcls <public-semweb-lifesci@w3.org>
hi stian, thanks much, very useful! cheers, michael Michael Miller Software Engineer Institute for Systems Biology > -----Original Message----- > From: stian@mygrid.org.uk [mailto:stian@mygrid.org.uk] On Behalf Of Stian > Soiland-Reyes > Sent: Tuesday, August 05, 2014 5:38 AM > To: Michael Miller > Cc: Joachim Baran; w3c semweb hcls > Subject: Re: hcls dataset description comments--Dataset Descriptions vs. > PROV > > Just some inputs: > > > PROV defines prov:wasDerivedFrom which in broad sense describes such a > relationset between datasets. However you do not know anything more > about what kind of derivation we are talking about. > > > In PAV we found the need to specialize three types of derivation: > > pav:retrievedFrom - > http://purl.org/pav/html#http://purl.org/pav/retrievedFrom > .. a byte-for-byte download > > pav:importedFrom - > http://purl.org/pav/html#http://purl.org/pav/importedFrom > .. a somewhat equivalent form of the source, but after some kind of > transformation or selection (e.g. CSV -> XML) > > pav:derivedFrom - > http://purl.org/pav/html#http://purl.org/pav/derivedFrom > .. when the new resource has been further refined or modified > (somewhat adding additional knowledge) > > > If you are simply concatenating several dataset, then multiple > pav:importedFrom statements would make sense. If further knowledge is > added, say by reasoning or calculation, then pav:derivedFrom would > make sense. > > > Now if you want to detail exactly how those datasets have been > combined, I think you are right that would make sense to break down > the derivation using PROV statements, e.g. a series of activities, > generation and usage. How to describe these activities (e.g. > subclasses and properties) will be specific to each case. > > > > If the process you generated the dataset with somewhat resembles a > dataflow, you might be interested in the wfprov and wfdesc ontologies > that specialize PROV to define a WorkflowRun of steps of ProcessRuns, > which can be related to a common workflow description (e.g. a > prov:Plan): > > http://purl.org/wf4ever/model#wfprov > > OPMW is a similar approach: > http://www.opmw.org/model/OPMW/ > > > > On 4 August 2014 17:44, Michael Miller > <Michael.Miller@systemsbiology.org> wrote: > > hi all, > > > > > > > > as you are all undoubtedly aware, a major, if not the major TCGA dataset > use > > cases revolve around taking the 3rd level data from the TCGA dcc > repository > > and doing analysis, producing 4th level data such as clusters, pca, etc. > > one of the things we do here at ISB is produce an intermediate data step > > that combines the different platforms (mRNA, miRNA, RPPA, METH, etc.) > into > > one feature matrix so that the analysis can use all the platforms > > together. > > the Broad firehose pipeline also has this as one of its outputs. > > > > > > > > as some of my comments allude to, it doesn't seem that Dataset > Descriptions > > deal with the use case of describing a dataset that is specifically > > derived > > from other datasets, which is what we are looking at ways we might > describe > > our data when we publish it. i took a look at PROV and, i've got a bit > > more > > mapping to do, but it seems like PROV provides the terms we need. > > > > > > > > but this has lead me to ask the question of what is the relation of > > Dataset > > Descriptions and PROV and how should they/should they be used > together? i > > think the above use case is quite common for datasets being published so > > might deserve a discussion in the Dataset Descriptions note > > > > > > > > cheers, > > > > michael > > > > > > > > Michael Miller > > > > Software Engineer > > > > Institute for Systems Biology > > > > > > > > > > > > From: Joachim Baran [mailto:joachim.baran@gmail.com] > > Sent: Thursday, July 31, 2014 3:43 PM > > To: Michael Miller > > Cc: w3c semweb hcls > > Subject: Re: hcls dataset description comments > > > > > > > > Hi! > > > > > > > > I will ponder about your edit suggestion of your first bullet point. I > > am > > not sure at the moment if it would have wider implications. > > > > > > > > You are right that the use cases were written by the groups > > themselves. I > > do not know how to improve the use cases without rewriting them, which > might > > not be agreeable to all parties involved. C'est la vie. > > > > > > > > The role of Data Catalogs should then be discussed during out next > > conf > > call. Thanks for highlighting that this might be unclear to readers. > > > > > > > > Kim > > > > > > > > > > > > > > > > On 30 July 2014 10:41, Michael Miller > > <Michael.Miller@systemsbiology.org> > > wrote: > > > > hi kim, > > > > > > > > 'For other edits, please fork the repository and create a pull request > > with > > your changes' > > > > > > > > of the four general comments, the first is really the only 'edit', i > > didn't > > put it in the minor edits because it had some implications that the > > group > > might not agree with. if the change makes sense, it might be easier for > > you > > to make the edit. > > > > > > > > the other three are general comments and i'm not sure what the solution > > might be, they were mainly points, as a reader, that weren't clear or > > were a > > bit confusing. these were all from the use case section so were > > probably > > written by the groups themselves? if i have permission, i can certainly > > add > > them as issues. > > > > > > > > cheers, > > > > michael > > > > > > > > Michael Miller > > > > Software Engineer > > > > Institute for Systems Biology > > > > > > > > > > > > From: Joachim Baran [mailto:joachim.baran@gmail.com] > > Sent: Tuesday, July 29, 2014 11:56 AM > > > > > > To: Michael Miller > > Cc: w3c semweb hcls > > Subject: Re: hcls dataset description comments > > > > > > > > Hi! > > > > > > > > Thanks for the suggestions. I have incorporated your minor edits. > > Unbelievable how those slipped through after so many re-readings still. > > > > > > > > For other edits, please fork the repository and create a pull request > > with > > your changes. > > > > > > > > Best wishes, > > > > > > > > Kim > > > > > > > > > > > > On 23 July 2014 08:53, Michael Miller > > <Michael.Miller@systemsbiology.org> > > wrote: > > > > hi kim, > > > > > > > > thanks for the pointer, i've updated my comments based on this newer > draft > > below. many fewer and i especially like the complete example in 10.1! > > > > > > > > cheers, > > > > michael > > > > > > > > Michael Miller > > > > Software Engineer > > > > Institute for Systems Biology > > > > > > > > general comments: > > > > · s4.4 'Dataset Linking': might mention also that datasets are > > derived from other datasets? > > 'A dataset may incorporate, or link to, data in other datasets, e.g. in > > the > > creation of a data mashup ' --> 'A dataset may incorporate, be derived > from, > > or link to, data in other datasets, e.g. in the analysis of original > > datasets or in the creation of a data mashup ' > > > > · s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are > > individual organizations but three (8.4, 8.8, 8.9) have subsections for > > different organizations. maybe organize so all top level sections > > define a > > type of organization with subsections beneath or make all top-level? > > > > · s8: some of the use cases could be more focused on how this > > note > > will help them (8.5-8.7) > > > > · s8.9: how do Data Catalogs fit into this note? wasn't clear > > to me > > how this note is relevant to them > > > > our use case questions: > > > > · how to reference 3rd party datasets that aren't described by > > this > > standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom' > > with > > the IRI being the URL into the repository? > > > > · we have a lot of intermediary files that we won't publish, the > > software specified in creating our published datasets from its sources > > form > > a (branching) workflow with the input being from the previous step(s) in > the > > workflow. how best to represent this? this note doesn't seem to cover > how > > the dataset is created so any recommendations? > > > > minor edits: > > > > · there are two s6.2.3 sections > > > > · s8.8.1: '... what period it is updated. To know when to...' > > should > > be '...what period it is updated to know when to...'? > > > > > > > > From: Joachim Baran [mailto:joachim.baran@gmail.com] > > Sent: Tuesday, July 22, 2014 3:43 PM > > To: Michael Miller > > Cc: w3c semweb hcls > > Subject: Re: hcls dataset description comments > > > > > > > > Hello, > > > > > > > > I believe you were looking at an old document. There is currently only > > one > > Figure in the note. > > > > > > > > Please check the actual draft at: > > > http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDe > scriptions/blob/master/Overview.html > > > > > > > > Best wishes, > > > > > > > > Kim > > > > > > > > > > > > On 22 July 2014 15:36, Michael Miller > > <Michael.Miller@systemsbiology.org> > > wrote: > > > > hi all, > > > > > > > > tremendous work, very clear and well-written. my group at ISB, the > > Shmulevich lab is looking to provide provenance for the analysis > > datasets > we > > are producing for TCGA. we're not sure if we'll be able to 'go all the > > way' > > but we want to make sure we have at hand all the information that we > could, > > at least in theory, be compliant. as long as i was reading the > > document, > > below are some notes. > > > > > > > > general comments: > > > > · s4.4 'Dataset Linking': might mention also that datasets are > > derived from other datasets? > > 'A dataset may incorporate, or link to, data in other datasets, e.g. in > > the > > creation of a data mashup ' --> 'A dataset may incorporate, be derived > from, > > or link to, data in other datasets, e.g. in the analysis of original > > datasets or in the creation of a data mashup ' > > > > · the chembl example in s5 is not compliant to the property > > table > > below, it probably is only supposed to show the relationship of the > > three > > terms but that could be clarified > > > > · s6.2.12 could use the example filled in > > > > · 6.3.2: not sure what an 'X level description' is > > > > · s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are > > individual organizations but three (8.4, 8.8, 8.9) have subsections for > > different organizations. maybe organize so all top level sections > > define a > > type of organization with subsections beneath or make all top-level? > > > > · s8: many of the use cases could be more focused on how this > > note > > will help them > > > > · s8.9: how do Data Catalogs fit into this note? wasn't clear > > to me > > how this note is relevant to them > > > > · would be nice to have a 'complete' example p[put together, > > maybe > > based on chembl? > > > > > > > > our use case questions: > > > > · how to reference 3rd party datasets that aren't described by > > this > > standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom' > > with > > the IRI being the URL into the repository? > > > > · we have a lot of intermediary files that we won't publish, the > > software specified in creating our published datasets from its sources > > form > > a (branching) workflow with the input being from the previous step(s) in > the > > workflow. how best to represent this? this note doesn't seem to cover > how > > the dataset is created so any recommendations? > > > > > > > > text issues: > > > > · Figure 1: 'Overview of dataset description level metadata > > profiles > > and their relationships': reference not resolved, image doesn't show > > > > · Figure 2: 'Improve diagram. Multiple appearance of > > concepts/description levels unclear.': reference not resolved, image > doesn't > > show. add actual label > > > > > > > > minor edits: > > > > · bottom of s.3: 'placeholde' should be 'placeholder' > > > > · use straight quotes rather than slant quotes in s6.2.2 example > > (and elsewhere)? > > > > · the text runs out of the box in s6.2.3 'Description' > > > > · s6.2.3: 'Dates of Creation and Issuance': 'state the date the > > dataset was generated using dct:created and/or the date the dataset was > made > > public using dct:created' should be 'state the date the dataset was > > generated using dct:created and/or the date the dataset was made public > > using dct:issued'? > > > > · there are two s6.2.3 sections > > > > · s6.2.4: 'Creation: ... The date of authorship' should be > > '...The > > date of creation' and 'Curation:... The date of authorship' should be > > '...The date of curation'? > > > > · s8.5: the author list has end parenthesis without beginning > > parenthesis > > > > · s8.8.1: '... what period it is updated. To know when to...' > > should > > be '...what period it is updated to know when to...' > > > > > > > > cheers, > > > > michael > > > > > > > > Michael Miller > > > > Software Engineer > > > > Institute for Systems Biology > > > > > > > > > > > > > > > > > > > > > > > > -- > Stian Soiland-Reyes, myGrid team > School of Computer Science > The University of Manchester > http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718
Received on Tuesday, 5 August 2014 15:14:24 UTC