- From: Joachim Baran <joachim.baran@gmail.com>
- Date: Mon, 11 Aug 2014 08:00:03 -0700
- To: Michael Miller <Michael.Miller@systemsbiology.org>
- Cc: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>, w3c semweb hcls <public-semweb-lifesci@w3.org>
- Message-ID: <CAObSwHU4Ue5Tiyg0mL2tWz8dHv=e=itY6TWUY4MTPb2P4CGyyA@mail.gmail.com>
Great! Before you send the pull request, please make sure that W3C's HTML validation passes: http://validator.w3.org/#validate_by_input Kim On 9 August 2014 10:15, Michael Miller <Michael.Miller@systemsbiology.org> wrote: > hi kim, > > i've made decent progress and expect to have something mid-week, if all > goes > well (as a pull request, tho no guarantee on the formatting!) > > cheers, > michael > > Michael Miller > Software Engineer > Institute for Systems Biology > > > -----Original Message----- > > From: Joachim Baran [mailto:joachim.baran@gmail.com] > > Sent: Friday, August 08, 2014 6:46 PM > > To: Michael Miller > > Cc: Stian Soiland-Reyes; w3c semweb hcls > > Subject: Re: hcls dataset description comments--Dataset Descriptions vs. > > PROV > > > > Hello, > > > > Has there been an update to this? Preferably a pull request? > > > > Thanks, > > > > Kim > > > > > > > > > On Aug 5, 2014, at 8:13 AM, Michael Miller > > <Michael.Miller@systemsbiology.org> wrote: > > > > > > hi stian, > > > > > > thanks much, very useful! > > > > > > cheers, > > > michael > > > > > > Michael Miller > > > Software Engineer > > > Institute for Systems Biology > > > > > > > > >> -----Original Message----- > > >> From: stian@mygrid.org.uk [mailto:stian@mygrid.org.uk] On Behalf Of > > Stian > > >> Soiland-Reyes > > >> Sent: Tuesday, August 05, 2014 5:38 AM > > >> To: Michael Miller > > >> Cc: Joachim Baran; w3c semweb hcls > > >> Subject: Re: hcls dataset description comments--Dataset Descriptions > > >> vs. > > >> PROV > > >> > > >> Just some inputs: > > >> > > >> > > >> PROV defines prov:wasDerivedFrom which in broad sense describes such > > a > > >> relationset between datasets. However you do not know anything more > > >> about what kind of derivation we are talking about. > > >> > > >> > > >> In PAV we found the need to specialize three types of derivation: > > >> > > >> pav:retrievedFrom - > > >> http://purl.org/pav/html#http://purl.org/pav/retrievedFrom > > >> .. a byte-for-byte download > > >> > > >> pav:importedFrom - > > >> http://purl.org/pav/html#http://purl.org/pav/importedFrom > > >> .. a somewhat equivalent form of the source, but after some kind of > > >> transformation or selection (e.g. CSV -> XML) > > >> > > >> pav:derivedFrom - > > >> http://purl.org/pav/html#http://purl.org/pav/derivedFrom > > >> .. when the new resource has been further refined or modified > > >> (somewhat adding additional knowledge) > > >> > > >> > > >> If you are simply concatenating several dataset, then multiple > > >> pav:importedFrom statements would make sense. If further knowledge is > > >> added, say by reasoning or calculation, then pav:derivedFrom would > > >> make sense. > > >> > > >> > > >> Now if you want to detail exactly how those datasets have been > > >> combined, I think you are right that would make sense to break down > > >> the derivation using PROV statements, e.g. a series of activities, > > >> generation and usage. How to describe these activities (e.g. > > >> subclasses and properties) will be specific to each case. > > >> > > >> > > >> > > >> If the process you generated the dataset with somewhat resembles a > > >> dataflow, you might be interested in the wfprov and wfdesc ontologies > > >> that specialize PROV to define a WorkflowRun of steps of ProcessRuns, > > >> which can be related to a common workflow description (e.g. a > > >> prov:Plan): > > >> > > >> http://purl.org/wf4ever/model#wfprov > > >> > > >> OPMW is a similar approach: > > >> http://www.opmw.org/model/OPMW/ > > >> > > >> > > >> > > >> On 4 August 2014 17:44, Michael Miller > > >> <Michael.Miller@systemsbiology.org> wrote: > > >>> hi all, > > >>> > > >>> > > >>> > > >>> as you are all undoubtedly aware, a major, if not the major TCGA > > >>> dataset > > >> use > > >>> cases revolve around taking the 3rd level data from the TCGA dcc > > >> repository > > >>> and doing analysis, producing 4th level data such as clusters, pca, > > >>> etc. > > >>> one of the things we do here at ISB is produce an intermediate data > > >>> step > > >>> that combines the different platforms (mRNA, miRNA, RPPA, METH, > > etc.) > > >> into > > >>> one feature matrix so that the analysis can use all the platforms > > >>> together. > > >>> the Broad firehose pipeline also has this as one of its outputs. > > >>> > > >>> > > >>> > > >>> as some of my comments allude to, it doesn't seem that Dataset > > >> Descriptions > > >>> deal with the use case of describing a dataset that is specifically > > >>> derived > > >>> from other datasets, which is what we are looking at ways we might > > >> describe > > >>> our data when we publish it. i took a look at PROV and, i've got a > > >>> bit > > >>> more > > >>> mapping to do, but it seems like PROV provides the terms we need. > > >>> > > >>> > > >>> > > >>> but this has lead me to ask the question of what is the relation of > > >>> Dataset > > >>> Descriptions and PROV and how should they/should they be used > > >> together? i > > >>> think the above use case is quite common for datasets being published > > so > > >>> might deserve a discussion in the Dataset Descriptions note > > >>> > > >>> > > >>> > > >>> cheers, > > >>> > > >>> michael > > >>> > > >>> > > >>> > > >>> Michael Miller > > >>> > > >>> Software Engineer > > >>> > > >>> Institute for Systems Biology > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> From: Joachim Baran [mailto:joachim.baran@gmail.com] > > >>> Sent: Thursday, July 31, 2014 3:43 PM > > >>> To: Michael Miller > > >>> Cc: w3c semweb hcls > > >>> Subject: Re: hcls dataset description comments > > >>> > > >>> > > >>> > > >>> Hi! > > >>> > > >>> > > >>> > > >>> I will ponder about your edit suggestion of your first bullet point. > > >>> I > > >>> am > > >>> not sure at the moment if it would have wider implications. > > >>> > > >>> > > >>> > > >>> You are right that the use cases were written by the groups > > >>> themselves. I > > >>> do not know how to improve the use cases without rewriting them, > > which > > >> might > > >>> not be agreeable to all parties involved. C'est la vie. > > >>> > > >>> > > >>> > > >>> The role of Data Catalogs should then be discussed during out next > > >>> conf > > >>> call. Thanks for highlighting that this might be unclear to readers. > > >>> > > >>> > > >>> > > >>> Kim > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> On 30 July 2014 10:41, Michael Miller > > >>> <Michael.Miller@systemsbiology.org> > > >>> wrote: > > >>> > > >>> hi kim, > > >>> > > >>> > > >>> > > >>> 'For other edits, please fork the repository and create a pull > request > > >>> with > > >>> your changes' > > >>> > > >>> > > >>> > > >>> of the four general comments, the first is really the only 'edit', i > > >>> didn't > > >>> put it in the minor edits because it had some implications that the > > >>> group > > >>> might not agree with. if the change makes sense, it might be easier > > >>> for > > >>> you > > >>> to make the edit. > > >>> > > >>> > > >>> > > >>> the other three are general comments and i'm not sure what the > > solution > > >>> might be, they were mainly points, as a reader, that weren't clear or > > >>> were a > > >>> bit confusing. these were all from the use case section so were > > >>> probably > > >>> written by the groups themselves? if i have permission, i can > > >>> certainly > > >>> add > > >>> them as issues. > > >>> > > >>> > > >>> > > >>> cheers, > > >>> > > >>> michael > > >>> > > >>> > > >>> > > >>> Michael Miller > > >>> > > >>> Software Engineer > > >>> > > >>> Institute for Systems Biology > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> From: Joachim Baran [mailto:joachim.baran@gmail.com] > > >>> Sent: Tuesday, July 29, 2014 11:56 AM > > >>> > > >>> > > >>> To: Michael Miller > > >>> Cc: w3c semweb hcls > > >>> Subject: Re: hcls dataset description comments > > >>> > > >>> > > >>> > > >>> Hi! > > >>> > > >>> > > >>> > > >>> Thanks for the suggestions. I have incorporated your minor edits. > > >>> Unbelievable how those slipped through after so many re-readings > > >>> still. > > >>> > > >>> > > >>> > > >>> For other edits, please fork the repository and create a pull > request > > >>> with > > >>> your changes. > > >>> > > >>> > > >>> > > >>> Best wishes, > > >>> > > >>> > > >>> > > >>> Kim > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> On 23 July 2014 08:53, Michael Miller > > >>> <Michael.Miller@systemsbiology.org> > > >>> wrote: > > >>> > > >>> hi kim, > > >>> > > >>> > > >>> > > >>> thanks for the pointer, i've updated my comments based on this newer > > >> draft > > >>> below. many fewer and i especially like the complete example in > 10.1! > > >>> > > >>> > > >>> > > >>> cheers, > > >>> > > >>> michael > > >>> > > >>> > > >>> > > >>> Michael Miller > > >>> > > >>> Software Engineer > > >>> > > >>> Institute for Systems Biology > > >>> > > >>> > > >>> > > >>> general comments: > > >>> > > >>> · s4.4 'Dataset Linking': might mention also that datasets > are > > >>> derived from other datasets? > > >>> 'A dataset may incorporate, or link to, data in other datasets, e.g. > > >>> in > > >>> the > > >>> creation of a data mashup ' --> 'A dataset may incorporate, be > derived > > >> from, > > >>> or link to, data in other datasets, e.g. in the analysis of original > > >>> datasets or in the creation of a data mashup ' > > >>> > > >>> · s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are > > >>> individual organizations but three (8.4, 8.8, 8.9) have subsections > > >>> for > > >>> different organizations. maybe organize so all top level sections > > >>> define a > > >>> type of organization with subsections beneath or make all top-level? > > >>> > > >>> · s8: some of the use cases could be more focused on how this > > >>> note > > >>> will help them (8.5-8.7) > > >>> > > >>> · s8.9: how do Data Catalogs fit into this note? wasn't > clear > > >>> to me > > >>> how this note is relevant to them > > >>> > > >>> our use case questions: > > >>> > > >>> · how to reference 3rd party datasets that aren't described > by > > >>> this > > >>> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom' > > >>> with > > >>> the IRI being the URL into the repository? > > >>> > > >>> · we have a lot of intermediary files that we won't publish, > > >>> the > > >>> software specified in creating our published datasets from its > sources > > >>> form > > >>> a (branching) workflow with the input being from the previous step(s) > > >>> in > > >> the > > >>> workflow. how best to represent this? this note doesn't seem to > > >>> cover > > >> how > > >>> the dataset is created so any recommendations? > > >>> > > >>> minor edits: > > >>> > > >>> · there are two s6.2.3 sections > > >>> > > >>> · s8.8.1: '... what period it is updated. To know when to...' > > >>> should > > >>> be '...what period it is updated to know when to...'? > > >>> > > >>> > > >>> > > >>> From: Joachim Baran [mailto:joachim.baran@gmail.com] > > >>> Sent: Tuesday, July 22, 2014 3:43 PM > > >>> To: Michael Miller > > >>> Cc: w3c semweb hcls > > >>> Subject: Re: hcls dataset description comments > > >>> > > >>> > > >>> > > >>> Hello, > > >>> > > >>> > > >>> > > >>> I believe you were looking at an old document. There is currently > > >>> only > > >>> one > > >>> Figure in the note. > > >>> > > >>> > > >>> > > >>> Please check the actual draft at: > > >> > > > http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDescriptions/blob/master/Overview.html > > >>> > > >>> > > >>> > > >>> Best wishes, > > >>> > > >>> > > >>> > > >>> Kim > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> On 22 July 2014 15:36, Michael Miller > > >>> <Michael.Miller@systemsbiology.org> > > >>> wrote: > > >>> > > >>> hi all, > > >>> > > >>> > > >>> > > >>> tremendous work, very clear and well-written. my group at ISB, the > > >>> Shmulevich lab is looking to provide provenance for the analysis > > >>> datasets > > >> we > > >>> are producing for TCGA. we're not sure if we'll be able to 'go all > > >>> the > > >>> way' > > >>> but we want to make sure we have at hand all the information that we > > >> could, > > >>> at least in theory, be compliant. as long as i was reading the > > >>> document, > > >>> below are some notes. > > >>> > > >>> > > >>> > > >>> general comments: > > >>> > > >>> · s4.4 'Dataset Linking': might mention also that datasets > are > > >>> derived from other datasets? > > >>> 'A dataset may incorporate, or link to, data in other datasets, e.g. > > >>> in > > >>> the > > >>> creation of a data mashup ' --> 'A dataset may incorporate, be > derived > > >> from, > > >>> or link to, data in other datasets, e.g. in the analysis of original > > >>> datasets or in the creation of a data mashup ' > > >>> > > >>> · the chembl example in s5 is not compliant to the property > > >>> table > > >>> below, it probably is only supposed to show the relationship of the > > >>> three > > >>> terms but that could be clarified > > >>> > > >>> · s6.2.12 could use the example filled in > > >>> > > >>> · 6.3.2: not sure what an 'X level description' is > > >>> > > >>> · s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are > > >>> individual organizations but three (8.4, 8.8, 8.9) have subsections > > >>> for > > >>> different organizations. maybe organize so all top level sections > > >>> define a > > >>> type of organization with subsections beneath or make all top-level? > > >>> > > >>> · s8: many of the use cases could be more focused on how this > > >>> note > > >>> will help them > > >>> > > >>> · s8.9: how do Data Catalogs fit into this note? wasn't > clear > > >>> to me > > >>> how this note is relevant to them > > >>> > > >>> · would be nice to have a 'complete' example p[put together, > > >>> maybe > > >>> based on chembl? > > >>> > > >>> > > >>> > > >>> our use case questions: > > >>> > > >>> · how to reference 3rd party datasets that aren't described > by > > >>> this > > >>> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom' > > >>> with > > >>> the IRI being the URL into the repository? > > >>> > > >>> · we have a lot of intermediary files that we won't publish, > > >>> the > > >>> software specified in creating our published datasets from its > sources > > >>> form > > >>> a (branching) workflow with the input being from the previous step(s) > > >>> in > > >> the > > >>> workflow. how best to represent this? this note doesn't seem to > > >>> cover > > >> how > > >>> the dataset is created so any recommendations? > > >>> > > >>> > > >>> > > >>> text issues: > > >>> > > >>> · Figure 1: 'Overview of dataset description level metadata > > >>> profiles > > >>> and their relationships': reference not resolved, image doesn't show > > >>> > > >>> · Figure 2: 'Improve diagram. Multiple appearance of > > >>> concepts/description levels unclear.': reference not resolved, image > > >> doesn't > > >>> show. add actual label > > >>> > > >>> > > >>> > > >>> minor edits: > > >>> > > >>> · bottom of s.3: 'placeholde' should be 'placeholder' > > >>> > > >>> · use straight quotes rather than slant quotes in s6.2.2 > > >>> example > > >>> (and elsewhere)? > > >>> > > >>> · the text runs out of the box in s6.2.3 'Description' > > >>> > > >>> · s6.2.3: 'Dates of Creation and Issuance': 'state the date > > >>> the > > >>> dataset was generated using dct:created and/or the date the dataset > > was > > >> made > > >>> public using dct:created' should be 'state the date the dataset was > > >>> generated using dct:created and/or the date the dataset was made > > public > > >>> using dct:issued'? > > >>> > > >>> · there are two s6.2.3 sections > > >>> > > >>> · s6.2.4: 'Creation: ... The date of authorship' should be > > >>> '...The > > >>> date of creation' and 'Curation:... The date of authorship' should be > > >>> '...The date of curation'? > > >>> > > >>> · s8.5: the author list has end parenthesis without beginning > > >>> parenthesis > > >>> > > >>> · s8.8.1: '... what period it is updated. To know when to...' > > >>> should > > >>> be '...what period it is updated to know when to...' > > >>> > > >>> > > >>> > > >>> cheers, > > >>> > > >>> michael > > >>> > > >>> > > >>> > > >>> Michael Miller > > >>> > > >>> Software Engineer > > >>> > > >>> Institute for Systems Biology > > >> > > >> > > >> > > >> -- > > >> Stian Soiland-Reyes, myGrid team > > >> School of Computer Science > > >> The University of Manchester > > >> http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842- > > 9718 >
Received on Monday, 11 August 2014 15:00:35 UTC