Re: hcls dataset description comments--Dataset Descriptions vs. PROV from Joachim Baran on 2014-08-09 (public-semweb-lifesci@w3.org from August 2014)

From: Joachim Baran <joachim.baran@gmail.com>
Date: Fri, 8 Aug 2014 18:46:27 -0700
To: Michael Miller <Michael.Miller@systemsbiology.org>
Cc: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>, w3c semweb hcls <public-semweb-lifesci@w3.org>
Message-Id: <055A2CA7-6BE9-4220-B3ED-C0C7874D311F@gmail.com>
Hello,

  Has there been an update to this? Preferably a pull request?

Thanks,

Kim



> On Aug 5, 2014, at 8:13 AM, Michael Miller <Michael.Miller@systemsbiology.org> wrote:
> 
> hi stian,
> 
> thanks much, very useful!
> 
> cheers,
> michael
> 
> Michael Miller
> Software Engineer
> Institute for Systems Biology
> 
> 
>> -----Original Message-----
>> From: stian@mygrid.org.uk [mailto:stian@mygrid.org.uk] On Behalf Of Stian
>> Soiland-Reyes
>> Sent: Tuesday, August 05, 2014 5:38 AM
>> To: Michael Miller
>> Cc: Joachim Baran; w3c semweb hcls
>> Subject: Re: hcls dataset description comments--Dataset Descriptions vs.
>> PROV
>> 
>> Just some inputs:
>> 
>> 
>> PROV defines prov:wasDerivedFrom which in broad sense describes such a
>> relationset between datasets. However you do not know anything more
>> about what kind of derivation we are talking about.
>> 
>> 
>> In PAV we found the need to specialize three types of derivation:
>> 
>> pav:retrievedFrom -
>> http://purl.org/pav/html#http://purl.org/pav/retrievedFrom
>> .. a byte-for-byte download
>> 
>> pav:importedFrom -
>> http://purl.org/pav/html#http://purl.org/pav/importedFrom
>> .. a somewhat equivalent form of the source, but after some kind of
>> transformation or selection (e.g. CSV -> XML)
>> 
>> pav:derivedFrom -
>> http://purl.org/pav/html#http://purl.org/pav/derivedFrom
>> .. when the new resource has been further refined or modified
>> (somewhat adding additional knowledge)
>> 
>> 
>> If you are simply concatenating several dataset, then multiple
>> pav:importedFrom statements would make sense. If further knowledge is
>> added, say by reasoning or calculation, then pav:derivedFrom would
>> make sense.
>> 
>> 
>> Now if you want to detail exactly how those datasets have been
>> combined, I think you are right that would make sense to break down
>> the derivation using PROV statements, e.g. a series of activities,
>> generation and usage. How to describe these activities (e.g.
>> subclasses and properties) will be specific to each case.
>> 
>> 
>> 
>> If the process you generated the dataset with somewhat resembles a
>> dataflow, you might be interested in the wfprov and wfdesc ontologies
>> that specialize PROV to define a WorkflowRun of steps of ProcessRuns,
>> which can be related to a common workflow description (e.g. a
>> prov:Plan):
>> 
>> http://purl.org/wf4ever/model#wfprov
>> 
>> OPMW is a similar approach:
>> http://www.opmw.org/model/OPMW/
>> 
>> 
>> 
>> On 4 August 2014 17:44, Michael Miller
>> <Michael.Miller@systemsbiology.org> wrote:
>>> hi all,
>>> 
>>> 
>>> 
>>> as you are all undoubtedly aware, a major, if not the major TCGA dataset
>> use
>>> cases revolve around taking the 3rd level data from the TCGA dcc
>> repository
>>> and doing analysis, producing 4th level data such as clusters, pca, etc.
>>> one of the things we do here at ISB is produce an intermediate data step
>>> that combines the different platforms (mRNA, miRNA, RPPA, METH, etc.)
>> into
>>> one feature matrix so that the analysis can use all the platforms
>>> together.
>>> the Broad firehose pipeline also has this as one of its outputs.
>>> 
>>> 
>>> 
>>> as some of my comments allude to, it doesn't seem that Dataset
>> Descriptions
>>> deal with the use case of describing a dataset that is specifically
>>> derived
>>> from other datasets, which is what we are looking at ways we might
>> describe
>>> our data when we publish it.  i took a look at PROV and, i've got a bit
>>> more
>>> mapping to do, but it seems like PROV provides the terms we need.
>>> 
>>> 
>>> 
>>> but this has lead me to ask the question of what is the relation of
>>> Dataset
>>> Descriptions and PROV and how should they/should they be used
>> together?  i
>>> think the above use case is quite common for datasets being published so
>>> might deserve a discussion in the Dataset Descriptions note
>>> 
>>> 
>>> 
>>> cheers,
>>> 
>>> michael
>>> 
>>> 
>>> 
>>> Michael Miller
>>> 
>>> Software Engineer
>>> 
>>> Institute for Systems Biology
>>> 
>>> 
>>> 
>>> 
>>> 
>>> From: Joachim Baran [mailto:joachim.baran@gmail.com]
>>> Sent: Thursday, July 31, 2014 3:43 PM
>>> To: Michael Miller
>>> Cc: w3c semweb hcls
>>> Subject: Re: hcls dataset description comments
>>> 
>>> 
>>> 
>>> Hi!
>>> 
>>> 
>>> 
>>>  I will ponder about your edit suggestion of your first bullet point. I
>>> am
>>> not sure at the moment if it would have wider implications.
>>> 
>>> 
>>> 
>>>  You are right that the use cases were written by the groups
>>> themselves. I
>>> do not know how to improve the use cases without rewriting them, which
>> might
>>> not be agreeable to all parties involved. C'est la vie.
>>> 
>>> 
>>> 
>>>  The role of Data Catalogs should then be discussed during out next
>>> conf
>>> call. Thanks for highlighting that this might be unclear to readers.
>>> 
>>> 
>>> 
>>> Kim
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 30 July 2014 10:41, Michael Miller
>>> <Michael.Miller@systemsbiology.org>
>>> wrote:
>>> 
>>> hi kim,
>>> 
>>> 
>>> 
>>> 'For other edits, please fork the repository and create a pull request
>>> with
>>> your changes'
>>> 
>>> 
>>> 
>>> of the four general comments, the first is really the only 'edit', i
>>> didn't
>>> put it in the minor edits because it had some implications that the
>>> group
>>> might not agree with.  if the change makes sense, it might be easier for
>>> you
>>> to make the edit.
>>> 
>>> 
>>> 
>>> the other three are general comments and i'm not sure what the solution
>>> might be, they were mainly points, as a reader, that weren't clear or
>>> were a
>>> bit confusing.  these were all from the use case section so were
>>> probably
>>> written by the groups themselves?  if i have permission, i can certainly
>>> add
>>> them as issues.
>>> 
>>> 
>>> 
>>> cheers,
>>> 
>>> michael
>>> 
>>> 
>>> 
>>> Michael Miller
>>> 
>>> Software Engineer
>>> 
>>> Institute for Systems Biology
>>> 
>>> 
>>> 
>>> 
>>> 
>>> From: Joachim Baran [mailto:joachim.baran@gmail.com]
>>> Sent: Tuesday, July 29, 2014 11:56 AM
>>> 
>>> 
>>> To: Michael Miller
>>> Cc: w3c semweb hcls
>>> Subject: Re: hcls dataset description comments
>>> 
>>> 
>>> 
>>> Hi!
>>> 
>>> 
>>> 
>>>  Thanks for the suggestions. I have incorporated your minor edits.
>>> Unbelievable how those slipped through after so many re-readings still.
>>> 
>>> 
>>> 
>>>  For other edits, please fork the repository and create a pull request
>>> with
>>> your changes.
>>> 
>>> 
>>> 
>>> Best wishes,
>>> 
>>> 
>>> 
>>> Kim
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 23 July 2014 08:53, Michael Miller
>>> <Michael.Miller@systemsbiology.org>
>>> wrote:
>>> 
>>> hi kim,
>>> 
>>> 
>>> 
>>> thanks for the pointer, i've updated my comments based on this newer
>> draft
>>> below.  many fewer and i especially like the complete example in 10.1!
>>> 
>>> 
>>> 
>>> cheers,
>>> 
>>> michael
>>> 
>>> 
>>> 
>>> Michael Miller
>>> 
>>> Software Engineer
>>> 
>>> Institute for Systems Biology
>>> 
>>> 
>>> 
>>> general comments:
>>> 
>>> ·         s4.4 'Dataset Linking': might mention also that datasets are
>>> derived from other datasets?
>>> 'A dataset may incorporate, or link to, data in other datasets, e.g. in
>>> the
>>> creation of a data mashup ' --> 'A dataset may incorporate, be derived
>> from,
>>> or link to, data in other datasets, e.g. in the analysis of original
>>> datasets or in the creation of a data mashup '
>>> 
>>> ·         s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are
>>> individual organizations but three (8.4, 8.8, 8.9) have subsections for
>>> different organizations.  maybe organize so all top level sections
>>> define a
>>> type of organization with subsections beneath or make all top-level?
>>> 
>>> ·         s8: some of the use cases could be more focused on how this
>>> note
>>> will help them (8.5-8.7)
>>> 
>>> ·         s8.9: how do Data Catalogs fit into this note?  wasn't clear
>>> to me
>>> how this note is relevant to them
>>> 
>>> our use case questions:
>>> 
>>> ·         how to reference 3rd party datasets that aren't described by
>>> this
>>> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom'
>>> with
>>> the IRI being the URL into the repository?
>>> 
>>> ·         we have a lot of intermediary files that we won't publish, the
>>> software specified in creating our published datasets from its sources
>>> form
>>> a (branching) workflow with the input being from the previous step(s) in
>> the
>>> workflow.  how best to represent this?  this note doesn't seem to cover
>> how
>>> the dataset is created so any recommendations?
>>> 
>>> minor edits:
>>> 
>>> ·         there are two s6.2.3 sections
>>> 
>>> ·         s8.8.1: '... what period it is updated. To know when to...'
>>> should
>>> be '...what period it is updated to know when to...'?
>>> 
>>> 
>>> 
>>> From: Joachim Baran [mailto:joachim.baran@gmail.com]
>>> Sent: Tuesday, July 22, 2014 3:43 PM
>>> To: Michael Miller
>>> Cc: w3c semweb hcls
>>> Subject: Re: hcls dataset description comments
>>> 
>>> 
>>> 
>>> Hello,
>>> 
>>> 
>>> 
>>>  I believe you were looking at an old document. There is currently only
>>> one
>>> Figure in the note.
>>> 
>>> 
>>> 
>>>  Please check the actual draft at:
>> http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDe
>> scriptions/blob/master/Overview.html
>>> 
>>> 
>>> 
>>> Best wishes,
>>> 
>>> 
>>> 
>>> Kim
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 22 July 2014 15:36, Michael Miller
>>> <Michael.Miller@systemsbiology.org>
>>> wrote:
>>> 
>>> hi all,
>>> 
>>> 
>>> 
>>> tremendous work, very clear and well-written.  my group at ISB, the
>>> Shmulevich lab is looking to provide provenance for the analysis
>>> datasets
>> we
>>> are producing for TCGA.  we're not sure if we'll be able to 'go all the
>>> way'
>>> but we want to make sure we have at hand all the information that we
>> could,
>>> at least in theory, be compliant.  as long as i was reading the
>>> document,
>>> below are some notes.
>>> 
>>> 
>>> 
>>> general comments:
>>> 
>>> ·         s4.4 'Dataset Linking': might mention also that datasets are
>>> derived from other datasets?
>>> 'A dataset may incorporate, or link to, data in other datasets, e.g. in
>>> the
>>> creation of a data mashup ' --> 'A dataset may incorporate, be derived
>> from,
>>> or link to, data in other datasets, e.g. in the analysis of original
>>> datasets or in the creation of a data mashup '
>>> 
>>> ·         the chembl example in s5 is not compliant to the property
>>> table
>>> below, it probably is only supposed to show the relationship of the
>>> three
>>> terms but that could be clarified
>>> 
>>> ·         s6.2.12 could use the example filled in
>>> 
>>> ·         6.3.2: not sure what an 'X level description' is
>>> 
>>> ·         s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are
>>> individual organizations but three (8.4, 8.8, 8.9) have subsections for
>>> different organizations.  maybe organize so all top level sections
>>> define a
>>> type of organization with subsections beneath or make all top-level?
>>> 
>>> ·         s8: many of the use cases could be more focused on how this
>>> note
>>> will help them
>>> 
>>> ·         s8.9: how do Data Catalogs fit into this note?  wasn't clear
>>> to me
>>> how this note is relevant to them
>>> 
>>> ·         would be nice to have a 'complete' example p[put together,
>>> maybe
>>> based on chembl?
>>> 
>>> 
>>> 
>>> our use case questions:
>>> 
>>> ·         how to reference 3rd party datasets that aren't described by
>>> this
>>> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom'
>>> with
>>> the IRI being the URL into the repository?
>>> 
>>> ·         we have a lot of intermediary files that we won't publish, the
>>> software specified in creating our published datasets from its sources
>>> form
>>> a (branching) workflow with the input being from the previous step(s) in
>> the
>>> workflow.  how best to represent this?  this note doesn't seem to cover
>> how
>>> the dataset is created so any recommendations?
>>> 
>>> 
>>> 
>>> text issues:
>>> 
>>> ·         Figure 1: 'Overview of dataset description level metadata
>>> profiles
>>> and their relationships': reference not resolved, image doesn't show
>>> 
>>> ·         Figure 2: 'Improve diagram. Multiple appearance of
>>> concepts/description levels unclear.': reference not resolved, image
>> doesn't
>>> show.  add actual label
>>> 
>>> 
>>> 
>>> minor edits:
>>> 
>>> ·         bottom of s.3: 'placeholde' should be 'placeholder'
>>> 
>>> ·         use straight quotes rather than slant quotes in s6.2.2 example
>>> (and elsewhere)?
>>> 
>>> ·         the text runs out of the box in s6.2.3 'Description'
>>> 
>>> ·         s6.2.3: 'Dates of Creation and Issuance': 'state the date the
>>> dataset was generated using dct:created and/or the date the dataset was
>> made
>>> public using dct:created' should be 'state the date the dataset was
>>> generated using dct:created and/or the date the dataset was made public
>>> using dct:issued'?
>>> 
>>> ·         there are two s6.2.3 sections
>>> 
>>> ·         s6.2.4: 'Creation: ... The date of authorship' should be
>>> '...The
>>> date of creation' and 'Curation:... The date of authorship' should be
>>> '...The date of curation'?
>>> 
>>> ·         s8.5: the author list has end parenthesis without beginning
>>> parenthesis
>>> 
>>> ·         s8.8.1: '... what period it is updated. To know when to...'
>>> should
>>> be '...what period it is updated to know when to...'
>>> 
>>> 
>>> 
>>> cheers,
>>> 
>>> michael
>>> 
>>> 
>>> 
>>> Michael Miller
>>> 
>>> Software Engineer
>>> 
>>> Institute for Systems Biology
>> 
>> 
>> 
>> --
>> Stian Soiland-Reyes, myGrid team
>> School of Computer Science
>> The University of Manchester
>> http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718
Received on Saturday, 9 August 2014 01:46:57 UTC