RE: hcls dataset description comments--Dataset Descriptions vs. PROV from Michael Miller on 2014-08-05 (public-semweb-lifesci@w3.org from August 2014)

From: Michael Miller <Michael.Miller@systemsbiology.org>
Date: Tue, 5 Aug 2014 08:13:55 -0700
To: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Cc: Joachim Baran <joachim.baran@gmail.com>, w3c semweb hcls <public-semweb-lifesci@w3.org>
Message-ID: <4b39f62dca8f464888693a2ad2b4e8d2@mail.gmail.com>
hi stian,

thanks much, very useful!

cheers,
michael

Michael Miller
Software Engineer
Institute for Systems Biology


> -----Original Message-----
> From: stian@mygrid.org.uk [mailto:stian@mygrid.org.uk] On Behalf Of Stian
> Soiland-Reyes
> Sent: Tuesday, August 05, 2014 5:38 AM
> To: Michael Miller
> Cc: Joachim Baran; w3c semweb hcls
> Subject: Re: hcls dataset description comments--Dataset Descriptions vs.
> PROV
>
> Just some inputs:
>
>
> PROV defines prov:wasDerivedFrom which in broad sense describes such a
> relationset between datasets. However you do not know anything more
> about what kind of derivation we are talking about.
>
>
> In PAV we found the need to specialize three types of derivation:
>
> pav:retrievedFrom -
> http://purl.org/pav/html#http://purl.org/pav/retrievedFrom
> .. a byte-for-byte download
>
> pav:importedFrom -
> http://purl.org/pav/html#http://purl.org/pav/importedFrom
> .. a somewhat equivalent form of the source, but after some kind of
> transformation or selection (e.g. CSV -> XML)
>
> pav:derivedFrom -
> http://purl.org/pav/html#http://purl.org/pav/derivedFrom
>  .. when the new resource has been further refined or modified
> (somewhat adding additional knowledge)
>
>
> If you are simply concatenating several dataset, then multiple
> pav:importedFrom statements would make sense. If further knowledge is
> added, say by reasoning or calculation, then pav:derivedFrom would
> make sense.
>
>
> Now if you want to detail exactly how those datasets have been
> combined, I think you are right that would make sense to break down
> the derivation using PROV statements, e.g. a series of activities,
> generation and usage. How to describe these activities (e.g.
> subclasses and properties) will be specific to each case.
>
>
>
> If the process you generated the dataset with somewhat resembles a
> dataflow, you might be interested in the wfprov and wfdesc ontologies
> that specialize PROV to define a WorkflowRun of steps of ProcessRuns,
> which can be related to a common workflow description (e.g. a
> prov:Plan):
>
> http://purl.org/wf4ever/model#wfprov
>
> OPMW is a similar approach:
> http://www.opmw.org/model/OPMW/
>
>
>
> On 4 August 2014 17:44, Michael Miller
> <Michael.Miller@systemsbiology.org> wrote:
> > hi all,
> >
> >
> >
> > as you are all undoubtedly aware, a major, if not the major TCGA dataset
> use
> > cases revolve around taking the 3rd level data from the TCGA dcc
> repository
> > and doing analysis, producing 4th level data such as clusters, pca, etc.
> > one of the things we do here at ISB is produce an intermediate data step
> > that combines the different platforms (mRNA, miRNA, RPPA, METH, etc.)
> into
> > one feature matrix so that the analysis can use all the platforms
> > together.
> > the Broad firehose pipeline also has this as one of its outputs.
> >
> >
> >
> > as some of my comments allude to, it doesn't seem that Dataset
> Descriptions
> > deal with the use case of describing a dataset that is specifically
> > derived
> > from other datasets, which is what we are looking at ways we might
> describe
> > our data when we publish it.  i took a look at PROV and, i've got a bit
> > more
> > mapping to do, but it seems like PROV provides the terms we need.
> >
> >
> >
> > but this has lead me to ask the question of what is the relation of
> > Dataset
> > Descriptions and PROV and how should they/should they be used
> together?  i
> > think the above use case is quite common for datasets being published so
> > might deserve a discussion in the Dataset Descriptions note
> >
> >
> >
> > cheers,
> >
> > michael
> >
> >
> >
> > Michael Miller
> >
> > Software Engineer
> >
> > Institute for Systems Biology
> >
> >
> >
> >
> >
> > From: Joachim Baran [mailto:joachim.baran@gmail.com]
> > Sent: Thursday, July 31, 2014 3:43 PM
> > To: Michael Miller
> > Cc: w3c semweb hcls
> > Subject: Re: hcls dataset description comments
> >
> >
> >
> > Hi!
> >
> >
> >
> >   I will ponder about your edit suggestion of your first bullet point. I
> > am
> > not sure at the moment if it would have wider implications.
> >
> >
> >
> >   You are right that the use cases were written by the groups
> > themselves. I
> > do not know how to improve the use cases without rewriting them, which
> might
> > not be agreeable to all parties involved. C'est la vie.
> >
> >
> >
> >   The role of Data Catalogs should then be discussed during out next
> > conf
> > call. Thanks for highlighting that this might be unclear to readers.
> >
> >
> >
> > Kim
> >
> >
> >
> >
> >
> >
> >
> > On 30 July 2014 10:41, Michael Miller
> > <Michael.Miller@systemsbiology.org>
> > wrote:
> >
> > hi kim,
> >
> >
> >
> > 'For other edits, please fork the repository and create a pull request
> > with
> > your changes'
> >
> >
> >
> > of the four general comments, the first is really the only 'edit', i
> > didn't
> > put it in the minor edits because it had some implications that the
> > group
> > might not agree with.  if the change makes sense, it might be easier for
> > you
> > to make the edit.
> >
> >
> >
> > the other three are general comments and i'm not sure what the solution
> > might be, they were mainly points, as a reader, that weren't clear or
> > were a
> > bit confusing.  these were all from the use case section so were
> > probably
> > written by the groups themselves?  if i have permission, i can certainly
> > add
> > them as issues.
> >
> >
> >
> > cheers,
> >
> > michael
> >
> >
> >
> > Michael Miller
> >
> > Software Engineer
> >
> > Institute for Systems Biology
> >
> >
> >
> >
> >
> > From: Joachim Baran [mailto:joachim.baran@gmail.com]
> > Sent: Tuesday, July 29, 2014 11:56 AM
> >
> >
> > To: Michael Miller
> > Cc: w3c semweb hcls
> > Subject: Re: hcls dataset description comments
> >
> >
> >
> > Hi!
> >
> >
> >
> >   Thanks for the suggestions. I have incorporated your minor edits.
> > Unbelievable how those slipped through after so many re-readings still.
> >
> >
> >
> >   For other edits, please fork the repository and create a pull request
> > with
> > your changes.
> >
> >
> >
> > Best wishes,
> >
> >
> >
> > Kim
> >
> >
> >
> >
> >
> > On 23 July 2014 08:53, Michael Miller
> > <Michael.Miller@systemsbiology.org>
> > wrote:
> >
> > hi kim,
> >
> >
> >
> > thanks for the pointer, i've updated my comments based on this newer
> draft
> > below.  many fewer and i especially like the complete example in 10.1!
> >
> >
> >
> > cheers,
> >
> > michael
> >
> >
> >
> > Michael Miller
> >
> > Software Engineer
> >
> > Institute for Systems Biology
> >
> >
> >
> > general comments:
> >
> > ·         s4.4 'Dataset Linking': might mention also that datasets are
> > derived from other datasets?
> > 'A dataset may incorporate, or link to, data in other datasets, e.g. in
> > the
> > creation of a data mashup ' --> 'A dataset may incorporate, be derived
> from,
> > or link to, data in other datasets, e.g. in the analysis of original
> > datasets or in the creation of a data mashup '
> >
> > ·         s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are
> > individual organizations but three (8.4, 8.8, 8.9) have subsections for
> > different organizations.  maybe organize so all top level sections
> > define a
> > type of organization with subsections beneath or make all top-level?
> >
> > ·         s8: some of the use cases could be more focused on how this
> > note
> > will help them (8.5-8.7)
> >
> > ·         s8.9: how do Data Catalogs fit into this note?  wasn't clear
> > to me
> > how this note is relevant to them
> >
> > our use case questions:
> >
> > ·         how to reference 3rd party datasets that aren't described by
> > this
> > standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom'
> > with
> > the IRI being the URL into the repository?
> >
> > ·         we have a lot of intermediary files that we won't publish, the
> > software specified in creating our published datasets from its sources
> > form
> > a (branching) workflow with the input being from the previous step(s) in
> the
> > workflow.  how best to represent this?  this note doesn't seem to cover
> how
> > the dataset is created so any recommendations?
> >
> > minor edits:
> >
> > ·         there are two s6.2.3 sections
> >
> > ·         s8.8.1: '... what period it is updated. To know when to...'
> > should
> > be '...what period it is updated to know when to...'?
> >
> >
> >
> > From: Joachim Baran [mailto:joachim.baran@gmail.com]
> > Sent: Tuesday, July 22, 2014 3:43 PM
> > To: Michael Miller
> > Cc: w3c semweb hcls
> > Subject: Re: hcls dataset description comments
> >
> >
> >
> > Hello,
> >
> >
> >
> >   I believe you were looking at an old document. There is currently only
> > one
> > Figure in the note.
> >
> >
> >
> >   Please check the actual draft at:
> >
> http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDe
> scriptions/blob/master/Overview.html
> >
> >
> >
> > Best wishes,
> >
> >
> >
> > Kim
> >
> >
> >
> >
> >
> > On 22 July 2014 15:36, Michael Miller
> > <Michael.Miller@systemsbiology.org>
> > wrote:
> >
> > hi all,
> >
> >
> >
> > tremendous work, very clear and well-written.  my group at ISB, the
> > Shmulevich lab is looking to provide provenance for the analysis
> > datasets
> we
> > are producing for TCGA.  we're not sure if we'll be able to 'go all the
> > way'
> > but we want to make sure we have at hand all the information that we
> could,
> > at least in theory, be compliant.  as long as i was reading the
> > document,
> > below are some notes.
> >
> >
> >
> > general comments:
> >
> > ·         s4.4 'Dataset Linking': might mention also that datasets are
> > derived from other datasets?
> > 'A dataset may incorporate, or link to, data in other datasets, e.g. in
> > the
> > creation of a data mashup ' --> 'A dataset may incorporate, be derived
> from,
> > or link to, data in other datasets, e.g. in the analysis of original
> > datasets or in the creation of a data mashup '
> >
> > ·         the chembl example in s5 is not compliant to the property
> > table
> > below, it probably is only supposed to show the relationship of the
> > three
> > terms but that could be clarified
> >
> > ·         s6.2.12 could use the example filled in
> >
> > ·         6.3.2: not sure what an 'X level description' is
> >
> > ·         s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are
> > individual organizations but three (8.4, 8.8, 8.9) have subsections for
> > different organizations.  maybe organize so all top level sections
> > define a
> > type of organization with subsections beneath or make all top-level?
> >
> > ·         s8: many of the use cases could be more focused on how this
> > note
> > will help them
> >
> > ·         s8.9: how do Data Catalogs fit into this note?  wasn't clear
> > to me
> > how this note is relevant to them
> >
> > ·         would be nice to have a 'complete' example p[put together,
> > maybe
> > based on chembl?
> >
> >
> >
> > our use case questions:
> >
> > ·         how to reference 3rd party datasets that aren't described by
> > this
> > standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom'
> > with
> > the IRI being the URL into the repository?
> >
> > ·         we have a lot of intermediary files that we won't publish, the
> > software specified in creating our published datasets from its sources
> > form
> > a (branching) workflow with the input being from the previous step(s) in
> the
> > workflow.  how best to represent this?  this note doesn't seem to cover
> how
> > the dataset is created so any recommendations?
> >
> >
> >
> > text issues:
> >
> > ·         Figure 1: 'Overview of dataset description level metadata
> > profiles
> > and their relationships': reference not resolved, image doesn't show
> >
> > ·         Figure 2: 'Improve diagram. Multiple appearance of
> > concepts/description levels unclear.': reference not resolved, image
> doesn't
> > show.  add actual label
> >
> >
> >
> > minor edits:
> >
> > ·         bottom of s.3: 'placeholde' should be 'placeholder'
> >
> > ·         use straight quotes rather than slant quotes in s6.2.2 example
> > (and elsewhere)?
> >
> > ·         the text runs out of the box in s6.2.3 'Description'
> >
> > ·         s6.2.3: 'Dates of Creation and Issuance': 'state the date the
> > dataset was generated using dct:created and/or the date the dataset was
> made
> > public using dct:created' should be 'state the date the dataset was
> > generated using dct:created and/or the date the dataset was made public
> > using dct:issued'?
> >
> > ·         there are two s6.2.3 sections
> >
> > ·         s6.2.4: 'Creation: ... The date of authorship' should be
> > '...The
> > date of creation' and 'Curation:... The date of authorship' should be
> > '...The date of curation'?
> >
> > ·         s8.5: the author list has end parenthesis without beginning
> > parenthesis
> >
> > ·         s8.8.1: '... what period it is updated. To know when to...'
> > should
> > be '...what period it is updated to know when to...'
> >
> >
> >
> > cheers,
> >
> > michael
> >
> >
> >
> > Michael Miller
> >
> > Software Engineer
> >
> > Institute for Systems Biology
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
>
> --
> Stian Soiland-Reyes, myGrid team
> School of Computer Science
> The University of Manchester
> http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718
Received on Tuesday, 5 August 2014 15:14:24 UTC