RE: hcls dataset description comments--Dataset Descriptions vs. PROV from Michael Miller on 2014-08-20 (public-semweb-lifesci@w3.org from August 2014)

From: Michael Miller <Michael.Miller@systemsbiology.org>
Date: Wed, 20 Aug 2014 09:30:26 -0700
To: Joachim Baran <joachim.baran@gmail.com>, w3c semweb hcls <public-semweb-lifesci@w3.org>
Cc: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Message-ID: <0b7e3655f1e37658a8aa7013ce3c9dc0@mail.gmail.com>
hi all,



finally finished my initial attempt at adding a section on datasets in
workflows.  had some brief discussions with melissa and she and carole may
have some additions.  i tried to keep it short and sweet, to really get
into it could be an entire note on its own.



kim, you should see my pull request.  hope to make the call next week.



cheers,

michael



Michael Miller

Software Engineer

Institute for Systems Biology





*From:* Joachim Baran [mailto:joachim.baran@gmail.com]
*Sent:* Monday, August 11, 2014 8:00 AM
*To:* Michael Miller
*Cc:* Stian Soiland-Reyes; w3c semweb hcls
*Subject:* Re: hcls dataset description comments--Dataset Descriptions vs.
PROV



Great! Before you send the pull request, please make sure that W3C's HTML
validation passes: http://validator.w3.org/#validate_by_input



Kim



On 9 August 2014 10:15, Michael Miller <Michael.Miller@systemsbiology.org>
wrote:

hi kim,

i've made decent progress and expect to have something mid-week, if all goes
well (as a pull request, tho no guarantee on the formatting!)


cheers,
michael

Michael Miller
Software Engineer
Institute for Systems Biology

> -----Original Message-----

> From: Joachim Baran [mailto:joachim.baran@gmail.com]

> Sent: Friday, August 08, 2014 6:46 PM
> To: Michael Miller

> Cc: Stian Soiland-Reyes; w3c semweb hcls
> Subject: Re: hcls dataset description comments--Dataset Descriptions vs.
> PROV
>
> Hello,
>
>   Has there been an update to this? Preferably a pull request?
>
> Thanks,
>
> Kim
>
>
>
> > On Aug 5, 2014, at 8:13 AM, Michael Miller
> <Michael.Miller@systemsbiology.org> wrote:
> >
> > hi stian,
> >
> > thanks much, very useful!
> >
> > cheers,
> > michael
> >
> > Michael Miller
> > Software Engineer
> > Institute for Systems Biology
> >
> >
> >> -----Original Message-----
> >> From: stian@mygrid.org.uk [mailto:stian@mygrid.org.uk] On Behalf Of
> Stian
> >> Soiland-Reyes
> >> Sent: Tuesday, August 05, 2014 5:38 AM
> >> To: Michael Miller
> >> Cc: Joachim Baran; w3c semweb hcls
> >> Subject: Re: hcls dataset description comments--Dataset Descriptions
> >> vs.
> >> PROV
> >>
> >> Just some inputs:
> >>
> >>
> >> PROV defines prov:wasDerivedFrom which in broad sense describes such
> a
> >> relationset between datasets. However you do not know anything more
> >> about what kind of derivation we are talking about.
> >>
> >>
> >> In PAV we found the need to specialize three types of derivation:
> >>
> >> pav:retrievedFrom -
> >> http://purl.org/pav/html#http://purl.org/pav/retrievedFrom
> >> .. a byte-for-byte download
> >>
> >> pav:importedFrom -
> >> http://purl.org/pav/html#http://purl.org/pav/importedFrom
> >> .. a somewhat equivalent form of the source, but after some kind of
> >> transformation or selection (e.g. CSV -> XML)
> >>
> >> pav:derivedFrom -
> >> http://purl.org/pav/html#http://purl.org/pav/derivedFrom
> >> .. when the new resource has been further refined or modified
> >> (somewhat adding additional knowledge)
> >>
> >>
> >> If you are simply concatenating several dataset, then multiple
> >> pav:importedFrom statements would make sense. If further knowledge is
> >> added, say by reasoning or calculation, then pav:derivedFrom would
> >> make sense.
> >>
> >>
> >> Now if you want to detail exactly how those datasets have been
> >> combined, I think you are right that would make sense to break down
> >> the derivation using PROV statements, e.g. a series of activities,
> >> generation and usage. How to describe these activities (e.g.
> >> subclasses and properties) will be specific to each case.
> >>
> >>
> >>
> >> If the process you generated the dataset with somewhat resembles a
> >> dataflow, you might be interested in the wfprov and wfdesc ontologies
> >> that specialize PROV to define a WorkflowRun of steps of ProcessRuns,
> >> which can be related to a common workflow description (e.g. a
> >> prov:Plan):
> >>
> >> http://purl.org/wf4ever/model#wfprov
> >>
> >> OPMW is a similar approach:
> >> http://www.opmw.org/model/OPMW/
> >>
> >>
> >>
> >> On 4 August 2014 17:44, Michael Miller
> >> <Michael.Miller@systemsbiology.org> wrote:
> >>> hi all,
> >>>
> >>>
> >>>
> >>> as you are all undoubtedly aware, a major, if not the major TCGA
> >>> dataset
> >> use
> >>> cases revolve around taking the 3rd level data from the TCGA dcc
> >> repository
> >>> and doing analysis, producing 4th level data such as clusters, pca,
> >>> etc.
> >>> one of the things we do here at ISB is produce an intermediate data
> >>> step
> >>> that combines the different platforms (mRNA, miRNA, RPPA, METH,
> etc.)
> >> into
> >>> one feature matrix so that the analysis can use all the platforms
> >>> together.
> >>> the Broad firehose pipeline also has this as one of its outputs.
> >>>
> >>>
> >>>
> >>> as some of my comments allude to, it doesn't seem that Dataset
> >> Descriptions
> >>> deal with the use case of describing a dataset that is specifically
> >>> derived
> >>> from other datasets, which is what we are looking at ways we might
> >> describe
> >>> our data when we publish it.  i took a look at PROV and, i've got a
> >>> bit
> >>> more
> >>> mapping to do, but it seems like PROV provides the terms we need.
> >>>
> >>>
> >>>
> >>> but this has lead me to ask the question of what is the relation of
> >>> Dataset
> >>> Descriptions and PROV and how should they/should they be used
> >> together?  i
> >>> think the above use case is quite common for datasets being published
> so
> >>> might deserve a discussion in the Dataset Descriptions note
> >>>
> >>>
> >>>
> >>> cheers,
> >>>
> >>> michael
> >>>
> >>>
> >>>
> >>> Michael Miller
> >>>
> >>> Software Engineer
> >>>
> >>> Institute for Systems Biology
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> From: Joachim Baran [mailto:joachim.baran@gmail.com]
> >>> Sent: Thursday, July 31, 2014 3:43 PM
> >>> To: Michael Miller
> >>> Cc: w3c semweb hcls
> >>> Subject: Re: hcls dataset description comments
> >>>
> >>>
> >>>
> >>> Hi!
> >>>
> >>>
> >>>
> >>>  I will ponder about your edit suggestion of your first bullet point.
> >>> I
> >>> am
> >>> not sure at the moment if it would have wider implications.
> >>>
> >>>
> >>>
> >>>  You are right that the use cases were written by the groups
> >>> themselves. I
> >>> do not know how to improve the use cases without rewriting them,
> which
> >> might
> >>> not be agreeable to all parties involved. C'est la vie.
> >>>
> >>>
> >>>
> >>>  The role of Data Catalogs should then be discussed during out next
> >>> conf
> >>> call. Thanks for highlighting that this might be unclear to readers.
> >>>
> >>>
> >>>
> >>> Kim
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On 30 July 2014 10:41, Michael Miller
> >>> <Michael.Miller@systemsbiology.org>
> >>> wrote:
> >>>
> >>> hi kim,
> >>>
> >>>
> >>>
> >>> 'For other edits, please fork the repository and create a pull request
> >>> with
> >>> your changes'
> >>>
> >>>
> >>>
> >>> of the four general comments, the first is really the only 'edit', i
> >>> didn't
> >>> put it in the minor edits because it had some implications that the
> >>> group
> >>> might not agree with.  if the change makes sense, it might be easier
> >>> for
> >>> you
> >>> to make the edit.
> >>>
> >>>
> >>>
> >>> the other three are general comments and i'm not sure what the
> solution
> >>> might be, they were mainly points, as a reader, that weren't clear or
> >>> were a
> >>> bit confusing.  these were all from the use case section so were
> >>> probably
> >>> written by the groups themselves?  if i have permission, i can
> >>> certainly
> >>> add
> >>> them as issues.
> >>>
> >>>
> >>>
> >>> cheers,
> >>>
> >>> michael
> >>>
> >>>
> >>>
> >>> Michael Miller
> >>>
> >>> Software Engineer
> >>>
> >>> Institute for Systems Biology
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> From: Joachim Baran [mailto:joachim.baran@gmail.com]
> >>> Sent: Tuesday, July 29, 2014 11:56 AM
> >>>
> >>>
> >>> To: Michael Miller
> >>> Cc: w3c semweb hcls
> >>> Subject: Re: hcls dataset description comments
> >>>
> >>>
> >>>
> >>> Hi!
> >>>
> >>>
> >>>
> >>>  Thanks for the suggestions. I have incorporated your minor edits.
> >>> Unbelievable how those slipped through after so many re-readings
> >>> still.
> >>>
> >>>
> >>>
> >>>  For other edits, please fork the repository and create a pull request
> >>> with
> >>> your changes.
> >>>
> >>>
> >>>
> >>> Best wishes,
> >>>
> >>>
> >>>
> >>> Kim
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On 23 July 2014 08:53, Michael Miller
> >>> <Michael.Miller@systemsbiology.org>
> >>> wrote:
> >>>
> >>> hi kim,
> >>>
> >>>
> >>>
> >>> thanks for the pointer, i've updated my comments based on this newer
> >> draft
> >>> below.  many fewer and i especially like the complete example in 10.1!
> >>>
> >>>
> >>>
> >>> cheers,
> >>>
> >>> michael
> >>>
> >>>
> >>>
> >>> Michael Miller
> >>>
> >>> Software Engineer
> >>>
> >>> Institute for Systems Biology
> >>>
> >>>
> >>>
> >>> general comments:
> >>>
> >>> ·         s4.4 'Dataset Linking': might mention also that datasets are
> >>> derived from other datasets?
> >>> 'A dataset may incorporate, or link to, data in other datasets, e.g.
> >>> in
> >>> the
> >>> creation of a data mashup ' --> 'A dataset may incorporate, be derived
> >> from,
> >>> or link to, data in other datasets, e.g. in the analysis of original
> >>> datasets or in the creation of a data mashup '
> >>>
> >>> ·         s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are
> >>> individual organizations but three (8.4, 8.8, 8.9) have subsections
> >>> for
> >>> different organizations.  maybe organize so all top level sections
> >>> define a
> >>> type of organization with subsections beneath or make all top-level?
> >>>
> >>> ·         s8: some of the use cases could be more focused on how this
> >>> note
> >>> will help them (8.5-8.7)
> >>>
> >>> ·         s8.9: how do Data Catalogs fit into this note?  wasn't clear
> >>> to me
> >>> how this note is relevant to them
> >>>
> >>> our use case questions:
> >>>
> >>> ·         how to reference 3rd party datasets that aren't described by
> >>> this
> >>> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom'
> >>> with
> >>> the IRI being the URL into the repository?
> >>>
> >>> ·         we have a lot of intermediary files that we won't publish,
> >>> the
> >>> software specified in creating our published datasets from its sources
> >>> form
> >>> a (branching) workflow with the input being from the previous step(s)
> >>> in
> >> the
> >>> workflow.  how best to represent this?  this note doesn't seem to
> >>> cover
> >> how
> >>> the dataset is created so any recommendations?
> >>>
> >>> minor edits:
> >>>
> >>> ·         there are two s6.2.3 sections
> >>>
> >>> ·         s8.8.1: '... what period it is updated. To know when to...'
> >>> should
> >>> be '...what period it is updated to know when to...'?
> >>>
> >>>
> >>>
> >>> From: Joachim Baran [mailto:joachim.baran@gmail.com]
> >>> Sent: Tuesday, July 22, 2014 3:43 PM
> >>> To: Michael Miller
> >>> Cc: w3c semweb hcls
> >>> Subject: Re: hcls dataset description comments
> >>>
> >>>
> >>>
> >>> Hello,
> >>>
> >>>
> >>>
> >>>  I believe you were looking at an old document. There is currently
> >>> only
> >>> one
> >>> Figure in the note.
> >>>
> >>>
> >>>
> >>>  Please check the actual draft at:
> >>
>
http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDescriptions/blob/master/Overview.html
> >>>
> >>>
> >>>
> >>> Best wishes,
> >>>
> >>>
> >>>
> >>> Kim
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On 22 July 2014 15:36, Michael Miller
> >>> <Michael.Miller@systemsbiology.org>
> >>> wrote:
> >>>
> >>> hi all,
> >>>
> >>>
> >>>
> >>> tremendous work, very clear and well-written.  my group at ISB, the
> >>> Shmulevich lab is looking to provide provenance for the analysis
> >>> datasets
> >> we
> >>> are producing for TCGA.  we're not sure if we'll be able to 'go all
> >>> the
> >>> way'
> >>> but we want to make sure we have at hand all the information that we
> >> could,
> >>> at least in theory, be compliant.  as long as i was reading the
> >>> document,
> >>> below are some notes.
> >>>
> >>>
> >>>
> >>> general comments:
> >>>
> >>> ·         s4.4 'Dataset Linking': might mention also that datasets are
> >>> derived from other datasets?
> >>> 'A dataset may incorporate, or link to, data in other datasets, e.g.
> >>> in
> >>> the
> >>> creation of a data mashup ' --> 'A dataset may incorporate, be derived
> >> from,
> >>> or link to, data in other datasets, e.g. in the analysis of original
> >>> datasets or in the creation of a data mashup '
> >>>
> >>> ·         the chembl example in s5 is not compliant to the property
> >>> table
> >>> below, it probably is only supposed to show the relationship of the
> >>> three
> >>> terms but that could be clarified
> >>>
> >>> ·         s6.2.12 could use the example filled in
> >>>
> >>> ·         6.3.2: not sure what an 'X level description' is
> >>>
> >>> ·         s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are
> >>> individual organizations but three (8.4, 8.8, 8.9) have subsections
> >>> for
> >>> different organizations.  maybe organize so all top level sections
> >>> define a
> >>> type of organization with subsections beneath or make all top-level?
> >>>
> >>> ·         s8: many of the use cases could be more focused on how this
> >>> note
> >>> will help them
> >>>
> >>> ·         s8.9: how do Data Catalogs fit into this note?  wasn't clear
> >>> to me
> >>> how this note is relevant to them
> >>>
> >>> ·         would be nice to have a 'complete' example p[put together,
> >>> maybe
> >>> based on chembl?
> >>>
> >>>
> >>>
> >>> our use case questions:
> >>>
> >>> ·         how to reference 3rd party datasets that aren't described by
> >>> this
> >>> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom'
> >>> with
> >>> the IRI being the URL into the repository?
> >>>
> >>> ·         we have a lot of intermediary files that we won't publish,
> >>> the
> >>> software specified in creating our published datasets from its sources
> >>> form
> >>> a (branching) workflow with the input being from the previous step(s)
> >>> in
> >> the
> >>> workflow.  how best to represent this?  this note doesn't seem to
> >>> cover
> >> how
> >>> the dataset is created so any recommendations?
> >>>
> >>>
> >>>
> >>> text issues:
> >>>
> >>> ·         Figure 1: 'Overview of dataset description level metadata
> >>> profiles
> >>> and their relationships': reference not resolved, image doesn't show
> >>>
> >>> ·         Figure 2: 'Improve diagram. Multiple appearance of
> >>> concepts/description levels unclear.': reference not resolved, image
> >> doesn't
> >>> show.  add actual label
> >>>
> >>>
> >>>
> >>> minor edits:
> >>>
> >>> ·         bottom of s.3: 'placeholde' should be 'placeholder'
> >>>
> >>> ·         use straight quotes rather than slant quotes in s6.2.2
> >>> example
> >>> (and elsewhere)?
> >>>
> >>> ·         the text runs out of the box in s6.2.3 'Description'
> >>>
> >>> ·         s6.2.3: 'Dates of Creation and Issuance': 'state the date
> >>> the
> >>> dataset was generated using dct:created and/or the date the dataset
> was
> >> made
> >>> public using dct:created' should be 'state the date the dataset was
> >>> generated using dct:created and/or the date the dataset was made
> public
> >>> using dct:issued'?
> >>>
> >>> ·         there are two s6.2.3 sections
> >>>
> >>> ·         s6.2.4: 'Creation: ... The date of authorship' should be
> >>> '...The
> >>> date of creation' and 'Curation:... The date of authorship' should be
> >>> '...The date of curation'?
> >>>
> >>> ·         s8.5: the author list has end parenthesis without beginning
> >>> parenthesis
> >>>
> >>> ·         s8.8.1: '... what period it is updated. To know when to...'
> >>> should
> >>> be '...what period it is updated to know when to...'
> >>>
> >>>
> >>>
> >>> cheers,
> >>>
> >>> michael
> >>>
> >>>
> >>>
> >>> Michael Miller
> >>>
> >>> Software Engineer
> >>>
> >>> Institute for Systems Biology
> >>
> >>
> >>
> >> --
> >> Stian Soiland-Reyes, myGrid team
> >> School of Computer Science
> >> The University of Manchester
> >> http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-
> 9718
Received on Wednesday, 20 August 2014 16:30:56 UTC