Re: hcls dataset description comments--Dataset Descriptions vs. PROV from Michel Dumontier on 2014-08-20 (public-semweb-lifesci@w3.org from August 2014)

From: Michel Dumontier <michel.dumontier@gmail.com>
Date: Wed, 20 Aug 2014 09:58:47 -0700
To: Michael Miller <Michael.Miller@systemsbiology.org>
Cc: Joachim Baran <joachim.baran@gmail.com>, w3c semweb hcls <public-semweb-lifesci@w3.org>, Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Message-ID: <CALcEXf53=MudLTDUkrkuF7MHRifedE9mSE9Agz8qAFkzpCyV=g@mail.gmail.com>
excellent. thank you Michael.

m.

On Wed, Aug 20, 2014 at 9:30 AM, Michael Miller
<Michael.Miller@systemsbiology.org> wrote:
> hi all,
>
>
>
> finally finished my initial attempt at adding a section on datasets in
> workflows.  had some brief discussions with melissa and she and carole may
> have some additions.  i tried to keep it short and sweet, to really get into
> it could be an entire note on its own.
>
>
>
> kim, you should see my pull request.  hope to make the call next week.
>
>
>
> cheers,
>
> michael
>
>
>
> Michael Miller
>
> Software Engineer
>
> Institute for Systems Biology
>
>
>
>
>
> From: Joachim Baran [mailto:joachim.baran@gmail.com]
> Sent: Monday, August 11, 2014 8:00 AM
> To: Michael Miller
>
>
> Cc: Stian Soiland-Reyes; w3c semweb hcls
> Subject: Re: hcls dataset description comments--Dataset Descriptions vs.
> PROV
>
>
>
> Great! Before you send the pull request, please make sure that W3C's HTML
> validation passes: http://validator.w3.org/#validate_by_input
>
>
>
> Kim
>
>
>
> On 9 August 2014 10:15, Michael Miller <Michael.Miller@systemsbiology.org>
> wrote:
>
> hi kim,
>
> i've made decent progress and expect to have something mid-week, if all goes
> well (as a pull request, tho no guarantee on the formatting!)
>
>
> cheers,
> michael
>
> Michael Miller
> Software Engineer
> Institute for Systems Biology
>
>> -----Original Message-----
>
>> From: Joachim Baran [mailto:joachim.baran@gmail.com]
>
>> Sent: Friday, August 08, 2014 6:46 PM
>> To: Michael Miller
>
>> Cc: Stian Soiland-Reyes; w3c semweb hcls
>> Subject: Re: hcls dataset description comments--Dataset Descriptions vs.
>> PROV
>>
>> Hello,
>>
>>   Has there been an update to this? Preferably a pull request?
>>
>> Thanks,
>>
>> Kim
>>
>>
>>
>> > On Aug 5, 2014, at 8:13 AM, Michael Miller
>> <Michael.Miller@systemsbiology.org> wrote:
>> >
>> > hi stian,
>> >
>> > thanks much, very useful!
>> >
>> > cheers,
>> > michael
>> >
>> > Michael Miller
>> > Software Engineer
>> > Institute for Systems Biology
>> >
>> >
>> >> -----Original Message-----
>> >> From: stian@mygrid.org.uk [mailto:stian@mygrid.org.uk] On Behalf Of
>> Stian
>> >> Soiland-Reyes
>> >> Sent: Tuesday, August 05, 2014 5:38 AM
>> >> To: Michael Miller
>> >> Cc: Joachim Baran; w3c semweb hcls
>> >> Subject: Re: hcls dataset description comments--Dataset Descriptions
>> >> vs.
>> >> PROV
>> >>
>> >> Just some inputs:
>> >>
>> >>
>> >> PROV defines prov:wasDerivedFrom which in broad sense describes such
>> a
>> >> relationset between datasets. However you do not know anything more
>> >> about what kind of derivation we are talking about.
>> >>
>> >>
>> >> In PAV we found the need to specialize three types of derivation:
>> >>
>> >> pav:retrievedFrom -
>> >> http://purl.org/pav/html#http://purl.org/pav/retrievedFrom
>> >> .. a byte-for-byte download
>> >>
>> >> pav:importedFrom -
>> >> http://purl.org/pav/html#http://purl.org/pav/importedFrom
>> >> .. a somewhat equivalent form of the source, but after some kind of
>> >> transformation or selection (e.g. CSV -> XML)
>> >>
>> >> pav:derivedFrom -
>> >> http://purl.org/pav/html#http://purl.org/pav/derivedFrom
>> >> .. when the new resource has been further refined or modified
>> >> (somewhat adding additional knowledge)
>> >>
>> >>
>> >> If you are simply concatenating several dataset, then multiple
>> >> pav:importedFrom statements would make sense. If further knowledge is
>> >> added, say by reasoning or calculation, then pav:derivedFrom would
>> >> make sense.
>> >>
>> >>
>> >> Now if you want to detail exactly how those datasets have been
>> >> combined, I think you are right that would make sense to break down
>> >> the derivation using PROV statements, e.g. a series of activities,
>> >> generation and usage. How to describe these activities (e.g.
>> >> subclasses and properties) will be specific to each case.
>> >>
>> >>
>> >>
>> >> If the process you generated the dataset with somewhat resembles a
>> >> dataflow, you might be interested in the wfprov and wfdesc ontologies
>> >> that specialize PROV to define a WorkflowRun of steps of ProcessRuns,
>> >> which can be related to a common workflow description (e.g. a
>> >> prov:Plan):
>> >>
>> >> http://purl.org/wf4ever/model#wfprov
>> >>
>> >> OPMW is a similar approach:
>> >> http://www.opmw.org/model/OPMW/
>> >>
>> >>
>> >>
>> >> On 4 August 2014 17:44, Michael Miller
>> >> <Michael.Miller@systemsbiology.org> wrote:
>> >>> hi all,
>> >>>
>> >>>
>> >>>
>> >>> as you are all undoubtedly aware, a major, if not the major TCGA
>> >>> dataset
>> >> use
>> >>> cases revolve around taking the 3rd level data from the TCGA dcc
>> >> repository
>> >>> and doing analysis, producing 4th level data such as clusters, pca,
>> >>> etc.
>> >>> one of the things we do here at ISB is produce an intermediate data
>> >>> step
>> >>> that combines the different platforms (mRNA, miRNA, RPPA, METH,
>> etc.)
>> >> into
>> >>> one feature matrix so that the analysis can use all the platforms
>> >>> together.
>> >>> the Broad firehose pipeline also has this as one of its outputs.
>> >>>
>> >>>
>> >>>
>> >>> as some of my comments allude to, it doesn't seem that Dataset
>> >> Descriptions
>> >>> deal with the use case of describing a dataset that is specifically
>> >>> derived
>> >>> from other datasets, which is what we are looking at ways we might
>> >> describe
>> >>> our data when we publish it.  i took a look at PROV and, i've got a
>> >>> bit
>> >>> more
>> >>> mapping to do, but it seems like PROV provides the terms we need.
>> >>>
>> >>>
>> >>>
>> >>> but this has lead me to ask the question of what is the relation of
>> >>> Dataset
>> >>> Descriptions and PROV and how should they/should they be used
>> >> together?  i
>> >>> think the above use case is quite common for datasets being published
>> so
>> >>> might deserve a discussion in the Dataset Descriptions note
>> >>>
>> >>>
>> >>>
>> >>> cheers,
>> >>>
>> >>> michael
>> >>>
>> >>>
>> >>>
>> >>> Michael Miller
>> >>>
>> >>> Software Engineer
>> >>>
>> >>> Institute for Systems Biology
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> From: Joachim Baran [mailto:joachim.baran@gmail.com]
>> >>> Sent: Thursday, July 31, 2014 3:43 PM
>> >>> To: Michael Miller
>> >>> Cc: w3c semweb hcls
>> >>> Subject: Re: hcls dataset description comments
>> >>>
>> >>>
>> >>>
>> >>> Hi!
>> >>>
>> >>>
>> >>>
>> >>>  I will ponder about your edit suggestion of your first bullet point.
>> >>> I
>> >>> am
>> >>> not sure at the moment if it would have wider implications.
>> >>>
>> >>>
>> >>>
>> >>>  You are right that the use cases were written by the groups
>> >>> themselves. I
>> >>> do not know how to improve the use cases without rewriting them,
>> which
>> >> might
>> >>> not be agreeable to all parties involved. C'est la vie.
>> >>>
>> >>>
>> >>>
>> >>>  The role of Data Catalogs should then be discussed during out next
>> >>> conf
>> >>> call. Thanks for highlighting that this might be unclear to readers.
>> >>>
>> >>>
>> >>>
>> >>> Kim
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On 30 July 2014 10:41, Michael Miller
>> >>> <Michael.Miller@systemsbiology.org>
>> >>> wrote:
>> >>>
>> >>> hi kim,
>> >>>
>> >>>
>> >>>
>> >>> 'For other edits, please fork the repository and create a pull request
>> >>> with
>> >>> your changes'
>> >>>
>> >>>
>> >>>
>> >>> of the four general comments, the first is really the only 'edit', i
>> >>> didn't
>> >>> put it in the minor edits because it had some implications that the
>> >>> group
>> >>> might not agree with.  if the change makes sense, it might be easier
>> >>> for
>> >>> you
>> >>> to make the edit.
>> >>>
>> >>>
>> >>>
>> >>> the other three are general comments and i'm not sure what the
>> solution
>> >>> might be, they were mainly points, as a reader, that weren't clear or
>> >>> were a
>> >>> bit confusing.  these were all from the use case section so were
>> >>> probably
>> >>> written by the groups themselves?  if i have permission, i can
>> >>> certainly
>> >>> add
>> >>> them as issues.
>> >>>
>> >>>
>> >>>
>> >>> cheers,
>> >>>
>> >>> michael
>> >>>
>> >>>
>> >>>
>> >>> Michael Miller
>> >>>
>> >>> Software Engineer
>> >>>
>> >>> Institute for Systems Biology
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> From: Joachim Baran [mailto:joachim.baran@gmail.com]
>> >>> Sent: Tuesday, July 29, 2014 11:56 AM
>> >>>
>> >>>
>> >>> To: Michael Miller
>> >>> Cc: w3c semweb hcls
>> >>> Subject: Re: hcls dataset description comments
>> >>>
>> >>>
>> >>>
>> >>> Hi!
>> >>>
>> >>>
>> >>>
>> >>>  Thanks for the suggestions. I have incorporated your minor edits.
>> >>> Unbelievable how those slipped through after so many re-readings
>> >>> still.
>> >>>
>> >>>
>> >>>
>> >>>  For other edits, please fork the repository and create a pull request
>> >>> with
>> >>> your changes.
>> >>>
>> >>>
>> >>>
>> >>> Best wishes,
>> >>>
>> >>>
>> >>>
>> >>> Kim
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On 23 July 2014 08:53, Michael Miller
>> >>> <Michael.Miller@systemsbiology.org>
>> >>> wrote:
>> >>>
>> >>> hi kim,
>> >>>
>> >>>
>> >>>
>> >>> thanks for the pointer, i've updated my comments based on this newer
>> >> draft
>> >>> below.  many fewer and i especially like the complete example in 10.1!
>> >>>
>> >>>
>> >>>
>> >>> cheers,
>> >>>
>> >>> michael
>> >>>
>> >>>
>> >>>
>> >>> Michael Miller
>> >>>
>> >>> Software Engineer
>> >>>
>> >>> Institute for Systems Biology
>> >>>
>> >>>
>> >>>
>> >>> general comments:
>> >>>
>> >>> ·         s4.4 'Dataset Linking': might mention also that datasets are
>> >>> derived from other datasets?
>> >>> 'A dataset may incorporate, or link to, data in other datasets, e.g.
>> >>> in
>> >>> the
>> >>> creation of a data mashup ' --> 'A dataset may incorporate, be derived
>> >> from,
>> >>> or link to, data in other datasets, e.g. in the analysis of original
>> >>> datasets or in the creation of a data mashup '
>> >>>
>> >>> ·         s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are
>> >>> individual organizations but three (8.4, 8.8, 8.9) have subsections
>> >>> for
>> >>> different organizations.  maybe organize so all top level sections
>> >>> define a
>> >>> type of organization with subsections beneath or make all top-level?
>> >>>
>> >>> ·         s8: some of the use cases could be more focused on how this
>> >>> note
>> >>> will help them (8.5-8.7)
>> >>>
>> >>> ·         s8.9: how do Data Catalogs fit into this note?  wasn't clear
>> >>> to me
>> >>> how this note is relevant to them
>> >>>
>> >>> our use case questions:
>> >>>
>> >>> ·         how to reference 3rd party datasets that aren't described by
>> >>> this
>> >>> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom'
>> >>> with
>> >>> the IRI being the URL into the repository?
>> >>>
>> >>> ·         we have a lot of intermediary files that we won't publish,
>> >>> the
>> >>> software specified in creating our published datasets from its sources
>> >>> form
>> >>> a (branching) workflow with the input being from the previous step(s)
>> >>> in
>> >> the
>> >>> workflow.  how best to represent this?  this note doesn't seem to
>> >>> cover
>> >> how
>> >>> the dataset is created so any recommendations?
>> >>>
>> >>> minor edits:
>> >>>
>> >>> ·         there are two s6.2.3 sections
>> >>>
>> >>> ·         s8.8.1: '... what period it is updated. To know when to...'
>> >>> should
>> >>> be '...what period it is updated to know when to...'?
>> >>>
>> >>>
>> >>>
>> >>> From: Joachim Baran [mailto:joachim.baran@gmail.com]
>> >>> Sent: Tuesday, July 22, 2014 3:43 PM
>> >>> To: Michael Miller
>> >>> Cc: w3c semweb hcls
>> >>> Subject: Re: hcls dataset description comments
>> >>>
>> >>>
>> >>>
>> >>> Hello,
>> >>>
>> >>>
>> >>>
>> >>>  I believe you were looking at an old document. There is currently
>> >>> only
>> >>> one
>> >>> Figure in the note.
>> >>>
>> >>>
>> >>>
>> >>>  Please check the actual draft at:
>> >>
>>
>> http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDescriptions/blob/master/Overview.html
>> >>>
>> >>>
>> >>>
>> >>> Best wishes,
>> >>>
>> >>>
>> >>>
>> >>> Kim
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On 22 July 2014 15:36, Michael Miller
>> >>> <Michael.Miller@systemsbiology.org>
>> >>> wrote:
>> >>>
>> >>> hi all,
>> >>>
>> >>>
>> >>>
>> >>> tremendous work, very clear and well-written.  my group at ISB, the
>> >>> Shmulevich lab is looking to provide provenance for the analysis
>> >>> datasets
>> >> we
>> >>> are producing for TCGA.  we're not sure if we'll be able to 'go all
>> >>> the
>> >>> way'
>> >>> but we want to make sure we have at hand all the information that we
>> >> could,
>> >>> at least in theory, be compliant.  as long as i was reading the
>> >>> document,
>> >>> below are some notes.
>> >>>
>> >>>
>> >>>
>> >>> general comments:
>> >>>
>> >>> ·         s4.4 'Dataset Linking': might mention also that datasets are
>> >>> derived from other datasets?
>> >>> 'A dataset may incorporate, or link to, data in other datasets, e.g.
>> >>> in
>> >>> the
>> >>> creation of a data mashup ' --> 'A dataset may incorporate, be derived
>> >> from,
>> >>> or link to, data in other datasets, e.g. in the analysis of original
>> >>> datasets or in the creation of a data mashup '
>> >>>
>> >>> ·         the chembl example in s5 is not compliant to the property
>> >>> table
>> >>> below, it probably is only supposed to show the relationship of the
>> >>> three
>> >>> terms but that could be clarified
>> >>>
>> >>> ·         s6.2.12 could use the example filled in
>> >>>
>> >>> ·         6.3.2: not sure what an 'X level description' is
>> >>>
>> >>> ·         s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are
>> >>> individual organizations but three (8.4, 8.8, 8.9) have subsections
>> >>> for
>> >>> different organizations.  maybe organize so all top level sections
>> >>> define a
>> >>> type of organization with subsections beneath or make all top-level?
>> >>>
>> >>> ·         s8: many of the use cases could be more focused on how this
>> >>> note
>> >>> will help them
>> >>>
>> >>> ·         s8.9: how do Data Catalogs fit into this note?  wasn't clear
>> >>> to me
>> >>> how this note is relevant to them
>> >>>
>> >>> ·         would be nice to have a 'complete' example p[put together,
>> >>> maybe
>> >>> based on chembl?
>> >>>
>> >>>
>> >>>
>> >>> our use case questions:
>> >>>
>> >>> ·         how to reference 3rd party datasets that aren't described by
>> >>> this
>> >>> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom'
>> >>> with
>> >>> the IRI being the URL into the repository?
>> >>>
>> >>> ·         we have a lot of intermediary files that we won't publish,
>> >>> the
>> >>> software specified in creating our published datasets from its sources
>> >>> form
>> >>> a (branching) workflow with the input being from the previous step(s)
>> >>> in
>> >> the
>> >>> workflow.  how best to represent this?  this note doesn't seem to
>> >>> cover
>> >> how
>> >>> the dataset is created so any recommendations?
>> >>>
>> >>>
>> >>>
>> >>> text issues:
>> >>>
>> >>> ·         Figure 1: 'Overview of dataset description level metadata
>> >>> profiles
>> >>> and their relationships': reference not resolved, image doesn't show
>> >>>
>> >>> ·         Figure 2: 'Improve diagram. Multiple appearance of
>> >>> concepts/description levels unclear.': reference not resolved, image
>> >> doesn't
>> >>> show.  add actual label
>> >>>
>> >>>
>> >>>
>> >>> minor edits:
>> >>>
>> >>> ·         bottom of s.3: 'placeholde' should be 'placeholder'
>> >>>
>> >>> ·         use straight quotes rather than slant quotes in s6.2.2
>> >>> example
>> >>> (and elsewhere)?
>> >>>
>> >>> ·         the text runs out of the box in s6.2.3 'Description'
>> >>>
>> >>> ·         s6.2.3: 'Dates of Creation and Issuance': 'state the date
>> >>> the
>> >>> dataset was generated using dct:created and/or the date the dataset
>> was
>> >> made
>> >>> public using dct:created' should be 'state the date the dataset was
>> >>> generated using dct:created and/or the date the dataset was made
>> public
>> >>> using dct:issued'?
>> >>>
>> >>> ·         there are two s6.2.3 sections
>> >>>
>> >>> ·         s6.2.4: 'Creation: ... The date of authorship' should be
>> >>> '...The
>> >>> date of creation' and 'Curation:... The date of authorship' should be
>> >>> '...The date of curation'?
>> >>>
>> >>> ·         s8.5: the author list has end parenthesis without beginning
>> >>> parenthesis
>> >>>
>> >>> ·         s8.8.1: '... what period it is updated. To know when to...'
>> >>> should
>> >>> be '...what period it is updated to know when to...'
>> >>>
>> >>>
>> >>>
>> >>> cheers,
>> >>>
>> >>> michael
>> >>>
>> >>>
>> >>>
>> >>> Michael Miller
>> >>>
>> >>> Software Engineer
>> >>>
>> >>> Institute for Systems Biology
>> >>
>> >>
>> >>
>> >> --
>> >> Stian Soiland-Reyes, myGrid team
>> >> School of Computer Science
>> >> The University of Manchester
>> >> http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-
>> 9718
>
>
Received on Wednesday, 20 August 2014 16:59:35 UTC