Re: hcls dataset description comments--Dataset Descriptions vs. PROV

Great! Before you send the pull request, please make sure that W3C's HTML
validation passes: http://validator.w3.org/#validate_by_input

Kim


On 9 August 2014 10:15, Michael Miller <Michael.Miller@systemsbiology.org>
wrote:

> hi kim,
>
> i've made decent progress and expect to have something mid-week, if all
> goes
> well (as a pull request, tho no guarantee on the formatting!)
>
> cheers,
> michael
>
> Michael Miller
> Software Engineer
> Institute for Systems Biology
>
> > -----Original Message-----
> > From: Joachim Baran [mailto:joachim.baran@gmail.com]
> > Sent: Friday, August 08, 2014 6:46 PM
> > To: Michael Miller
> > Cc: Stian Soiland-Reyes; w3c semweb hcls
> > Subject: Re: hcls dataset description comments--Dataset Descriptions vs.
> > PROV
> >
> > Hello,
> >
> >   Has there been an update to this? Preferably a pull request?
> >
> > Thanks,
> >
> > Kim
> >
> >
> >
> > > On Aug 5, 2014, at 8:13 AM, Michael Miller
> > <Michael.Miller@systemsbiology.org> wrote:
> > >
> > > hi stian,
> > >
> > > thanks much, very useful!
> > >
> > > cheers,
> > > michael
> > >
> > > Michael Miller
> > > Software Engineer
> > > Institute for Systems Biology
> > >
> > >
> > >> -----Original Message-----
> > >> From: stian@mygrid.org.uk [mailto:stian@mygrid.org.uk] On Behalf Of
> > Stian
> > >> Soiland-Reyes
> > >> Sent: Tuesday, August 05, 2014 5:38 AM
> > >> To: Michael Miller
> > >> Cc: Joachim Baran; w3c semweb hcls
> > >> Subject: Re: hcls dataset description comments--Dataset Descriptions
> > >> vs.
> > >> PROV
> > >>
> > >> Just some inputs:
> > >>
> > >>
> > >> PROV defines prov:wasDerivedFrom which in broad sense describes such
> > a
> > >> relationset between datasets. However you do not know anything more
> > >> about what kind of derivation we are talking about.
> > >>
> > >>
> > >> In PAV we found the need to specialize three types of derivation:
> > >>
> > >> pav:retrievedFrom -
> > >> http://purl.org/pav/html#http://purl.org/pav/retrievedFrom
> > >> .. a byte-for-byte download
> > >>
> > >> pav:importedFrom -
> > >> http://purl.org/pav/html#http://purl.org/pav/importedFrom
> > >> .. a somewhat equivalent form of the source, but after some kind of
> > >> transformation or selection (e.g. CSV -> XML)
> > >>
> > >> pav:derivedFrom -
> > >> http://purl.org/pav/html#http://purl.org/pav/derivedFrom
> > >> .. when the new resource has been further refined or modified
> > >> (somewhat adding additional knowledge)
> > >>
> > >>
> > >> If you are simply concatenating several dataset, then multiple
> > >> pav:importedFrom statements would make sense. If further knowledge is
> > >> added, say by reasoning or calculation, then pav:derivedFrom would
> > >> make sense.
> > >>
> > >>
> > >> Now if you want to detail exactly how those datasets have been
> > >> combined, I think you are right that would make sense to break down
> > >> the derivation using PROV statements, e.g. a series of activities,
> > >> generation and usage. How to describe these activities (e.g.
> > >> subclasses and properties) will be specific to each case.
> > >>
> > >>
> > >>
> > >> If the process you generated the dataset with somewhat resembles a
> > >> dataflow, you might be interested in the wfprov and wfdesc ontologies
> > >> that specialize PROV to define a WorkflowRun of steps of ProcessRuns,
> > >> which can be related to a common workflow description (e.g. a
> > >> prov:Plan):
> > >>
> > >> http://purl.org/wf4ever/model#wfprov
> > >>
> > >> OPMW is a similar approach:
> > >> http://www.opmw.org/model/OPMW/
> > >>
> > >>
> > >>
> > >> On 4 August 2014 17:44, Michael Miller
> > >> <Michael.Miller@systemsbiology.org> wrote:
> > >>> hi all,
> > >>>
> > >>>
> > >>>
> > >>> as you are all undoubtedly aware, a major, if not the major TCGA
> > >>> dataset
> > >> use
> > >>> cases revolve around taking the 3rd level data from the TCGA dcc
> > >> repository
> > >>> and doing analysis, producing 4th level data such as clusters, pca,
> > >>> etc.
> > >>> one of the things we do here at ISB is produce an intermediate data
> > >>> step
> > >>> that combines the different platforms (mRNA, miRNA, RPPA, METH,
> > etc.)
> > >> into
> > >>> one feature matrix so that the analysis can use all the platforms
> > >>> together.
> > >>> the Broad firehose pipeline also has this as one of its outputs.
> > >>>
> > >>>
> > >>>
> > >>> as some of my comments allude to, it doesn't seem that Dataset
> > >> Descriptions
> > >>> deal with the use case of describing a dataset that is specifically
> > >>> derived
> > >>> from other datasets, which is what we are looking at ways we might
> > >> describe
> > >>> our data when we publish it.  i took a look at PROV and, i've got a
> > >>> bit
> > >>> more
> > >>> mapping to do, but it seems like PROV provides the terms we need.
> > >>>
> > >>>
> > >>>
> > >>> but this has lead me to ask the question of what is the relation of
> > >>> Dataset
> > >>> Descriptions and PROV and how should they/should they be used
> > >> together?  i
> > >>> think the above use case is quite common for datasets being published
> > so
> > >>> might deserve a discussion in the Dataset Descriptions note
> > >>>
> > >>>
> > >>>
> > >>> cheers,
> > >>>
> > >>> michael
> > >>>
> > >>>
> > >>>
> > >>> Michael Miller
> > >>>
> > >>> Software Engineer
> > >>>
> > >>> Institute for Systems Biology
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> From: Joachim Baran [mailto:joachim.baran@gmail.com]
> > >>> Sent: Thursday, July 31, 2014 3:43 PM
> > >>> To: Michael Miller
> > >>> Cc: w3c semweb hcls
> > >>> Subject: Re: hcls dataset description comments
> > >>>
> > >>>
> > >>>
> > >>> Hi!
> > >>>
> > >>>
> > >>>
> > >>>  I will ponder about your edit suggestion of your first bullet point.
> > >>> I
> > >>> am
> > >>> not sure at the moment if it would have wider implications.
> > >>>
> > >>>
> > >>>
> > >>>  You are right that the use cases were written by the groups
> > >>> themselves. I
> > >>> do not know how to improve the use cases without rewriting them,
> > which
> > >> might
> > >>> not be agreeable to all parties involved. C'est la vie.
> > >>>
> > >>>
> > >>>
> > >>>  The role of Data Catalogs should then be discussed during out next
> > >>> conf
> > >>> call. Thanks for highlighting that this might be unclear to readers.
> > >>>
> > >>>
> > >>>
> > >>> Kim
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On 30 July 2014 10:41, Michael Miller
> > >>> <Michael.Miller@systemsbiology.org>
> > >>> wrote:
> > >>>
> > >>> hi kim,
> > >>>
> > >>>
> > >>>
> > >>> 'For other edits, please fork the repository and create a pull
> request
> > >>> with
> > >>> your changes'
> > >>>
> > >>>
> > >>>
> > >>> of the four general comments, the first is really the only 'edit', i
> > >>> didn't
> > >>> put it in the minor edits because it had some implications that the
> > >>> group
> > >>> might not agree with.  if the change makes sense, it might be easier
> > >>> for
> > >>> you
> > >>> to make the edit.
> > >>>
> > >>>
> > >>>
> > >>> the other three are general comments and i'm not sure what the
> > solution
> > >>> might be, they were mainly points, as a reader, that weren't clear or
> > >>> were a
> > >>> bit confusing.  these were all from the use case section so were
> > >>> probably
> > >>> written by the groups themselves?  if i have permission, i can
> > >>> certainly
> > >>> add
> > >>> them as issues.
> > >>>
> > >>>
> > >>>
> > >>> cheers,
> > >>>
> > >>> michael
> > >>>
> > >>>
> > >>>
> > >>> Michael Miller
> > >>>
> > >>> Software Engineer
> > >>>
> > >>> Institute for Systems Biology
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> From: Joachim Baran [mailto:joachim.baran@gmail.com]
> > >>> Sent: Tuesday, July 29, 2014 11:56 AM
> > >>>
> > >>>
> > >>> To: Michael Miller
> > >>> Cc: w3c semweb hcls
> > >>> Subject: Re: hcls dataset description comments
> > >>>
> > >>>
> > >>>
> > >>> Hi!
> > >>>
> > >>>
> > >>>
> > >>>  Thanks for the suggestions. I have incorporated your minor edits.
> > >>> Unbelievable how those slipped through after so many re-readings
> > >>> still.
> > >>>
> > >>>
> > >>>
> > >>>  For other edits, please fork the repository and create a pull
> request
> > >>> with
> > >>> your changes.
> > >>>
> > >>>
> > >>>
> > >>> Best wishes,
> > >>>
> > >>>
> > >>>
> > >>> Kim
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On 23 July 2014 08:53, Michael Miller
> > >>> <Michael.Miller@systemsbiology.org>
> > >>> wrote:
> > >>>
> > >>> hi kim,
> > >>>
> > >>>
> > >>>
> > >>> thanks for the pointer, i've updated my comments based on this newer
> > >> draft
> > >>> below.  many fewer and i especially like the complete example in
> 10.1!
> > >>>
> > >>>
> > >>>
> > >>> cheers,
> > >>>
> > >>> michael
> > >>>
> > >>>
> > >>>
> > >>> Michael Miller
> > >>>
> > >>> Software Engineer
> > >>>
> > >>> Institute for Systems Biology
> > >>>
> > >>>
> > >>>
> > >>> general comments:
> > >>>
> > >>> ·         s4.4 'Dataset Linking': might mention also that datasets
> are
> > >>> derived from other datasets?
> > >>> 'A dataset may incorporate, or link to, data in other datasets, e.g.
> > >>> in
> > >>> the
> > >>> creation of a data mashup ' --> 'A dataset may incorporate, be
> derived
> > >> from,
> > >>> or link to, data in other datasets, e.g. in the analysis of original
> > >>> datasets or in the creation of a data mashup '
> > >>>
> > >>> ·         s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are
> > >>> individual organizations but three (8.4, 8.8, 8.9) have subsections
> > >>> for
> > >>> different organizations.  maybe organize so all top level sections
> > >>> define a
> > >>> type of organization with subsections beneath or make all top-level?
> > >>>
> > >>> ·         s8: some of the use cases could be more focused on how this
> > >>> note
> > >>> will help them (8.5-8.7)
> > >>>
> > >>> ·         s8.9: how do Data Catalogs fit into this note?  wasn't
> clear
> > >>> to me
> > >>> how this note is relevant to them
> > >>>
> > >>> our use case questions:
> > >>>
> > >>> ·         how to reference 3rd party datasets that aren't described
> by
> > >>> this
> > >>> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom'
> > >>> with
> > >>> the IRI being the URL into the repository?
> > >>>
> > >>> ·         we have a lot of intermediary files that we won't publish,
> > >>> the
> > >>> software specified in creating our published datasets from its
> sources
> > >>> form
> > >>> a (branching) workflow with the input being from the previous step(s)
> > >>> in
> > >> the
> > >>> workflow.  how best to represent this?  this note doesn't seem to
> > >>> cover
> > >> how
> > >>> the dataset is created so any recommendations?
> > >>>
> > >>> minor edits:
> > >>>
> > >>> ·         there are two s6.2.3 sections
> > >>>
> > >>> ·         s8.8.1: '... what period it is updated. To know when to...'
> > >>> should
> > >>> be '...what period it is updated to know when to...'?
> > >>>
> > >>>
> > >>>
> > >>> From: Joachim Baran [mailto:joachim.baran@gmail.com]
> > >>> Sent: Tuesday, July 22, 2014 3:43 PM
> > >>> To: Michael Miller
> > >>> Cc: w3c semweb hcls
> > >>> Subject: Re: hcls dataset description comments
> > >>>
> > >>>
> > >>>
> > >>> Hello,
> > >>>
> > >>>
> > >>>
> > >>>  I believe you were looking at an old document. There is currently
> > >>> only
> > >>> one
> > >>> Figure in the note.
> > >>>
> > >>>
> > >>>
> > >>>  Please check the actual draft at:
> > >>
> >
> http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDescriptions/blob/master/Overview.html
> > >>>
> > >>>
> > >>>
> > >>> Best wishes,
> > >>>
> > >>>
> > >>>
> > >>> Kim
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On 22 July 2014 15:36, Michael Miller
> > >>> <Michael.Miller@systemsbiology.org>
> > >>> wrote:
> > >>>
> > >>> hi all,
> > >>>
> > >>>
> > >>>
> > >>> tremendous work, very clear and well-written.  my group at ISB, the
> > >>> Shmulevich lab is looking to provide provenance for the analysis
> > >>> datasets
> > >> we
> > >>> are producing for TCGA.  we're not sure if we'll be able to 'go all
> > >>> the
> > >>> way'
> > >>> but we want to make sure we have at hand all the information that we
> > >> could,
> > >>> at least in theory, be compliant.  as long as i was reading the
> > >>> document,
> > >>> below are some notes.
> > >>>
> > >>>
> > >>>
> > >>> general comments:
> > >>>
> > >>> ·         s4.4 'Dataset Linking': might mention also that datasets
> are
> > >>> derived from other datasets?
> > >>> 'A dataset may incorporate, or link to, data in other datasets, e.g.
> > >>> in
> > >>> the
> > >>> creation of a data mashup ' --> 'A dataset may incorporate, be
> derived
> > >> from,
> > >>> or link to, data in other datasets, e.g. in the analysis of original
> > >>> datasets or in the creation of a data mashup '
> > >>>
> > >>> ·         the chembl example in s5 is not compliant to the property
> > >>> table
> > >>> below, it probably is only supposed to show the relationship of the
> > >>> three
> > >>> terms but that could be clarified
> > >>>
> > >>> ·         s6.2.12 could use the example filled in
> > >>>
> > >>> ·         6.3.2: not sure what an 'X level description' is
> > >>>
> > >>> ·         s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are
> > >>> individual organizations but three (8.4, 8.8, 8.9) have subsections
> > >>> for
> > >>> different organizations.  maybe organize so all top level sections
> > >>> define a
> > >>> type of organization with subsections beneath or make all top-level?
> > >>>
> > >>> ·         s8: many of the use cases could be more focused on how this
> > >>> note
> > >>> will help them
> > >>>
> > >>> ·         s8.9: how do Data Catalogs fit into this note?  wasn't
> clear
> > >>> to me
> > >>> how this note is relevant to them
> > >>>
> > >>> ·         would be nice to have a 'complete' example p[put together,
> > >>> maybe
> > >>> based on chembl?
> > >>>
> > >>>
> > >>>
> > >>> our use case questions:
> > >>>
> > >>> ·         how to reference 3rd party datasets that aren't described
> by
> > >>> this
> > >>> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom'
> > >>> with
> > >>> the IRI being the URL into the repository?
> > >>>
> > >>> ·         we have a lot of intermediary files that we won't publish,
> > >>> the
> > >>> software specified in creating our published datasets from its
> sources
> > >>> form
> > >>> a (branching) workflow with the input being from the previous step(s)
> > >>> in
> > >> the
> > >>> workflow.  how best to represent this?  this note doesn't seem to
> > >>> cover
> > >> how
> > >>> the dataset is created so any recommendations?
> > >>>
> > >>>
> > >>>
> > >>> text issues:
> > >>>
> > >>> ·         Figure 1: 'Overview of dataset description level metadata
> > >>> profiles
> > >>> and their relationships': reference not resolved, image doesn't show
> > >>>
> > >>> ·         Figure 2: 'Improve diagram. Multiple appearance of
> > >>> concepts/description levels unclear.': reference not resolved, image
> > >> doesn't
> > >>> show.  add actual label
> > >>>
> > >>>
> > >>>
> > >>> minor edits:
> > >>>
> > >>> ·         bottom of s.3: 'placeholde' should be 'placeholder'
> > >>>
> > >>> ·         use straight quotes rather than slant quotes in s6.2.2
> > >>> example
> > >>> (and elsewhere)?
> > >>>
> > >>> ·         the text runs out of the box in s6.2.3 'Description'
> > >>>
> > >>> ·         s6.2.3: 'Dates of Creation and Issuance': 'state the date
> > >>> the
> > >>> dataset was generated using dct:created and/or the date the dataset
> > was
> > >> made
> > >>> public using dct:created' should be 'state the date the dataset was
> > >>> generated using dct:created and/or the date the dataset was made
> > public
> > >>> using dct:issued'?
> > >>>
> > >>> ·         there are two s6.2.3 sections
> > >>>
> > >>> ·         s6.2.4: 'Creation: ... The date of authorship' should be
> > >>> '...The
> > >>> date of creation' and 'Curation:... The date of authorship' should be
> > >>> '...The date of curation'?
> > >>>
> > >>> ·         s8.5: the author list has end parenthesis without beginning
> > >>> parenthesis
> > >>>
> > >>> ·         s8.8.1: '... what period it is updated. To know when to...'
> > >>> should
> > >>> be '...what period it is updated to know when to...'
> > >>>
> > >>>
> > >>>
> > >>> cheers,
> > >>>
> > >>> michael
> > >>>
> > >>>
> > >>>
> > >>> Michael Miller
> > >>>
> > >>> Software Engineer
> > >>>
> > >>> Institute for Systems Biology
> > >>
> > >>
> > >>
> > >> --
> > >> Stian Soiland-Reyes, myGrid team
> > >> School of Computer Science
> > >> The University of Manchester
> > >> http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-
> > 9718
>

Received on Monday, 11 August 2014 15:00:35 UTC