Re: hcls dataset description comments--Dataset Descriptions vs. PROV

Just some inputs:


PROV defines prov:wasDerivedFrom which in broad sense describes such a
relationset between datasets. However you do not know anything more
about what kind of derivation we are talking about.


In PAV we found the need to specialize three types of derivation:

pav:retrievedFrom - http://purl.org/pav/html#http://purl.org/pav/retrievedFrom
.. a byte-for-byte download

pav:importedFrom - http://purl.org/pav/html#http://purl.org/pav/importedFrom
.. a somewhat equivalent form of the source, but after some kind of
transformation or selection (e.g. CSV -> XML)

pav:derivedFrom - http://purl.org/pav/html#http://purl.org/pav/derivedFrom
 .. when the new resource has been further refined or modified
(somewhat adding additional knowledge)


If you are simply concatenating several dataset, then multiple
pav:importedFrom statements would make sense. If further knowledge is
added, say by reasoning or calculation, then pav:derivedFrom would
make sense.


Now if you want to detail exactly how those datasets have been
combined, I think you are right that would make sense to break down
the derivation using PROV statements, e.g. a series of activities,
generation and usage. How to describe these activities (e.g.
subclasses and properties) will be specific to each case.



If the process you generated the dataset with somewhat resembles a
dataflow, you might be interested in the wfprov and wfdesc ontologies
that specialize PROV to define a WorkflowRun of steps of ProcessRuns,
which can be related to a common workflow description (e.g. a
prov:Plan):

http://purl.org/wf4ever/model#wfprov

OPMW is a similar approach:
http://www.opmw.org/model/OPMW/



On 4 August 2014 17:44, Michael Miller
<Michael.Miller@systemsbiology.org> wrote:
> hi all,
>
>
>
> as you are all undoubtedly aware, a major, if not the major TCGA dataset use
> cases revolve around taking the 3rd level data from the TCGA dcc repository
> and doing analysis, producing 4th level data such as clusters, pca, etc.
> one of the things we do here at ISB is produce an intermediate data step
> that combines the different platforms (mRNA, miRNA, RPPA, METH, etc.) into
> one feature matrix so that the analysis can use all the platforms together.
> the Broad firehose pipeline also has this as one of its outputs.
>
>
>
> as some of my comments allude to, it doesn't seem that Dataset Descriptions
> deal with the use case of describing a dataset that is specifically derived
> from other datasets, which is what we are looking at ways we might describe
> our data when we publish it.  i took a look at PROV and, i've got a bit more
> mapping to do, but it seems like PROV provides the terms we need.
>
>
>
> but this has lead me to ask the question of what is the relation of Dataset
> Descriptions and PROV and how should they/should they be used together?  i
> think the above use case is quite common for datasets being published so
> might deserve a discussion in the Dataset Descriptions note
>
>
>
> cheers,
>
> michael
>
>
>
> Michael Miller
>
> Software Engineer
>
> Institute for Systems Biology
>
>
>
>
>
> From: Joachim Baran [mailto:joachim.baran@gmail.com]
> Sent: Thursday, July 31, 2014 3:43 PM
> To: Michael Miller
> Cc: w3c semweb hcls
> Subject: Re: hcls dataset description comments
>
>
>
> Hi!
>
>
>
>   I will ponder about your edit suggestion of your first bullet point. I am
> not sure at the moment if it would have wider implications.
>
>
>
>   You are right that the use cases were written by the groups themselves. I
> do not know how to improve the use cases without rewriting them, which might
> not be agreeable to all parties involved. C'est la vie.
>
>
>
>   The role of Data Catalogs should then be discussed during out next conf
> call. Thanks for highlighting that this might be unclear to readers.
>
>
>
> Kim
>
>
>
>
>
>
>
> On 30 July 2014 10:41, Michael Miller <Michael.Miller@systemsbiology.org>
> wrote:
>
> hi kim,
>
>
>
> 'For other edits, please fork the repository and create a pull request with
> your changes'
>
>
>
> of the four general comments, the first is really the only 'edit', i didn't
> put it in the minor edits because it had some implications that the group
> might not agree with.  if the change makes sense, it might be easier for you
> to make the edit.
>
>
>
> the other three are general comments and i'm not sure what the solution
> might be, they were mainly points, as a reader, that weren't clear or were a
> bit confusing.  these were all from the use case section so were probably
> written by the groups themselves?  if i have permission, i can certainly add
> them as issues.
>
>
>
> cheers,
>
> michael
>
>
>
> Michael Miller
>
> Software Engineer
>
> Institute for Systems Biology
>
>
>
>
>
> From: Joachim Baran [mailto:joachim.baran@gmail.com]
> Sent: Tuesday, July 29, 2014 11:56 AM
>
>
> To: Michael Miller
> Cc: w3c semweb hcls
> Subject: Re: hcls dataset description comments
>
>
>
> Hi!
>
>
>
>   Thanks for the suggestions. I have incorporated your minor edits.
> Unbelievable how those slipped through after so many re-readings still.
>
>
>
>   For other edits, please fork the repository and create a pull request with
> your changes.
>
>
>
> Best wishes,
>
>
>
> Kim
>
>
>
>
>
> On 23 July 2014 08:53, Michael Miller <Michael.Miller@systemsbiology.org>
> wrote:
>
> hi kim,
>
>
>
> thanks for the pointer, i've updated my comments based on this newer draft
> below.  many fewer and i especially like the complete example in 10.1!
>
>
>
> cheers,
>
> michael
>
>
>
> Michael Miller
>
> Software Engineer
>
> Institute for Systems Biology
>
>
>
> general comments:
>
> ·         s4.4 'Dataset Linking': might mention also that datasets are
> derived from other datasets?
> 'A dataset may incorporate, or link to, data in other datasets, e.g. in the
> creation of a data mashup ' --> 'A dataset may incorporate, be derived from,
> or link to, data in other datasets, e.g. in the analysis of original
> datasets or in the creation of a data mashup '
>
> ·         s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are
> individual organizations but three (8.4, 8.8, 8.9) have subsections for
> different organizations.  maybe organize so all top level sections define a
> type of organization with subsections beneath or make all top-level?
>
> ·         s8: some of the use cases could be more focused on how this note
> will help them (8.5-8.7)
>
> ·         s8.9: how do Data Catalogs fit into this note?  wasn't clear to me
> how this note is relevant to them
>
> our use case questions:
>
> ·         how to reference 3rd party datasets that aren't described by this
> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom' with
> the IRI being the URL into the repository?
>
> ·         we have a lot of intermediary files that we won't publish, the
> software specified in creating our published datasets from its sources form
> a (branching) workflow with the input being from the previous step(s) in the
> workflow.  how best to represent this?  this note doesn't seem to cover how
> the dataset is created so any recommendations?
>
> minor edits:
>
> ·         there are two s6.2.3 sections
>
> ·         s8.8.1: '... what period it is updated. To know when to...' should
> be '...what period it is updated to know when to...'?
>
>
>
> From: Joachim Baran [mailto:joachim.baran@gmail.com]
> Sent: Tuesday, July 22, 2014 3:43 PM
> To: Michael Miller
> Cc: w3c semweb hcls
> Subject: Re: hcls dataset description comments
>
>
>
> Hello,
>
>
>
>   I believe you were looking at an old document. There is currently only one
> Figure in the note.
>
>
>
>   Please check the actual draft at:
> http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDescriptions/blob/master/Overview.html
>
>
>
> Best wishes,
>
>
>
> Kim
>
>
>
>
>
> On 22 July 2014 15:36, Michael Miller <Michael.Miller@systemsbiology.org>
> wrote:
>
> hi all,
>
>
>
> tremendous work, very clear and well-written.  my group at ISB, the
> Shmulevich lab is looking to provide provenance for the analysis datasets we
> are producing for TCGA.  we're not sure if we'll be able to 'go all the way'
> but we want to make sure we have at hand all the information that we could,
> at least in theory, be compliant.  as long as i was reading the document,
> below are some notes.
>
>
>
> general comments:
>
> ·         s4.4 'Dataset Linking': might mention also that datasets are
> derived from other datasets?
> 'A dataset may incorporate, or link to, data in other datasets, e.g. in the
> creation of a data mashup ' --> 'A dataset may incorporate, be derived from,
> or link to, data in other datasets, e.g. in the analysis of original
> datasets or in the creation of a data mashup '
>
> ·         the chembl example in s5 is not compliant to the property table
> below, it probably is only supposed to show the relationship of the three
> terms but that could be clarified
>
> ·         s6.2.12 could use the example filled in
>
> ·         6.3.2: not sure what an 'X level description' is
>
> ·         s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are
> individual organizations but three (8.4, 8.8, 8.9) have subsections for
> different organizations.  maybe organize so all top level sections define a
> type of organization with subsections beneath or make all top-level?
>
> ·         s8: many of the use cases could be more focused on how this note
> will help them
>
> ·         s8.9: how do Data Catalogs fit into this note?  wasn't clear to me
> how this note is relevant to them
>
> ·         would be nice to have a 'complete' example p[put together, maybe
> based on chembl?
>
>
>
> our use case questions:
>
> ·         how to reference 3rd party datasets that aren't described by this
> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom' with
> the IRI being the URL into the repository?
>
> ·         we have a lot of intermediary files that we won't publish, the
> software specified in creating our published datasets from its sources form
> a (branching) workflow with the input being from the previous step(s) in the
> workflow.  how best to represent this?  this note doesn't seem to cover how
> the dataset is created so any recommendations?
>
>
>
> text issues:
>
> ·         Figure 1: 'Overview of dataset description level metadata profiles
> and their relationships': reference not resolved, image doesn't show
>
> ·         Figure 2: 'Improve diagram. Multiple appearance of
> concepts/description levels unclear.': reference not resolved, image doesn't
> show.  add actual label
>
>
>
> minor edits:
>
> ·         bottom of s.3: 'placeholde' should be 'placeholder'
>
> ·         use straight quotes rather than slant quotes in s6.2.2 example
> (and elsewhere)?
>
> ·         the text runs out of the box in s6.2.3 'Description'
>
> ·         s6.2.3: 'Dates of Creation and Issuance': 'state the date the
> dataset was generated using dct:created and/or the date the dataset was made
> public using dct:created' should be 'state the date the dataset was
> generated using dct:created and/or the date the dataset was made public
> using dct:issued'?
>
> ·         there are two s6.2.3 sections
>
> ·         s6.2.4: 'Creation: ... The date of authorship' should be '...The
> date of creation' and 'Curation:... The date of authorship' should be
> '...The date of curation'?
>
> ·         s8.5: the author list has end parenthesis without beginning
> parenthesis
>
> ·         s8.8.1: '... what period it is updated. To know when to...' should
> be '...what period it is updated to know when to...'
>
>
>
> cheers,
>
> michael
>
>
>
> Michael Miller
>
> Software Engineer
>
> Institute for Systems Biology
>
>
>
>
>
>
>
>
>
>



-- 
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester
http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718

Received on Tuesday, 5 August 2014 12:38:47 UTC