RE: hcls dataset description comments--Dataset Descriptions vs. PROV from Michael Miller on 2014-08-28 (public-semweb-lifesci@w3.org from August 2014)

From: Michael Miller <Michael.Miller@systemsbiology.org>
Date: Thu, 28 Aug 2014 08:08:27 -0700
To: Joachim Baran <joachim.baran@gmail.com>, Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Cc: Alasdair J G Gray <A.J.G.Gray@hw.ac.uk>, w3c semweb hcls <public-semweb-lifesci@w3.org>, pav-ontology@googlegroups.com
Message-ID: <14882860fa05243dee69164a6a6383af@mail.gmail.com>
hi kim and stian,



thanks for your comments, aggregation is indeed appropriate i think.   one
extra kicker is that the set of imported files has all the data, i.e. if
there were three columns in the original files, there will be three files,
one each for each column.  so there is, in a sense, an aggregation of
imported files also.



i agree that it's hard to distinguish between 'clever and non-clever'
functions but not if you change the criteria to the distinction between
transforming and mapping  functions upon the data themselves, here we have
a clear mapping function so i would lean towards import



cheers,

michael



Michael Miller

Software Engineer

Institute for Systems Biology





*From:* Joachim Baran [mailto:joachim.baran@gmail.com]
*Sent:* Wednesday, August 27, 2014 5:49 PM
*To:* Stian Soiland-Reyes
*Cc:* Michael Miller; Alasdair J G Gray; w3c semweb hcls;
pav-ontology@googlegroups.com
*Subject:* Re: hcls dataset description comments--Dataset Descriptions vs.
PROV



  I disagree.



  In my opinion it is subjective to use "import" when some unspecified
functions are applied to the data that is being incorporated in a new
dataset, but "derived" other times. I think there is no objective
discriminator that could distinguish between clever and non-clever
functions.



  If X is the source dataset and Y is the new dataset, then I would use
"import" if X is a subset of Y. I would use "derived" otherwise.



Kim





On 27 August 2014 16:47, Stian Soiland-Reyes <
soiland-reyes@cs.manchester.ac.uk> wrote:

That is a nice use-case.

If you make a new file by selecting out a column from a single source file,
that is a narrow case of pav:importedFrom in my opinion. This is why we
added in http://purl.org/pav/html#http://purl.org/pav/importedFrom the
phrase:

> The imported resource does not have to be complete, but should be
consistent with the knowledge conveyed by the original resource.

e.g. if you extract a list of all the names and  email addresses from an
address book (but skipping phone and fax numbers) then that is still a case
of import (which is not complete, but consistent).

However if you do anything "clever", like importing only those addresses
that are in the UK, then you are deriving new information and cannot use
pav:importedFrom (pav:derivedFrom would be appropriate).



In your case you are turning it around across multiple sources.. so if I
understand it right, it is as if your source is a set of vcard files, one
per person, and then you create new files, one that has all the email
addresses, another one with all the names, etc. I don't think it should
matter if the 'subject' is in the column or the row.

The new resource is one that is expressing a collection of email addresses.
So if we are to express an pav:importedFrom here, it should better go to
some collection of vcards, rather than multiple imports of many vcard files
(where did that list of vcard files come from? Who decides which is in or
out?)

So I think it sounds like it would be better to express that you are
importing the collection/aggregation of those files (e.g. an
ore:Aggregation) - rather than having hundred import edges. Then you have a
resource to describe where that selection of files came from rather then
them seemingly randomly meeting up in that import. :-)



e.g.

@base <http://www.example.com/>



<merged.csv> pav:importedFrom <sourcefiles/> ;

  pav:importedBy <http://orcid.org/0000-0001-9842-9718> ;

  pav:createdWith </merger-tool> .



<sourcefiles/> a ore:Aggregation ;

  ore:aggregates <sourcefiles/file1.csv>, <sourcefiles/file2.csv>,
<sourcefiles/file3.csv> ;

  pav:createdBy <http://orcid.org/0000-0001-9842-9718> ;

  pav:derivedFrom <https://tcga-data.nci.nih.gov/tcga/is-there-a-query-link>
;

  pav:providedBy <https://tcga-data.nci.nih.gov/tcga/> .

<sourcefiles/file1.csv> pav:retrievedFrom <
https://tcga-data.nci.nih.gov/tcga/was-there-a-download-link> ;

   pav:createdWith <https://tcga-data.nci.nih.gov/tcga/> .



<http://orcid.org/0000-0001-9842-9718> a foaf:Person, prov:Person;
   foaf:name "Stian Soiland-Reyes" .





For argument's sake I have stayed with PAV properties here as I think it
makes it rather clear. The above says that the <merged.csv> conveys the
same knowledge as the <sourcefiles/> aggregation which its content is
imported from. The CSV representation was made with /merger-tool. Stian
initiated the import - clicked the button so to speak (perhaps set some
parameters) - but did not (according to these statements alone) convey any
knowledge into the CSV.

The ORE aggregation (but not its files) was created by Stian. (but I didn't
author/contribute to the aggregation, unless I selected the files). It
contains 3 files. If there is a query link, then we can give its
pav:derivedFrom (or even pav:importedFrom) - but anyway we can give at
least pav:providedBy to indicate the original publisher of this collection
(e.g. the list was on the result page).

Each of the files have been retrieved - now if there is not a download-link
from tcga this gets a bit tricky, but again pav:providedBy can be a last
resort to at least indicate the service. Here we use pav:createdWith - I
don't know about the provenance of those files, are they verbatimly
uploaded to tcga (just pav:providedBy) or created on demand based on the
query (pav:createdWith)?



On 21 August 2014 16:21, Michael Miller <Michael.Miller@systemsbiology.org>
wrote:

hi stian and alasdair,



there's a real use case along these lines that is part of the broad's TCGA
firehose pipeline[1].  for each tumor type and for each of the platforms
(gene expression, miRNA, methylation, etc.) the data is stored per
subject[2].  part of the broad pipeline 'merges'  all the values from the
subject files into one file per column from the set of original files per
subject.  so for miRNA[3], there is a file that merges all the raw values,
 a file that merges all the RPKM values, and a file that merges the values
from the cross-mapping column.  the gene names are not duplicated, they are
the row headers.  so no values are changed, just a bit of modest
reformatting and filtering.



cheers,

michael



Michael Miller

Software Engineer

Institute for Systems Biology



[1] https://confluence.broadinstitute.org/display/GDAC/Dashboard-Stddata

[2] https://tcga-data.nci.nih.gov/tcga/tcgaDownload.jsp and
https://tcga-data.nci.nih.gov/ccg-data-web/searchForTCGAData.htm

[3]
http://gdac.broadinstitute.org/runs/stddata__2014_07_15/data/STAD/20140715/
and the file
gdac.broadinstitute.org_STAD.Merge_mirnaseq__illuminaga_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data.Level_3.2014071500.0.0.tar.gz
<http://gdac.broadinstitute.org/runs/stddata__2014_07_15/data/STAD/20140715/gdac.broadinstitute.org_STAD.Merge_mirnaseq__illuminaga_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data.Level_3.2014071500.0.0.tar.gz>





*From:* stian@mygrid.org.uk [mailto:stian@mygrid.org.uk] *On Behalf Of *Stian
Soiland-Reyes
*Sent:* Wednesday, August 20, 2014 5:46 PM
*To:* Alasdair J G Gray
*Cc:* w3c semweb hcls; Joachim Baran; Michael Miller
*Subject:* Re: hcls dataset description comments--Dataset Descriptions vs.
PROV



Hi, sorry for not replying earlier.



I think it depends on the nature of the concatenation. pav:importedFrom
with multiple resources would only make sense for 'pure' concatenation
where no additional knowledge is conceived, and the content of both sources
can be said to be preserved. So for instance, if two CSV files are simply
merged by adding the new rows at the bottom, it could work with multiple
pav:importedFrom. Anything more clever that goes beyond just changing
formats, like matching up foreign keys or heuristic matching on compound
names would mean you are adding new content/knowledge and must use
pav:derivedFrom instead.  Concatenating RDF graphs is a grey area here, as
nodes with the same URI automatically merge. Perhaps it also depends on
your reason for adding the second source - was that based on inspecting the
first resource (e.g. following links) or just something you always do
blindly?

So given that pav:importedFrom with multiple sources is a thin line that
can be hard to explain, perhaps it's better to just leave the above as an
unexplored edge-case, and rather recommend using pav:derivedFrom (or the
non-specific prov:wasDerivedFrom) when there are multiple sources.



On 11 Aug 2014 15:53, "Gray, Alasdair J G" <A.J.G.Gray@hw.ac.uk> wrote:

Hi Stian,



On 5 Aug 2014, at 13:37, Stian Soiland-Reyes <
soiland-reyes@CS.MANCHESTER.AC.UK> wrote:



Just some inputs:


PROV defines prov:wasDerivedFrom which in broad sense describes such a
relationset between datasets. However you do not know anything more
about what kind of derivation we are talking about.


In PAV we found the need to specialize three types of derivation:

pav:retrievedFrom -
http://purl.org/pav/html#http://purl.org/pav/retrievedFrom
.. a byte-for-byte download

pav:importedFrom - http://purl.org/pav/html#http://purl.org/pav/importedFrom
.. a somewhat equivalent form of the source, but after some kind of
transformation or selection (e.g. CSV -> XML)

pav:derivedFrom - http://purl.org/pav/html#http://purl.org/pav/derivedFrom
.. when the new resource has been further refined or modified
(somewhat adding additional knowledge)


If you are simply concatenating several dataset, then multiple
pav:importedFrom statements would make sense. If further knowledge is
added, say by reasoning or calculation, then pav:derivedFrom would
make sense.



Are you sure about this? I thought that pav:importedFrom meant that the
derived dataset was essentially the same data as the original modulo data
format, i.e. it is a 1:1 relationship (as near as possible). To my mind,
this would mean that you could not have more than one pav:importedFrom
statement for a dataset.



Alasdair





Now if you want to detail exactly how those datasets have been
combined, I think you are right that would make sense to break down
the derivation using PROV statements, e.g. a series of activities,
generation and usage. How to describe these activities (e.g.
subclasses and properties) will be specific to each case.



If the process you generated the dataset with somewhat resembles a
dataflow, you might be interested in the wfprov and wfdesc ontologies
that specialize PROV to define a WorkflowRun of steps of ProcessRuns,
which can be related to a common workflow description (e.g. a
prov:Plan):

http://purl.org/wf4ever/model#wfprov

OPMW is a similar approach:
http://www.opmw.org/model/OPMW/



On 4 August 2014 17:44, Michael Miller
<Michael.Miller@systemsbiology.org> wrote:

hi all,



as you are all undoubtedly aware, a major, if not the major TCGA dataset use
cases revolve around taking the 3rd level data from the TCGA dcc repository
and doing analysis, producing 4th level data such as clusters, pca, etc.
one of the things we do here at ISB is produce an intermediate data step
that combines the different platforms (mRNA, miRNA, RPPA, METH, etc.) into
one feature matrix so that the analysis can use all the platforms together.
the Broad firehose pipeline also has this as one of its outputs.



as some of my comments allude to, it doesn't seem that Dataset Descriptions
deal with the use case of describing a dataset that is specifically derived
from other datasets, which is what we are looking at ways we might describe
our data when we publish it.  i took a look at PROV and, i've got a bit more
mapping to do, but it seems like PROV provides the terms we need.



but this has lead me to ask the question of what is the relation of Dataset
Descriptions and PROV and how should they/should they be used together?  i
think the above use case is quite common for datasets being published so
might deserve a discussion in the Dataset Descriptions note



cheers,

michael



Michael Miller

Software Engineer

Institute for Systems Biology





From: Joachim Baran [mailto:joachim.baran@gmail.com]
Sent: Thursday, July 31, 2014 3:43 PM
To: Michael Miller
Cc: w3c semweb hcls
Subject: Re: hcls dataset description comments



Hi!



 I will ponder about your edit suggestion of your first bullet point. I am
not sure at the moment if it would have wider implications.



 You are right that the use cases were written by the groups themselves. I
do not know how to improve the use cases without rewriting them, which might
not be agreeable to all parties involved. C'est la vie.



 The role of Data Catalogs should then be discussed during out next conf
call. Thanks for highlighting that this might be unclear to readers.



Kim







On 30 July 2014 10:41, Michael Miller <Michael.Miller@systemsbiology.org>
wrote:

hi kim,



'For other edits, please fork the repository and create a pull request with
your changes'



of the four general comments, the first is really the only 'edit', i didn't
put it in the minor edits because it had some implications that the group
might not agree with.  if the change makes sense, it might be easier for you
to make the edit.



the other three are general comments and i'm not sure what the solution
might be, they were mainly points, as a reader, that weren't clear or were a
bit confusing.  these were all from the use case section so were probably
written by the groups themselves?  if i have permission, i can certainly add
them as issues.



cheers,

michael



Michael Miller

Software Engineer

Institute for Systems Biology





From: Joachim Baran [mailto:joachim.baran@gmail.com]
Sent: Tuesday, July 29, 2014 11:56 AM


To: Michael Miller
Cc: w3c semweb hcls
Subject: Re: hcls dataset description comments



Hi!



 Thanks for the suggestions. I have incorporated your minor edits.
Unbelievable how those slipped through after so many re-readings still.



 For other edits, please fork the repository and create a pull request with
your changes.



Best wishes,



Kim





On 23 July 2014 08:53, Michael Miller <Michael.Miller@systemsbiology.org>
wrote:

hi kim,



thanks for the pointer, i've updated my comments based on this newer draft
below.  many fewer and i especially like the complete example in 10.1!



cheers,

michael



Michael Miller

Software Engineer

Institute for Systems Biology



general comments:

·         s4.4 'Dataset Linking': might mention also that datasets are
derived from other datasets?
'A dataset may incorporate, or link to, data in other datasets, e.g. in the
creation of a data mashup ' --> 'A dataset may incorporate, be derived from,
or link to, data in other datasets, e.g. in the analysis of original
datasets or in the creation of a data mashup '

·         s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are
individual organizations but three (8.4, 8.8, 8.9) have subsections for
different organizations.  maybe organize so all top level sections define a
type of organization with subsections beneath or make all top-level?

·         s8: some of the use cases could be more focused on how this note
will help them (8.5-8.7)

·         s8.9: how do Data Catalogs fit into this note?  wasn't clear to me
how this note is relevant to them

our use case questions:

·         how to reference 3rd party datasets that aren't described by this
standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom' with
the IRI being the URL into the repository?

·         we have a lot of intermediary files that we won't publish, the
software specified in creating our published datasets from its sources form
a (branching) workflow with the input being from the previous step(s) in the
workflow.  how best to represent this?  this note doesn't seem to cover how
the dataset is created so any recommendations?

minor edits:

·         there are two s6.2.3 sections

·         s8.8.1: '... what period it is updated. To know when to...' should
be '...what period it is updated to know when to...'?



From: Joachim Baran [mailto:joachim.baran@gmail.com]
Sent: Tuesday, July 22, 2014 3:43 PM
To: Michael Miller
Cc: w3c semweb hcls
Subject: Re: hcls dataset description comments



Hello,



 I believe you were looking at an old document. There is currently only one
Figure in the note.



 Please check the actual draft at:
http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDescriptions/blob/master/Overview.html



Best wishes,



Kim





On 22 July 2014 15:36, Michael Miller <Michael.Miller@systemsbiology.org>
wrote:

hi all,



tremendous work, very clear and well-written.  my group at ISB, the
Shmulevich lab is looking to provide provenance for the analysis datasets we
are producing for TCGA.  we're not sure if we'll be able to 'go all the way'
but we want to make sure we have at hand all the information that we could,
at least in theory, be compliant.  as long as i was reading the document,
below are some notes.



general comments:

·         s4.4 'Dataset Linking': might mention also that datasets are
derived from other datasets?
'A dataset may incorporate, or link to, data in other datasets, e.g. in the
creation of a data mashup ' --> 'A dataset may incorporate, be derived from,
or link to, data in other datasets, e.g. in the analysis of original
datasets or in the creation of a data mashup '

·         the chembl example in s5 is not compliant to the property table
below, it probably is only supposed to show the relationship of the three
terms but that could be clarified

·         s6.2.12 could use the example filled in

·         6.3.2: not sure what an 'X level description' is

·         s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are
individual organizations but three (8.4, 8.8, 8.9) have subsections for
different organizations.  maybe organize so all top level sections define a
type of organization with subsections beneath or make all top-level?

·         s8: many of the use cases could be more focused on how this note
will help them

·         s8.9: how do Data Catalogs fit into this note?  wasn't clear to me
how this note is relevant to them

·         would be nice to have a 'complete' example p[put together, maybe
based on chembl?



our use case questions:

·         how to reference 3rd party datasets that aren't described by this
standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom' with
the IRI being the URL into the repository?

·         we have a lot of intermediary files that we won't publish, the
software specified in creating our published datasets from its sources form
a (branching) workflow with the input being from the previous step(s) in the
workflow.  how best to represent this?  this note doesn't seem to cover how
the dataset is created so any recommendations?



text issues:

·         Figure 1: 'Overview of dataset description level metadata profiles
and their relationships': reference not resolved, image doesn't show

·         Figure 2: 'Improve diagram. Multiple appearance of
concepts/description levels unclear.': reference not resolved, image doesn't
show.  add actual label



minor edits:

·         bottom of s.3: 'placeholde' should be 'placeholder'

·         use straight quotes rather than slant quotes in s6.2.2 example
(and elsewhere)?

·         the text runs out of the box in s6.2.3 'Description'

·         s6.2.3: 'Dates of Creation and Issuance': 'state the date the
dataset was generated using dct:created and/or the date the dataset was made
public using dct:created' should be 'state the date the dataset was
generated using dct:created and/or the date the dataset was made public
using dct:issued'?

·         there are two s6.2.3 sections

·         s6.2.4: 'Creation: ... The date of authorship' should be '...The
date of creation' and 'Curation:... The date of authorship' should be
'...The date of curation'?

·         s8.5: the author list has end parenthesis without beginning
parenthesis

·         s8.8.1: '... what period it is updated. To know when to...' should
be '...what period it is updated to know when to...'



cheers,

michael



Michael Miller

Software Engineer

Institute for Systems Biology











-- 
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester
http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718



Alasdair J G Gray

Lecturer in Computer Science, Heriot-Watt University, UK.

Email: A.J.G.Gray@hw.ac.uk

Web: *MailScanner has detected a possible fraud attempt from
"www.macs.hw.ac.uk" claiming to be* http://www.alasdairjggray.co.uk
<http://www.macs.hw.ac.uk/~ajg33>

ORCID: http://orcid.org/0000-0002-5711-4872

Telephone: +44 131 451 3429

Twitter: @gray_alasdair










------------------------------


Sunday Times Scottish University of the Year 2011-2013
Top in the UK for student experience
Fourth university in the UK and top in Scotland (National Student Survey
2012)

We invite research leaders and ambitious early career researchers to join
us in leading and driving research in key inter-disciplinary themes. Please
see www.hw.ac.uk/researchleaders for further information and how to apply.

Heriot-Watt University is a Scottish charity registered under charity
number SC000278.




-- 
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester
http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718
Received on Thursday, 28 August 2014 15:08:59 UTC