Re: hcls dataset description comments--Dataset Descriptions vs. PROV from Joachim Baran on 2014-08-28 (public-semweb-lifesci@w3.org from August 2014)

From: Joachim Baran <joachim.baran@gmail.com>
Date: Wed, 27 Aug 2014 17:48:37 -0700
To: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Cc: Michael Miller <Michael.Miller@systemsbiology.org>, Alasdair J G Gray <A.J.G.Gray@hw.ac.uk>, w3c semweb hcls <public-semweb-lifesci@w3.org>, "pav-ontology@googlegroups.com" <pav-ontology@googlegroups.com>
Message-ID: <CAObSwHX=XBm_UB1kjJUE3z9okC+8_iT-MAdkKPaQg7OGd+sL9A@mail.gmail.com>
  I disagree.

  In my opinion it is subjective to use "import" when some unspecified
functions are applied to the data that is being incorporated in a new
dataset, but "derived" other times. I think there is no objective
discriminator that could distinguish between clever and non-clever
functions.

  If X is the source dataset and Y is the new dataset, then I would use
"import" if X is a subset of Y. I would use "derived" otherwise.

Kim



On 27 August 2014 16:47, Stian Soiland-Reyes <
soiland-reyes@cs.manchester.ac.uk> wrote:

> That is a nice use-case.
>
> If you make a new file by selecting out a column from a single source
> file, that is a narrow case of pav:importedFrom in my opinion. This is why
> we added in http://purl.org/pav/html#http://purl.org/pav/importedFrom the
> phrase:
>
> > The imported resource does not have to be complete, but should be
> consistent with the knowledge conveyed by the original resource.
>
> e.g. if you extract a list of all the names and  email addresses from an
> address book (but skipping phone and fax numbers) then that is still a case
> of import (which is not complete, but consistent).
>
> However if you do anything "clever", like importing only those addresses
> that are in the UK, then you are deriving new information and cannot use
> pav:importedFrom (pav:derivedFrom would be appropriate).
>
>
> In your case you are turning it around across multiple sources.. so if I
> understand it right, it is as if your source is a set of vcard files, one
> per person, and then you create new files, one that has all the email
> addresses, another one with all the names, etc. I don't think it should
> matter if the 'subject' is in the column or the row.
>
> The new resource is one that is expressing a collection of email
> addresses. So if we are to express an pav:importedFrom here, it should
> better go to some collection of vcards, rather than multiple imports of
> many vcard files (where did that list of vcard files come from? Who decides
> which is in or out?)
>
> So I think it sounds like it would be better to express that you are
> importing the collection/aggregation of those files (e.g. an
> ore:Aggregation) - rather than having hundred import edges. Then you have a
> resource to describe where that selection of files came from rather then
> them seemingly randomly meeting up in that import. :-)
>
> e.g.
>
>
> @base <http://www.example.com/>
>
> <merged.csv> pav:importedFrom <sourcefiles/> ;
>   pav:importedBy <http://orcid.org/0000-0001-9842-9718> ;
>   pav:createdWith </merger-tool> .
>
> <sourcefiles/> a ore:Aggregation ;
>   ore:aggregates <sourcefiles/file1.csv>, <sourcefiles/file2.csv>,
> <sourcefiles/file3.csv> ;
>   pav:createdBy <http://orcid.org/0000-0001-9842-9718> ;
>   pav:derivedFrom <
> https://tcga-data.nci.nih.gov/tcga/is-there-a-query-link> ;
>   pav:providedBy <https://tcga-data.nci.nih.gov/tcga/> .
>
> <sourcefiles/file1.csv> pav:retrievedFrom <
> https://tcga-data.nci.nih.gov/tcga/was-there-a-download-link> ;
>    pav:createdWith <https://tcga-data.nci.nih.gov/tcga/> .
>
> <http://orcid.org/0000-0001-9842-9718> a foaf:Person, prov:Person;
>    foaf:name "Stian Soiland-Reyes" .
>
>
>
> For argument's sake I have stayed with PAV properties here as I think it
> makes it rather clear. The above says that the <merged.csv> conveys the
> same knowledge as the <sourcefiles/> aggregation which its content is
> imported from. The CSV representation was made with /merger-tool. Stian
> initiated the import - clicked the button so to speak (perhaps set some
> parameters) - but did not (according to these statements alone) convey any
> knowledge into the CSV.
>
> The ORE aggregation (but not its files) was created by Stian. (but I
> didn't author/contribute to the aggregation, unless I selected the files).
> It contains 3 files. If there is a query link, then we can give its
> pav:derivedFrom (or even pav:importedFrom) - but anyway we can give at
> least pav:providedBy to indicate the original publisher of this collection
> (e.g. the list was on the result page).
>
> Each of the files have been retrieved - now if there is not a
> download-link from tcga this gets a bit tricky, but again pav:providedBy
> can be a last resort to at least indicate the service. Here we use
> pav:createdWith - I don't know about the provenance of those files, are
> they verbatimly uploaded to tcga (just pav:providedBy) or created on demand
> based on the query (pav:createdWith)?
>
>
>
>
> On 21 August 2014 16:21, Michael Miller <Michael.Miller@systemsbiology.org
> > wrote:
>
>> hi stian and alasdair,
>>
>>
>>
>> there's a real use case along these lines that is part of the broad's
>> TCGA firehose pipeline[1].  for each tumor type and for each of the
>> platforms (gene expression, miRNA, methylation, etc.) the data is stored
>> per subject[2].  part of the broad pipeline 'merges'  all the values from
>> the subject files into one file per column from the set of original files
>> per subject.  so for miRNA[3], there is a file that merges all the raw
>> values,  a file that merges all the RPKM values, and a file that merges the
>> values from the cross-mapping column.  the gene names are not duplicated,
>> they are the row headers.  so no values are changed, just a bit of modest
>> reformatting and filtering.
>>
>>
>>
>> cheers,
>>
>> michael
>>
>>
>>
>> Michael Miller
>>
>> Software Engineer
>>
>> Institute for Systems Biology
>>
>>
>>
>> [1] https://confluence.broadinstitute.org/display/GDAC/Dashboard-Stddata
>>
>> [2] https://tcga-data.nci.nih.gov/tcga/tcgaDownload.jsp and
>> https://tcga-data.nci.nih.gov/ccg-data-web/searchForTCGAData.htm
>>
>> [3]
>> http://gdac.broadinstitute.org/runs/stddata__2014_07_15/data/STAD/20140715/
>> and the file
>> gdac.broadinstitute.org_STAD.Merge_mirnaseq__illuminaga_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data.Level_3.2014071500.0.0.tar.gz
>> <http://gdac.broadinstitute.org/runs/stddata__2014_07_15/data/STAD/20140715/gdac.broadinstitute.org_STAD.Merge_mirnaseq__illuminaga_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data.Level_3.2014071500.0.0.tar.gz>
>>
>>
>>
>>
>>
>> *From:* stian@mygrid.org.uk [mailto:stian@mygrid.org.uk] *On Behalf Of *Stian
>> Soiland-Reyes
>> *Sent:* Wednesday, August 20, 2014 5:46 PM
>> *To:* Alasdair J G Gray
>> *Cc:* w3c semweb hcls; Joachim Baran; Michael Miller
>> *Subject:* Re: hcls dataset description comments--Dataset Descriptions
>> vs. PROV
>>
>>
>>
>> Hi, sorry for not replying earlier.
>>
>>
>>
>> I think it depends on the nature of the concatenation. pav:importedFrom
>> with multiple resources would only make sense for 'pure' concatenation
>> where no additional knowledge is conceived, and the content of both sources
>> can be said to be preserved. So for instance, if two CSV files are simply
>> merged by adding the new rows at the bottom, it could work with multiple
>> pav:importedFrom. Anything more clever that goes beyond just changing
>> formats, like matching up foreign keys or heuristic matching on compound
>> names would mean you are adding new content/knowledge and must use
>> pav:derivedFrom instead.  Concatenating RDF graphs is a grey area here, as
>> nodes with the same URI automatically merge. Perhaps it also depends on
>> your reason for adding the second source - was that based on inspecting the
>> first resource (e.g. following links) or just something you always do
>> blindly?
>>
>> So given that pav:importedFrom with multiple sources is a thin line that
>> can be hard to explain, perhaps it's better to just leave the above as an
>> unexplored edge-case, and rather recommend using pav:derivedFrom (or the
>> non-specific prov:wasDerivedFrom) when there are multiple sources.
>>
>>
>>
>> On 11 Aug 2014 15:53, "Gray, Alasdair J G" <A.J.G.Gray@hw.ac.uk> wrote:
>>
>> Hi Stian,
>>
>>
>>
>> On 5 Aug 2014, at 13:37, Stian Soiland-Reyes <
>> soiland-reyes@CS.MANCHESTER.AC.UK> wrote:
>>
>>
>>
>> Just some inputs:
>>
>>
>> PROV defines prov:wasDerivedFrom which in broad sense describes such a
>> relationset between datasets. However you do not know anything more
>> about what kind of derivation we are talking about.
>>
>>
>> In PAV we found the need to specialize three types of derivation:
>>
>> pav:retrievedFrom -
>> http://purl.org/pav/html#http://purl.org/pav/retrievedFrom
>> .. a byte-for-byte download
>>
>> pav:importedFrom -
>> http://purl.org/pav/html#http://purl.org/pav/importedFrom
>> .. a somewhat equivalent form of the source, but after some kind of
>> transformation or selection (e.g. CSV -> XML)
>>
>> pav:derivedFrom -
>> http://purl.org/pav/html#http://purl.org/pav/derivedFrom
>> .. when the new resource has been further refined or modified
>> (somewhat adding additional knowledge)
>>
>>
>> If you are simply concatenating several dataset, then multiple
>> pav:importedFrom statements would make sense. If further knowledge is
>> added, say by reasoning or calculation, then pav:derivedFrom would
>> make sense.
>>
>>
>>
>> Are you sure about this? I thought that pav:importedFrom meant that the
>> derived dataset was essentially the same data as the original modulo data
>> format, i.e. it is a 1:1 relationship (as near as possible). To my mind,
>> this would mean that you could not have more than one pav:importedFrom
>> statement for a dataset.
>>
>>
>>
>> Alasdair
>>
>>
>>
>>
>>
>> Now if you want to detail exactly how those datasets have been
>> combined, I think you are right that would make sense to break down
>> the derivation using PROV statements, e.g. a series of activities,
>> generation and usage. How to describe these activities (e.g.
>> subclasses and properties) will be specific to each case.
>>
>>
>>
>> If the process you generated the dataset with somewhat resembles a
>> dataflow, you might be interested in the wfprov and wfdesc ontologies
>> that specialize PROV to define a WorkflowRun of steps of ProcessRuns,
>> which can be related to a common workflow description (e.g. a
>> prov:Plan):
>>
>> http://purl.org/wf4ever/model#wfprov
>>
>> OPMW is a similar approach:
>> http://www.opmw.org/model/OPMW/
>>
>>
>>
>> On 4 August 2014 17:44, Michael Miller
>> <Michael.Miller@systemsbiology.org> wrote:
>>
>> hi all,
>>
>>
>>
>> as you are all undoubtedly aware, a major, if not the major TCGA dataset
>> use
>> cases revolve around taking the 3rd level data from the TCGA dcc
>> repository
>> and doing analysis, producing 4th level data such as clusters, pca, etc.
>> one of the things we do here at ISB is produce an intermediate data step
>> that combines the different platforms (mRNA, miRNA, RPPA, METH, etc.) into
>> one feature matrix so that the analysis can use all the platforms
>> together.
>> the Broad firehose pipeline also has this as one of its outputs.
>>
>>
>>
>> as some of my comments allude to, it doesn't seem that Dataset
>> Descriptions
>> deal with the use case of describing a dataset that is specifically
>> derived
>> from other datasets, which is what we are looking at ways we might
>> describe
>> our data when we publish it.  i took a look at PROV and, i've got a bit
>> more
>> mapping to do, but it seems like PROV provides the terms we need.
>>
>>
>>
>> but this has lead me to ask the question of what is the relation of
>> Dataset
>> Descriptions and PROV and how should they/should they be used together?  i
>> think the above use case is quite common for datasets being published so
>> might deserve a discussion in the Dataset Descriptions note
>>
>>
>>
>> cheers,
>>
>> michael
>>
>>
>>
>> Michael Miller
>>
>> Software Engineer
>>
>> Institute for Systems Biology
>>
>>
>>
>>
>>
>> From: Joachim Baran [mailto:joachim.baran@gmail.com]
>> Sent: Thursday, July 31, 2014 3:43 PM
>> To: Michael Miller
>> Cc: w3c semweb hcls
>> Subject: Re: hcls dataset description comments
>>
>>
>>
>> Hi!
>>
>>
>>
>>  I will ponder about your edit suggestion of your first bullet point. I am
>> not sure at the moment if it would have wider implications.
>>
>>
>>
>>  You are right that the use cases were written by the groups themselves. I
>> do not know how to improve the use cases without rewriting them, which
>> might
>> not be agreeable to all parties involved. C'est la vie.
>>
>>
>>
>>  The role of Data Catalogs should then be discussed during out next conf
>> call. Thanks for highlighting that this might be unclear to readers.
>>
>>
>>
>> Kim
>>
>>
>>
>>
>>
>>
>>
>> On 30 July 2014 10:41, Michael Miller <Michael.Miller@systemsbiology.org>
>> wrote:
>>
>> hi kim,
>>
>>
>>
>> 'For other edits, please fork the repository and create a pull request
>> with
>> your changes'
>>
>>
>>
>> of the four general comments, the first is really the only 'edit', i
>> didn't
>> put it in the minor edits because it had some implications that the group
>> might not agree with.  if the change makes sense, it might be easier for
>> you
>> to make the edit.
>>
>>
>>
>> the other three are general comments and i'm not sure what the solution
>> might be, they were mainly points, as a reader, that weren't clear or
>> were a
>> bit confusing.  these were all from the use case section so were probably
>> written by the groups themselves?  if i have permission, i can certainly
>> add
>> them as issues.
>>
>>
>>
>> cheers,
>>
>> michael
>>
>>
>>
>> Michael Miller
>>
>> Software Engineer
>>
>> Institute for Systems Biology
>>
>>
>>
>>
>>
>> From: Joachim Baran [mailto:joachim.baran@gmail.com]
>> Sent: Tuesday, July 29, 2014 11:56 AM
>>
>>
>> To: Michael Miller
>> Cc: w3c semweb hcls
>> Subject: Re: hcls dataset description comments
>>
>>
>>
>> Hi!
>>
>>
>>
>>  Thanks for the suggestions. I have incorporated your minor edits.
>> Unbelievable how those slipped through after so many re-readings still.
>>
>>
>>
>>  For other edits, please fork the repository and create a pull request
>> with
>> your changes.
>>
>>
>>
>> Best wishes,
>>
>>
>>
>> Kim
>>
>>
>>
>>
>>
>> On 23 July 2014 08:53, Michael Miller <Michael.Miller@systemsbiology.org>
>> wrote:
>>
>> hi kim,
>>
>>
>>
>> thanks for the pointer, i've updated my comments based on this newer draft
>> below.  many fewer and i especially like the complete example in 10.1!
>>
>>
>>
>> cheers,
>>
>> michael
>>
>>
>>
>> Michael Miller
>>
>> Software Engineer
>>
>> Institute for Systems Biology
>>
>>
>>
>> general comments:
>>
>> ·         s4.4 'Dataset Linking': might mention also that datasets are
>> derived from other datasets?
>> 'A dataset may incorporate, or link to, data in other datasets, e.g. in
>> the
>> creation of a data mashup ' --> 'A dataset may incorporate, be derived
>> from,
>> or link to, data in other datasets, e.g. in the analysis of original
>> datasets or in the creation of a data mashup '
>>
>> ·         s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are
>> individual organizations but three (8.4, 8.8, 8.9) have subsections for
>> different organizations.  maybe organize so all top level sections define
>> a
>> type of organization with subsections beneath or make all top-level?
>>
>> ·         s8: some of the use cases could be more focused on how this note
>> will help them (8.5-8.7)
>>
>> ·         s8.9: how do Data Catalogs fit into this note?  wasn't clear to
>> me
>> how this note is relevant to them
>>
>> our use case questions:
>>
>> ·         how to reference 3rd party datasets that aren't described by
>> this
>> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom' with
>> the IRI being the URL into the repository?
>>
>> ·         we have a lot of intermediary files that we won't publish, the
>> software specified in creating our published datasets from its sources
>> form
>> a (branching) workflow with the input being from the previous step(s) in
>> the
>> workflow.  how best to represent this?  this note doesn't seem to cover
>> how
>> the dataset is created so any recommendations?
>>
>> minor edits:
>>
>> ·         there are two s6.2.3 sections
>>
>> ·         s8.8.1: '... what period it is updated. To know when to...'
>> should
>> be '...what period it is updated to know when to...'?
>>
>>
>>
>> From: Joachim Baran [mailto:joachim.baran@gmail.com]
>> Sent: Tuesday, July 22, 2014 3:43 PM
>> To: Michael Miller
>> Cc: w3c semweb hcls
>> Subject: Re: hcls dataset description comments
>>
>>
>>
>> Hello,
>>
>>
>>
>>  I believe you were looking at an old document. There is currently only
>> one
>> Figure in the note.
>>
>>
>>
>>  Please check the actual draft at:
>>
>> http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDescriptions/blob/master/Overview.html
>>
>>
>>
>> Best wishes,
>>
>>
>>
>> Kim
>>
>>
>>
>>
>>
>> On 22 July 2014 15:36, Michael Miller <Michael.Miller@systemsbiology.org>
>> wrote:
>>
>> hi all,
>>
>>
>>
>> tremendous work, very clear and well-written.  my group at ISB, the
>> Shmulevich lab is looking to provide provenance for the analysis datasets
>> we
>> are producing for TCGA.  we're not sure if we'll be able to 'go all the
>> way'
>> but we want to make sure we have at hand all the information that we
>> could,
>> at least in theory, be compliant.  as long as i was reading the document,
>> below are some notes.
>>
>>
>>
>> general comments:
>>
>> ·         s4.4 'Dataset Linking': might mention also that datasets are
>> derived from other datasets?
>> 'A dataset may incorporate, or link to, data in other datasets, e.g. in
>> the
>> creation of a data mashup ' --> 'A dataset may incorporate, be derived
>> from,
>> or link to, data in other datasets, e.g. in the analysis of original
>> datasets or in the creation of a data mashup '
>>
>> ·         the chembl example in s5 is not compliant to the property table
>> below, it probably is only supposed to show the relationship of the three
>> terms but that could be clarified
>>
>> ·         s6.2.12 could use the example filled in
>>
>> ·         6.3.2: not sure what an 'X level description' is
>>
>> ·         s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are
>> individual organizations but three (8.4, 8.8, 8.9) have subsections for
>> different organizations.  maybe organize so all top level sections define
>> a
>> type of organization with subsections beneath or make all top-level?
>>
>> ·         s8: many of the use cases could be more focused on how this note
>> will help them
>>
>> ·         s8.9: how do Data Catalogs fit into this note?  wasn't clear to
>> me
>> how this note is relevant to them
>>
>> ·         would be nice to have a 'complete' example p[put together, maybe
>> based on chembl?
>>
>>
>>
>> our use case questions:
>>
>> ·         how to reference 3rd party datasets that aren't described by
>> this
>> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom' with
>> the IRI being the URL into the repository?
>>
>> ·         we have a lot of intermediary files that we won't publish, the
>> software specified in creating our published datasets from its sources
>> form
>> a (branching) workflow with the input being from the previous step(s) in
>> the
>> workflow.  how best to represent this?  this note doesn't seem to cover
>> how
>> the dataset is created so any recommendations?
>>
>>
>>
>> text issues:
>>
>> ·         Figure 1: 'Overview of dataset description level metadata
>> profiles
>> and their relationships': reference not resolved, image doesn't show
>>
>> ·         Figure 2: 'Improve diagram. Multiple appearance of
>> concepts/description levels unclear.': reference not resolved, image
>> doesn't
>> show.  add actual label
>>
>>
>>
>> minor edits:
>>
>> ·         bottom of s.3: 'placeholde' should be 'placeholder'
>>
>> ·         use straight quotes rather than slant quotes in s6.2.2 example
>> (and elsewhere)?
>>
>> ·         the text runs out of the box in s6.2.3 'Description'
>>
>> ·         s6.2.3: 'Dates of Creation and Issuance': 'state the date the
>> dataset was generated using dct:created and/or the date the dataset was
>> made
>> public using dct:created' should be 'state the date the dataset was
>> generated using dct:created and/or the date the dataset was made public
>> using dct:issued'?
>>
>> ·         there are two s6.2.3 sections
>>
>> ·         s6.2.4: 'Creation: ... The date of authorship' should be '...The
>> date of creation' and 'Curation:... The date of authorship' should be
>> '...The date of curation'?
>>
>> ·         s8.5: the author list has end parenthesis without beginning
>> parenthesis
>>
>> ·         s8.8.1: '... what period it is updated. To know when to...'
>> should
>> be '...what period it is updated to know when to...'
>>
>>
>>
>> cheers,
>>
>> michael
>>
>>
>>
>> Michael Miller
>>
>> Software Engineer
>>
>> Institute for Systems Biology
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Stian Soiland-Reyes, myGrid team
>> School of Computer Science
>> The University of Manchester
>> http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718
>>
>>
>>
>> Alasdair J G Gray
>>
>> Lecturer in Computer Science, Heriot-Watt University, UK.
>>
>> Email: A.J.G.Gray@hw.ac.uk
>>
>> Web: *MailScanner has detected a possible fraud attempt from
>> "www.macs.hw.ac.uk" claiming to be* http://www.alasdairjggray.co.uk
>> <http://www.macs.hw.ac.uk/~ajg33>
>>
>> ORCID: http://orcid.org/0000-0002-5711-4872
>>
>> Telephone: +44 131 451 3429
>>
>> Twitter: @gray_alasdair
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ------------------------------
>>
>>
>> Sunday Times Scottish University of the Year 2011-2013
>> Top in the UK for student experience
>> Fourth university in the UK and top in Scotland (National Student Survey
>> 2012)
>>
>> We invite research leaders and ambitious early career researchers to join
>> us in leading and driving research in key inter-disciplinary themes. Please
>> see www.hw.ac.uk/researchleaders for further information and how to
>> apply.
>>
>> Heriot-Watt University is a Scottish charity registered under charity
>> number SC000278.
>>
>
>
>
> --
> Stian Soiland-Reyes, myGrid team
> School of Computer Science
> The University of Manchester
> http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718
>
Received on Thursday, 28 August 2014 00:49:06 UTC