- From: Joachim Baran <joachim.baran@gmail.com>
- Date: Wed, 27 Aug 2014 17:48:37 -0700
- To: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
- Cc: Michael Miller <Michael.Miller@systemsbiology.org>, Alasdair J G Gray <A.J.G.Gray@hw.ac.uk>, w3c semweb hcls <public-semweb-lifesci@w3.org>, "pav-ontology@googlegroups.com" <pav-ontology@googlegroups.com>
- Message-ID: <CAObSwHX=XBm_UB1kjJUE3z9okC+8_iT-MAdkKPaQg7OGd+sL9A@mail.gmail.com>
I disagree. In my opinion it is subjective to use "import" when some unspecified functions are applied to the data that is being incorporated in a new dataset, but "derived" other times. I think there is no objective discriminator that could distinguish between clever and non-clever functions. If X is the source dataset and Y is the new dataset, then I would use "import" if X is a subset of Y. I would use "derived" otherwise. Kim On 27 August 2014 16:47, Stian Soiland-Reyes < soiland-reyes@cs.manchester.ac.uk> wrote: > That is a nice use-case. > > If you make a new file by selecting out a column from a single source > file, that is a narrow case of pav:importedFrom in my opinion. This is why > we added in http://purl.org/pav/html#http://purl.org/pav/importedFrom the > phrase: > > > The imported resource does not have to be complete, but should be > consistent with the knowledge conveyed by the original resource. > > e.g. if you extract a list of all the names and email addresses from an > address book (but skipping phone and fax numbers) then that is still a case > of import (which is not complete, but consistent). > > However if you do anything "clever", like importing only those addresses > that are in the UK, then you are deriving new information and cannot use > pav:importedFrom (pav:derivedFrom would be appropriate). > > > In your case you are turning it around across multiple sources.. so if I > understand it right, it is as if your source is a set of vcard files, one > per person, and then you create new files, one that has all the email > addresses, another one with all the names, etc. I don't think it should > matter if the 'subject' is in the column or the row. > > The new resource is one that is expressing a collection of email > addresses. So if we are to express an pav:importedFrom here, it should > better go to some collection of vcards, rather than multiple imports of > many vcard files (where did that list of vcard files come from? Who decides > which is in or out?) > > So I think it sounds like it would be better to express that you are > importing the collection/aggregation of those files (e.g. an > ore:Aggregation) - rather than having hundred import edges. Then you have a > resource to describe where that selection of files came from rather then > them seemingly randomly meeting up in that import. :-) > > e.g. > > > @base <http://www.example.com/> > > <merged.csv> pav:importedFrom <sourcefiles/> ; > pav:importedBy <http://orcid.org/0000-0001-9842-9718> ; > pav:createdWith </merger-tool> . > > <sourcefiles/> a ore:Aggregation ; > ore:aggregates <sourcefiles/file1.csv>, <sourcefiles/file2.csv>, > <sourcefiles/file3.csv> ; > pav:createdBy <http://orcid.org/0000-0001-9842-9718> ; > pav:derivedFrom < > https://tcga-data.nci.nih.gov/tcga/is-there-a-query-link> ; > pav:providedBy <https://tcga-data.nci.nih.gov/tcga/> . > > <sourcefiles/file1.csv> pav:retrievedFrom < > https://tcga-data.nci.nih.gov/tcga/was-there-a-download-link> ; > pav:createdWith <https://tcga-data.nci.nih.gov/tcga/> . > > <http://orcid.org/0000-0001-9842-9718> a foaf:Person, prov:Person; > foaf:name "Stian Soiland-Reyes" . > > > > For argument's sake I have stayed with PAV properties here as I think it > makes it rather clear. The above says that the <merged.csv> conveys the > same knowledge as the <sourcefiles/> aggregation which its content is > imported from. The CSV representation was made with /merger-tool. Stian > initiated the import - clicked the button so to speak (perhaps set some > parameters) - but did not (according to these statements alone) convey any > knowledge into the CSV. > > The ORE aggregation (but not its files) was created by Stian. (but I > didn't author/contribute to the aggregation, unless I selected the files). > It contains 3 files. If there is a query link, then we can give its > pav:derivedFrom (or even pav:importedFrom) - but anyway we can give at > least pav:providedBy to indicate the original publisher of this collection > (e.g. the list was on the result page). > > Each of the files have been retrieved - now if there is not a > download-link from tcga this gets a bit tricky, but again pav:providedBy > can be a last resort to at least indicate the service. Here we use > pav:createdWith - I don't know about the provenance of those files, are > they verbatimly uploaded to tcga (just pav:providedBy) or created on demand > based on the query (pav:createdWith)? > > > > > On 21 August 2014 16:21, Michael Miller <Michael.Miller@systemsbiology.org > > wrote: > >> hi stian and alasdair, >> >> >> >> there's a real use case along these lines that is part of the broad's >> TCGA firehose pipeline[1]. for each tumor type and for each of the >> platforms (gene expression, miRNA, methylation, etc.) the data is stored >> per subject[2]. part of the broad pipeline 'merges' all the values from >> the subject files into one file per column from the set of original files >> per subject. so for miRNA[3], there is a file that merges all the raw >> values, a file that merges all the RPKM values, and a file that merges the >> values from the cross-mapping column. the gene names are not duplicated, >> they are the row headers. so no values are changed, just a bit of modest >> reformatting and filtering. >> >> >> >> cheers, >> >> michael >> >> >> >> Michael Miller >> >> Software Engineer >> >> Institute for Systems Biology >> >> >> >> [1] https://confluence.broadinstitute.org/display/GDAC/Dashboard-Stddata >> >> [2] https://tcga-data.nci.nih.gov/tcga/tcgaDownload.jsp and >> https://tcga-data.nci.nih.gov/ccg-data-web/searchForTCGAData.htm >> >> [3] >> http://gdac.broadinstitute.org/runs/stddata__2014_07_15/data/STAD/20140715/ >> and the file >> gdac.broadinstitute.org_STAD.Merge_mirnaseq__illuminaga_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data.Level_3.2014071500.0.0.tar.gz >> <http://gdac.broadinstitute.org/runs/stddata__2014_07_15/data/STAD/20140715/gdac.broadinstitute.org_STAD.Merge_mirnaseq__illuminaga_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data.Level_3.2014071500.0.0.tar.gz> >> >> >> >> >> >> *From:* stian@mygrid.org.uk [mailto:stian@mygrid.org.uk] *On Behalf Of *Stian >> Soiland-Reyes >> *Sent:* Wednesday, August 20, 2014 5:46 PM >> *To:* Alasdair J G Gray >> *Cc:* w3c semweb hcls; Joachim Baran; Michael Miller >> *Subject:* Re: hcls dataset description comments--Dataset Descriptions >> vs. PROV >> >> >> >> Hi, sorry for not replying earlier. >> >> >> >> I think it depends on the nature of the concatenation. pav:importedFrom >> with multiple resources would only make sense for 'pure' concatenation >> where no additional knowledge is conceived, and the content of both sources >> can be said to be preserved. So for instance, if two CSV files are simply >> merged by adding the new rows at the bottom, it could work with multiple >> pav:importedFrom. Anything more clever that goes beyond just changing >> formats, like matching up foreign keys or heuristic matching on compound >> names would mean you are adding new content/knowledge and must use >> pav:derivedFrom instead. Concatenating RDF graphs is a grey area here, as >> nodes with the same URI automatically merge. Perhaps it also depends on >> your reason for adding the second source - was that based on inspecting the >> first resource (e.g. following links) or just something you always do >> blindly? >> >> So given that pav:importedFrom with multiple sources is a thin line that >> can be hard to explain, perhaps it's better to just leave the above as an >> unexplored edge-case, and rather recommend using pav:derivedFrom (or the >> non-specific prov:wasDerivedFrom) when there are multiple sources. >> >> >> >> On 11 Aug 2014 15:53, "Gray, Alasdair J G" <A.J.G.Gray@hw.ac.uk> wrote: >> >> Hi Stian, >> >> >> >> On 5 Aug 2014, at 13:37, Stian Soiland-Reyes < >> soiland-reyes@CS.MANCHESTER.AC.UK> wrote: >> >> >> >> Just some inputs: >> >> >> PROV defines prov:wasDerivedFrom which in broad sense describes such a >> relationset between datasets. However you do not know anything more >> about what kind of derivation we are talking about. >> >> >> In PAV we found the need to specialize three types of derivation: >> >> pav:retrievedFrom - >> http://purl.org/pav/html#http://purl.org/pav/retrievedFrom >> .. a byte-for-byte download >> >> pav:importedFrom - >> http://purl.org/pav/html#http://purl.org/pav/importedFrom >> .. a somewhat equivalent form of the source, but after some kind of >> transformation or selection (e.g. CSV -> XML) >> >> pav:derivedFrom - >> http://purl.org/pav/html#http://purl.org/pav/derivedFrom >> .. when the new resource has been further refined or modified >> (somewhat adding additional knowledge) >> >> >> If you are simply concatenating several dataset, then multiple >> pav:importedFrom statements would make sense. If further knowledge is >> added, say by reasoning or calculation, then pav:derivedFrom would >> make sense. >> >> >> >> Are you sure about this? I thought that pav:importedFrom meant that the >> derived dataset was essentially the same data as the original modulo data >> format, i.e. it is a 1:1 relationship (as near as possible). To my mind, >> this would mean that you could not have more than one pav:importedFrom >> statement for a dataset. >> >> >> >> Alasdair >> >> >> >> >> >> Now if you want to detail exactly how those datasets have been >> combined, I think you are right that would make sense to break down >> the derivation using PROV statements, e.g. a series of activities, >> generation and usage. How to describe these activities (e.g. >> subclasses and properties) will be specific to each case. >> >> >> >> If the process you generated the dataset with somewhat resembles a >> dataflow, you might be interested in the wfprov and wfdesc ontologies >> that specialize PROV to define a WorkflowRun of steps of ProcessRuns, >> which can be related to a common workflow description (e.g. a >> prov:Plan): >> >> http://purl.org/wf4ever/model#wfprov >> >> OPMW is a similar approach: >> http://www.opmw.org/model/OPMW/ >> >> >> >> On 4 August 2014 17:44, Michael Miller >> <Michael.Miller@systemsbiology.org> wrote: >> >> hi all, >> >> >> >> as you are all undoubtedly aware, a major, if not the major TCGA dataset >> use >> cases revolve around taking the 3rd level data from the TCGA dcc >> repository >> and doing analysis, producing 4th level data such as clusters, pca, etc. >> one of the things we do here at ISB is produce an intermediate data step >> that combines the different platforms (mRNA, miRNA, RPPA, METH, etc.) into >> one feature matrix so that the analysis can use all the platforms >> together. >> the Broad firehose pipeline also has this as one of its outputs. >> >> >> >> as some of my comments allude to, it doesn't seem that Dataset >> Descriptions >> deal with the use case of describing a dataset that is specifically >> derived >> from other datasets, which is what we are looking at ways we might >> describe >> our data when we publish it. i took a look at PROV and, i've got a bit >> more >> mapping to do, but it seems like PROV provides the terms we need. >> >> >> >> but this has lead me to ask the question of what is the relation of >> Dataset >> Descriptions and PROV and how should they/should they be used together? i >> think the above use case is quite common for datasets being published so >> might deserve a discussion in the Dataset Descriptions note >> >> >> >> cheers, >> >> michael >> >> >> >> Michael Miller >> >> Software Engineer >> >> Institute for Systems Biology >> >> >> >> >> >> From: Joachim Baran [mailto:joachim.baran@gmail.com] >> Sent: Thursday, July 31, 2014 3:43 PM >> To: Michael Miller >> Cc: w3c semweb hcls >> Subject: Re: hcls dataset description comments >> >> >> >> Hi! >> >> >> >> I will ponder about your edit suggestion of your first bullet point. I am >> not sure at the moment if it would have wider implications. >> >> >> >> You are right that the use cases were written by the groups themselves. I >> do not know how to improve the use cases without rewriting them, which >> might >> not be agreeable to all parties involved. C'est la vie. >> >> >> >> The role of Data Catalogs should then be discussed during out next conf >> call. Thanks for highlighting that this might be unclear to readers. >> >> >> >> Kim >> >> >> >> >> >> >> >> On 30 July 2014 10:41, Michael Miller <Michael.Miller@systemsbiology.org> >> wrote: >> >> hi kim, >> >> >> >> 'For other edits, please fork the repository and create a pull request >> with >> your changes' >> >> >> >> of the four general comments, the first is really the only 'edit', i >> didn't >> put it in the minor edits because it had some implications that the group >> might not agree with. if the change makes sense, it might be easier for >> you >> to make the edit. >> >> >> >> the other three are general comments and i'm not sure what the solution >> might be, they were mainly points, as a reader, that weren't clear or >> were a >> bit confusing. these were all from the use case section so were probably >> written by the groups themselves? if i have permission, i can certainly >> add >> them as issues. >> >> >> >> cheers, >> >> michael >> >> >> >> Michael Miller >> >> Software Engineer >> >> Institute for Systems Biology >> >> >> >> >> >> From: Joachim Baran [mailto:joachim.baran@gmail.com] >> Sent: Tuesday, July 29, 2014 11:56 AM >> >> >> To: Michael Miller >> Cc: w3c semweb hcls >> Subject: Re: hcls dataset description comments >> >> >> >> Hi! >> >> >> >> Thanks for the suggestions. I have incorporated your minor edits. >> Unbelievable how those slipped through after so many re-readings still. >> >> >> >> For other edits, please fork the repository and create a pull request >> with >> your changes. >> >> >> >> Best wishes, >> >> >> >> Kim >> >> >> >> >> >> On 23 July 2014 08:53, Michael Miller <Michael.Miller@systemsbiology.org> >> wrote: >> >> hi kim, >> >> >> >> thanks for the pointer, i've updated my comments based on this newer draft >> below. many fewer and i especially like the complete example in 10.1! >> >> >> >> cheers, >> >> michael >> >> >> >> Michael Miller >> >> Software Engineer >> >> Institute for Systems Biology >> >> >> >> general comments: >> >> · s4.4 'Dataset Linking': might mention also that datasets are >> derived from other datasets? >> 'A dataset may incorporate, or link to, data in other datasets, e.g. in >> the >> creation of a data mashup ' --> 'A dataset may incorporate, be derived >> from, >> or link to, data in other datasets, e.g. in the analysis of original >> datasets or in the creation of a data mashup ' >> >> · s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are >> individual organizations but three (8.4, 8.8, 8.9) have subsections for >> different organizations. maybe organize so all top level sections define >> a >> type of organization with subsections beneath or make all top-level? >> >> · s8: some of the use cases could be more focused on how this note >> will help them (8.5-8.7) >> >> · s8.9: how do Data Catalogs fit into this note? wasn't clear to >> me >> how this note is relevant to them >> >> our use case questions: >> >> · how to reference 3rd party datasets that aren't described by >> this >> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom' with >> the IRI being the URL into the repository? >> >> · we have a lot of intermediary files that we won't publish, the >> software specified in creating our published datasets from its sources >> form >> a (branching) workflow with the input being from the previous step(s) in >> the >> workflow. how best to represent this? this note doesn't seem to cover >> how >> the dataset is created so any recommendations? >> >> minor edits: >> >> · there are two s6.2.3 sections >> >> · s8.8.1: '... what period it is updated. To know when to...' >> should >> be '...what period it is updated to know when to...'? >> >> >> >> From: Joachim Baran [mailto:joachim.baran@gmail.com] >> Sent: Tuesday, July 22, 2014 3:43 PM >> To: Michael Miller >> Cc: w3c semweb hcls >> Subject: Re: hcls dataset description comments >> >> >> >> Hello, >> >> >> >> I believe you were looking at an old document. There is currently only >> one >> Figure in the note. >> >> >> >> Please check the actual draft at: >> >> http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDescriptions/blob/master/Overview.html >> >> >> >> Best wishes, >> >> >> >> Kim >> >> >> >> >> >> On 22 July 2014 15:36, Michael Miller <Michael.Miller@systemsbiology.org> >> wrote: >> >> hi all, >> >> >> >> tremendous work, very clear and well-written. my group at ISB, the >> Shmulevich lab is looking to provide provenance for the analysis datasets >> we >> are producing for TCGA. we're not sure if we'll be able to 'go all the >> way' >> but we want to make sure we have at hand all the information that we >> could, >> at least in theory, be compliant. as long as i was reading the document, >> below are some notes. >> >> >> >> general comments: >> >> · s4.4 'Dataset Linking': might mention also that datasets are >> derived from other datasets? >> 'A dataset may incorporate, or link to, data in other datasets, e.g. in >> the >> creation of a data mashup ' --> 'A dataset may incorporate, be derived >> from, >> or link to, data in other datasets, e.g. in the analysis of original >> datasets or in the creation of a data mashup ' >> >> · the chembl example in s5 is not compliant to the property table >> below, it probably is only supposed to show the relationship of the three >> terms but that could be clarified >> >> · s6.2.12 could use the example filled in >> >> · 6.3.2: not sure what an 'X level description' is >> >> · s8: odd that some of the top sections (8.1-8.3,8.5-8.7) are >> individual organizations but three (8.4, 8.8, 8.9) have subsections for >> different organizations. maybe organize so all top level sections define >> a >> type of organization with subsections beneath or make all top-level? >> >> · s8: many of the use cases could be more focused on how this note >> will help them >> >> · s8.9: how do Data Catalogs fit into this note? wasn't clear to >> me >> how this note is relevant to them >> >> · would be nice to have a 'complete' example p[put together, maybe >> based on chembl? >> >> >> >> our use case questions: >> >> · how to reference 3rd party datasets that aren't described by >> this >> standard, i.e. TCGA data from the DCC, simply use 'pav:retrievedFrom' with >> the IRI being the URL into the repository? >> >> · we have a lot of intermediary files that we won't publish, the >> software specified in creating our published datasets from its sources >> form >> a (branching) workflow with the input being from the previous step(s) in >> the >> workflow. how best to represent this? this note doesn't seem to cover >> how >> the dataset is created so any recommendations? >> >> >> >> text issues: >> >> · Figure 1: 'Overview of dataset description level metadata >> profiles >> and their relationships': reference not resolved, image doesn't show >> >> · Figure 2: 'Improve diagram. Multiple appearance of >> concepts/description levels unclear.': reference not resolved, image >> doesn't >> show. add actual label >> >> >> >> minor edits: >> >> · bottom of s.3: 'placeholde' should be 'placeholder' >> >> · use straight quotes rather than slant quotes in s6.2.2 example >> (and elsewhere)? >> >> · the text runs out of the box in s6.2.3 'Description' >> >> · s6.2.3: 'Dates of Creation and Issuance': 'state the date the >> dataset was generated using dct:created and/or the date the dataset was >> made >> public using dct:created' should be 'state the date the dataset was >> generated using dct:created and/or the date the dataset was made public >> using dct:issued'? >> >> · there are two s6.2.3 sections >> >> · s6.2.4: 'Creation: ... The date of authorship' should be '...The >> date of creation' and 'Curation:... The date of authorship' should be >> '...The date of curation'? >> >> · s8.5: the author list has end parenthesis without beginning >> parenthesis >> >> · s8.8.1: '... what period it is updated. To know when to...' >> should >> be '...what period it is updated to know when to...' >> >> >> >> cheers, >> >> michael >> >> >> >> Michael Miller >> >> Software Engineer >> >> Institute for Systems Biology >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> Stian Soiland-Reyes, myGrid team >> School of Computer Science >> The University of Manchester >> http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718 >> >> >> >> Alasdair J G Gray >> >> Lecturer in Computer Science, Heriot-Watt University, UK. >> >> Email: A.J.G.Gray@hw.ac.uk >> >> Web: *MailScanner has detected a possible fraud attempt from >> "www.macs.hw.ac.uk" claiming to be* http://www.alasdairjggray.co.uk >> <http://www.macs.hw.ac.uk/~ajg33> >> >> ORCID: http://orcid.org/0000-0002-5711-4872 >> >> Telephone: +44 131 451 3429 >> >> Twitter: @gray_alasdair >> >> >> >> >> >> >> >> >> >> >> ------------------------------ >> >> >> Sunday Times Scottish University of the Year 2011-2013 >> Top in the UK for student experience >> Fourth university in the UK and top in Scotland (National Student Survey >> 2012) >> >> We invite research leaders and ambitious early career researchers to join >> us in leading and driving research in key inter-disciplinary themes. Please >> see www.hw.ac.uk/researchleaders for further information and how to >> apply. >> >> Heriot-Watt University is a Scottish charity registered under charity >> number SC000278. >> > > > > -- > Stian Soiland-Reyes, myGrid team > School of Computer Science > The University of Manchester > http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718 >
Received on Thursday, 28 August 2014 00:49:06 UTC