Re: inDataset (was Notes from today's meeting) from Simon Jupp on 2013-06-04 (public-semweb-lifesci@w3.org from June 2013)

From: Simon Jupp <jupp@ebi.ac.uk>
Date: Tue, 4 Jun 2013 15:43:05 +0100
To: Michel Dumontier <michel.dumontier@gmail.com>
Cc: Jerven Bolleman <me@jerven.eu>, Alasdair J G Gray <Alasdair.Gray@manchester.ac.uk>, "public-semweb-lifesci@w3.org" <public-semweb-lifesci@w3.org>
Message-Id: <2AED0B73-7562-48FA-B6C4-2420E7FB0E2D@ebi.ac.uk>
We load the entire dataset into a named graph, then use the graph URI for provenance. I'm with Jerven in that we don't want to add void:inDataset links in the source data for the many million resources we index, however, I have no problem including this information when you request information about this resource from our server. Part of the problem is that most linked data browser work directly off a basic describe query. If you have more control over how you serve up the data, then it's very easy to add additional information like a void:inDataset triple.
Simon 

On 4 Jun 2013, at 14:47, Michel Dumontier wrote:

> Hi Jerven,
> 
>  First: Bio2RDF's current datasets are listed here:
> 
> http://bio2rdf.org/datasets
> 
> as i mentioned, and has been presented [1] those that are in Bio2RDF release 2 have provenance. e.g.
> http://bio2rdf.org/geneid:123
> 
> (we did not provide updates to uniprot, genbank, refseq, pubmed in release 2, and they won't have provenance associated with them).
> 
> 
> Yes, I agree that the provenance of an assertion is more interesting, and we are working towards implementing this for Bio2RDF. But if you were worried about adding 1.6 billion relations, you'll be more worried about adding 8 billion more to annotate each triple.
> 
> 
> m.
> 
> [1] http://www.slideshare.net/micheldumontier/bio2rdf-release-2-improved-coverage-interoperability-and-provenance-of-linked-data-for-the-life-sciences
> 
> 
> 
> 
> On Tue, Jun 4, 2013 at 3:23 PM, Jerven Bolleman <me@jerven.eu> wrote:
> 
> 
> 
> On Tue, Jun 4, 2013 at 3:05 PM, Michel Dumontier <michel.dumontier@gmail.com> wrote:
> 
> 
> 
> On Tue, Jun 4, 2013 at 2:40 PM, Jerven Bolleman <me@jerven.eu> wrote:
> 
> 
> 
> On Tue, Jun 4, 2013 at 11:36 AM, Alasdair J G Gray <Alasdair.Gray@manchester.ac.uk> wrote:
> 
> On 3 Jun 2013, at 17:51, Michel Dumontier <michel.dumontier@gmail.com> wrote:
> 
>> About void:inDataset I personally don't like it. I suspect it would cost me a 13% growth in triple size for negligible benefits. This also means that the dataset description starts to affect the data. Although I could only present this in the rest / linked data interface and not in the sparql endpoint. I am worried that I can not put it into the FTP data dump rdf. As the data item concept does not map 1:1 on a set of triples that are atomic. 
>> 
>> 
>> i'm not sure that i completely understand your objection. the primary use of void:inDataset is to link data items to the dataset description, and as such supports linked data applications without looking at the graph for a potential, but un-guaranteed provenance description. Using void:inDataset is normal practice in the RDF / linked data community. It would be strange to not include it in any RDF dataset if you have the dataset description.
>> 
>> http://www.w3.org/TR/void/#backlinks
>> 
>>  
>> e.g. someone can use just the UniProtKB sequences. Once they did that is it still the same dataset that I published it as? I don't think so. Which means uniprot end users need to be careful to remove more triples. Which why I disagree with alasdair's call for MUST.
>> 
>> 
>> if one wanted to know which version/issue of uniprot that the sequences came from, it would be necessary to provide access to the dataset description. if the void:inDataset predicate is used, the user need not even retrieve that to store locally, as you should provide resolution services to those dataset descriptions.
> 
> I also do not follow your objection. If you have created a file that contains a subset of the data, then you can declare this to be a subset of the parent-versioned-formatted dataset, ideally with some way of distinguishing the content of the dataset. 
> I will try to explain my objections. The fist is the dataset is a set of triples while the void:inDataset is a predicate on a single resource/entity/subject.
> So as I have 1.4 billion entities I would add 1.4 billion void:inDataset triples. Which to me seems like the incorrect thing to do.
> 
>  we would like to know the provenance of every data item. if you define 1.4 billion entities, then you should provide 1.4 billion links to their provenance.
>  
> Well you say you should only add them to the "important" resources and then we are down to a 100 million of these statements. 
> Yet for users who use slices of our data these void:inDataset triples are annoying/misleading especially if they merge them with their own sources.
> 
> e.g.
> 
> uniref:UniRef100_ up:sequenceFor uniprot:P12345 .
> uniprot:P12345 a up:Protein ;
>                        void:inDataset dataset:uniprot .
> dataset:uniprot dcterms:licence cc:by-sa-v3 .
> uniprot:P12345 .roche:activatedBy secretdrugchemical:1000  .
> secretdrugchemical:1000 void:inDataset top:secret .
> 
> Given these triples what is the license for knowledge about secretdrugchemical:1000 activating uniprot:P12345? 
> 
> The dataset description is about a set of data, not single triples so single back links seem to me to be the incorrect solution?
> 
> 
> the focus on the assertion(s) is perfectly fine. several mechanisms have now been proposed; nanopublications [1], micropublications [2] and ovopubs [3]
> 
> [1] http://www.w3.org/wiki/images/c/c0/HCLSIG$$SWANSIOC$$Actions$$RhetoricalStructure$$meetings$$20100215$cwa-anatomy-nanopub-v3.pdf
> [2]http://arxiv.org/abs/1305.3506
> [3]http://arxiv.org/abs/1305.6800
> 
>  
>  
> From all the scenarios I have encountered, scientists (not just in the healthcare and life sciences) care about where their data has come from and what version it is. As such, we need some way to allow for the linking of data back to the description of the data.
> Of course I don't disagree with the usecase. I disagree with the chosen solution because it is on the wrong level of granularity. 
> 
> 
> it's not wrong, it's just at a level that you don't want to provide.  We do it in Bio2RDF, and now each of our data items from Release 2 are linked accordingly.
> No you don't see http://bio2rdf.org/page/beilstein:1900390
> 
> Also you put it on the entity/subject while what is interesting is the provenance of the triple.
> 
> The provenance is on the triple in your linked papers not in the bio2rdf case or the void:inDataset case.
> 
> Regards,
> Jerven 
> 
> m.
>  
> 
> Alasdair
> 
> Dr Alasdair J G Gray
> Research Associate
> Alasdair.Gray@manchester.ac.uk
> +44 161 275 0145
> 
> http://www.cs.man.ac.uk/~graya/
> 
> Please consider the environment before printing this email.
> 
> 
> 
> 
> -- 
> Jerven Bolleman
> me@jerven.eu
> 
> 
> 
> -- 
> Michel Dumontier
> Associate Professor of Bioinformatics, Carleton University
> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
> http://dumontierlab.com
> 
> 
> 
> -- 
> Jerven Bolleman
> me@jerven.eu
> 
> 
> 
> -- 
> Michel Dumontier
> Associate Professor of Bioinformatics, Carleton University
> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
> http://dumontierlab.com
Received on Tuesday, 4 June 2013 14:43:39 UTC