Re: inDataset (was Notes from today's meeting) from Michel Dumontier on 2013-06-04 (public-semweb-lifesci@w3.org from June 2013)

From: Michel Dumontier <michel.dumontier@gmail.com>
Date: Tue, 4 Jun 2013 15:47:04 +0200
To: Jerven Bolleman <me@jerven.eu>
Cc: Alasdair J G Gray <Alasdair.Gray@manchester.ac.uk>, "public-semweb-lifesci@w3.org" <public-semweb-lifesci@w3.org>
Message-ID: <CALcEXf5QTVETAiYgOxi_sUJ+EnZQFm0MGde=1sorDj_N6y=1Cw@mail.gmail.com>
Hi Jerven,

 First: Bio2RDF's current datasets are listed here:

http://bio2rdf.org/datasets

as i mentioned, and has been presented [1] those that are in Bio2RDF
release 2 have provenance. e.g.
http://bio2rdf.org/geneid:123

(we did not provide updates to uniprot, genbank, refseq, pubmed in release
2, and they won't have provenance associated with them).


Yes, I agree that the provenance of an assertion is more interesting, and
we are working towards implementing this for Bio2RDF. But if you were
worried about adding 1.6 billion relations, you'll be more worried about
adding 8 billion more to annotate each triple.


m.

[1]
http://www.slideshare.net/micheldumontier/bio2rdf-release-2-improved-coverage-interoperability-and-provenance-of-linked-data-for-the-life-sciences




On Tue, Jun 4, 2013 at 3:23 PM, Jerven Bolleman <me@jerven.eu> wrote:

>
>
>
> On Tue, Jun 4, 2013 at 3:05 PM, Michel Dumontier <
> michel.dumontier@gmail.com> wrote:
>
>>
>>
>>
>> On Tue, Jun 4, 2013 at 2:40 PM, Jerven Bolleman <me@jerven.eu> wrote:
>>
>>>
>>>
>>>
>>> On Tue, Jun 4, 2013 at 11:36 AM, Alasdair J G Gray <
>>> Alasdair.Gray@manchester.ac.uk> wrote:
>>>
>>>>
>>>> On 3 Jun 2013, at 17:51, Michel Dumontier <michel.dumontier@gmail.com>
>>>> wrote:
>>>>
>>>> About void:inDataset I personally don't like it. I suspect it would
>>>>> cost me a 13% growth in triple size for negligible benefits. This also
>>>>> means that the dataset description starts to affect the data. Although I
>>>>> could only present this in the rest / linked data interface and not in the
>>>>> sparql endpoint. I am worried that I can not put it into the FTP data dump
>>>>> rdf. As the data item concept does not map 1:1 on a set of triples that are
>>>>> atomic.
>>>>>
>>>>>
>>>> i'm not sure that i completely understand your objection. the primary
>>>> use of void:inDataset is to link data items to the dataset description, and
>>>> as such supports linked data applications without looking at the graph for
>>>> a potential, but un-guaranteed provenance description. Using void:inDataset
>>>> is normal practice in the RDF / linked data community. It would be strange
>>>> to not include it in any RDF dataset if you have the dataset description.
>>>>
>>>> http://www.w3.org/TR/void/#backlinks
>>>>
>>>>
>>>>
>>>>> e.g. someone can use just the UniProtKB sequences. Once they did that
>>>>> is it still the same dataset that I published it as? I don't think so.
>>>>> Which means uniprot end users need to be careful to remove more triples.
>>>>> Which why I disagree with alasdair's call for MUST.
>>>>>
>>>>>
>>>> if one wanted to know which version/issue of uniprot that the sequences
>>>> came from, it would be necessary to provide access to the dataset
>>>> description. if the void:inDataset predicate is used, the user need not
>>>> even retrieve that to store locally, as you should provide resolution
>>>> services to those dataset descriptions.
>>>>
>>>>
>>>> I also do not follow your objection. If you have created a file that
>>>> contains a subset of the data, then you can declare this to be a subset of
>>>> the parent-versioned-formatted dataset, ideally with some way of
>>>> distinguishing the content of the dataset.
>>>>
>>> I will try to explain my objections. The fist is the dataset is a set of
>>> triples while the void:inDataset is a predicate on a single
>>> resource/entity/subject.
>>> So as I have 1.4 billion entities I would add 1.4 billion void:inDataset
>>> triples. Which to me seems like the incorrect thing to do.
>>>
>>
>>  we would like to know the provenance of every data item. if you define
>> 1.4 billion entities, then you should provide 1.4 billion links to their
>> provenance.
>>
>>
>>>  Well you say you should only add them to the "important" resources and
>>> then we are down to a 100 million of these statements.
>>> Yet for users who use slices of our data these void:inDataset triples
>>> are annoying/misleading especially if they merge them with their own
>>> sources.
>>>
>>> e.g.
>>>
>>> uniref:UniRef100_ up:sequenceFor uniprot:P12345 .
>>> uniprot:P12345 a up:Protein ;
>>>                        void:inDataset dataset:uniprot .
>>> dataset:uniprot dcterms:licence cc:by-sa-v3 .
>>> uniprot:P12345 .roche:activatedBy secretdrugchemical:1000  .
>>> secretdrugchemical:1000 void:inDataset top:secret .
>>>
>>> Given these triples what is the license for knowledge about
>>> secretdrugchemical:1000 activating uniprot:P12345?
>>>
>>> The dataset description is about a set of data, not single triples so
>>> single back links seem to me to be the incorrect solution?
>>>
>>>
>> the focus on the assertion(s) is perfectly fine. several mechanisms have
>> now been proposed; nanopublications [1], micropublications [2] and ovopubs
>> [3]
>>
>> [1]
>> http://www.w3.org/wiki/images/c/c0/HCLSIG$$SWANSIOC$$Actions$$RhetoricalStructure$$meetings$$20100215$cwa-anatomy-nanopub-v3.pdf
>> [2]http://arxiv.org/abs/1305.3506
>> [3]http://arxiv.org/abs/1305.6800
>>
>>
>>
>>>
>>>
>>>> From all the scenarios I have encountered, scientists (not just in the
>>>> healthcare and life sciences) care about where their data has come from and
>>>> what version it is. As such, we need some way to allow for the linking of
>>>> data back to the description of the data.
>>>>
>>> Of course I don't disagree with the usecase. I disagree with the chosen
>>> solution because it is on the wrong level of granularity.
>>>
>>>
>> it's not wrong, it's just at a level that you don't want to provide.  We
>> do it in Bio2RDF, and now each of our data items from Release 2 are linked
>> accordingly.
>>
> No you don't see http://bio2rdf.org/page/beilstein:1900390
>
> Also you put it on the entity/subject while what is interesting is the
> provenance of the triple.
>
> The provenance is on the triple in your linked papers not in the bio2rdf
> case or the void:inDataset case.
>
> Regards,
> Jerven
>
>>
>> m.
>>
>>
>>>
>>>> Alasdair
>>>>
>>>> Dr Alasdair J G Gray
>>>> Research Associate
>>>> Alasdair.Gray@manchester.ac.uk
>>>> +44 161 275 0145
>>>>
>>>> http://www.cs.man.ac.uk/~graya/
>>>>
>>>> Please consider the environment before printing this email.
>>>>
>>>>
>>>
>>>
>>> --
>>> Jerven Bolleman
>>> me@jerven.eu
>>>
>>
>>
>>
>> --
>> Michel Dumontier
>> Associate Professor of Bioinformatics, Carleton University
>> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest
>> Group
>> http://dumontierlab.com
>>
>
>
>
> --
> Jerven Bolleman
> me@jerven.eu
>



-- 
Michel Dumontier
Associate Professor of Bioinformatics, Carleton University
Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
http://dumontierlab.com
Received on Tuesday, 4 June 2013 13:47:58 UTC