Re: inDataset (was Notes from today's meeting) from Jerven Bolleman on 2013-06-04 (public-semweb-lifesci@w3.org from June 2013)

From: Jerven Bolleman <me@jerven.eu>
Date: Tue, 4 Jun 2013 17:14:38 +0200
To: Michel Dumontier <michel.dumontier@gmail.com>
Cc: Alasdair J G Gray <Alasdair.Gray@manchester.ac.uk>, "public-semweb-lifesci@w3.org" <public-semweb-lifesci@w3.org>
Message-ID: <CAHM_hUNZE-GJBCtBUc3TPDtr=a511W4HzAn3vVKDtEK5VT4jGw@mail.gmail.com>
On Tue, Jun 4, 2013 at 3:47 PM, Michel Dumontier <michel.dumontier@gmail.com
> wrote:

> Hi Jerven,
>
>  First: Bio2RDF's current datasets are listed here:
>
> http://bio2rdf.org/datasets
>
> as i mentioned, and has been presented [1] those that are in Bio2RDF
> release 2 have provenance. e.g.
> http://bio2rdf.org/geneid:123
>
> (we did not provide updates to uniprot, genbank, refseq, pubmed in release
> 2, and they won't have provenance associated with them).
>
>
> Yes, I agree that the provenance of an assertion is more interesting, and
> we are working towards implementing this for Bio2RDF. But if you were
> worried about adding 1.6 billion relations, you'll be more worried about
> adding 8 billion more to annotate each triple.
>
No it's only 16 graph id's ;) at least on the SPARQL endpoint. Using
reification we would add 4*8 billion triples ...
But with trix or n-quads dumps we would not need these triples as again we
would do provenance on a graph level.
Which I disagree with the MUST qualification not with a MAY qualification
in the standard.


>
> m.
>
> [1]
> http://www.slideshare.net/micheldumontier/bio2rdf-release-2-improved-coverage-interoperability-and-provenance-of-linked-data-for-the-life-sciences
>
>
>
>
> On Tue, Jun 4, 2013 at 3:23 PM, Jerven Bolleman <me@jerven.eu> wrote:
>
>>
>>
>>
>> On Tue, Jun 4, 2013 at 3:05 PM, Michel Dumontier <
>> michel.dumontier@gmail.com> wrote:
>>
>>>
>>>
>>>
>>> On Tue, Jun 4, 2013 at 2:40 PM, Jerven Bolleman <me@jerven.eu> wrote:
>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jun 4, 2013 at 11:36 AM, Alasdair J G Gray <
>>>> Alasdair.Gray@manchester.ac.uk> wrote:
>>>>
>>>>>
>>>>> On 3 Jun 2013, at 17:51, Michel Dumontier <michel.dumontier@gmail.com>
>>>>> wrote:
>>>>>
>>>>> About void:inDataset I personally don't like it. I suspect it would
>>>>>> cost me a 13% growth in triple size for negligible benefits. This also
>>>>>> means that the dataset description starts to affect the data. Although I
>>>>>> could only present this in the rest / linked data interface and not in the
>>>>>> sparql endpoint. I am worried that I can not put it into the FTP data dump
>>>>>> rdf. As the data item concept does not map 1:1 on a set of triples that are
>>>>>> atomic.
>>>>>>
>>>>>>
>>>>> i'm not sure that i completely understand your objection. the primary
>>>>> use of void:inDataset is to link data items to the dataset description, and
>>>>> as such supports linked data applications without looking at the graph for
>>>>> a potential, but un-guaranteed provenance description. Using void:inDataset
>>>>> is normal practice in the RDF / linked data community. It would be strange
>>>>> to not include it in any RDF dataset if you have the dataset description.
>>>>>
>>>>> http://www.w3.org/TR/void/#backlinks
>>>>>
>>>>>
>>>>>
>>>>>> e.g. someone can use just the UniProtKB sequences. Once they did that
>>>>>> is it still the same dataset that I published it as? I don't think so.
>>>>>> Which means uniprot end users need to be careful to remove more triples.
>>>>>> Which why I disagree with alasdair's call for MUST.
>>>>>>
>>>>>>
>>>>> if one wanted to know which version/issue of uniprot that the
>>>>> sequences came from, it would be necessary to provide access to the dataset
>>>>> description. if the void:inDataset predicate is used, the user need not
>>>>> even retrieve that to store locally, as you should provide resolution
>>>>> services to those dataset descriptions.
>>>>>
>>>>>
>>>>> I also do not follow your objection. If you have created a file that
>>>>> contains a subset of the data, then you can declare this to be a subset of
>>>>> the parent-versioned-formatted dataset, ideally with some way of
>>>>> distinguishing the content of the dataset.
>>>>>
>>>> I will try to explain my objections. The fist is the dataset is a set
>>>> of triples while the void:inDataset is a predicate on a single
>>>> resource/entity/subject.
>>>> So as I have 1.4 billion entities I would add 1.4 billion
>>>> void:inDataset triples. Which to me seems like the incorrect thing to do.
>>>>
>>>
>>>  we would like to know the provenance of every data item. if you define
>>> 1.4 billion entities, then you should provide 1.4 billion links to their
>>> provenance.
>>>
>>>
>>>>  Well you say you should only add them to the "important" resources
>>>> and then we are down to a 100 million of these statements.
>>>> Yet for users who use slices of our data these void:inDataset triples
>>>> are annoying/misleading especially if they merge them with their own
>>>> sources.
>>>>
>>>> e.g.
>>>>
>>>> uniref:UniRef100_ up:sequenceFor uniprot:P12345 .
>>>> uniprot:P12345 a up:Protein ;
>>>>                        void:inDataset dataset:uniprot .
>>>> dataset:uniprot dcterms:licence cc:by-sa-v3 .
>>>> uniprot:P12345 .roche:activatedBy secretdrugchemical:1000  .
>>>> secretdrugchemical:1000 void:inDataset top:secret .
>>>>
>>>> Given these triples what is the license for knowledge about
>>>> secretdrugchemical:1000 activating uniprot:P12345?
>>>>
>>>> The dataset description is about a set of data, not single triples so
>>>> single back links seem to me to be the incorrect solution?
>>>>
>>>>
>>> the focus on the assertion(s) is perfectly fine. several mechanisms have
>>> now been proposed; nanopublications [1], micropublications [2] and ovopubs
>>> [3]
>>>
>>> [1]
>>> http://www.w3.org/wiki/images/c/c0/HCLSIG$$SWANSIOC$$Actions$$RhetoricalStructure$$meetings$$20100215$cwa-anatomy-nanopub-v3.pdf
>>> [2]http://arxiv.org/abs/1305.3506
>>> [3]http://arxiv.org/abs/1305.6800
>>>
>>>
>>>
>>>>
>>>>
>>>>> From all the scenarios I have encountered, scientists (not just in the
>>>>> healthcare and life sciences) care about where their data has come from and
>>>>> what version it is. As such, we need some way to allow for the linking of
>>>>> data back to the description of the data.
>>>>>
>>>> Of course I don't disagree with the usecase. I disagree with the chosen
>>>> solution because it is on the wrong level of granularity.
>>>>
>>>>
>>> it's not wrong, it's just at a level that you don't want to provide.  We
>>> do it in Bio2RDF, and now each of our data items from Release 2 are linked
>>> accordingly.
>>>
>> No you don't see http://bio2rdf.org/page/beilstein:1900390
>>
>> Also you put it on the entity/subject while what is interesting is the
>> provenance of the triple.
>>
>> The provenance is on the triple in your linked papers not in the bio2rdf
>> case or the void:inDataset case.
>>
>> Regards,
>> Jerven
>>
>>>
>>> m.
>>>
>>>
>>>>
>>>>> Alasdair
>>>>>
>>>>> Dr Alasdair J G Gray
>>>>> Research Associate
>>>>> Alasdair.Gray@manchester.ac.uk
>>>>> +44 161 275 0145
>>>>>
>>>>> http://www.cs.man.ac.uk/~graya/
>>>>>
>>>>> Please consider the environment before printing this email.
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Jerven Bolleman
>>>> me@jerven.eu
>>>>
>>>
>>>
>>>
>>> --
>>> Michel Dumontier
>>> Associate Professor of Bioinformatics, Carleton University
>>> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest
>>> Group
>>> http://dumontierlab.com
>>>
>>
>>
>>
>> --
>> Jerven Bolleman
>> me@jerven.eu
>>
>
>
>
> --
> Michel Dumontier
> Associate Professor of Bioinformatics, Carleton University
> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest
> Group
> http://dumontierlab.com
>



-- 
Jerven Bolleman
me@jerven.eu
Received on Tuesday, 4 June 2013 15:15:10 UTC