Re: inDataset (was Notes from today's meeting) from Michel Dumontier on 2013-06-04 (public-semweb-lifesci@w3.org from June 2013)

From: Michel Dumontier <michel.dumontier@gmail.com>
Date: Tue, 4 Jun 2013 17:26:21 +0200
To: Jerven Bolleman <me@jerven.eu>
Cc: Alasdair J G Gray <Alasdair.Gray@manchester.ac.uk>, "public-semweb-lifesci@w3.org" <public-semweb-lifesci@w3.org>
Message-ID: <CALcEXf4p2cASD+N1+3MaHMx=dTDQe=vZrX7QDuskEieow4YdTg@mail.gmail.com>
 i'm willing to comprise to "should", as it it is generally seen as a good
practice to use void:inDataset for linking data items to datasets.  We can
bring this discussion to the semantic-web mailing list, if you want to
additional feedback.

m.


On Tue, Jun 4, 2013 at 5:14 PM, Jerven Bolleman <me@jerven.eu> wrote:

>
>
>
> On Tue, Jun 4, 2013 at 3:47 PM, Michel Dumontier <
> michel.dumontier@gmail.com> wrote:
>
>> Hi Jerven,
>>
>>  First: Bio2RDF's current datasets are listed here:
>>
>> http://bio2rdf.org/datasets
>>
>> as i mentioned, and has been presented [1] those that are in Bio2RDF
>> release 2 have provenance. e.g.
>> http://bio2rdf.org/geneid:123
>>
>> (we did not provide updates to uniprot, genbank, refseq, pubmed in
>> release 2, and they won't have provenance associated with them).
>>
>>
>> Yes, I agree that the provenance of an assertion is more interesting, and
>> we are working towards implementing this for Bio2RDF. But if you were
>> worried about adding 1.6 billion relations, you'll be more worried about
>> adding 8 billion more to annotate each triple.
>>
> No it's only 16 graph id's ;) at least on the SPARQL endpoint. Using
> reification we would add 4*8 billion triples ...
> But with trix or n-quads dumps we would not need these triples as again we
> would do provenance on a graph level.
> Which I disagree with the MUST qualification not with a MAY qualification
> in the standard.
>
>
>>
>> m.
>>
>> [1]
>> http://www.slideshare.net/micheldumontier/bio2rdf-release-2-improved-coverage-interoperability-and-provenance-of-linked-data-for-the-life-sciences
>>
>>
>>
>>
>> On Tue, Jun 4, 2013 at 3:23 PM, Jerven Bolleman <me@jerven.eu> wrote:
>>
>>>
>>>
>>>
>>> On Tue, Jun 4, 2013 at 3:05 PM, Michel Dumontier <
>>> michel.dumontier@gmail.com> wrote:
>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jun 4, 2013 at 2:40 PM, Jerven Bolleman <me@jerven.eu> wrote:
>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 4, 2013 at 11:36 AM, Alasdair J G Gray <
>>>>> Alasdair.Gray@manchester.ac.uk> wrote:
>>>>>
>>>>>>
>>>>>> On 3 Jun 2013, at 17:51, Michel Dumontier <michel.dumontier@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> About void:inDataset I personally don't like it. I suspect it would
>>>>>>> cost me a 13% growth in triple size for negligible benefits. This also
>>>>>>> means that the dataset description starts to affect the data. Although I
>>>>>>> could only present this in the rest / linked data interface and not in the
>>>>>>> sparql endpoint. I am worried that I can not put it into the FTP data dump
>>>>>>> rdf. As the data item concept does not map 1:1 on a set of triples that are
>>>>>>> atomic.
>>>>>>>
>>>>>>>
>>>>>> i'm not sure that i completely understand your objection. the primary
>>>>>> use of void:inDataset is to link data items to the dataset description, and
>>>>>> as such supports linked data applications without looking at the graph for
>>>>>> a potential, but un-guaranteed provenance description. Using void:inDataset
>>>>>> is normal practice in the RDF / linked data community. It would be strange
>>>>>> to not include it in any RDF dataset if you have the dataset description.
>>>>>>
>>>>>> http://www.w3.org/TR/void/#backlinks
>>>>>>
>>>>>>
>>>>>>
>>>>>>> e.g. someone can use just the UniProtKB sequences. Once they did
>>>>>>> that is it still the same dataset that I published it as? I don't think so.
>>>>>>> Which means uniprot end users need to be careful to remove more triples.
>>>>>>> Which why I disagree with alasdair's call for MUST.
>>>>>>>
>>>>>>>
>>>>>> if one wanted to know which version/issue of uniprot that the
>>>>>> sequences came from, it would be necessary to provide access to the dataset
>>>>>> description. if the void:inDataset predicate is used, the user need not
>>>>>> even retrieve that to store locally, as you should provide resolution
>>>>>> services to those dataset descriptions.
>>>>>>
>>>>>>
>>>>>> I also do not follow your objection. If you have created a file that
>>>>>> contains a subset of the data, then you can declare this to be a subset of
>>>>>> the parent-versioned-formatted dataset, ideally with some way of
>>>>>> distinguishing the content of the dataset.
>>>>>>
>>>>> I will try to explain my objections. The fist is the dataset is a set
>>>>> of triples while the void:inDataset is a predicate on a single
>>>>> resource/entity/subject.
>>>>> So as I have 1.4 billion entities I would add 1.4 billion
>>>>> void:inDataset triples. Which to me seems like the incorrect thing to do.
>>>>>
>>>>
>>>>  we would like to know the provenance of every data item. if you define
>>>> 1.4 billion entities, then you should provide 1.4 billion links to their
>>>> provenance.
>>>>
>>>>
>>>>>  Well you say you should only add them to the "important" resources
>>>>> and then we are down to a 100 million of these statements.
>>>>> Yet for users who use slices of our data these void:inDataset triples
>>>>> are annoying/misleading especially if they merge them with their own
>>>>> sources.
>>>>>
>>>>> e.g.
>>>>>
>>>>> uniref:UniRef100_ up:sequenceFor uniprot:P12345 .
>>>>> uniprot:P12345 a up:Protein ;
>>>>>                        void:inDataset dataset:uniprot .
>>>>> dataset:uniprot dcterms:licence cc:by-sa-v3 .
>>>>> uniprot:P12345 .roche:activatedBy secretdrugchemical:1000  .
>>>>> secretdrugchemical:1000 void:inDataset top:secret .
>>>>>
>>>>> Given these triples what is the license for knowledge about
>>>>> secretdrugchemical:1000 activating uniprot:P12345?
>>>>>
>>>>> The dataset description is about a set of data, not single triples so
>>>>> single back links seem to me to be the incorrect solution?
>>>>>
>>>>>
>>>> the focus on the assertion(s) is perfectly fine. several mechanisms
>>>> have now been proposed; nanopublications [1], micropublications [2] and
>>>> ovopubs [3]
>>>>
>>>> [1]
>>>> http://www.w3.org/wiki/images/c/c0/HCLSIG$$SWANSIOC$$Actions$$RhetoricalStructure$$meetings$$20100215$cwa-anatomy-nanopub-v3.pdf
>>>> [2]http://arxiv.org/abs/1305.3506
>>>> [3]http://arxiv.org/abs/1305.6800
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>> From all the scenarios I have encountered, scientists (not just in
>>>>>> the healthcare and life sciences) care about where their data has come from
>>>>>> and what version it is. As such, we need some way to allow for the linking
>>>>>> of data back to the description of the data.
>>>>>>
>>>>> Of course I don't disagree with the usecase. I disagree with the
>>>>> chosen solution because it is on the wrong level of granularity.
>>>>>
>>>>>
>>>> it's not wrong, it's just at a level that you don't want to provide.
>>>>  We do it in Bio2RDF, and now each of our data items from Release 2 are
>>>> linked accordingly.
>>>>
>>> No you don't see http://bio2rdf.org/page/beilstein:1900390
>>>
>>> Also you put it on the entity/subject while what is interesting is the
>>> provenance of the triple.
>>>
>>> The provenance is on the triple in your linked papers not in the bio2rdf
>>> case or the void:inDataset case.
>>>
>>> Regards,
>>> Jerven
>>>
>>>>
>>>> m.
>>>>
>>>>
>>>>>
>>>>>> Alasdair
>>>>>>
>>>>>> Dr Alasdair J G Gray
>>>>>> Research Associate
>>>>>> Alasdair.Gray@manchester.ac.uk
>>>>>> +44 161 275 0145
>>>>>>
>>>>>> http://www.cs.man.ac.uk/~graya/
>>>>>>
>>>>>> Please consider the environment before printing this email.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jerven Bolleman
>>>>> me@jerven.eu
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Michel Dumontier
>>>> Associate Professor of Bioinformatics, Carleton University
>>>> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest
>>>> Group
>>>> http://dumontierlab.com
>>>>
>>>
>>>
>>>
>>> --
>>> Jerven Bolleman
>>> me@jerven.eu
>>>
>>
>>
>>
>> --
>> Michel Dumontier
>> Associate Professor of Bioinformatics, Carleton University
>> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest
>> Group
>> http://dumontierlab.com
>>
>
>
>
> --
> Jerven Bolleman
> me@jerven.eu
>



-- 
Michel Dumontier
Associate Professor of Bioinformatics, Carleton University
Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
http://dumontierlab.com
Received on Tuesday, 4 June 2013 15:27:13 UTC