Re: inDataset (was Notes from today's meeting)

On Tue, Jun 4, 2013 at 3:05 PM, Michel Dumontier <michel.dumontier@gmail.com
> wrote:

>
>
>
> On Tue, Jun 4, 2013 at 2:40 PM, Jerven Bolleman <me@jerven.eu> wrote:
>
>>
>>
>>
>> On Tue, Jun 4, 2013 at 11:36 AM, Alasdair J G Gray <
>> Alasdair.Gray@manchester.ac.uk> wrote:
>>
>>>
>>> On 3 Jun 2013, at 17:51, Michel Dumontier <michel.dumontier@gmail.com>
>>> wrote:
>>>
>>> About void:inDataset I personally don't like it. I suspect it would cost
>>>> me a 13% growth in triple size for negligible benefits. This also means
>>>> that the dataset description starts to affect the data. Although I could
>>>> only present this in the rest / linked data interface and not in the sparql
>>>> endpoint. I am worried that I can not put it into the FTP data dump rdf. As
>>>> the data item concept does not map 1:1 on a set of triples that are atomic.
>>>>
>>>>
>>> i'm not sure that i completely understand your objection. the primary
>>> use of void:inDataset is to link data items to the dataset description, and
>>> as such supports linked data applications without looking at the graph for
>>> a potential, but un-guaranteed provenance description. Using void:inDataset
>>> is normal practice in the RDF / linked data community. It would be strange
>>> to not include it in any RDF dataset if you have the dataset description.
>>>
>>> http://www.w3.org/TR/void/#backlinks
>>>
>>>
>>>
>>>> e.g. someone can use just the UniProtKB sequences. Once they did that
>>>> is it still the same dataset that I published it as? I don't think so.
>>>> Which means uniprot end users need to be careful to remove more triples.
>>>> Which why I disagree with alasdair's call for MUST.
>>>>
>>>>
>>> if one wanted to know which version/issue of uniprot that the sequences
>>> came from, it would be necessary to provide access to the dataset
>>> description. if the void:inDataset predicate is used, the user need not
>>> even retrieve that to store locally, as you should provide resolution
>>> services to those dataset descriptions.
>>>
>>>
>>> I also do not follow your objection. If you have created a file that
>>> contains a subset of the data, then you can declare this to be a subset of
>>> the parent-versioned-formatted dataset, ideally with some way of
>>> distinguishing the content of the dataset.
>>>
>> I will try to explain my objections. The fist is the dataset is a set of
>> triples while the void:inDataset is a predicate on a single
>> resource/entity/subject.
>> So as I have 1.4 billion entities I would add 1.4 billion void:inDataset
>> triples. Which to me seems like the incorrect thing to do.
>>
>
>  we would like to know the provenance of every data item. if you define
> 1.4 billion entities, then you should provide 1.4 billion links to their
> provenance.
>
>
>>  Well you say you should only add them to the "important" resources and
>> then we are down to a 100 million of these statements.
>> Yet for users who use slices of our data these void:inDataset triples are
>> annoying/misleading especially if they merge them with their own sources.
>>
>> e.g.
>>
>> uniref:UniRef100_ up:sequenceFor uniprot:P12345 .
>> uniprot:P12345 a up:Protein ;
>>                        void:inDataset dataset:uniprot .
>> dataset:uniprot dcterms:licence cc:by-sa-v3 .
>> uniprot:P12345 .roche:activatedBy secretdrugchemical:1000  .
>> secretdrugchemical:1000 void:inDataset top:secret .
>>
>> Given these triples what is the license for knowledge about
>> secretdrugchemical:1000 activating uniprot:P12345?
>>
>> The dataset description is about a set of data, not single triples so
>> single back links seem to me to be the incorrect solution?
>>
>>
> the focus on the assertion(s) is perfectly fine. several mechanisms have
> now been proposed; nanopublications [1], micropublications [2] and ovopubs
> [3]
>
> [1]
> http://www.w3.org/wiki/images/c/c0/HCLSIG$$SWANSIOC$$Actions$$RhetoricalStructure$$meetings$$20100215$cwa-anatomy-nanopub-v3.pdf
> [2]http://arxiv.org/abs/1305.3506
> [3]http://arxiv.org/abs/1305.6800
>
>
>
>>
>>
>>> From all the scenarios I have encountered, scientists (not just in the
>>> healthcare and life sciences) care about where their data has come from and
>>> what version it is. As such, we need some way to allow for the linking of
>>> data back to the description of the data.
>>>
>> Of course I don't disagree with the usecase. I disagree with the chosen
>> solution because it is on the wrong level of granularity.
>>
>>
> it's not wrong, it's just at a level that you don't want to provide.  We
> do it in Bio2RDF, and now each of our data items from Release 2 are linked
> accordingly.
>
No you don't see http://bio2rdf.org/page/beilstein:1900390

Also you put it on the entity/subject while what is interesting is the
provenance of the triple.

The provenance is on the triple in your linked papers not in the bio2rdf
case or the void:inDataset case.

Regards,
Jerven

>
> m.
>
>
>>
>>> Alasdair
>>>
>>> Dr Alasdair J G Gray
>>> Research Associate
>>> Alasdair.Gray@manchester.ac.uk
>>> +44 161 275 0145
>>>
>>> http://www.cs.man.ac.uk/~graya/
>>>
>>> Please consider the environment before printing this email.
>>>
>>>
>>
>>
>> --
>> Jerven Bolleman
>> me@jerven.eu
>>
>
>
>
> --
> Michel Dumontier
> Associate Professor of Bioinformatics, Carleton University
> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest
> Group
> http://dumontierlab.com
>



-- 
Jerven Bolleman
me@jerven.eu

Received on Tuesday, 4 June 2013 13:24:12 UTC