W3C home > Mailing lists > Public > public-semweb-lifesci@w3.org > June 2013

Re: inDataset (was Notes from today's meeting)

From: Michel Dumontier <michel.dumontier@gmail.com>
Date: Tue, 4 Jun 2013 15:05:08 +0200
Message-ID: <CALcEXf4AXYR-g_EK01wMKU0UxQ-eUcpZYBOM+w0wzGiYnS4zDQ@mail.gmail.com>
To: Jerven Bolleman <me@jerven.eu>
Cc: Alasdair J G Gray <Alasdair.Gray@manchester.ac.uk>, "public-semweb-lifesci@w3.org" <public-semweb-lifesci@w3.org>
On Tue, Jun 4, 2013 at 2:40 PM, Jerven Bolleman <me@jerven.eu> wrote:

> On Tue, Jun 4, 2013 at 11:36 AM, Alasdair J G Gray <
> Alasdair.Gray@manchester.ac.uk> wrote:
>> On 3 Jun 2013, at 17:51, Michel Dumontier <michel.dumontier@gmail.com>
>> wrote:
>> About void:inDataset I personally don't like it. I suspect it would cost
>>> me a 13% growth in triple size for negligible benefits. This also means
>>> that the dataset description starts to affect the data. Although I could
>>> only present this in the rest / linked data interface and not in the sparql
>>> endpoint. I am worried that I can not put it into the FTP data dump rdf. As
>>> the data item concept does not map 1:1 on a set of triples that are atomic.
>> i'm not sure that i completely understand your objection. the primary use
>> of void:inDataset is to link data items to the dataset description, and as
>> such supports linked data applications without looking at the graph for a
>> potential, but un-guaranteed provenance description. Using void:inDataset
>> is normal practice in the RDF / linked data community. It would be strange
>> to not include it in any RDF dataset if you have the dataset description.
>> http://www.w3.org/TR/void/#backlinks
>>> e.g. someone can use just the UniProtKB sequences. Once they did that is
>>> it still the same dataset that I published it as? I don't think so. Which
>>> means uniprot end users need to be careful to remove more triples. Which
>>> why I disagree with alasdair's call for MUST.
>> if one wanted to know which version/issue of uniprot that the sequences
>> came from, it would be necessary to provide access to the dataset
>> description. if the void:inDataset predicate is used, the user need not
>> even retrieve that to store locally, as you should provide resolution
>> services to those dataset descriptions.
>> I also do not follow your objection. If you have created a file that
>> contains a subset of the data, then you can declare this to be a subset of
>> the parent-versioned-formatted dataset, ideally with some way of
>> distinguishing the content of the dataset.
> I will try to explain my objections. The fist is the dataset is a set of
> triples while the void:inDataset is a predicate on a single
> resource/entity/subject.
> So as I have 1.4 billion entities I would add 1.4 billion void:inDataset
> triples. Which to me seems like the incorrect thing to do.

 we would like to know the provenance of every data item. if you define 1.4
billion entities, then you should provide 1.4 billion links to their

>  Well you say you should only add them to the "important" resources and
> then we are down to a 100 million of these statements.
> Yet for users who use slices of our data these void:inDataset triples are
> annoying/misleading especially if they merge them with their own sources.
> e.g.
> uniref:UniRef100_ up:sequenceFor uniprot:P12345 .
> uniprot:P12345 a up:Protein ;
>                        void:inDataset dataset:uniprot .
> dataset:uniprot dcterms:licence cc:by-sa-v3 .
> uniprot:P12345 .roche:activatedBy secretdrugchemical:1000  .
> secretdrugchemical:1000 void:inDataset top:secret .
> Given these triples what is the license for knowledge about
> secretdrugchemical:1000 activating uniprot:P12345?
> The dataset description is about a set of data, not single triples so
> single back links seem to me to be the incorrect solution?
the focus on the assertion(s) is perfectly fine. several mechanisms have
now been proposed; nanopublications [1], micropublications [2] and ovopubs


>> From all the scenarios I have encountered, scientists (not just in the
>> healthcare and life sciences) care about where their data has come from and
>> what version it is. As such, we need some way to allow for the linking of
>> data back to the description of the data.
> Of course I don't disagree with the usecase. I disagree with the chosen
> solution because it is on the wrong level of granularity.
it's not wrong, it's just at a level that you don't want to provide.  We do
it in Bio2RDF, and now each of our data items from Release 2 are linked


>> Alasdair
>> Dr Alasdair J G Gray
>> Research Associate
>> Alasdair.Gray@manchester.ac.uk
>> +44 161 275 0145
>> http://www.cs.man.ac.uk/~graya/
>> Please consider the environment before printing this email.
> --
> Jerven Bolleman
> me@jerven.eu

Michel Dumontier
Associate Professor of Bioinformatics, Carleton University
Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
Received on Tuesday, 4 June 2013 13:05:56 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:21:33 UTC