Re: inDataset (was Notes from today's meeting)

On Tue, Jun 4, 2013 at 11:36 AM, Alasdair J G Gray <
Alasdair.Gray@manchester.ac.uk> wrote:

>
> On 3 Jun 2013, at 17:51, Michel Dumontier <michel.dumontier@gmail.com>
> wrote:
>
> About void:inDataset I personally don't like it. I suspect it would cost
>> me a 13% growth in triple size for negligible benefits. This also means
>> that the dataset description starts to affect the data. Although I could
>> only present this in the rest / linked data interface and not in the sparql
>> endpoint. I am worried that I can not put it into the FTP data dump rdf. As
>> the data item concept does not map 1:1 on a set of triples that are atomic.
>>
>>
> i'm not sure that i completely understand your objection. the primary use
> of void:inDataset is to link data items to the dataset description, and as
> such supports linked data applications without looking at the graph for a
> potential, but un-guaranteed provenance description. Using void:inDataset
> is normal practice in the RDF / linked data community. It would be strange
> to not include it in any RDF dataset if you have the dataset description.
>
> http://www.w3.org/TR/void/#backlinks
>
>
>
>> e.g. someone can use just the UniProtKB sequences. Once they did that is
>> it still the same dataset that I published it as? I don't think so. Which
>> means uniprot end users need to be careful to remove more triples. Which
>> why I disagree with alasdair's call for MUST.
>>
>>
> if one wanted to know which version/issue of uniprot that the sequences
> came from, it would be necessary to provide access to the dataset
> description. if the void:inDataset predicate is used, the user need not
> even retrieve that to store locally, as you should provide resolution
> services to those dataset descriptions.
>
>
> I also do not follow your objection. If you have created a file that
> contains a subset of the data, then you can declare this to be a subset of
> the parent-versioned-formatted dataset, ideally with some way of
> distinguishing the content of the dataset.
>
I will try to explain my objections. The fist is the dataset is a set of
triples while the void:inDataset is a predicate on a single
resource/entity/subject.
So as I have 1.4 billion entities I would add 1.4 billion void:inDataset
triples. Which to me seems like the incorrect thing to do.
Well you say you should only add them to the "important" resources and then
we are down to a 100 million of these statements.
Yet for users who use slices of our data these void:inDataset triples are
annoying/misleading especially if they merge them with their own sources.

e.g.

uniref:UniRef100_ up:sequenceFor uniprot:P12345 .
uniprot:P12345 a up:Protein ;
                       void:inDataset dataset:uniprot .
dataset:uniprot dcterms:licence cc:by-sa-v3 .
uniprot:P12345 .roche:activatedBy secretdrugchemical:1000  .
secretdrugchemical:1000 void:inDataset top:secret .

Given these triples what is the license for knowledge about
secretdrugchemical:1000 activating uniprot:P12345?

The dataset description is about a set of data, not single triples so
single back links seem to me to be the incorrect solution?



> From all the scenarios I have encountered, scientists (not just in the
> healthcare and life sciences) care about where their data has come from and
> what version it is. As such, we need some way to allow for the linking of
> data back to the description of the data.
>
Of course I don't disagree with the usecase. I disagree with the chosen
solution because it is on the wrong level of granularity.


> Alasdair
>
> Dr Alasdair J G Gray
> Research Associate
> Alasdair.Gray@manchester.ac.uk
> +44 161 275 0145
>
> http://www.cs.man.ac.uk/~graya/
>
> Please consider the environment before printing this email.
>
>


-- 
Jerven Bolleman
me@jerven.eu

Received on Tuesday, 4 June 2013 12:40:56 UTC