- From: Jerven Bolleman <me@jerven.eu>
- Date: Tue, 4 Jun 2013 15:23:43 +0200
- To: Michel Dumontier <michel.dumontier@gmail.com>
- Cc: Alasdair J G Gray <Alasdair.Gray@manchester.ac.uk>, "public-semweb-lifesci@w3.org" <public-semweb-lifesci@w3.org>
- Message-ID: <CAHM_hUPEVUa5wr2XOH4D8m0MGoi6en9isLbYEQWGjzKPRXU_iQ@mail.gmail.com>
On Tue, Jun 4, 2013 at 3:05 PM, Michel Dumontier <michel.dumontier@gmail.com > wrote: > > > > On Tue, Jun 4, 2013 at 2:40 PM, Jerven Bolleman <me@jerven.eu> wrote: > >> >> >> >> On Tue, Jun 4, 2013 at 11:36 AM, Alasdair J G Gray < >> Alasdair.Gray@manchester.ac.uk> wrote: >> >>> >>> On 3 Jun 2013, at 17:51, Michel Dumontier <michel.dumontier@gmail.com> >>> wrote: >>> >>> About void:inDataset I personally don't like it. I suspect it would cost >>>> me a 13% growth in triple size for negligible benefits. This also means >>>> that the dataset description starts to affect the data. Although I could >>>> only present this in the rest / linked data interface and not in the sparql >>>> endpoint. I am worried that I can not put it into the FTP data dump rdf. As >>>> the data item concept does not map 1:1 on a set of triples that are atomic. >>>> >>>> >>> i'm not sure that i completely understand your objection. the primary >>> use of void:inDataset is to link data items to the dataset description, and >>> as such supports linked data applications without looking at the graph for >>> a potential, but un-guaranteed provenance description. Using void:inDataset >>> is normal practice in the RDF / linked data community. It would be strange >>> to not include it in any RDF dataset if you have the dataset description. >>> >>> http://www.w3.org/TR/void/#backlinks >>> >>> >>> >>>> e.g. someone can use just the UniProtKB sequences. Once they did that >>>> is it still the same dataset that I published it as? I don't think so. >>>> Which means uniprot end users need to be careful to remove more triples. >>>> Which why I disagree with alasdair's call for MUST. >>>> >>>> >>> if one wanted to know which version/issue of uniprot that the sequences >>> came from, it would be necessary to provide access to the dataset >>> description. if the void:inDataset predicate is used, the user need not >>> even retrieve that to store locally, as you should provide resolution >>> services to those dataset descriptions. >>> >>> >>> I also do not follow your objection. If you have created a file that >>> contains a subset of the data, then you can declare this to be a subset of >>> the parent-versioned-formatted dataset, ideally with some way of >>> distinguishing the content of the dataset. >>> >> I will try to explain my objections. The fist is the dataset is a set of >> triples while the void:inDataset is a predicate on a single >> resource/entity/subject. >> So as I have 1.4 billion entities I would add 1.4 billion void:inDataset >> triples. Which to me seems like the incorrect thing to do. >> > > we would like to know the provenance of every data item. if you define > 1.4 billion entities, then you should provide 1.4 billion links to their > provenance. > > >> Well you say you should only add them to the "important" resources and >> then we are down to a 100 million of these statements. >> Yet for users who use slices of our data these void:inDataset triples are >> annoying/misleading especially if they merge them with their own sources. >> >> e.g. >> >> uniref:UniRef100_ up:sequenceFor uniprot:P12345 . >> uniprot:P12345 a up:Protein ; >> void:inDataset dataset:uniprot . >> dataset:uniprot dcterms:licence cc:by-sa-v3 . >> uniprot:P12345 .roche:activatedBy secretdrugchemical:1000 . >> secretdrugchemical:1000 void:inDataset top:secret . >> >> Given these triples what is the license for knowledge about >> secretdrugchemical:1000 activating uniprot:P12345? >> >> The dataset description is about a set of data, not single triples so >> single back links seem to me to be the incorrect solution? >> >> > the focus on the assertion(s) is perfectly fine. several mechanisms have > now been proposed; nanopublications [1], micropublications [2] and ovopubs > [3] > > [1] > http://www.w3.org/wiki/images/c/c0/HCLSIG$$SWANSIOC$$Actions$$RhetoricalStructure$$meetings$$20100215$cwa-anatomy-nanopub-v3.pdf > [2]http://arxiv.org/abs/1305.3506 > [3]http://arxiv.org/abs/1305.6800 > > > >> >> >>> From all the scenarios I have encountered, scientists (not just in the >>> healthcare and life sciences) care about where their data has come from and >>> what version it is. As such, we need some way to allow for the linking of >>> data back to the description of the data. >>> >> Of course I don't disagree with the usecase. I disagree with the chosen >> solution because it is on the wrong level of granularity. >> >> > it's not wrong, it's just at a level that you don't want to provide. We > do it in Bio2RDF, and now each of our data items from Release 2 are linked > accordingly. > No you don't see http://bio2rdf.org/page/beilstein:1900390 Also you put it on the entity/subject while what is interesting is the provenance of the triple. The provenance is on the triple in your linked papers not in the bio2rdf case or the void:inDataset case. Regards, Jerven > > m. > > >> >>> Alasdair >>> >>> Dr Alasdair J G Gray >>> Research Associate >>> Alasdair.Gray@manchester.ac.uk >>> +44 161 275 0145 >>> >>> http://www.cs.man.ac.uk/~graya/ >>> >>> Please consider the environment before printing this email. >>> >>> >> >> >> -- >> Jerven Bolleman >> me@jerven.eu >> > > > > -- > Michel Dumontier > Associate Professor of Bioinformatics, Carleton University > Chair, W3C Semantic Web for Health Care and the Life Sciences Interest > Group > http://dumontierlab.com > -- Jerven Bolleman me@jerven.eu
Received on Tuesday, 4 June 2013 13:24:12 UTC