- From: Michel Dumontier <michel.dumontier@gmail.com>
- Date: Tue, 4 Jun 2013 17:26:21 +0200
- To: Jerven Bolleman <me@jerven.eu>
- Cc: Alasdair J G Gray <Alasdair.Gray@manchester.ac.uk>, "public-semweb-lifesci@w3.org" <public-semweb-lifesci@w3.org>
- Message-ID: <CALcEXf4p2cASD+N1+3MaHMx=dTDQe=vZrX7QDuskEieow4YdTg@mail.gmail.com>
i'm willing to comprise to "should", as it it is generally seen as a good practice to use void:inDataset for linking data items to datasets. We can bring this discussion to the semantic-web mailing list, if you want to additional feedback. m. On Tue, Jun 4, 2013 at 5:14 PM, Jerven Bolleman <me@jerven.eu> wrote: > > > > On Tue, Jun 4, 2013 at 3:47 PM, Michel Dumontier < > michel.dumontier@gmail.com> wrote: > >> Hi Jerven, >> >> First: Bio2RDF's current datasets are listed here: >> >> http://bio2rdf.org/datasets >> >> as i mentioned, and has been presented [1] those that are in Bio2RDF >> release 2 have provenance. e.g. >> http://bio2rdf.org/geneid:123 >> >> (we did not provide updates to uniprot, genbank, refseq, pubmed in >> release 2, and they won't have provenance associated with them). >> >> >> Yes, I agree that the provenance of an assertion is more interesting, and >> we are working towards implementing this for Bio2RDF. But if you were >> worried about adding 1.6 billion relations, you'll be more worried about >> adding 8 billion more to annotate each triple. >> > No it's only 16 graph id's ;) at least on the SPARQL endpoint. Using > reification we would add 4*8 billion triples ... > But with trix or n-quads dumps we would not need these triples as again we > would do provenance on a graph level. > Which I disagree with the MUST qualification not with a MAY qualification > in the standard. > > >> >> m. >> >> [1] >> http://www.slideshare.net/micheldumontier/bio2rdf-release-2-improved-coverage-interoperability-and-provenance-of-linked-data-for-the-life-sciences >> >> >> >> >> On Tue, Jun 4, 2013 at 3:23 PM, Jerven Bolleman <me@jerven.eu> wrote: >> >>> >>> >>> >>> On Tue, Jun 4, 2013 at 3:05 PM, Michel Dumontier < >>> michel.dumontier@gmail.com> wrote: >>> >>>> >>>> >>>> >>>> On Tue, Jun 4, 2013 at 2:40 PM, Jerven Bolleman <me@jerven.eu> wrote: >>>> >>>>> >>>>> >>>>> >>>>> On Tue, Jun 4, 2013 at 11:36 AM, Alasdair J G Gray < >>>>> Alasdair.Gray@manchester.ac.uk> wrote: >>>>> >>>>>> >>>>>> On 3 Jun 2013, at 17:51, Michel Dumontier <michel.dumontier@gmail.com> >>>>>> wrote: >>>>>> >>>>>> About void:inDataset I personally don't like it. I suspect it would >>>>>>> cost me a 13% growth in triple size for negligible benefits. This also >>>>>>> means that the dataset description starts to affect the data. Although I >>>>>>> could only present this in the rest / linked data interface and not in the >>>>>>> sparql endpoint. I am worried that I can not put it into the FTP data dump >>>>>>> rdf. As the data item concept does not map 1:1 on a set of triples that are >>>>>>> atomic. >>>>>>> >>>>>>> >>>>>> i'm not sure that i completely understand your objection. the primary >>>>>> use of void:inDataset is to link data items to the dataset description, and >>>>>> as such supports linked data applications without looking at the graph for >>>>>> a potential, but un-guaranteed provenance description. Using void:inDataset >>>>>> is normal practice in the RDF / linked data community. It would be strange >>>>>> to not include it in any RDF dataset if you have the dataset description. >>>>>> >>>>>> http://www.w3.org/TR/void/#backlinks >>>>>> >>>>>> >>>>>> >>>>>>> e.g. someone can use just the UniProtKB sequences. Once they did >>>>>>> that is it still the same dataset that I published it as? I don't think so. >>>>>>> Which means uniprot end users need to be careful to remove more triples. >>>>>>> Which why I disagree with alasdair's call for MUST. >>>>>>> >>>>>>> >>>>>> if one wanted to know which version/issue of uniprot that the >>>>>> sequences came from, it would be necessary to provide access to the dataset >>>>>> description. if the void:inDataset predicate is used, the user need not >>>>>> even retrieve that to store locally, as you should provide resolution >>>>>> services to those dataset descriptions. >>>>>> >>>>>> >>>>>> I also do not follow your objection. If you have created a file that >>>>>> contains a subset of the data, then you can declare this to be a subset of >>>>>> the parent-versioned-formatted dataset, ideally with some way of >>>>>> distinguishing the content of the dataset. >>>>>> >>>>> I will try to explain my objections. The fist is the dataset is a set >>>>> of triples while the void:inDataset is a predicate on a single >>>>> resource/entity/subject. >>>>> So as I have 1.4 billion entities I would add 1.4 billion >>>>> void:inDataset triples. Which to me seems like the incorrect thing to do. >>>>> >>>> >>>> we would like to know the provenance of every data item. if you define >>>> 1.4 billion entities, then you should provide 1.4 billion links to their >>>> provenance. >>>> >>>> >>>>> Well you say you should only add them to the "important" resources >>>>> and then we are down to a 100 million of these statements. >>>>> Yet for users who use slices of our data these void:inDataset triples >>>>> are annoying/misleading especially if they merge them with their own >>>>> sources. >>>>> >>>>> e.g. >>>>> >>>>> uniref:UniRef100_ up:sequenceFor uniprot:P12345 . >>>>> uniprot:P12345 a up:Protein ; >>>>> void:inDataset dataset:uniprot . >>>>> dataset:uniprot dcterms:licence cc:by-sa-v3 . >>>>> uniprot:P12345 .roche:activatedBy secretdrugchemical:1000 . >>>>> secretdrugchemical:1000 void:inDataset top:secret . >>>>> >>>>> Given these triples what is the license for knowledge about >>>>> secretdrugchemical:1000 activating uniprot:P12345? >>>>> >>>>> The dataset description is about a set of data, not single triples so >>>>> single back links seem to me to be the incorrect solution? >>>>> >>>>> >>>> the focus on the assertion(s) is perfectly fine. several mechanisms >>>> have now been proposed; nanopublications [1], micropublications [2] and >>>> ovopubs [3] >>>> >>>> [1] >>>> http://www.w3.org/wiki/images/c/c0/HCLSIG$$SWANSIOC$$Actions$$RhetoricalStructure$$meetings$$20100215$cwa-anatomy-nanopub-v3.pdf >>>> [2]http://arxiv.org/abs/1305.3506 >>>> [3]http://arxiv.org/abs/1305.6800 >>>> >>>> >>>> >>>>> >>>>> >>>>>> From all the scenarios I have encountered, scientists (not just in >>>>>> the healthcare and life sciences) care about where their data has come from >>>>>> and what version it is. As such, we need some way to allow for the linking >>>>>> of data back to the description of the data. >>>>>> >>>>> Of course I don't disagree with the usecase. I disagree with the >>>>> chosen solution because it is on the wrong level of granularity. >>>>> >>>>> >>>> it's not wrong, it's just at a level that you don't want to provide. >>>> We do it in Bio2RDF, and now each of our data items from Release 2 are >>>> linked accordingly. >>>> >>> No you don't see http://bio2rdf.org/page/beilstein:1900390 >>> >>> Also you put it on the entity/subject while what is interesting is the >>> provenance of the triple. >>> >>> The provenance is on the triple in your linked papers not in the bio2rdf >>> case or the void:inDataset case. >>> >>> Regards, >>> Jerven >>> >>>> >>>> m. >>>> >>>> >>>>> >>>>>> Alasdair >>>>>> >>>>>> Dr Alasdair J G Gray >>>>>> Research Associate >>>>>> Alasdair.Gray@manchester.ac.uk >>>>>> +44 161 275 0145 >>>>>> >>>>>> http://www.cs.man.ac.uk/~graya/ >>>>>> >>>>>> Please consider the environment before printing this email. >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Jerven Bolleman >>>>> me@jerven.eu >>>>> >>>> >>>> >>>> >>>> -- >>>> Michel Dumontier >>>> Associate Professor of Bioinformatics, Carleton University >>>> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest >>>> Group >>>> http://dumontierlab.com >>>> >>> >>> >>> >>> -- >>> Jerven Bolleman >>> me@jerven.eu >>> >> >> >> >> -- >> Michel Dumontier >> Associate Professor of Bioinformatics, Carleton University >> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest >> Group >> http://dumontierlab.com >> > > > > -- > Jerven Bolleman > me@jerven.eu > -- Michel Dumontier Associate Professor of Bioinformatics, Carleton University Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group http://dumontierlab.com
Received on Tuesday, 4 June 2013 15:27:13 UTC