Re: RDF Datasets with provenance data from Michel Dumontier on 2016-09-23 (semantic-web@w3.org from September 2016)

From: Michel Dumontier <michel.dumontier@gmail.com>
Date: Fri, 23 Sep 2016 15:14:40 -0700
To: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
Cc: Mark Wallace <mwallace@modusoperandi.com>, David Booth <david@dbooth.org>, Kay Müller <kay.mueller@informatik.uni-leipzig.de>, "semantic-web@w3.org" <semantic-web@w3.org>, Johannes Frey <frey@informatik.uni-leipzig.de>
Message-ID: <CALcEXf7bKx+w2JgO69c4DcZSx5fyc6c4F7C-B4zhW6Fw0ib8kQ@mail.gmail.com>

Hi Sebastian,
 Bio2RDF provides its data in nquads, in which the graph name is
annotated with dataset metadata.
  see http://download.bio2rdf.org/release/3/drugbank/ , where the .nq
file is the provenance data as an example

m.
Michel Dumontier
Associate Professor of Medicine (Biomedical Informatics), Stanford University
Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
http://dumontierlab.com


On Fri, Sep 23, 2016 at 2:58 PM,  <hellmann@informatik.uni-leipzig.de> wrote:
> Hi David and Mark,
> both your answer were not helpful, sorry.
> We are looking for triple datasets that have Metadata, i.e. serialized
> downloadable files in any format (N3, nquad, trix, etc) that come with
> sensible metadata (provenance, last updated/update frequncy) or as an
> alternative triples converted from a legacy source where we could extend the
> extractor software easily to spew out useful metadata per triple.
>
> An example would be the datasets in the meta section here:
> http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/
>
> Thanks,
> Sebastian
>
> Am 23. September 2016 17:16:43 MESZ, schrieb Mark Wallace
> <mwallace@modusoperandi.com>:
>>
>> I like David's guidance.
>>
>> We have projects which require provenance on individual facts/triples (as
>> opposed to groups of them).  As David mentions, one alternative is to use a
>> named graph for each triple (it acts like a statement ID in this case).  An
>> alternative is to use RDF Reification[1] to create a statement ID (resource)
>> to which provenance can be "attached."  The reification approach requires
>> lots more triples, but it has the advantage in our case of leaving named
>> graphs for other uses.   In such cases, provenance triples can be 10x larger
>> than the data set.  For performance reasons, we sometimes put the provenance
>> triples in a separate repository/store, and query/join them (using federated
>> queries) only when the provenance is needed.
>>
>> [1] https://www.w3.org/TR/rdf11-mt/#whatnot
>>
>> --
>> Mark Wallace
>> PRINCIPAL ENGINEER, SEMANTIC APPLICATIONS
>> MODUS OPERANDI,
>> INC.
>>
>> -----Original Message-----
>> From: David Booth [mailto:david@dbooth.org]
>> Sent: Friday, September 23, 2016 10:45 AM
>> To: Kay Müller <kay.mueller@informatik.uni-leipzig.de>;
>> semantic-web@w3.org
>> Cc: Johannes Frey <frey@informatik.uni-leipzig.de>; Sebastian Hellmann
>> <hellmann@informatik.uni-leipzig.de>
>> Subject: Re: RDF Datasets with provenance data
>>
>> On 09/23/2016 10:07 AM, Kay Müller wrote:
>>>
>>>  Dear Sir/Madam,
>>>
>>>  My name is Kay Mueller and I am a researcher at the University of
>>>  Leipzig. Currently we are planing to evaluate whether it is feasible
>>>  to store provenance and meta data for each triple in a graph, hence we
>>>  are wondering whether you are aware of any dataset which either stores
>>>  data at the triple level or which could be converted into this format
>>> (e.g.
>>>
>>> Yago, Wikidata).
>>
>>
>> The usual technique for associating provenance or other metadata with
>> certain triples is to put those triples into a named graph, and make the
>> provenance/metadata assertions about that named graph.  A named graph can
>> hold any number of triples, so it could hold a single triple if you want to
>> be that fine grained.  But triples are not usually created individually --
>> they are usually created in bunches -- so for efficiency one would usually
>> create a named graph containing multiple triples that all have the same
>> provenance.
>>
>> All major "triplestores" -- quad stores really -- and SPARQL servers
>> support named graphs.
>>
>> David Booth
>>
>>>
>>>  We would be very grateful, if you could give us any pointers to
>>>  datasets, related work, etc.
>>>
>>>  Thank you very much in advance.
>>>  --
>>>  Kind
>>> regards / Mit freundlichem Gruß
>>>
>>>  Kay Müller
>>>
>>>  AKSW/KILT <http://aksw.org/Groups/KILT.html>
>>>   Office: InfAI e.V., Hainstr. 11, Room 101a, 04109 Leipzig, Germany
>>>  Homepage: http://aksw.org/KayMueller.html My Twitter
>>>  <https://twitter.com/mullekay> My LinkedIn
>>>  <https://de.linkedin.com/in/mullerkay> My Xing
>>>  <https://www.xing.com/profile/Kay_Mueller12> My GitHub
>>>  <https://github.com/mullekay> My Google Scholar
>>>  <https://scholar.google.de/citations?user=8tFijv0AAAAJ>
>>
>>
>>
>>
>>
>
> --
> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

Received on Friday, 23 September 2016 22:15:28 UTC