Call for contributions ======================== ProvBench (co-located with the Provenance Week (IPAW + TAPP)) https://sites.google.com/site/provbench/home/provbench-provenance-week-2014 Cologne, 13th of June, 2014 ======================== ProvBench: Benchmarking Provenance Management Systems 2nd edition: Call for benchmarking datasets Background ---------- Provenance metadata, or metadata that describes the origins of data, is now widely as a key ingredient for numerous (traditional and novel) applications. For example, provenance can be used to inspect the quality of data provided by third-parties, to identify active members in social networks analytics, as well as to ensure correct attribution, citation and licensing. The increasing number of provenance-related proposals and systems generates the need for a well documented and impartial provenance corpus that can be used by researchers and systems developers as a means for testing and validating their provenance management systems (ProvMS), including storage techniques for large provenance graphs, query models, and analysis algorithms. These systems are currently being tested and assessed on proprietary provenance datasets. This makes it difficult to benchmark and compare different implementations. On the other hand, benchmark datasets are already available for a wide variety of generic DBMS, upon which many implementations of ProvMS are based. These generic systems include RDF triple stores, native graph DBMS, relational DBMS, and more. Thus, the questions we aim to answer include: Is there in fact a need for new benchmark datasets which are specific to provenance data and that reflect its usages? for instance: system-level provenance, provenance of web pages (MediaWiki), provenance of a SW project, provenance of scientific workflows, provenance of human processes, etc. Does provenance exhibit typical data or query patterns that may suggest ways to optimize either storage, or query processing? To what realistic sizes and at what rate does provenance data accumulate in different settings, and when does size begin to pose a problem to storage and query processing? Objective --------- With these questions in mind, ProvBench looks to build upon the tradition of database benchmarks (e.g. relational, RDF). Its purpose is to collect a corpus of provenance datasets, along with associated query workloads that are at the same time: broad: representative of a variety of provenance usage scenarios specific to provenance data (as opposed to general RDF, graph, or relational benchmarking datasets) challenging to provenance management systems (scalable storage, query performance) Why do this? ------------ You will not get a formal paper publication out of this, as we cannot include your documentation in the TAPP/IPAW proceedings. However you will get a data publication with an official DOI. The datasets will be cited by members of the community who make use of them in their publications. To encourage this practice, the datasets accepted in ProvBench will be minted by DOIs, which will be allocated with the help of FigShare. Submissions ----------- Submissions can be entirely new or they can be new versions, or refinements, of submissions to the first edition of ProvBench. Submissions should include a dataset and accompanying documentation, as specified below. Contributors should email Khalid Belhajjame (kbelhajj@googlemail.com), for access to Github. Each submission shall consist of: - A dataset (provenance trace). Multiple distinct datasets can be submitted. These however should be “similar” provenance traces at differing scale, derived from the same original data source. Traces can be serialized in any of the W3C PROV encodings[1], either official (PROV Notation, PROV-O) or unofficial (PROV-XML, PROV-JSON) - A query workload. Lacking a standard query language for provenance, queries are to be expressed in natural language and must be sufficiently precise to allow for unambiguous implementation. - Metadata: size ( number or entities, activities, relationships), format, authorship, etc. - Rationale and documentation for the submission, including: the type of scenario that the submission is representative of, along with any background info useful to understand the domain What can the dataset and its accompanying queries be used to test What makes the dataset distinct from generic DBMS benchmarks What makes the submission challenging How the dataset has been used to test specific properties of a ProvMS Note: The rationale document does not constitute a paper, and will not be published in a proceedings. Companion papers, if desired, should be submitted to TAPP[2] or IPAW[3]. The Event ---------- The day of the event might take place as a mixture of presentations, mini-hackathon and panel sessions, depending on the number of submissions and number of participants. A detailed agenda will be announced a few weeks prior to the event. Note that you have to register to ProvenanceWeek2014 in order to attend this event. Important Dates (tentative) Expression of interest: May 2nd, 2014. Submission deadline: May 9th, 2014. Notification: June 1st, 2014. Organisers ---------- khalid Belhajjame, Université Paris-Dauphine Adriane Chapman, The MITRE Corporation Hugo Firth, Newcastle Univeristy Paolo Missier, Newcastle University Jun Zhao, Lancaster University References [1]: http://www.w3.org/TR/prov-overview/ [2]: http://provenanceweek.dlr.de/tapp/call-participation/ [3]: http://provenanceweek.dlr.de/ipaw/call-participation/ [4]: https://github.com/provbench