Call for contributions
========================
ProvBench (co-located with the Provenance Week (IPAW + TAPP))
https://sites.google.com/site/provbench/home/provbench-provenance-week-2014
Cologne, 13th of June, 2014
========================

ProvBench: Benchmarking Provenance Management Systems
2nd edition: Call for benchmarking datasets

Background
----------
Provenance metadata, or metadata that describes the origins of data, is now widely 
 as a key ingredient for numerous (traditional and novel) applications. For example, 
 provenance can be used to inspect the quality of data provided by third-parties, 
 to identify active members in social networks analytics, as well as to ensure correct 
 attribution, citation and licensing. 

The increasing number of provenance-related proposals and systems generates the need 
for a well documented and impartial provenance corpus that can be used by researchers 
and systems developers as a means for testing and validating their provenance management 
systems (ProvMS), including storage techniques for large provenance graphs, query models, 
and analysis algorithms. These systems are currently being tested and assessed on 
proprietary provenance datasets. This makes it difficult to benchmark and compare 
different implementations.

On the other hand, benchmark datasets are already available for a wide variety of generic 
DBMS, upon which many implementations of ProvMS are based. These generic systems include 
RDF triple stores, native graph DBMS, relational DBMS, and more. Thus, the questions we 
aim to answer include:
Is there in fact a need for new benchmark datasets which are specific to provenance data and 
that reflect its usages? for instance: system-level provenance, provenance of web pages
 (MediaWiki), provenance of a SW project, provenance of scientific workflows, provenance of 
 human processes, etc.
Does provenance exhibit typical data or query patterns that may suggest ways to optimize 
either storage, or query processing? 
To what realistic sizes and at what rate does provenance data accumulate in different settings, 
and when does size begin to pose a problem to storage and query processing?

Objective
---------
With these questions in mind, ProvBench looks to build upon the tradition of database benchmarks 
(e.g. relational, RDF). Its purpose is to collect a corpus of provenance datasets, along with 
associated query workloads that are at the same time:
broad: representative of a variety of provenance usage scenarios
specific to provenance data (as opposed to general RDF, graph, or relational benchmarking datasets)
challenging to provenance management systems (scalable storage, query performance)

Why do this?
------------
You will not get a formal paper publication out of this, as we cannot include your documentation 
in the TAPP/IPAW proceedings. However you will get a data publication with an official DOI. 
The datasets will be cited by members of the community who make use of them in their publications. 
To encourage this practice, the datasets accepted in ProvBench will be minted by DOIs, which will 
be allocated with the help of FigShare.  

Submissions
-----------
Submissions can be entirely new or they can be new versions, or refinements, of submissions to 
the first edition of ProvBench.

Submissions should include a dataset and accompanying documentation, as specified below.
Contributors should email Khalid Belhajjame (kbelhajj@googlemail.com), for access to Github.

Each submission shall consist of:

- A dataset (provenance trace).
Multiple distinct datasets can be submitted. These however should be “similar” provenance 
traces at differing scale, derived from the same original data source. 
Traces can be serialized in any of the W3C PROV encodings[1], either official (PROV Notation, 
PROV-O) or unofficial (PROV-XML, PROV-JSON)
- A query workload. Lacking a standard query language for provenance, queries are to be 
expressed in natural language and must be sufficiently precise to allow for unambiguous implementation.
- Metadata: size ( number or entities, activities, relationships), format, authorship, etc.
- Rationale and documentation for the submission, including:
the type of scenario that the submission is representative of, along with any background info 
useful to understand the domain 
What can the dataset and its accompanying queries be used to test
What makes the dataset distinct from generic DBMS benchmarks
What makes the submission challenging
How the dataset has been used to test specific properties of a ProvMS

Note: The rationale document does not constitute a paper, and will not be published in a proceedings. Companion papers, if desired, should be submitted to TAPP[2] or IPAW[3].

The Event
----------
The day of the event might take place as a mixture of presentations, mini-hackathon and panel sessions, depending on the number of submissions and number of participants. A detailed agenda will be announced a few weeks prior to the event.
Note that you have to register to ProvenanceWeek2014 in order to attend this event.
Important Dates (tentative)

Expression of interest: May 2nd, 2014. 
Submission deadline: May  9th, 2014.
Notification: June 1st, 2014.

Organisers
----------
khalid Belhajjame, Université Paris-Dauphine
Adriane Chapman, The MITRE Corporation
Hugo Firth, Newcastle Univeristy
Paolo Missier, Newcastle University
Jun Zhao, Lancaster University

References
[1]: http://www.w3.org/TR/prov-overview/
[2]: http://provenanceweek.dlr.de/tapp/call-participation/
[3]: http://provenanceweek.dlr.de/ipaw/call-participation/
[4]: https://github.com/provbench