Action-26 : PROV scenarios and quality

Dear all,

Before it's really too late, I want to report a bit on
http://www.w3.org/2013/dwbp/track/actions/26

The paper is
http://www.ijdc.net/index.php/ijdc/article/view/203

It summarizes the requirements from the PROV use cases. Large chunks of these use cases are relevant for us, as the scenarios put the emphasis on present credible, trusted content, with correct attribution (for credit) including affiliation, and content that is up-to-date. In other words, enabling the assessment of the context in which the data was created, its quality and validity, and the appropriate conditions for use.

A specific issue is that the processed around data collection and management are not always clearly documented, while they are crucial for re-use. Notable points are that the source of the information is not often apparent from the data aggregated from the Web. Also, it is important to
determine if content was modified, by whom. In general, one would like to inspect how the content was constructed and from what sources. It may be handy to have also a trust rating, and a description of how the trust rating was derived from those sources...

The general goal is to get to a more principled way, and automatize if possible:
- determine whether a Web document or resource can be used based on the original source of the content
- ascertain whether or not to trust it by examining the processes that created, aggregated, and delivered it, as well as who was responsible for these processes

(NB: I've sometimes kept the original focus of the Prov scenarios on 'content'. But in our our own scenarios, I believe Prov's 'content' is in straightforward correspondence to 'data' or 'dataset'.)

Table 1 in the paper presents a nicely formalized way, aspects that provenance models should address. I think that the items in the category 'content' is the most relevant for us (for each item I add details that I took from other parts of the paper):

- object of the provenance. I keep it emphasized because this object may be a specific part of a dataset, beyond the dataset itself (which is what we'd expect in most of our scenarios; we may want to emphasize here that a dataset is about a certain topic, helping to assert its relevance for given scenarios.

- attribution: source of data, entities responsible of making them (persons and/or organizations). This includes statements of responsibility or checks/endorsement by (independent) third parties. As well as mechanisms for representing and certifying (Signatures) the origin of the attribution statements themselves.

- process: statements about the creation of data. Level of detail may vary. One of the possible outcome is to allow reproducibility: re-create the data from primary sources. This includes details on the gathering, organization of data, including transcription, conversion from original data (for integration purposes)

- versioning and evolution. if an original source of a data is updated (e.g. retracting claims or data) then the re-user should be able to update their content too. Versions should have their own provenance info. The level of detail for representing updates may vary, from indicating that there's a new version, to indicating precisely what the changes (deltas) were.

- justification of decisions and entailments when facts (conclusions) are derived from other facts. To me both seems to be part of the 'process' category, and be too detailed for us (as it could be domain-dependent). But as I'm not sure I'm mentioning it nonetheless.



The 'management' category lists items that are relevant to our efforts (licensing, publication, access). But to me it seems less relevant to data on 'quality' as we've started to scope it.

The 'use' category seems to include rather meta-level considerations, i.e. trust, levels of abstraction, understandability, interoperability, but for the provenance metadata itself. It's not the provenance vocabulary trying to capture quality aspects of the content/dataset that is being tracked.
I'm not saying that such criteria should never be considered in the description of quality for data. On the contrary! It's just that I would recommend to consider them from our own level, rather than the meta-level of provenance data itself.

Finally, the paper notes that publishing provenance information may be restricted for confidentiality reasons. I believe our own scenarios may come with similar constraints, but it's difficult for me to envision more specific details. The PROV work I've seen also leaves it to the general requirement of being able to publish (and exploit) only partial information.

I have further work on this, including trying to merge it with what is at
https://www.w3.org/2013/dwbp/wiki/Quality_and_Granularity_Description_Vocabulary
But this will have to wait a bit more!

Best,

Antoine

Received on Friday, 9 May 2014 10:27:03 UTC