Re: DQV, ISO 19115/19157 and GeoDCAT-AP

Dear Antoine,

Thanks for giving me the opportunity to comment, and sincere apologies 
for my late reply. We were busy with the final version of the GeoDCAT-AP 
specification, that was released just before Christmas, and I didn't 
manage to reply to your mail earlier.

Please find my comments inline. Sorry in advance for the long mail.

On 04/12/2015 20:08, Antoine Isaac wrote:
> Dear Andrea,
>
> Three months ago we owe you an update on your original feedback email at
> [1]. Besides an original answer [2] you could see that we've been
> actively discussing your comments, notably around Issue-202 we raised
> after your comments [3].
>
> The status is currently:
>
> 1. We have not followed the PROV paths to express conformance. While
> relevant for many cases, we believe it is too complex for our simple
> requirements, and would count on provenance-focused applications
> (including ones seeking to implement ISO or GeoDCAT-AP) to come with
> their own PROV patterns for representing how a conformance statement has
> been produced.

I see the point. Indeed, the PROV-based pattern is quite cumbersome, 
especially compared with the original EARL-based one. However, this 
decision was taken in the framework of the GeoDCAT-AP WG because of the 
more general-purpose nature of PROV, and its widespread adoption.

Since the DWBP WG is discussing this issue, in GeoDCAT-AP we have marked 
the PROV-based solution as a "provisional" mapping, that can be possibly 
replaced in the future by the approach defined in DQV, in case it is 
able to address GeoDCAT-AP requirements.

> 2. We have made amendments on the DQV specification for expressing
> conformance in a way that follows the simpler DCAT-AP pattern using
> dcterms:conformsTo [4].
> We would by the way welcome feedback on our example: is the way we
> introduce GeoDCAT as a dcterms:Standard appropriate, or would you prefer
> us to adapt the example on "COMMISSION REGULATION (EC) No 976/2009" at [5]?

Your example is perfectly fine with me, thank you.

> 3. We have discussed your suggestions of introducing "not evaluated" and
> "not conformant" as you suggested. We are convinced that adding "not
> conformant" would be very useful [5].
> However, if we implement it, are you aware of a property and a value
> (URI) to build such statement in RDF?

None I'm aware of.

About this, I think the issue here is about the ability not only to 
express whether data are conformant or not with a given (quality) 
standard, but also how the conformance test has been done, by whom, and 
when.

Besides this being the approach used in ISO, this information is 
important for a number of reasons.

One of them is that, in many cases, quality control is not something is 
done once, but needs to be carried out on a regular basis. This happens, 
e.g., for datasets that are regularly updated (dynamic data).

Another one is about the ability to control and verify how the quality 
check has been carried out, for instance in order to be able to 
reproduce it. Providing all the information needed to "reproduce an 
experiment" is common practice in scientific publications, and the same 
principles can be applied to data. And here we are also talking about 
transparency.

A final example is that, in some cases, the final conformance result may 
be related to conformance tests carried out against more than one 
criterion - i.e., the final conformance result is determined by the 
aggregation of multiple conformance tests, each concerning a specific 
criterion. An example is also provided in the EARL specification, that, 
in its examples, is referring to WCAG, where conformance depends on a 
set of criteria ("general techniques") to be checked.


Based on that, I see two levels for the specification of data quality:


1. The former is about dataset filtering in the discovery phase, where I 
might just want to get the datasets conformant with a given (quality) 
standard. In such cases, properties expressing just whether data are 
conformant (dct:conformsTo) or not (??) can do the job.


2. The latter concerns two classes of actors:

(a) Who is managing data. This is about the ability to record the 
details of conformance tests and results, to use them in the data 
management workflow, and to expose to users the final results, possibly 
in an aggregated form (i.e., as in scenario #1).

(b) Users who would like to contribute feedback/reviews on data quality. 
Note that such users can be also third-parties who are involved directly 
by data custodians for some reasons (e.g., this is the case of quality 
certificates, or peer-reviews of data).

For these cases, EARL might be the appropriate tool in a SemWeb / LD 
context. Or, at least, EARL provides the vocabulary to specify the 
relevant information. This includes "outcome values" of test results 
that are not limited to conformant / not conformant (appropriate in 
scenario #1). Notably, EARL supports 5 possible outcome values (see 
http://www.w3.org/TR/EARL10-Schema/#OutcomeValue) - quoting:

[[
earl:passed
   Passed - the subject passed the test.
earl:failed
   Failed - the subject failed the test.
earl:cantTell
   Cannot tell - it is unclear if the subject passed or failed the test.
earl:inapplicable
   Inapplicable - the test is not applicable to the subject.
earl:untested
   Untested - the test has not been carried out.
]]

In this scenario, also values like "not evaluated" (earl:untested) are 
relevant, since they can be used internally to plan the quality checks.


Based on what I see, DQV addresses all these users' scenarios, but it's 
unclear to me if it is able to encode conformance test results with the 
level of detail described above.

> 4. The example of conformance is triggering a discussion on how to
> indicate that it's fine to use DQV to indicate the quality of metadata,
> not only 'original' datasets. This is less relevant to your original
> comment, but you may want to chime in.

This is actually one of the issues my colleagues and me have been 
dealing with in our work. So, +1 from me.

I can contribute an existing example from the INSPIRE Geoportal.

The INSPIRE Geoportal is harvesting metadata records (100K+) from 
catalogue services operated by EU Member States. As a post-processing 
step, the geoportal infrastructure carries out a validation test against 
a set of criteria, and it generates a validation report.

This procedure is not meant to decide whether a metadata record can be 
published or not. All records are published, irrespective of whether 
they have passed the validation. Rather, these reports are meant to 
provide (meta)data providers precise information on the issues identified.

Notably, this approach proved to be effective in dramatically increasing 
the quality of metadata (currently, valid metadata are, in average, more 
than 90% of those harvested). But this is a process that needs to be 
carried out not only when a new metadata provider joins, but also 
whenever metadata records are re-harvested. In other words, this needs 
to be integral part of the metadata management workflow.

This experience also provides an example of atomic/aggregated quality 
checks, in particular along a temporal dimension. E.g., this applies to 
the links included in metadata records, that can point to distributions, 
services for data access and/or visualisation. In such a case, the 
validation results require aggregating link / service check results over 
a given time frame (e.g., 24h, one week).

So, IMO, also for metadata both the scenarios described earlier apply.


Cheers,

Andrea

Received on Monday, 11 January 2016 08:01:22 UTC