Re: DQV, ISO 19115/19157 and GeoDCAT-AP from Antoine Isaac on 2016-03-07 (public-dwbp-wg@w3.org from March 2016)

From: Antoine Isaac <aisaac@few.vu.nl>
Date: Mon, 7 Mar 2016 11:21:23 +0100
To: Andrea Perego <andrea.perego@jrc.ec.europa.eu>
CC: "public-dwbp-wg@w3.org" <public-dwbp-wg@w3.org>
Message-ID: <56DD5623.9010703@few.vu.nl>
Dear Andrea,

Again, my turn for being late...

Actually your email was very long, I'm going to split it.

On items #1 and #2 I believe we're almost set.
May I however ask you to check two last things?

- what's the proper way to refer to the GeoDCAT-AP doc. Our referenced at http://w3c.github.io/dwbp/vocab-dqg.html#bib-GeoDCAT-AP is a bit old. I see there's a PDF at [1], but it's always a bit confusing, as the PDF itself mentions https://joinup.ec.europa.eu/node/148281

- whether we can/should refer to the 'provisional' PROV pattern you mentioned for conformance. This is possible as an additional sentence at [2]. And if yes, whether indeed we're talking about Annex II.14 (p62 onwards) in the document at [1].

Cheers,

[1] https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/geodcat-ap-v10
[2] http://w3c.github.io/dwbp/vocab-dqg.html#ExpressConformanceWithStandard


On 1/11/16 9:00 AM, Andrea Perego wrote:
> Dear Antoine,
>
> Thanks for giving me the opportunity to comment, and sincere apologies for my late reply. We were busy with the final version of the GeoDCAT-AP specification, that was released just before Christmas, and I didn't manage to reply to your mail earlier.
>
> Please find my comments inline. Sorry in advance for the long mail.
>
> On 04/12/2015 20:08, Antoine Isaac wrote:
>> Dear Andrea,
>>
>> Three months ago we owe you an update on your original feedback email at
>> [1]. Besides an original answer [2] you could see that we've been
>> actively discussing your comments, notably around Issue-202 we raised
>> after your comments [3].
>>
>> The status is currently:
>>
>> 1. We have not followed the PROV paths to express conformance. While
>> relevant for many cases, we believe it is too complex for our simple
>> requirements, and would count on provenance-focused applications
>> (including ones seeking to implement ISO or GeoDCAT-AP) to come with
>> their own PROV patterns for representing how a conformance statement has
>> been produced.
>
> I see the point. Indeed, the PROV-based pattern is quite cumbersome, especially compared with the original EARL-based one. However, this decision was taken in the framework of the GeoDCAT-AP WG because of the more general-purpose nature of PROV, and its widespread adoption.
>
> Since the DWBP WG is discussing this issue, in GeoDCAT-AP we have marked the PROV-based solution as a "provisional" mapping, that can be possibly replaced in the future by the approach defined in DQV, in case it is able to address GeoDCAT-AP requirements.
>
>> 2. We have made amendments on the DQV specification for expressing
>> conformance in a way that follows the simpler DCAT-AP pattern using
>> dcterms:conformsTo [4].
>> We would by the way welcome feedback on our example: is the way we
>> introduce GeoDCAT as a dcterms:Standard appropriate, or would you prefer
>> us to adapt the example on "COMMISSION REGULATION (EC) No 976/2009" at [5]?
>
> Your example is perfectly fine with me, thank you.
>
>> 3. We have discussed your suggestions of introducing "not evaluated" and
>> "not conformant" as you suggested. We are convinced that adding "not
>> conformant" would be very useful [5].
>> However, if we implement it, are you aware of a property and a value
>> (URI) to build such statement in RDF?
>
> None I'm aware of.
>
> About this, I think the issue here is about the ability not only to express whether data are conformant or not with a given (quality) standard, but also how the conformance test has been done, by whom, and when.
>
> Besides this being the approach used in ISO, this information is important for a number of reasons.
>
> One of them is that, in many cases, quality control is not something is done once, but needs to be carried out on a regular basis. This happens, e.g., for datasets that are regularly updated (dynamic data).
>
> Another one is about the ability to control and verify how the quality check has been carried out, for instance in order to be able to reproduce it. Providing all the information needed to "reproduce an experiment" is common practice in scientific publications, and the same principles can be applied to data. And here we are also talking about transparency.
>
> A final example is that, in some cases, the final conformance result may be related to conformance tests carried out against more than one criterion - i.e., the final conformance result is determined by the aggregation of multiple conformance tests, each concerning a specific criterion. An example is also provided in the EARL specification, that, in its examples, is referring to WCAG, where conformance depends on a set of criteria ("general techniques") to be checked.
>
>
> Based on that, I see two levels for the specification of data quality:
>
>
> 1. The former is about dataset filtering in the discovery phase, where I might just want to get the datasets conformant with a given (quality) standard. In such cases, properties expressing just whether data are conformant (dct:conformsTo) or not (??) can do the job.
>
>
> 2. The latter concerns two classes of actors:
>
> (a) Who is managing data. This is about the ability to record the details of conformance tests and results, to use them in the data management workflow, and to expose to users the final results, possibly in an aggregated form (i.e., as in scenario #1).
>
> (b) Users who would like to contribute feedback/reviews on data quality. Note that such users can be also third-parties who are involved directly by data custodians for some reasons (e.g., this is the case of quality certificates, or peer-reviews of data).
>
> For these cases, EARL might be the appropriate tool in a SemWeb / LD context. Or, at least, EARL provides the vocabulary to specify the relevant information. This includes "outcome values" of test results that are not limited to conformant / not conformant (appropriate in scenario #1). Notably, EARL supports 5 possible outcome values (see http://www.w3.org/TR/EARL10-Schema/#OutcomeValue) - quoting:
>
> [[
> earl:passed
>    Passed - the subject passed the test.
> earl:failed
>    Failed - the subject failed the test.
> earl:cantTell
>    Cannot tell - it is unclear if the subject passed or failed the test.
> earl:inapplicable
>    Inapplicable - the test is not applicable to the subject.
> earl:untested
>    Untested - the test has not been carried out.
> ]]
>
> In this scenario, also values like "not evaluated" (earl:untested) are relevant, since they can be used internally to plan the quality checks.
>
>
> Based on what I see, DQV addresses all these users' scenarios, but it's unclear to me if it is able to encode conformance test results with the level of detail described above.
>
>> 4. The example of conformance is triggering a discussion on how to
>> indicate that it's fine to use DQV to indicate the quality of metadata,
>> not only 'original' datasets. This is less relevant to your original
>> comment, but you may want to chime in.
>
> This is actually one of the issues my colleagues and me have been dealing with in our work. So, +1 from me.
>
> I can contribute an existing example from the INSPIRE Geoportal.
>
> The INSPIRE Geoportal is harvesting metadata records (100K+) from catalogue services operated by EU Member States. As a post-processing step, the geoportal infrastructure carries out a validation test against a set of criteria, and it generates a validation report.
>
> This procedure is not meant to decide whether a metadata record can be published or not. All records are published, irrespective of whether they have passed the validation. Rather, these reports are meant to provide (meta)data providers precise information on the issues identified.
>
> Notably, this approach proved to be effective in dramatically increasing the quality of metadata (currently, valid metadata are, in average, more than 90% of those harvested). But this is a process that needs to be carried out not only when a new metadata provider joins, but also whenever metadata records are re-harvested. In other words, this needs to be integral part of the metadata management workflow.
>
> This experience also provides an example of atomic/aggregated quality checks, in particular along a temporal dimension. E.g., this applies to the links included in metadata records, that can point to distributions, services for data access and/or visualisation. In such a case, the validation results require aggregating link / service check results over a given time frame (e.g., 24h, one week).
>
> So, IMO, also for metadata both the scenarios described earlier apply.
>
>
> Cheers,
>
> Andrea
>
>
Received on Monday, 7 March 2016 10:21:57 UTC