W3C home > Mailing lists > Public > public-dwbp-wg@w3.org > September 2015

Re: DQV - metrics related to the completeness dimension

From: Nandana Mihindukulasooriya <nmihindu@fi.upm.es>
Date: Wed, 30 Sep 2015 14:25:31 +0200
Message-ID: <CAAOEr1mcVLt47W6KYBeQVYgPpOqSVq5AOemrOgD6rn2805pT-w@mail.gmail.com>
To: Steven Adler <adler1@us.ibm.com>
Cc: "Debattista, Jeremy" <Jeremy.Debattista@iais.fraunhofer.de>, Data on the Web Best Practices Working Group <public-dwbp-wg@w3.org>, Makx Dekkers <mail@makxdekkers.com>
Hi Steve,

In that case, I think this fits well with the purpose of the
dqv:QualityFeedback. At the moment it is very open and it can be used for
providing any type of feedback about quality. This feedback may be
optionally related to a quality dimension either by the person who provided
the feedback or a person who classifies these feedback later on.

However, I am with Makx that it is also possible to define some quality
aspects including completeness in a way that can be objectively measured
and we need to be able to represent both these types of quality indicators.
IMO, at the moment this is possible using different concepts provided in
DQV such as dqv:QualityFeedback, and dqv:QualityMeasure. I agree with you
that depending on the community and how they think about data quality, we
may see one gets adopted more widely than the other.

Best Regards,

On Wed, Sep 30, 2015 at 1:30 PM, Steven Adler <adler1@us.ibm.com> wrote:

> All of the above.  Feedback can be a check box or blank field with opinions.  This is public data and we want the public to participate.
> Best Regards,
> Steve
> Nandana Mihindukulasooriya --- Re: DQV - metrics related to the
> completeness dimension ---
> From:"Nandana Mihindukulasooriya" <nmihindu@fi.upm.es>To:"Debattista,
> Jeremy" <Jeremy.Debattista@iais.fraunhofer.de>Cc:"Steven Adler" <
> adler1@us.ibm.com>, "Data on the Web Best Practices Working Group" <
> public-dwbp-wg@w3.org>Date:Wed, Sep 30, 2015 5:39 AMSubject:Re: DQV -
> metrics related to the completeness dimension
> ------------------------------
> Hi,
> I wonder whether measures of confidence and doubt in the form for product
> reviews would fit more as Quality Annotations. I guess there might be other
> cases where quality annotations are quite subjective. But thinking about
> product review example, one can always generate metrics using them such as
> average rating, average confidence levels, etc. Though they are subjective
> I think they give a good indication of users' perspective.
> To relate this more concretely to the completeness dimension, are we
> talking about the confidence that one has about the completeness of data or
> in general including accuracy, timeliness, etc etc. ?
> Best Regards,
> Nandana
> On Wed, Sep 30, 2015 at 9:26 AM, Debattista, Jeremy <
> Jeremy.Debattista@iais.fraunhofer.de> wrote:
>> What you said is true Steven, and (in principle) I would agree on
>> avoiding universal completeness in favour of a more sustainable measure. On
>> the other hand your solution is highly subjective and thus very hard to
>> calculate. It would be nice to have such an index score, but I’m not quite
>> sure that this will work in practice as there are many factors that have to
>> be considered.
>> Cheers,
>> Jer
>> On 30 Sep 2015, at 03:42, Steven Adler <adler1@us.ibm.com> wrote:
>> You can avoid "universal" completeness by allowing publishers and
>> consumers to publish their confidence level in the data. The combination of
>> confidence attributes would be calculated as an index of confidence and
>> doubt, like a set of product reviews. This method is more organic to how
>> the data has been and is used.
>> Just a thought.
>> Best Regards,
>> Steve
>> Motto: "Do First, Think, Do it Again"
>> <graycol.gif>Nandana Mihindukulasooriya ---09/27/2015 08:07:02 PM---Hi
>> all, In the F2F (re: action-153), we talked about the difficulties of
>> defining
>> From: Nandana Mihindukulasooriya <nmihindu@fi.upm.es>
>> To: Data on the Web Best Practices Working Group <public-dwbp-wg@w3.org>
>> Date: 09/27/2015 08:07 PM
>> Subject: DQV - metrics related to the completeness dimension
>> ------------------------------
>> Hi all,
>> In the F2F (re: action-153), we talked about the difficulties of defining
>> metrics for measuring completeness and the need for examples. Here's some
>> input from a project we are working on at the moment.
>> TD;LR version
>> It's hard to define universal completeness metrics that suit everyone.
>> However, completeness metrics can be defined for concrete use cases or
>> specific contexts of use. In the case of RDF data, a closed world
>> assumption has to be applied to calculate completeness.
>> Longer version
>> Quality is generally defined as "fitness for *use*". Further,
>> completeness is defined as "The degree to which subject data associated
>> with an entity has values for all expected attributes and related entity
>> instances *in a specific context of use*" [ISO 25012]. It's important to
>> note that both definitions emphasize that the perceived quality depends on
>> the intended use. Thus, a dataset fully complete for a one task might be
>> quite incomplete for another task.
>> For example, it's not easy to define a metric that universally measures
>> the completeness of a dataset. However, for a concrete use case such as
>> calculating some economic indicators of Spanish provinces, we can define a
>> set of completeness metrics.
>> In this case, we can define three metrics
>> (i) Schema completeness i.e. the degree to which required attributes are
>> not missing in the schema. In our use case, the attributes we are
>> interested are the total population, unemployment level, and average
>> personal income of a province and the schema completeness is calculated
>> using those attributes.
>> (ii) Population completeness i.e. the degree to which elements of the
>> required population are not missing in the data. In our use case, the
>> population we are interested in is all the provinces of Spain and the
>> population completeness is calculated against them.
>> (iii) Column completeness i.e. the degree to which which the values of
>> the required attributes are not missing in the data. Column completeness is
>> calculated using the schema and the population defined before and the facts
>> in the dataset.
>> With these metrics, now we can measure the completeness of the dataset
>> for our use case. As we can see, those metrics are quite specific to our
>> use case. Later if we have another use case about Spanish movies, we can
>> define a set of different schema, population, and column completeness
>> metrics and the same dataset will have different values for those different
>> metrics.
>> If the data providers foresee some specific use cases, they might be able
>> to define some concrete completeness metrics and made them available as
>> quality measures. If not, the data consumers can define more specific
>> completeness metrics for their use cases and measure values for those
>> metrics. These completeness metrics can be used to evaluate the "fitness
>> for use" of different datasets for a given use case. To generate population
>> completeness, the required population should be known. The required
>> attributes and other constraints of schema might be expressed using SHACL
>> shapes [1].
>> In the case of RDF data, we will assume a closed world assumption and
>> only consider the axioms and facts included in the dataset. Also, if the
>> use case involves linksets, other metrics such as interlinking completeness
>> can be used.
>> Hope this helps to discuss more concretely about the completeness
>> metrics. It will be interesting to hear other experiences in defining
>> completeness metrics and counter examples where it is easy to define
>> universal completeness metrics.
>> Best Regards,
>> Nandana
>> [1] *http://w3c.github.io/data-shapes/shacl/*
>> <http://w3c.github.io/data-shapes/shacl/>
Received on Wednesday, 30 September 2015 12:26:30 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 19:39:41 UTC