Re: DQV - metrics related to the completeness dimension

Hi Makx,

I completely agree. Further, If we can enrich the completeness measure with
a sort of description/definition of what's the meant by being "complete" in
the context of this concrete measure, i.e., X, Y, Z observed attributes for
all measuring stations (observed population), I think we can make it a more
useful measure.

In that case, as a consumer, I can check my definition of "completeness"
that has "required attributes" and a "required population" against the
"observed attributes" and "observed population" provided by the data
provider or the quality elevator to check if I can use the given measure to
decide if the dataset is complete for my use.

Best Regards,
Nandana

On Wed, Sep 30, 2015 at 10:59 AM, Makx Dekkers <mail@makxdekkers.com> wrote:

> All,
>
>
>
> Aren’t we making this too complex?
>
>
>
> It seems to me that in certain cases there can be ‘absolute’ measures of
> quality. For example, if I publish a dataset with air quality observations
> from 50 measuring stations, I can state that the dataset is complete
> because it contains all observations from all measuring stations, or that
> it is not complete because observations from stations X and Y are missing.
> This is not subjective at all.
>
>
>
> Defining quality as “fitness for use” allows for a discussion about the
> completeness of the approach behind the dataset, e.g. someone can argue
> that my measurements of air quality do not include parameters that are
> crucial for their research and therefore the measurements are “incomplete”.
> I would argue that the dataset is still “complete”.
>
>
>
> In my mind, the three completeness metrics (Schema completeness,
> population completeness, column completeness) as formulated by Nandana
> point mainly to the quality of the approach as it talks about “required
> attributes”, not to the quality of the dataset itself. If you change the
> phrases “required attributes” and “required population” by “observed
> attributes” and “observed population”, you can have an objective measure of
> the completeness of the dataset.
>
>
>
> Of course, if a user has particular requirements in terms of the set of
> attributes, and a dataset contains a different set of attributes, that
> dataset may not be “fit for (this user’s) use”, but it could still be 100%
> complete with respect to its own set of attributes and population.
>
>
>
> Makx.
>
>
>
>
>
>
>
> *From:* Debattista, Jeremy [mailto:Jeremy.Debattista@iais.fraunhofer.de]
> *Sent:* 30 September 2015 09:26
> *To:* Steven Adler <adler1@us.ibm.com>
> *Cc:* Nandana Mihindukulasooriya <nmihindu@fi.upm.es>; Data on the Web
> Best Practices Working Group <public-dwbp-wg@w3.org>
> *Subject:* Re: DQV - metrics related to the completeness dimension
>
>
>
> What you said is true Steven, and (in principle) I would agree on avoiding
> universal completeness in favour of a more sustainable measure. On the
> other hand your solution is highly subjective and thus very hard to
> calculate. It would be nice to have such an index score, but I’m not quite
> sure that this will work in practice as there are many factors that have to
> be considered.
>
>
>
> Cheers,
>
> Jer
>
>
>
> On 30 Sep 2015, at 03:42, Steven Adler <adler1@us.ibm.com> wrote:
>
>
>
> You can avoid "universal" completeness by allowing publishers and
> consumers to publish their confidence level in the data. The combination of
> confidence attributes would be calculated as an index of confidence and
> doubt, like a set of product reviews. This method is more organic to how
> the data has been and is used.
>
> Just a thought.
>
>
>
> Best Regards,
>
> Steve
>
> Motto: "Do First, Think, Do it Again"
>
> <graycol.gif>Nandana Mihindukulasooriya ---09/27/2015 08:07:02 PM---Hi
> all, In the F2F (re: action-153), we talked about the difficulties of
> defining
>
> From: Nandana Mihindukulasooriya <nmihindu@fi.upm.es>
> To: Data on the Web Best Practices Working Group <public-dwbp-wg@w3.org>
> Date: 09/27/2015 08:07 PM
> Subject: DQV - metrics related to the completeness dimension
> ------------------------------
>
>
>
>
> Hi all,
>
> In the F2F (re: action-153), we talked about the difficulties of defining
> metrics for measuring completeness and the need for examples. Here's some
> input from a project we are working on at the moment.
>
> TD;LR version
>
> It's hard to define universal completeness metrics that suit everyone.
> However, completeness metrics can be defined for concrete use cases or
> specific contexts of use. In the case of RDF data, a closed world
> assumption has to be applied to calculate completeness.
>
> Longer version
>
> Quality is generally defined as "fitness for *use*". Further, completeness
> is defined as "The degree to which subject data associated with an entity
> has values for all expected attributes and related entity instances *in a
> specific context of use*" [ISO 25012]. It's important to note that both
> definitions emphasize that the perceived quality depends on the intended
> use. Thus, a dataset fully complete for a one task might be quite
> incomplete for another task.
>
> For example, it's not easy to define a metric that universally measures
> the completeness of a dataset. However, for a concrete use case such as
> calculating some economic indicators of Spanish provinces, we can define a
> set of completeness metrics.
>
> In this case, we can define three metrics
> (i) Schema completeness i.e. the degree to which required attributes are
> not missing in the schema. In our use case, the attributes we are
> interested are the total population, unemployment level, and average
> personal income of a province and the schema completeness is calculated
> using those attributes.
> (ii) Population completeness i.e. the degree to which elements of the
> required population are not missing in the data. In our use case, the
> population we are interested in is all the provinces of Spain and the
> population completeness is calculated against them.
> (iii) Column completeness i.e. the degree to which which the values of the
> required attributes are not missing in the data. Column completeness is
> calculated using the schema and the population defined before and the facts
> in the dataset.
>
> With these metrics, now we can measure the completeness of the dataset for
> our use case. As we can see, those metrics are quite specific to our use
> case. Later if we have another use case about Spanish movies, we can define
> a set of different schema, population, and column completeness metrics and
> the same dataset will have different values for those different metrics.
>
> If the data providers foresee some specific use cases, they might be able
> to define some concrete completeness metrics and made them available as
> quality measures. If not, the data consumers can define more specific
> completeness metrics for their use cases and measure values for those
> metrics. These completeness metrics can be used to evaluate the "fitness
> for use" of different datasets for a given use case. To generate population
> completeness, the required population should be known. The required
> attributes and other constraints of schema might be expressed using SHACL
> shapes [1].
>
> In the case of RDF data, we will assume a closed world assumption and only
> consider the axioms and facts included in the dataset. Also, if the use
> case involves linksets, other metrics such as interlinking completeness can
> be used.
>
> Hope this helps to discuss more concretely about the completeness metrics.
> It will be interesting to hear other experiences in defining completeness
> metrics and counter examples where it is easy to define universal
> completeness metrics.
>
> Best Regards,
> Nandana
>
> [1] http://w3c.github.io/data-shapes/shacl/
>
>
>

Received on Wednesday, 30 September 2015 09:20:14 UTC