RE: DQV - metrics related to the completeness dimension from Steven Adler on 2015-09-30 (public-dwbp-wg@w3.org from September 2015)

From: Steven Adler <adler1@us.ibm.com>
Date: Wed, 30 Sep 2015 14:29:47 +0000
To: "Makx Dekkers" <mail@makxdekkers.com>
Cc: "'Debattista, Jeremy'" <Jeremy.Debattista@iais.fraunhofer.de>,"'Nandana Mihindukulasooriya'" <nmihindu@fi.upm.es>,"'Data on the Web Best Practices Working Group'" <public-dwbp-wg@w3.org>
Message-ID: <OF6D9D699A.48AD93EF-ON00257ED0.004FA173-1443623387291@notes.na.collabserv.com>

I've worked with many banks, insurers, retailers, academics, ngos, and governments on data quality. Vendors like IBM make products that validate data quality and we have thousands of engineers with data quality skills.

In my work, I've met many honest people who make many honest mistakes when publishing data. I have a file cabinet full of stories from all over the world on data quality mistakes.

I have never met an objective human being. All humans have bias. Even computer systems have the biases we endow them with when we humans create them - which this debate between you and I perfectly illustrates.

There are no single sources of truth and every dataset can contain the errors and omissions and biases of the publisher.

But don't take it from me. Take it from journalists, whose job it is to publish stories from sources. When a journalist publishes a story with only one source it is called gossip. Serious journalists corroborate sources to validate bias and fact and so should we.

Best Regards,

Steve

Makx Dekkers --- RE: DQV - metrics related to the completeness dimension ---
From:"Makx Dekkers" <mail@makxdekkers.com>To:"Steven Adler" <adler1@us.ibm.com>Cc:"'Debattista, Jeremy'" <Jeremy.Debattista@iais.fraunhofer.de>, "'Nandana Mihindukulasooriya'" <nmihindu@fi.upm.es>, "'Data on the Web Best Practices Working Group'" <public-dwbp-wg@w3.org>Date:Wed, Sep 30, 2015 10:06 AMSubject:RE: DQV - metrics related to the completeness dimension

Steve,

How is this related to “vendors”? My example was about a data provider who wants to and can make objective statements about how complete, accurate, granular etc. their data is. Whether you believe them is out of scope for this discussion.

I understand from your earlier comments that your perspective is on data providers that have an interest in making false or at least biased statements about the quality of their data. I agree that those exist, but it is not the only sector of data providers that we’re addressing. We also need to take into account ‘honest’ data providers whose interest it is to make accurate statements about quality that are measurable, reproducible and verifiable.

I don’t understand why you insist that quality is entirely subjective.

Makx.

From: Steven Adler [mailto:adler1@us.ibm.com] Sent: 30 September 2015 15:47To: Makx Dekkers <mail@makxdekkers.com>Cc: 'Debattista, Jeremy' <Jeremy.Debattista@iais.fraunhofer.de>; 'Nandana Mihindukulasooriya' <nmihindu@fi.upm.es>; 'Data on the Web Best Practices Working Group' <public-dwbp-wg@w3.org>Subject: RE: DQV - metrics related to the completeness dimension

Really? Show me a vendor who does that?

Best Regards,

Steve

Makx Dekkers --- RE: DQV - metrics related to the completeness dimension ---

From:
"Makx Dekkers" <mail@makxdekkers.com>
To:
"Steven Adler" <adler1@us.ibm.com>
Cc:
"'Debattista, Jeremy'" <Jeremy.Debattista@iais.fraunhofer.de>, "'Nandana Mihindukulasooriya'" <nmihindu@fi.upm.es>, "'Data on the Web Best Practices Working Group'" <public-dwbp-wg@w3.org>
Date:
Wed, Sep 30, 2015 9:44 AM
Subject:
RE: DQV - metrics related to the completeness dimension

Steve,

I don’t agree. There are aspects of quality that can be objectively measured. Let’s not throw those away.

Makx.

From: Steven Adler [mailto:adler1@us.ibm.com] Sent: 30 September 2015 13:01To: Makx Dekkers <mail@makxdekkers.com>Cc: 'Debattista, Jeremy' <Jeremy.Debattista@iais.fraunhofer.de>; 'Nandana Mihindukulasooriya' <nmihindu@fi.upm.es>; 'Data on the Web Best Practices Working Group' <public-dwbp-wg@w3.org>Subject: RE: DQV - metrics related to the completeness dimension

That is not how DATA quality works. You have to define lineage and provenance to illustrate where the data came from and it s age. You can compare it to other aources. But your singular assertion of completeness is just one opinion. Data quality relies on many opinions.

Best Regards,

Steve

Makx Dekkers --- RE: DQV - metrics related to the completeness dimension ---

From:
"Makx Dekkers" <mail@makxdekkers.com>
To:
"Steven Adler" <adler1@us.ibm.com>, "'Debattista, Jeremy'" <Jeremy.Debattista@iais.fraunhofer.de>
Cc:
"'Nandana Mihindukulasooriya'" <nmihindu@fi.upm.es>, "'Data on the Web Best Practices Working Group'" <public-dwbp-wg@w3.org>
Date:
Wed, Sep 30, 2015 6:16 AM
Subject:
RE: DQV - metrics related to the completeness dimension

Data quality is not “subjective by nature”. It depends on your definition. If you see quality as mainly an expression of what you think about something, yes, of course, that’s always subjective.

If, on the other hand, you define quality in terms of measurable quantities (e.g. precision of a figure in terms of number of decimals), it can be objectively measured and expressed. Other objective measures are for example the one for completeness that I mentioned before, in the sense of containing all observations that I have.

I think we should allow both types of “quality” to be expressed.

Makx.

From: Steven Adler [mailto:adler1@us.ibm.com] Sent: 30 September 2015 11:34To: Debattista, Jeremy <Jeremy.Debattista@iais.fraunhofer.de>Cc: Nandana Mihindukulasooriya <nmihindu@fi.upm.es>; Data on the Web Best Practices Working Group <public-dwbp-wg@w3.org>Subject: Re: DQV - metrics related to the completeness dimension

Aside from data refresh rates and comparisons to corobborative sources, data quality is subjective by nature.

Best Regards,

Steve

Debattista, Jeremy --- Re: DQV - metrics related to the completeness dimension ---

From:
"Debattista, Jeremy" <Jeremy.Debattista@iais.fraunhofer.de>
To:
"Steven Adler" <adler1@us.ibm.com>
Cc:
"Nandana Mihindukulasooriya" <nmihindu@fi.upm.es>, "Data on the Web Best Practices Working Group" <public-dwbp-wg@w3.org>
Date:
Wed, Sep 30, 2015 3:26 AM
Subject:
Re: DQV - metrics related to the completeness dimension

What you said is true Steven, and (in principle) I would agree on avoiding universal completeness in favour of a more sustainable measure. On the other hand your solution is highly subjective and thus very hard to calculate. It would be nice to have such an index score, but I’m not quite sure that this will work in practice as there are many factors that have to be considered.

Cheers,
Jer

On 30 Sep 2015, at 03:42, Steven Adler <adler1@us.ibm.com> wrote:

You can avoid "universal" completeness by allowing publishers and consumers to publish their confidence level in the data. The combination of confidence attributes would be calculated as an index of confidence and doubt, like a set of product reviews. This method is more organic to how the data has been and is used. Just a thought.Best Regards,SteveMotto: "Do First, Think, Do it Again"<graycol.gif>Nandana Mihindukulasooriya ---09/27/2015 08:07:02 PM---Hi all, In the F2F (re: action-153), we talked about the difficulties of definingFrom: Nandana Mihindukulasooriya <nmihindu@fi.upm.es>To: Data on the Web Best Practices Working Group <public-dwbp-wg@w3.org>Date: 09/27/2015 08:07 PMSubject: DQV - metrics related to the completeness dimension
Hi all,In the F2F (re: action-153), we talked about the difficulties of defining metrics for measuring completeness and the need for examples. Here's some input from a project we are working on at the moment. TD;LR versionIt's hard to define universal completeness metrics that suit everyone. However, completeness metrics can be defined for concrete use cases or specific contexts of use. In the case of RDF data, a closed world assumption has to be applied to calculate completeness. Longer versionQuality is generally defined as "fitness for *use*". Further, completeness is defined as "The degree to which subject data associated with an entity has values for all expected attributes and related entity instances *in a specific context of use*" [ISO 25012]. It's important to note that both definitions emphasize that the perceived quality depends on the intended use. Thus, a dataset fully complete for a one task might be quite incomplete for another task. For example, it's not easy to define a metric that universally measures the completeness of a dataset. However, for a concrete use case such as calculating some economic indicators of Spanish provinces, we can define a set of completeness metrics. In this case, we can define three metrics(i) Schema completeness i.e. the degree to which required attributes are not missing in the schema. In our use case, the attributes we are interested are the total population, unemployment level, and average personal income of a province and the schema completeness is calculated using those attributes. (ii) Population completeness i.e. the degree to which elements of the required population are not missing in the data. In our use case, the population we are interested in is all the provinces of Spain and the population completeness is calculated against them. (iii) Column completeness i.e. the degree to which which the values of the required attributes are not missing in the data. Column completeness is calculated using the schema and the population defined before and the facts in the dataset.With these metrics, now we can measure the completeness of the dataset for our use case. As we can see, those metrics are quite specific to our use case. Later if we have another use case about Spanish movies, we can define a set of different schema, population, and column completeness metrics and the same dataset will have different values for those different metrics. If the data providers foresee some specific use cases, they might be able to define some concrete completeness metrics and made them available as quality measures. If not, the data consumers can define more specific completeness metrics for their use cases and measure values for those metrics. These completeness metrics can be used to evaluate the "fitness for use" of different datasets for a given use case. To generate population completeness, the required population should be known. The required attributes and other constraints of schema might be expressed using SHACL shapes [1].In the case of RDF data, we will assume a closed world assumption and only consider the axioms and facts included in the dataset. Also, if the use case involves linksets, other metrics such as interlinking completeness can be used. Hope this helps to discuss more concretely about the completeness metrics. It will be interesting to hear other experiences in defining completeness metrics and counter examples where it is easy to define universal completeness metrics. Best Regards,Nandana[1] http://w3c.github.io/data-shapes/shacl/

Received on Wednesday, 30 September 2015 14:30:20 UTC