Re: DQV - metrics related to the completeness dimension from Steven Adler on 2015-09-30 (public-dwbp-wg@w3.org from September 2015)

From: Steven Adler <adler1@us.ibm.com>
Date: Tue, 29 Sep 2015 21:42:54 -0400
To: Nandana Mihindukulasooriya <nmihindu@fi.upm.es>
Cc: Data on the Web Best Practices Working Group <public-dwbp-wg@w3.org>
Message-ID: <OF39E4137D.3655391B-ON85257ED0.000917B8-85257ED0.00096BEA@notes.na.collabserv.c>
You can avoid "universal" completeness by allowing publishers and consumers
to publish their confidence level in the data.  The combination of
confidence attributes would be calculated as an index of confidence and
doubt, like a set of product reviews.  This method is more organic to how
the data has been and is used.

Just a thought.



Best Regards,

Steve

Motto: "Do First, Think, Do it Again"



From:	Nandana Mihindukulasooriya <nmihindu@fi.upm.es>
To:	Data on the Web Best Practices Working Group
            <public-dwbp-wg@w3.org>
Date:	09/27/2015 08:07 PM
Subject:	DQV - metrics related to the completeness dimension



Hi all,

In the F2F (re: action-153), we talked about the difficulties of defining
metrics for measuring completeness and the need for examples. Here's some
input from a project we are working on at the moment.

TD;LR version

It's hard to define universal completeness metrics that suit everyone.
However, completeness metrics can be defined for concrete use cases or
specific contexts of use. In the case of RDF data, a closed world
assumption has to be applied to calculate completeness.

Longer version

Quality is generally defined as "fitness for *use*". Further, completeness
is defined as "The degree to which subject data associated with an entity
has values for all expected attributes and related entity instances *in a
specific context of use*" [ISO 25012]. It's important to note that both
definitions emphasize that the perceived quality depends on the intended
use. Thus, a dataset fully complete for a one task might be quite
incomplete for another task.

For example, it's not easy to define a metric that universally measures the
completeness of a dataset. However, for a concrete use case such as
calculating some economic indicators of Spanish provinces, we can define a
set of completeness metrics.

In this case, we can define three metrics
(i) Schema completeness i.e. the degree to which required attributes are
not missing in the schema. In our use case, the attributes we are
interested are the total population, unemployment level, and average
personal income of a province and the schema completeness is calculated
using those attributes.
(ii) Population completeness i.e. the degree to which elements of the
required population are not missing in the data. In our use case, the
population we are interested in is all the provinces of Spain and the
population completeness is calculated against them.
(iii) Column completeness i.e. the degree to which which the values of the
required attributes are not missing in the data. Column completeness is
calculated using the schema and the population defined before and the facts
in the dataset.

With these metrics, now we can measure the completeness of the dataset for
our use case. As we can see, those metrics are quite specific to our use
case. Later if we have another use case about Spanish movies, we can define
a set of different schema, population, and column completeness metrics and
the same dataset will have different values for those different metrics.

If the data providers foresee some specific use cases, they might be able
to define some concrete completeness metrics and made them available as
quality measures. If not, the data consumers can define more specific
completeness metrics for their use cases and measure values for those
metrics. These completeness metrics can be used to evaluate the "fitness
for use" of different datasets for a given use case. To generate population
completeness, the required population should be known. The required
attributes and other constraints of schema might be expressed using SHACL
shapes [1].

In the case of RDF data, we will assume a closed world assumption and only
consider the axioms and facts included in the dataset. Also, if the use
case involves linksets, other metrics such as interlinking completeness can
be used.

Hope this helps to discuss more concretely about the completeness metrics.
It will be interesting to hear other experiences in defining completeness
metrics and counter examples where it is easy to define universal
completeness metrics.

Best Regards,
Nandana

[1] http://w3c.github.io/data-shapes/shacl/

--1__
BBF443DF9A91288f9e8a93df938690918c0ABBF443DF9A9128
Content-Transfer-Encoding: quoted-printable
Content-type: text/html; charset=ISO-8859-1
Content-Disposition: inline

<html><body><p>You can avoid &quot;universal&quot; completeness by allowing publishers and consumers to publish their confidence level in the data.  The combination of confidence attributes would be calculated as an index of confidence and doubt, like a set of product reviews.  This method is more organic to how the data has been and is used.  <br><br>Just a thought.<br><br><br><br>Best Regards,<br><br>Steve<br><br>Motto: &quot;Do First, Think, Do it Again&quot;<br><br><img width="16" height="16" src="cid:1__=0ABBF443DF9A91288f9e8a93df938690918c0AB@" border="0" alt="Inactive hide details for Nandana Mihindukulasooriya ---09/27/2015 08:07:02 PM---Hi all, In the F2F (re: action-153), we talked"><font color="#424282">Nandana Mihindukulasooriya ---09/27/2015 08:07:02 PM---Hi all, In the F2F (re: action-153), we talked about the difficulties of defining</font><br><br><font size="2" color="#5F5F5F">From:        </font><font size="2">Nandana Mihindukulasooriya &lt;nmihindu@fi.upm.es&gt;</font><br><font size="2" color="#5F5F5F">To:        </font><font size="2">Data on the Web Best Practices Working Group &lt;public-dwbp-wg@w3.org&gt;</font><br><font size="2" color="#5F5F5F">Date:        </font><font size="2">09/27/2015 08:07 PM</font><br><font size="2" color="#5F5F5F">Subject:        </font><font size="2">DQV - metrics related to the completeness dimension</font><br><hr width="100%" size="2" align="left" noshade style="color:#8091A5; "><br><br><br><font size="4">Hi all,</font><br><br><font size="4">In the F2F (re: action-153), we talked about the difficulties of defining metrics for measuring completeness and the need for examples. Here's some input from a project we are working on at the moment. </font><br><br><font size="4">TD;LR version</font><br><br><font size="4">It's hard to define universal completeness metrics that suit everyone. However, completeness metrics can be defined for concrete use cases or specific contexts of use. In the case of RDF data, a closed world assumption has to be applied to calculate completeness. </font><br><br><font size="4">Longer version</font><br><br><font size="4">Quality is generally defined as &quot;fitness for *use*&quot;. Further, completeness is defined as &quot;The degree to which subject data associated with an entity has values for all expected attributes and related entity instances *in a specific context of use*&quot; [ISO 25012]. It's important to note that both definitions emphasize that the perceived quality depends on the intended use. Thus, a dataset fully complete for a one task might be quite incomplete for another task. </font><br><br><font size="4">For example, it's not easy to define a metric that universally measures the completeness of a dataset. However, for a concrete use case such as calculating some economic indicators of Spanish provinces, we can define a set of completeness metrics. </font><br><br><font size="4">In this case, we can define three metrics</font><br><font size="4">(i) Schema completeness i.e. the degree to which required attributes are not missing in the schema. In our use case, the attributes we are interested are the total population, unemployment level, and average personal income of a province and the schema completeness is calculated using those attributes.  </font><br><font size="4">(ii) Population completeness i.e. the degree to which elements of the required population are not missing in the data. In our use case, the population we are interested in is all the provinces of Spain and the population completeness is calculated against them. </font><br><font size="4">(iii) Column completeness i.e. the degree to which which the values of the required attributes are not missing in the data. Column completeness is calculated using the schema and the population defined before and the facts in the dataset.</font><br><br><font size="4">With these metrics, now we can measure the completeness of the dataset for our use case. As we can see, those metrics are quite specific to our use case. Later if we have another use case about Spanish movies, we can define a set of different schema, population, and column completeness metrics and the same dataset will have different values for those different metrics. </font><br><br><font size="4">If the data providers foresee some specific use cases, they might be able to define some concrete completeness metrics and made them available as quality measures. If not, the data consumers can define more specific completeness metrics for their use cases and measure values for those metrics. These completeness metrics can be used to evaluate the &quot;fitness for use&quot; of different datasets for a given use case. To generate population completeness, the required population should be known. The required attributes and other constraints of schema might be expressed using SHACL shapes [1].</font><br><br><font size="4">In the case of RDF data, we will assume a closed world assumption and only consider the axioms and facts included in the dataset. Also, if the use case involves linksets, other metrics such as interlinking completeness can be used. </font><br><br><font size="4">Hope this helps to discuss more concretely about the completeness metrics. It will be interesting to hear other experiences in defining completeness metrics and counter examples where it is easy to define universal completeness metrics.  </font><br><br><font size="4">Best Regards,</font><br><font size="4">Nandana</font><br><br><font size="4">[1] </font><a href="http://w3c.github.io/data-shapes/shacl/" target="_blank"><u><font size="4" color="#0000FF">http://w3c.github.io/data-shapes/shacl/</font></u></a><br><BR>
</body></html>

--1__
BBF443DF9A91288f9e8a93df938690918c0ABBF443DF9A9128--


--0__
BBF443DF9A91288f9e8a93df938690918c0ABBF443DF9A9128
Content-type: image/gif; 
	name="graycol.gif"
Content-Disposition: inline; filename="graycol.gif"
Content-ID: <1__
BBF443DF9A91288f9e8a93df938690918c0AB@>
Content-Transfer-Encoding: base64

R0lGODlhEAAQAKECAMzMzAAAAP///wAAACH5BAEAAAIALAAAAAAQABAAAAIXlI+py+0PopwxUbpu
ZRfKZ2zgSJbmSRYAIf4fT3B0aW1pemVkIGJ5IFVsZWFkIFNtYXJ0U2F2ZXIhAAA7


--0__
BBF443DF9A91288f9e8a93df938690918c0ABBF443DF9A9128--
Received on Wednesday, 30 September 2015 01:43:41 UTC