- From: Nandana Mihindukulasooriya <nmihindu@fi.upm.es>
- Date: Wed, 30 Sep 2015 11:19:23 +0200
- To: Makx Dekkers <mail@makxdekkers.com>
- Cc: "Debattista, Jeremy" <Jeremy.Debattista@iais.fraunhofer.de>, Steven Adler <adler1@us.ibm.com>, Data on the Web Best Practices Working Group <public-dwbp-wg@w3.org>
- Message-ID: <CAAOEr1mQom_UbBtZ2aEY8Tw33j-ikh2ExyqYAwgGBPWPVHPnog@mail.gmail.com>
Hi Makx, I completely agree. Further, If we can enrich the completeness measure with a sort of description/definition of what's the meant by being "complete" in the context of this concrete measure, i.e., X, Y, Z observed attributes for all measuring stations (observed population), I think we can make it a more useful measure. In that case, as a consumer, I can check my definition of "completeness" that has "required attributes" and a "required population" against the "observed attributes" and "observed population" provided by the data provider or the quality elevator to check if I can use the given measure to decide if the dataset is complete for my use. Best Regards, Nandana On Wed, Sep 30, 2015 at 10:59 AM, Makx Dekkers <mail@makxdekkers.com> wrote: > All, > > > > Aren’t we making this too complex? > > > > It seems to me that in certain cases there can be ‘absolute’ measures of > quality. For example, if I publish a dataset with air quality observations > from 50 measuring stations, I can state that the dataset is complete > because it contains all observations from all measuring stations, or that > it is not complete because observations from stations X and Y are missing. > This is not subjective at all. > > > > Defining quality as “fitness for use” allows for a discussion about the > completeness of the approach behind the dataset, e.g. someone can argue > that my measurements of air quality do not include parameters that are > crucial for their research and therefore the measurements are “incomplete”. > I would argue that the dataset is still “complete”. > > > > In my mind, the three completeness metrics (Schema completeness, > population completeness, column completeness) as formulated by Nandana > point mainly to the quality of the approach as it talks about “required > attributes”, not to the quality of the dataset itself. If you change the > phrases “required attributes” and “required population” by “observed > attributes” and “observed population”, you can have an objective measure of > the completeness of the dataset. > > > > Of course, if a user has particular requirements in terms of the set of > attributes, and a dataset contains a different set of attributes, that > dataset may not be “fit for (this user’s) use”, but it could still be 100% > complete with respect to its own set of attributes and population. > > > > Makx. > > > > > > > > *From:* Debattista, Jeremy [mailto:Jeremy.Debattista@iais.fraunhofer.de] > *Sent:* 30 September 2015 09:26 > *To:* Steven Adler <adler1@us.ibm.com> > *Cc:* Nandana Mihindukulasooriya <nmihindu@fi.upm.es>; Data on the Web > Best Practices Working Group <public-dwbp-wg@w3.org> > *Subject:* Re: DQV - metrics related to the completeness dimension > > > > What you said is true Steven, and (in principle) I would agree on avoiding > universal completeness in favour of a more sustainable measure. On the > other hand your solution is highly subjective and thus very hard to > calculate. It would be nice to have such an index score, but I’m not quite > sure that this will work in practice as there are many factors that have to > be considered. > > > > Cheers, > > Jer > > > > On 30 Sep 2015, at 03:42, Steven Adler <adler1@us.ibm.com> wrote: > > > > You can avoid "universal" completeness by allowing publishers and > consumers to publish their confidence level in the data. The combination of > confidence attributes would be calculated as an index of confidence and > doubt, like a set of product reviews. This method is more organic to how > the data has been and is used. > > Just a thought. > > > > Best Regards, > > Steve > > Motto: "Do First, Think, Do it Again" > > <graycol.gif>Nandana Mihindukulasooriya ---09/27/2015 08:07:02 PM---Hi > all, In the F2F (re: action-153), we talked about the difficulties of > defining > > From: Nandana Mihindukulasooriya <nmihindu@fi.upm.es> > To: Data on the Web Best Practices Working Group <public-dwbp-wg@w3.org> > Date: 09/27/2015 08:07 PM > Subject: DQV - metrics related to the completeness dimension > ------------------------------ > > > > > Hi all, > > In the F2F (re: action-153), we talked about the difficulties of defining > metrics for measuring completeness and the need for examples. Here's some > input from a project we are working on at the moment. > > TD;LR version > > It's hard to define universal completeness metrics that suit everyone. > However, completeness metrics can be defined for concrete use cases or > specific contexts of use. In the case of RDF data, a closed world > assumption has to be applied to calculate completeness. > > Longer version > > Quality is generally defined as "fitness for *use*". Further, completeness > is defined as "The degree to which subject data associated with an entity > has values for all expected attributes and related entity instances *in a > specific context of use*" [ISO 25012]. It's important to note that both > definitions emphasize that the perceived quality depends on the intended > use. Thus, a dataset fully complete for a one task might be quite > incomplete for another task. > > For example, it's not easy to define a metric that universally measures > the completeness of a dataset. However, for a concrete use case such as > calculating some economic indicators of Spanish provinces, we can define a > set of completeness metrics. > > In this case, we can define three metrics > (i) Schema completeness i.e. the degree to which required attributes are > not missing in the schema. In our use case, the attributes we are > interested are the total population, unemployment level, and average > personal income of a province and the schema completeness is calculated > using those attributes. > (ii) Population completeness i.e. the degree to which elements of the > required population are not missing in the data. In our use case, the > population we are interested in is all the provinces of Spain and the > population completeness is calculated against them. > (iii) Column completeness i.e. the degree to which which the values of the > required attributes are not missing in the data. Column completeness is > calculated using the schema and the population defined before and the facts > in the dataset. > > With these metrics, now we can measure the completeness of the dataset for > our use case. As we can see, those metrics are quite specific to our use > case. Later if we have another use case about Spanish movies, we can define > a set of different schema, population, and column completeness metrics and > the same dataset will have different values for those different metrics. > > If the data providers foresee some specific use cases, they might be able > to define some concrete completeness metrics and made them available as > quality measures. If not, the data consumers can define more specific > completeness metrics for their use cases and measure values for those > metrics. These completeness metrics can be used to evaluate the "fitness > for use" of different datasets for a given use case. To generate population > completeness, the required population should be known. The required > attributes and other constraints of schema might be expressed using SHACL > shapes [1]. > > In the case of RDF data, we will assume a closed world assumption and only > consider the axioms and facts included in the dataset. Also, if the use > case involves linksets, other metrics such as interlinking completeness can > be used. > > Hope this helps to discuss more concretely about the completeness metrics. > It will be interesting to hear other experiences in defining completeness > metrics and counter examples where it is easy to define universal > completeness metrics. > > Best Regards, > Nandana > > [1] http://w3c.github.io/data-shapes/shacl/ > > >
Received on Wednesday, 30 September 2015 09:20:14 UTC