Re: Data Quality Vocabulary - feedback welcome! from Antoine Isaac on 2015-12-06 (public-dwbp-comments@w3.org from December 2015)

From: Antoine Isaac <aisaac@few.vu.nl>
Date: Sun, 6 Dec 2015 17:39:31 +0100
To: "Debattista, Jeremy" <Jeremy.Debattista@iais.fraunhofer.de>, "Bailer, Werner" <werner.bailer@joanneum.at>
CC: "public-dwbp-comments@w3.org" <public-dwbp-comments@w3.org>, "Russegger, Silvia" <silvia.russegger@joanneum.at>, "Orgel, Thomas" <Thomas.Orgel@joanneum.at>, "Höffernig, Martin" <Martin.Hoeffernig@joanneum.at>, "Ehgarter, Stephan" <Stephan.Ehgarter@joanneum.at>
Message-ID: <566464C3.2070503@few.vu.nl>
Dear Werner,

As one of editor of the DQV spec, I'd like to thank you once again for the relevant feedback. This is priceless for us! Especially as you are actually looking at using DQV...

We had not reacted after Jeremy's answer below, as it seemed a great way to get the discussion going. But we are nearing the publication of a new DQV draft, and we would really like that commenters are happy enough with the way we tackle their input.

So, taking you to your word when you say you "are happy to continue the discussion and provide feedback on future iterations" :-)

We have raised issues to represent the different part of your comments:
https://www.w3.org/2013/dwbp/track/issues/222 - Multiple/Derived values of a metric
https://www.w3.org/2013/dwbp/track/issues/223 - Parameters for metrics
https://www.w3.org/2013/dwbp/track/issues/224 - Expected Data type for metrics
https://www.w3.org/2013/dwbp/track/issues/225 - Levels of granularity for dimensions and categories

We hope these capture your concerns, and will mail you about them in different emails.
In the meantime we will also mention these issues in the current editor's draft of DQV [1].

Best,

Antoine

[1] http://w3c.github.io/dwbp/vocab-dqg.html


>> There are however a few observation from the use of DQV that we'd like to share with the group.
>>
>> 1. daq:metric
>> a. multiple values of one metric
>>
>> We found that there are metrics where a single output value may not be sufficient. In particular, this applies to statistics (which are listed as one dimension in the draft spec). For example, one may want to express the mean, min or max of a metric over a dataset, or providing an absolute and a relative (normalized) value for the same metric. Of course this could be done by defining multiple metrics, but then one would need a mechanism to group/link them or express their dependency.
>>
>> In the EBU, the working group on quality control [2] has defined a data model for the somewhat related problem of describing quality of audiovisual content (with XML serialisations so far, not RDF). This model supports multiple output values, that can be typed. For DQV, this could for example be achieved by having multiple values, and defining subproperties of daq:value.
>
> In theory, there is only one value for a metric - others are derivatives. With a daq:Observation, such derivatives should be easily defined by creating external data cube measure property.
>
>> b. parameters
>>
>> For some metrics, input parameters could be required. E.g., there have been recent publications on metadata quality which use weights or target values in the metrics. For descriptions with quality measurements that are self-contained, it would be required to include the values of such parameters in the description of the metric.
>
> A daq:Metric (which is the equivalent class of dqv:Metric) has the property daq:requires. The purpose of that property is exactly for input parameters.
>
>
>> c. daq:expectedDataType
>>
>> This property from DAQ is defined to have range xsd:anySimpleType. While it seems useful to define the expected data type for a metric, a simple type may too narrow: in many cases a metric will be determined on a data record or a subgraph.
>
> It will be taken into consideration - although I’m not sure it works well with data cube. Please can you provide us (or me) with an example where a quality metric returns a data record or sub graph?
>
>>
>> 2. Dimensions and categories
>>
>> The dimensions proposed seem quite high-level, so it is difficult to think of categories that are more general and group dimensions. In contrast, it seems in some cases desirable to have a level between dimensions and metrics. For example, we are dealing with assessing mapping quality. The metrics fall in the dimension of accuracy (i.e., does the output of the mapping process represent the object less accurately), and form a specific group there. To make the distinction of the different levels more confusing, the note in 7.3 Processability currently says "Level on the 5-star scale", which sounds more like a metric than a dimension (there could of course be metrics aggregating results from other metric, daq:requires could be used to express such a dependency).
>> We are not sure if there is a strong need for categories, we would rather propose to consider nesting multiple levels of dimensions to allow grouping.
>
> I’m not sure if I understood “nesting multiple levels of dimensions” correctly, but a category groups a set of dimensions which have a common type of information as a quality indicator. For example the Accessibility category groups dimensions such as Availability, Security and Performance. Each of these dimensions have a number of different metrics, each assessing different aspect of a dimension. This is how we define Category-Dimension-Metric in daq:
>
>     /A *Quality Dimension* is a characteristic of a dataset relevant to the consumer (e.g. Availability of a dataset)./
>     /
>     /
>     /A *Quality Metric* is concrete quality measure for a concrete quality indicator usually associ- ated with a measuring procedure. This assessment procedure returns a score, which we also call the value of the metric. There are usually multi- ple metrics per dimension; e.g., availability can be measured by the accessibility of a SPARQL endpoint, or of an RDF dump. The value of a metric can be numeric (e.g., for the metric “human-readable labelling of classes, properties and entities”, the percentage of entities having an rdfs:label or rdfs:comment) or boolean (e.g. whether or not a SPARQL endpoint is accessible)./
>     /
>     /
>     /A *Category* is a group of quality dimensions in which a common type of information is used as quality indicator (e.g. Accessibility, which comprises not only availability but also dimensions such as security or performance). Grouping the dimensions into categories helps to organise the space of all quality aspects, given their large number./
>
>
>
> I hope this helps.
>
> Best Regards,
> Jeremy
>
Received on Sunday, 6 December 2015 16:40:02 UTC