Re: [public-dwbp-comments] Comments from Guillaume Duffes on DQV

Dear Guillaume,

We are coming back to this quite old post, thanks for your comments,
Sorry again if we haven't reply you as early as we had promised earlier [1].
Since your original mail, DQV has been changed a lot, and this partly explains our delay.
The work in progress version of DQV is at [2].

You can find inline with your comments  our replies. We hope our
replies will clarify the doubts you have pointed out.

On 7 January 2016 at 11:37, Duffes Guillaume <guillaume.duffes@insee.fr> wrote:
> Dear all,
>
> Firstly thank you very much for your very good work on the Data Quality
> Vocabulary, which is of high interest for us at Insee, the French National
> Statistical Institute.


This is great: we are interested to hear from you if you have scenarios that could use DQV!


> Here are below some comments regarding the draft of the Data Quality
> Vocabulary. Many of the issues raised are preceded by some explanations
> making explicit my (uncorrect or correct) understanding of the DQV classes
> and properties.
>
>
> DQV builds on and is considered as an extension of DCAT:
>
> A dcat:Dataset is not a qb:DataSet, the former has a QualityMeasure whereas
> the latter has dqv:MeasureQualityDataset as a sub-class.


Right. Note that the class QualityMeasure has changed his name in the
last drafts, and it is now DQV:QualityMeasurement.


>
>
> Dimensions, metrics, measure, attributes : DQV vs QB (SDMX) terminology
>


DQV and RDF Data Cube have clearly distinct purposes: RDF Data Cube
is not about quality. But we care about the compatibility of DQV to RDF
Data Cube as this might have some nice side effects, for example,
encoding DQV quality measurements according to RDF Data Cube enables
the use of RDF Data Cube visualizer to browse DQV quality Measurements.
However, In principle, DQV can be deployed ignoring RDF Data Cube.


> DQV Quality Dimension is “a characteristic of a dataset relevant to the
> consumer (e.g., the availability of a dataset)”. A QB DimensionProperty
> identifies in a dataset “the phenomenon being measured. Any given data
> structure must have at least one dimension.” The same terminology
> “dimension” is used for things which are different.


The terminology is clashing but the concepts of Quality Dimension,
Quality Metrics, Quality Measure(ment) are very well established in
the field of information/data quality, so they had to be reused in DQV
vocabulary. This might be a little confusing at first, especially
for people who are familiar with RDF Data Cube and less familiar with the
quality field. In the document, we mostly refer to the quality
dimension, and only in a couple of parts to RDF cube dimension (see
appendix E. Compatibility with RDF Data Cube). We tried to make it clear
from the context, whether we refer to the former or the latter.

>
> In QB the availability of a dataset could be an attribute attached at the
> qb:DataSet level. Here the same business object (availability of a dataset)
> takes different forms in each vocabulary.
>
> A DQV Quality Metric “gives a procedure for measuring a data quality
> dimension, which is abstract, by observing a concrete quality indicator.
> There are usually multiple metrics per dimension; e.g., availability can be
> indicated by the accessibility of a SPARQL endpoint, or of an RDF dump. The
> value of a metric can be numeric (e.g., for the metric “human-readable
> labeling of classes, properties and entities”, the percentage of entities
> having an rdfs:label or rdfs:comment) or boolean (e.g. whether or not a
> SPARQL endpoint is accessible).”. In the DQV model, a Quality Metric is
> actually a concrete Quality Measure for a concrete quality indicator. There
> is no such thing in QB since I guess the Quality Metric is essentially
> quality-oriented. However the issue of mutiple values for a metric is raised
> here. This links back as well to the semantics of a metric and a measure
> discussed below.
>
> The QB vocabulary allows to handle multi-measure observations, i.e multiple
> observed values attached to an individual observation. A qb:measure is a
> property and has a qb:ComponentSet (e.g a qb:DataStructureDefinition) as
> domain. A dqv:QualityMeasure is on the contrary a class that inherits from a
> qb:Observation but is attached to a dcat:Dataset. In DQV providing an
> absolute and a relative value for the same metric over a dataset plays the
> same role as describing in QB a set of shipment data containing unit count
> and total weight for each observation. Nevetheless in the former this is
> expressed as several metrics whereas in the latter as several measures.
>
> A dqv:Category is a set of dimensions which have a common type of
> information as a quality indicator (e.g Accessibility category groups
> dimensions such as Availability, Security and Performance). QB does not
> define any category although it is a well-known object in the SDMX
> information model on which is based QB. A category in SDMX has nothing to do
> with a dqv:Category and is defined as “an item at any level within a
> classification, typically tabulation categories, sections,
> subsections,divisions, subdivisions, groups, subgroups, classes and
> subclasses.” One should keep in mind that other standard such as DDI (which
> will have an RDF representation in its next version) has a broad acception
> of what a category is: “A description of a particular category or response”.
> The interest of grouping dimensions in categories is discussed as well here.
> Indeed hierarchical dimensions or dimension groups instead of categories
> would certainly make things easier, and avoid confusion between different
> meanings of a Category (at least in the statistical world).
>
>
> The Dataset Quality (daQ) vocabulary is “a lightweight, extensible core
> vocabulary for attaching the result of quality benchmarking of a linked open
> dataset (usually an expensive process) to that dataset”. It builds on QB
> vocabulary.
>
> DaQ and DQV duplicates a significant number of classes and properties a
> pratice that remains questionnable. DQV  follows the best practices for data
> vocabularies identified by the Data on the Web Best Practices Working Group;
> then the rationale for duplicating those classes and properties should be
> made explicit.



Probably it was not that evident in the DQV version that you have
commented, but  it should be more  evident in the latest version of
the draft, it is not a mere duplication! In fact, changes in daQ
classes/properties, names and semantics have been made to meet the
group requirements, so we could not keep the daQ URI. Besides, we
wanted to  have URIs under the w3c umbrella.

>
> QB is the basis of DQV and daQ.


We would rather say that DQV and daQ reuses QB. In fact, not all the
DQV elements ( ie QualityAnnotation and Policy) are encoded in QB,
and generally speaking, a QB dataset is not a quality dataset.

>
> One month before the release of the DQV working draft, another vocabulary on
> Quality was published: the Data Quality Management (DQM) vocabulary
> “provides an ontology for the structured representation of data
> requirements, data quality assessment results, data cleansing rules, and
> data requirement violations connected to their origin. It, therefore,
> supports data quality monitoring, data quality assessment, and data
> cleansing in Semantic Web architectures”. The relashionship (if any) between
> both vocabularies is not made explicit in any of each, and I am wondering
> how one could benefit from the other.


There are different vocabularies  that are related to quality,
beside DQM, there are other ontologies such as qmo [3].  In some
sense, the fact that more vocabularies from different parties come up
makes evident that there was a need for a W3C efforts. However, the
primary goal of DQV is not to harmonize all existing ontologies, but
rather to meet  the requirements from the DWBP group.
Mapping with other quality vocabularies is interesting but
unfortunately we couldn't address them all. Further efforts in this
direction will be made in future working groups if the groups' use cases require so.
Note that contrary to what you say, the DQM Vocabulary has not been published one month
before DQV, but over 3 years before. It hasn't been active since then,
and it wasn't mature when it stopped.
There are more active efforts, notably the RDFUnit work,
which we are liaising with (http://aksw.org/Projects/RDFUnit.html).
DQM also focused a lot on data validation aspects, which in the W3C context are
currently being addressed by SHACL, and thus out of scope for a vocabulary like DQV.


>
> Is DQM a focus on the description of the “procedure for measuring a data
> quality dimension [..]” that defines a dqv:Metric? Or simply a list of
> instances of dqv:Dimension such as accuracy (e.g dqm:Accuracy)?
>
> What about the dqv:Dimension not defined in DQM? Is there also a
> recommandation to reuse DQM classes as dqv:Dimension when available?
>
> You might be aware of the European SIMS vocabulary (based on the SDMX
> standard) which is the European Statistical System Quality Reference
> Metadata Standard. The conceptual relationship between the aforementioned
> quality vocabularies (DQV, DQM and daQ) and the SIMS (formerly ESQRS+ESMS)
> seems to be semantically strong. However this relationhsip raises many
> questions in my mind:
>
> How would the European SIMS fit into the RDF quality vocabularies in
> particular DQV? Have you already had some thoughts about it?


Not really.
We do not know much about the European Statistical System Quality Reference
Metadata Standard and SIMS. SIMS vocabulary seems tailored to
specific directive, whilst DQV does not make any assumption on
domain.

Some sort of translation between SIMS and DQV can in principle be
provided assuming we can identify  metrics and Dimensions in SIMS.
Also the development of domain specific version of DQV can be
considered on the base of more precise set requirements.

>
> Does it make more sense to have a full QB version of SIMS rather than
> expressing directly (if possible) the SIMS metadata as instances of DQV and
> daQ classes?
>


Sorry,  what makes sense depends a lot on the uses cases you are
considering. We do not know much about the uses cases that you are
referring to,  so we are not in the position of answering to that
question.

Your question makes sense but it should be probably addressed in a
more specific working group or project.

We do not have resources in the short period to address this kind of
issues, but we might be available in supporting  you in this in the
future.

Kind regards,

Riccardo and Antoine

[1] https://lists.w3.org/Archives/Public/public-dwbp-comments/2016Mar/0006.html
[2] http://w3c.github.io/dwbp/vocab-dqg.html
[3] http://vocab.linkeddata.es/qmo/index.html#

Received on Wednesday, 10 August 2016 22:53:09 UTC