[public-dwbp-comments] <none> from Duffes Guillaume on 2016-01-07 (public-dwbp-comments@w3.org from January 2016)

From: Duffes Guillaume <guillaume.duffes@insee.fr>
Date: Thu, 7 Jan 2016 10:37:12 +0000
To: "public-dwbp-comments@w3.org" <public-dwbp-comments@w3.org>
Message-ID: <39559E14C93A5B4AB4C314F40D1CD73D2E364668@pdexchbalwst01.ad.insee.intra>

Dear all,

Firstly thank you very much for your very good work on the Data Quality Vocabulary, which is of high interest for us at Insee, the French National Statistical Institute.
Here are below some comments regarding the draft of the Data Quality Vocabulary. Many of the issues raised are preceded by some explanations making explicit my (uncorrect or correct) understanding of the DQV classes and properties.

* DQV<http://www.w3.org/TR/2015/WD-vocab-dqv-20151217/> builds on and is considered as an extension of DCAT<http://www.w3.org/TR/vocab-dcat/>:

* A dcat:Dataset is not a qb:DataSet, the former has a QualityMeasure whereas the latter has dqv:MeasureQualityDataset as a sub-class.

* Dimensions, metrics, measure, attributes : DQV vs QB<http://www.w3.org/TR/vocab-data-cube/> (SDMX<https://sdmx.org/?page_id=5008>) terminology

* DQV Quality Dimension is "a characteristic of a dataset relevant to the consumer (e.g., the availability of a dataset)". A QB DimensionProperty identifies in a dataset "the phenomenon being measured. Any given data structure must have at least one dimension." The same terminology "dimension" is used for things which are different.

* In QB the availability of a dataset could be an attribute attached at the qb:DataSet level. Here the same business object (availability of a dataset) takes different forms in each vocabulary.

* A DQV Quality Metric "gives a procedure for measuring a data quality dimension, which is abstract, by observing a concrete quality indicator. There are usually multiple metrics per dimension; e.g., availability can be indicated by the accessibility of a SPARQL endpoint, or of an RDF dump. The value of a metric can be numeric (e.g., for the metric "human-readable labeling of classes, properties and entities", the percentage of entities having an rdfs:label or rdfs:comment) or boolean (e.g. whether or not a SPARQL endpoint is accessible).". In the DQV model, a Quality Metric is actually a concrete Quality Measure for a concrete quality indicator. There is no such thing in QB since I guess the Quality Metric is essentially quality-oriented. However the issue of mutiple values for a metric is raised here<https://lists.w3.org/Archives/Public/public-dwbp-comments/2015Nov/0000.html>. This links back as well to the semantics of a metric and a measure discussed below.

* The QB vocabulary allows to handle multi-measure observations, i.e multiple observed values attached to an individual observation. A qb:measure is a property and has a qb:ComponentSet (e.g a qb:DataStructureDefinition) as domain. A dqv:QualityMeasure is on the contrary a class that inherits from a qb:Observation but is attached to a dcat:Dataset. In DQV providing an absolute and a relative value for the same metric over a dataset plays the same role as describing in QB a set of shipment data containing unit count and total weight for each observation. Nevetheless in the former this is expressed as several metrics whereas in the latter as several measures.

* A dqv:Category is a set of dimensions which have a common type of information as a quality indicator (e.g Accessibility category groups dimensions such as Availability, Security and Performance). QB does not define any category although it is a well-known object in the SDMX information model on which is based QB. A category in SDMX has nothing to do with a dqv:Category and is defined as "an item at any level within a classification, typically tabulation categories, sections, subsections,divisions, subdivisions, groups, subgroups, classes and subclasses." One should keep in mind that other standard such as DDI<http://www.ddialliance.org/explore-documentation> (which will have an RDF representation in its next version) has a broad acception of what a category is: "A description of a particular category or response". The interest of grouping dimensions in categories is discussed as well here<https://lists.w3.org/Archives/Public/public-dwbp-comments/2015Nov/0000.html>. Indeed hierarchical dimensions or dimension groups instead of categories would certainly make things easier, and avoid confusion between different meanings of a Category (at least in the statistical world).

* The Dataset Quality (daQ)<http://butterbur04.iai.uni-bonn.de/ontologies/daq/daq> vocabulary is "a lightweight, extensible core vocabulary for attaching the result of quality benchmarking of a linked open dataset (usually an expensive process) to that dataset". It builds on QB vocabulary.

* DaQ and DQV duplicates a significant number of classes and properties a pratice that remains questionnable. DQV follows the best practices for data vocabularies<http://www.w3.org/TR/dwbp/#dataVocabularies> identified by the Data on the Web Best Practices Working Group; then the rationale for duplicating those classes and properties should be made explicit.

* QB is the basis of DQV and daQ.

* One month before the release of the DQV working draft, another vocabulary on Quality was published: the Data Quality Management (DQM)<http://semwebquality.org/dqm-vocabulary/v1/dqm> vocabulary "provides an ontology for the structured representation of data requirements, data quality assessment results, data cleansing rules, and data requirement violations connected to their origin. It, therefore, supports data quality monitoring, data quality assessment, and data cleansing in Semantic Web architectures". The relashionship (if any) between both vocabularies is not made explicit in any of each, and I am wondering how one could benefit from the other.

* Is DQM a focus on the description of the "procedure for measuring a data quality dimension [..]" that defines a dqv:Metric? Or simply a list of instances of dqv:Dimension such as accuracy (e.g dqm:Accuracy)?

* What about the dqv:Dimension not defined in DQM? Is there also a recommandation to reuse DQM classes as dqv:Dimension when available?

* You might be aware of the European SIMS<http://ec.europa.eu/eurostat/documents/64157/4373903/SIMS-2-0-Revised-standards-November-2015-ESSC-final..pdf/47c0b80d-0e19-4777-8f9e-28f89f82ce18> vocabulary (based on the SDMX standard) which is the European Statistical System Quality Reference Metadata Standard. The conceptual relationship between the aforementioned quality vocabularies (DQV, DQM and daQ) and the SIMS (formerly ESQRS+ESMS) seems to be semantically strong. However this relationhsip raises many questions in my mind:

* How would the European SIMS fit into the RDF quality vocabularies in particular DQV? Have you already had some thoughts about it?

* Does it make more sense to have a full QB version of SIMS rather than expressing directly (if possible) the SIMS metadata as instances of DQV and daQ classes?

Thank you very much for your feedback.

Regards,

Guillaume Duffes
INSEE

Received on Friday, 8 January 2016 13:30:26 UTC