W3C home > Mailing lists > Public > public-dwbp-wg@w3.org > April 2015

RE: Data Q&G vocabulary - report and questions for F2F

From: Makx Dekkers <mail@makxdekkers.com>
Date: Tue, 7 Apr 2015 22:31:37 +0200
To: "'Public DWBP WG'" <public-dwbp-wg@w3.org>
Message-ID: <000001d07171$d971c760$8c555620$@makxdekkers.com>
 

I strongly second Riccardo’s suggestion to base the work as much as possible on existing work, and in particular the DaQ vocabulary.

 

Let me also give the group a short report on the session on quality that I moderated at the Share-PSI worship in Timisoara last month. If you’re interested, the raw session notes are available at http://www.w3.org/2013/share-psi/wiki/Timisoara/Scribe#Tuesday_17th_March_.2811:30_-_12:40_Parallel_Sessions_B.29.

 

A major outcome was that people in the session agreed that there are three main aspects of quality for which some form of metric (either quantitative or qualitative) is considered useful:

 

*         Availability

*         Processability

*         Accuracy/consistency/relevance

 

(These terms as defined and described in http://www.slideshare.net/OpenDataSupport/open-data-quality-29248578)

 

For availability, potential metrics that were mentioned were:

 

*         Yes/no, maybe with explanation why the data is not available (privacy, security, archived, lost, not yet captured etc.)

*         Open/restricted/registration, again possibly with explanation

*         For access/re-use

*         Indication of persistence and longevity

 

For processability:

 

*         Level on the 5-star scale (although there were opinions that it is dangerous to attach value to the linking because the data might be good but link to ‘bad’ data)

*         Links to metadata standards used and data model/schema to enable automatic processing

 

In the discussion related to the cluster accuracy/consistency/relevance, it was also noted that it might be useful to include some information about the context (e.g. why was the data created and what purpose is it supposed to serve).

 

On another level, the comment was made that quality is not a stable characteristic of a resource – some quality aspects deteriorate over time, e.g. what is fresh today will be stale tomorrow if it is not maintained, updated, refreshed.

 

At the end, we agreed to look at the ODI certificate approach to see how the elements of the certificate relate to the quality aspects that were discussed.

 

Hope this helps, Makx.

 

 

 

 

From: Riccardo Albertoni [mailto:albertoni@ge.imati.cnr.it] 
Sent: 7 April 2015 15:31
To: contact@carlosiglesias.es
Cc: Antoine Isaac; Public DWBP WG
Subject: Re: Data Q&G vocabulary - report and questions for F2F

 

Dear all, 

 

let me share with you some of my thoughts hoping they might contribute in the discussion.

 

1) Antoine has  mentioned  the following two scope issues 

  

"Quality Vocabulary for express dataset compliance to Best practices" vs "Quality vocabulary to express metrics for data quality"

 

 I think both are in scope and should be addressed.

I might change my mind after a proper discussion,  but in my opinion, 

 

- the latter, "Quality vocabulary to express metrics for data quality",  should be addressed by  providing  a RDF vocabulary the so called "Quality Vocabulary".  I think   the quality vocabulary should be provided   by revising, extending  the Jeremy's DAQ  ontology [1], which has been mentioned by Carlos and other,  and by specializing some other W3C ontologies.  For example, starting from DAQ and other W3C vocabulary, we might 

(a) doublecheck that  any kind of quality metrics   can be easily represented and that the Quality vocabulary can be adopted as a mean to exchange quality results;

(b) extend  the vocabulary, so that,  it can cover the competency questions derived from requirement analysis ( e.g., my list of CQ from BP document  [2] once the list has properly revised by the group); 

(c) include other  quality representations besides metrics' results. Don't get me wrong,  I am a big supporter of metrics, actually, in my own research activity, I am trying to define new metrics for linkset quality ( e.g., [3]), but I suspect   not all the providers  want to deal with  metrics. Might  they  need to document the quality,   perhaps in a less "machine oriented", such as, by providing guided descriptions about known issues?   Here,  it would be of great help if we get a list of approaches followed  in literature or by people in the group, especially  for "non linked data"  open datasets.  Carlos has already sent some, are there any others,  except those included in [4], the group considers as relevant examples? 

 

- I think  the former, "Quality Vocabulary for express dataset compliance to Best practices"  should   be firstly addressed in the best practice document. For example, by defining  a set of  levels/profiles for compliance ( see discussion on 5 stars.. I tend to endorse the Phil's proposal, )  and defining  procedure to evaluate compliance  (perhaps, lately  we might take advantage of SHACL (Shapes Constraint Language) if it serves the goal)).

Of course,  lately, statements of  compliance to a certain level/profile of best practice might be one of the other  "quality representations" to put besides metric results.

 

2) concerning what quality dimensions to consider,  ..   Surely it is interesting to know which among the possible quality dimensions are more appealing for the group, at the same time, I suspect plenty of efforts are going to be spent  defining  quality measures in the next years, and it might be that the set of dimensions/ metrics changes a lot in the near and not so near future, so  in my opinion, at least for the moment,  we should leave the taxonomy about   dimensions-metrics out the core quality  vocabulary, and we should provide it  as a sort of non-normative example taxonomy, perhaps, in a separate namespace. 

 

I wonder if there are  objections or radically different views in the group about these points?

 

Regards, 

Riccardo

 

[1]  <http://butterbur04.iai.uni-bonn.de/ontologies/daq/daq> http://butterbur04.iai.uni-bonn.de/ontologies/daq/daq

[2] https://www.w3.org/2013/dwbp/wiki/Requirements_From_FPWD_BP

[3] Albertoni, Asunción Gómez-Pérez: Assessing linkset quality for complementing third-party datasets. EDBT/ICDT Workshops 2013: 52-59

[4] https://www.w3.org/2013/dwbp/wiki/Data_quality_notes#Links.2C_related_work

 

On 6 April 2015 at 11:22, Carlos Iglesias <contact@carlosiglesias.es <mailto:contact@carlosiglesias.es> > wrote:

Good. I'm adding also the Dataset Quality Vocabulary (daQ) as a reference as well http://butterbur04.iai.uni-bonn.de/ontologies/daq/daq

 

Best,

 CI.

 

On 4 April 2015 at 18:37, Antoine Isaac <aisaac@few.vu.nl <mailto:aisaac@few.vu.nl> > wrote:

Hi Carlos,

Thanks a lot for the links!
I've been collecting a list at
https://www.w3.org/2013/dwbp/wiki/Data_quality_notes#Links.2C_related_work
I've added your ones that were not there (all but one!)

We should certainly study all this at one point.
For the moment however we'd like to give it a try to define quality by our own use cases and best practices. Especially for defining what is in scope or not.
There is indeed a lot of related work, mostly academic, and this could end in trying to tackle many things, some perhaps less important than others.

Cheers,

Antoine

PS: @Carlos sorry I won't have time to answer on the other (BP/vocabulary) thread very soon...

On 4/4/15 3:37 AM, Carlos Iglesias wrote:

Hi Antoine, all,

I think there is extensive literature on the different data quality characteristics that may be useful here as well.
Some examples are:

- Data quality under the computer science perspective
http://www.academia.edu/2746633/Data_quality_under_the_computer_science_perspective

- Data quality at a glance
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.106.8628 <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.106.8628&rep=rep1&type=pdf> &rep=rep1&type=pdf

- A metrics-driven approach for quality assessment of LOD
http://www.scielo.cl/pdf/jtaer/v9n2/art06.pdf

- Socio-technical impediments of Open Data
http://www.ejeg.com/issue/download.html?idArticle=255

- Risk Analysis to Overcome Barriers to Open Data
http://www.ejeg.com/issue/download.html?idArticle=296

- Quality Assessment Methodologies for Linked Open Data
http://www.semantic-web-journal.net/system/files/swj414.pdf

As well as other authoritative resources we may consider as well such as:

- The Sebastopol principles
https://public.resource.org/8_principles.html

- ISO 8000 Data quality series.

-- ISO 25012 Data quality model.

Hope it helps.
  Best,
  CI.

On 3 April 2015 at 18:42, Antoine Isaac <aisaac@few.vu.nl <mailto:aisaac@few.vu.nl>  <mailto:aisaac@few.vu.nl <mailto:aisaac@few.vu.nl> >> wrote:

    Dear all,

    One week has passed since our previous report. The same situation is roughly the same. Since there was no reaction to my previous email I'm trying a different format.

    We analyzed Q&G aspects in the Use Cases and Requirements FPWD:
    - assessing which requirements should be in scope for the Q&G work [1]
    - extracting the relevant Q&G stuff from the descriptions of Use Cases [2]

    The outcome is that use cases have very diverse views on quality. There are two main issues for scoping the voc:

    1. Focusing on expressing metrics for data quality
    VS.
    Also expressing compliance of dataset wrt Best practices. from our BP WD.

    2. Focusing on a general framework to express metrics for data quality and exchange results along specific quality dimensions
    VS.
    Defining specific metrics with such framework.


    Meanwhile, we have started extracting requirements from the best practices [3]

    This includes identifying 'competency questions' guiding us to add classes and properties in the voc.

    In general we feel we don't have much material to continue our work.
    In fact most of the competency questions come from Riccardo, not from the best practices in the WD.

    One option is to ask use case owners more precise questions. We started a questionnaire [4].

    What is the group's reaction on this?
    Can this be discussed at the F2F?

    I am afraid that without further input it will be hard to keep to our schedule [5], which is already very late compared to the charter.

    Antoine, on behalf of Riccardo, Deirdre and Christophe.

    [1] https://www.w3.org/2013/dwbp/__wiki/Requirements_In_Scope___For_Quality <https://www.w3.org/2013/dwbp/wiki/Requirements_In_Scope_For_Quality>
    [2] https://www.w3.org/2013/dwbp/__wiki/Quality_Aspects_In_Use___Cases <https://www.w3.org/2013/dwbp/wiki/Quality_Aspects_In_Use_Cases>
    [3] https://www.w3.org/2013/dwbp/__wiki/Requirements_From_FPWD_BP <https://www.w3.org/2013/dwbp/wiki/Requirements_From_FPWD_BP>
    [4] https://www.w3.org/2013/dwbp/__wiki/QualityQuestionnaire <https://www.w3.org/2013/dwbp/wiki/QualityQuestionnaire>
    [5] https://www.w3.org/2013/dwbp/__wiki/Data_quality_schedule <https://www.w3.org/2013/dwbp/wiki/Data_quality_schedule>




--
---

Carlos Iglesias.
Open Data Consultant.
+34 687 917 759 <tel:%2B34%20687%20917%20759> 
contact@carlosiglesias.es <mailto:contact@carlosiglesias.es>  <mailto:contact@carlosiglesias.es <mailto:contact@carlosiglesias.es> >
@carlosiglesias
http://es.linkedin.com/in/carlosiglesiasmoro/en

 





 

-- 

---

 

Carlos Iglesias. 

Internet & Web Consultant.

+34 687 917 759 

 <mailto:contact@carlosiglesias.es> contact@carlosiglesias.es 

@carlosiglesias 

 <http://es.linkedin.com/in/carlosiglesiasmoro/en> http://es.linkedin.com/in/carlosiglesiasmoro/en


-- 
This message has been scanned for viruses and dangerous content by 
 <http://www.efa-project.org> E.F.A. Project, and is believed to be clean. 





 

-- 

----------------------------------------------------------------------------
Riccardo Albertoni
Istituto per la Matematica Applicata e Tecnologie Informatiche "Enrico Magenes"
Consiglio Nazionale delle Ricerche
via de Marini 6 - 16149 GENOVA - ITALIA
tel. +39-010-6475624 <tel:%2B39-010-6475624>  - fax +39-010-6475660 <tel:%2B39-010-6475660> 
e-mail:  <mailto:Riccardo.Albertoni@ge.imati.cnr.it> Riccardo.Albertoni@ge.imati.cnr.it
Skype: callto://riccardoalbertoni/
LinkedIn:  <http://www.linkedin.com/in/riccardoalbertoni> http://www.linkedin.com/in/riccardoalbertoni
www:  <http://www.ge.imati.cnr.it/Albertoni> http://www.ge.imati.cnr.it/Albertoni

http://purl.oclc.org/NET/riccardoAlbertoni
FOAF:http://purl.oclc.org/NET/RiccardoAlbertoni/foaf
Received on Tuesday, 7 April 2015 20:32:14 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 7 April 2015 20:32:15 UTC