Re: ACTION-153: Completeness as one of the quality dimensions

Dear Antoine, dear all,

I think that the  definition of completeness provided in [7] can help in
this discussion as it clearly point out that there are different notions of
completeness.

"Definition 10 (Completeness). Completeness refers to the degree to which
all required information is present in a particular dataset. In terms of
LD, completeness comprises of the following aspects:
 (i) Schema completeness, the degree to which the classes and properties of
an ontology are represented, thus can be called “ontology completeness”,
(ii) Property completeness, measure of the missing values for a specific
property,
(iii) Population completeness is the percentage of all real-world objects
of a particular type that are represented in the datasets and
(iv) Interlinking completeness, which has to be considered especially in
LD, refers to the degree to which instances in the dataset are interlinked.
It should be noted that in this case, users should assume a
closed-world-assumption where a gold standard dataset is available and can
be used to compare against the converted dataset."

Concerning the statistics in Bio2RDF,  I think they do not provide a
quality assessment per se, but they are indicators on which a  completeness
assessment can be worked out.
For example,  provided   the number of expected entities in the dataset is
known  "number of entities in the dataset/ Number of entities in the gold
standard"  is a rough metric to check  the population  completeness.

Basically, I've raised the  issue 164  to figure out  if we have to
explicitly model  "statistics" as a subclass of
dqv:QualityInfo/dqv:QualityMetadata
in [8].

Issue 164 can be more clearly rephrased as

<<Assuming actual quality info pertaining to completeness is not available
( e.g., no metric evaluation, no document saying  the dataset published is
complete in terms of population),   should  Indicators such as dataset
statistics, which might eventually exploited for assessing quality, be
modeled  in DQV? >>

I don't see an easy answer here,
 if I wear my academic hat, I would probably  answer no:   statistics are
not a proper way to provide info about completeness,  they lack of critical
pieces of info ..

If I wear my data publisher/consumer hat, I would say... ok, perhaps
statistic are not a proper way to inform about quality, but they are better
than nothing. Moreover, very often, a proper  gold standard is not
available, and  thus statistics become the  only concrete info a data
provider can offer, so we can count them as something helping in figuring
out quality of a dataset, which might even considered as a kind of quality
info.

It is not the most urgent issue we have on DQV,  but  I guess that at some
point,    we should figure out  which between these  views is  the most
 interesting ....

Regards,
Riccardo
[7] http://www.semantic-web-journal.net/system/files/swj773.pdf
[8] https://www.w3.org/2013/dwbp/wiki/Data_Quality_Vocabulary_(DQV)


On 8 May 2015 at 09:11, Antoine Isaac <aisaac@few.vu.nl> wrote:

> Dear all,
>
> During the F2F I got an action to look at completeness as one of the
> quality dimensions [1]
>
> At least for me then, it was about trying to gether completeness-related
> material from our use cases and best practices. Of course there is more
> about completeness, e.g. in my own (cultural heritage) domain but I would
> rather focus on our stuff first, as the outside world is wide [2] and going
> through everything is far beyond one action.
>
> So my starting point is the pre-F2F gathering of quality-related aspects
> in the use cases [3]. Completeness (as represented by the req
> R-DataMissingIncomplete and R-QualityCompleteness) is mentioned in many UCs:
> 1 ASO: Airborne Snow Observatory
> 4 BuildingEye: SME use of public data
> 10 The Land Portal
> 12 LusTRE: Linked Thesaurus fRamework for Environment
> 14 Mass Spectrometry Imaging (MSI)
> 15 OKFN Transport WG
> 16 Open City Data Pipeline
> 18 Resource Discovery for Extreme Scale Collaboration (RDESC)
> 19 Recife Open Data Portal
> 20 Retrato da Violência (Violence Map)
> 22 Tabulae - how to get value out of data
> 24 Uruguay Open Data Catalog
>
> The wiki page at [3] has all quality-related extracts in the UC document.
> Most of these cases talk in very general terms (e.g. 'dataset must be
> complete') which strongly hints that completeness is indeed expected to be
> an indicator for quality.
>
> However, I could find only one use case really defines concretely what
> completeness means in its context: it's UC #12, LusTRE, with Riccardo's
> paper [4]. It is focused on completeness of owl:sameAs linksets, ie. sets
> of owl:sameAs links between two different sets. Its goal is to reflect how
> datasets can be 'complemented' via a linkset. Based on a small set of
> indicators (number of types, mappable types, etc), it proposes 3
> completeness measures:
> - extent a linkset covers (all) types involved in its subject or object
> datasets.
> - level completeness of a linkset with respect to (linkable) types
> involved in its datasets.
> - percentage of entities of a selected type considered in the linkset.
>
> One can say that linksets are a very specific case, as completeness is
> 'derived' from datasets. Still this case is the only one I've seen with
> indicators and measure for completeness.
>
>
> Actually there is another UC that brings concrete hints about completeness
> is UC #3, Bio2RDF [5]
> That one doesn't mention explicit completeness-related reqs. However, it
> does present a number of indicators that I think could relate to
> completeness:
>    total number of triples
>    number of unique subjects
>    number of unique predicates
>    number of unique objects
>    number of unique types
>    unique predicate-object links and their frequencies
>    unique predicate-literal links and their frequencies
>    unique subject type-predicate-object type links and their frequencies
>    unique subject type-predicate-literal links and their frequencies
>    total number of references to a namespace
>    total number of inter-namespace references
>    total number of inter-namespace-predicate references
>
> But I see there is an issue raised precisely about it [6] questioning
> whether it relates to quality. If we decide that it's not the case, then
> the Bio2RDF UC has not much about completeness!
>
> Best,
>
> Antoine
>
> [1] http://www.w3.org/2013/dwbp/track/actions/153
> [2]
> https://www.w3.org/2013/dwbp/wiki/Data_quality_notes#Links.2C_related_work
> [3] https://www.w3.org/2013/dwbp/wiki/Quality_Aspects_In_Use_Cases
> [4]
> http://www.edbt.org/Proceedings/2013-Genova/papers/workshops/a8-albertoni.pdf
> [5] http://www.w3.org/TR/2015/NOTE-dwbp-ucr-20150224/#UC-Bio2RDF
> [6]http://www.w3.org/2013/dwbp/track/issues/164
>
>
>
> --
> This message has been scanned by E.F.A. Project and is believed to be
> clean.
>
>
>


-- 
----------------------------------------------------------------------------
Riccardo Albertoni
Istituto per la Matematica Applicata e Tecnologie Informatiche "Enrico
Magenes"
Consiglio Nazionale delle Ricerche
via de Marini 6 - 16149 GENOVA - ITALIA
tel. +39-010-6475624 - fax +39-010-6475660
e-mail: Riccardo.Albertoni@ge.imati.cnr.it
Skype: callto://riccardoalbertoni/
LinkedIn: http://www.linkedin.com/in/riccardoalbertoni
www: http://www.ge.imati.cnr.it/Albertoni
http://purl.oclc.org/NET/riccardoAlbertoni
FOAF:http://purl.oclc.org/NET/RiccardoAlbertoni/foaf

Received on Friday, 8 May 2015 13:15:35 UTC