RE: data quality from Michael Schneider on 2010-04-19 (semantic-web@w3.org from April 2010)

From: Michael Schneider <schneid@fzi.de>
Date: Mon, 19 Apr 2010 13:23:49 +0200
To: "Polleres, Axel" <axel.polleres@deri.org>
Cc: <semantic-web@w3.org>, <paoladimaio10@googlemail.com>
Message-ID: <0EF30CAA69519C4CB91D01481AEA06A001CAD970@judith.fzi.de>
Hi!

The quality-of-data question is not an easy one, and it's very vague what
"good quality" means for data. What you are about there on pedantic-web.org
[1] seems to be an effort to obtain some sort of "minimum practically
achievable quality" for the data existing on the web. This is very important
IMO, but other people won't probably be satisfied by this, because (amongst
other things) this minimum standard won't match their tools' requirements. 

So, from the perspective of trying to make as many SemWeb tools as possible
happy, an alternative quality criterion could be OWL 2 DL compliance. An OWL
DL tool (parser, ontology management framework, editor, or reasoner)
requires, in principle, that the data fed into it meets all syntactic
restrictions defined in the OWL (2) DL specification [2][3]. But this is a
much higher bar than what pedantic-web asks for, and is IMHO unlikely to be
ever met by the majority of "real web data".

Some OWL DL tools relax some of these strict requirements. For example,
Pellet and, I think, the new OWL API as well apply some heuristics in order
to "repair" input data [4]. But, while useful, these approaches will likely
have their limits, in particular when being applied to that "chaotic" data
achievable on the web, and may, in some cases, even lead to unintended
results. In any case, these approaches are strictly tool-specific, so if
someone authors data following these tool-specific relaxations, other
standard-compliant OWL DL tools cannot be expected to be able to cope with
this data. So, after all, if someone defines "data quality" in terms of OWL
DL compliance, then this should really mean /full/ OWL 2 DL compliance.
Which is, as said, tough to achieve on the web.

As an aside: Point 5 about "Reasoning" in the pedantic-web FAQ [5] discusses
problems with bogus values of inverse-functional data properties (IFDPs),
such as foaf:mbox_sha1sum, and how to cope with it. From an "OWL DL data
quality" perspective, this discussion would be largely redundant: The IFDPs
used in FOAF (and some other vocabularies as well) make use of
owl:FunctionalProperty, and this is not even allowed in OWL DL, where
functional properties must not be data properties, i.e. must not be used
with literals. There is an alternative in OWL 2 DL called "Keys", but I
don't know of any vocabulary used on the web that applies this very new
feature (for the RDF encoding of OWL 2 Keys, lookup the column for the term
"owl:hasKey" in Table 16 of [3]).

Michael

[1] Pedantic-Web.org <http://pedantic-web.org>
[2] OWL 2 Structural Specification
<http://www.w3.org/TR/2009/REC-owl2-syntax-20091027/>
[3] OWL 2 Mapping to RDF Graphs
<http://www.w3.org/TR/2009/REC-owl2-mapping-to-rdf-20091027/>
[4] Pellet's relaxations <http://clarkparsia.com/pellet/faq/owl-full/>
[5] IFDP discussion: <http://pedantic-web.org/fops.html#ifps>

From: semantic-web-request@w3.org [mailto:semantic-web-request@w3.org] On
Behalf Of Polleres, Axel
Sent: Monday, April 19, 2010 12:10 PM
To: paoladimaio10@googlemail.com; adam.saltiel@gmail.com;
uk-government-data-developers@googlegroups.com
Cc: semantic-web@w3.org
Subject: Re: data quality

Paola, 

You may want to check: 
http://www.pedantic-web.org/

on our efforts to improve data quality.

We also have a paper on findings so far at LDOW [1].

Cheers,
Axel

1. Aidan Hogan, Andreas Harth, Alexandre Passant, Stefan Decker, and Axel
Polleres. Weaving the pedantic web. In 3rd International Workshop on Linked
Data on the Web (LDOW2010) at WWW2010, Raleigh, USA, April 2010. 
________________________________________
From: semantic-web-request@w3.org 
To: adasal ; uk-government-data-developers@googlegroups.com 
Cc: Semantic Web 
Sent: Mon Apr 19 10:51:14 2010
Subject: data quality 

Something else I wanted to add but forgot as it was late post:


One of the issues that is coming up related to the discussion below, is the
quaity of data
(which came up in the gov data list a while back, hence in cc)

A question then is: why (in some cases) is the data 'not fit for purpose?'

Again several possible hypotheses in each case may  need to be tested

is the data inconsistent because the real world is inconsistent (the world
seems to hang together even when it does not make sense to us
while data models dont) - in which case maybe there is not much tha we can
do, other than to continue to attempt creating plausible
models of the world

is  the data any use before it is opened and rdfized? or does something 
happen in the rdfization process?


lets not forget that to obtain meaningful outputs from dbases, a lot of work
needs to go in it, I am thinking normalisation of schemas
but also, data cleaning, which constitutes a majority of efforts in data
mining

I dont think the fact that data is expressed in RDF would automatically make
it good


Again, a good diggin of a significant set of examples of when 'data is not
fit for purpose' could yield some clues as to what kind of work needs to be
done

So I would be inclined when something doesnt work, not just trhow it away,
but study it systematically


After all, most of what we know in medicine has com from  dissecting corpses


PDM

--
Dipl.-Inform. Michael Schneider
Research Scientist, Information Process Engineering (IPE)
Tel  : +49-721-9654-726
Fax  : +49-721-9654-727
Email: michael.schneider@fzi.de
WWW  : http://www.fzi.de/michael.schneider

=======================================================================
FZI Forschungszentrum Informatik an der Universität Karlsruhe
Haid-und-Neu-Str. 10-14, D-76131 Karlsruhe
Tel.: +49-721-9654-0, Fax: +49-721-9654-959
Stiftung des bürgerlichen Rechts, Az 14-0563.1, RP Karlsruhe
Vorstand: Prof. Dr.-Ing. Rüdiger Dillmann, Dipl. Wi.-Ing. Michael Flor,
Prof. Dr. Dr. h.c. Wolffried Stucky, Prof. Dr. Rudi Studer
Vorsitzender des Kuratoriums: Ministerialdirigent Günther Leßnerkraus
=======================================================================
Received on Monday, 19 April 2010 11:24:25 UTC