RE: data quality from Michael Schneider on 2010-04-20 (semantic-web@w3.org from April 2010)

From: Michael Schneider <schneid@fzi.de>
Date: Wed, 21 Apr 2010 00:18:59 +0200
To: <paoladimaio10@googlemail.com>
Cc: <semantic-web@w3.org>, "Polleres, Axel" <axel.polleres@deri.org>
Message-ID: <0EF30CAA69519C4CB91D01481AEA06A001CADD76@judith.fzi.de>
Hi Paola!

The list of validators [1] that you mention is very heterogeneous, reaching
from basic RDF syntax checkers, over special-purpose validators that look
for certain flaws (such as Eyeball), up to a full-fledged OWL DL
syntax-and-semantics validator (btw, the Pellet-based validator happens to
be largely outdated). Each of these validators can be said to represent a
certain notion of "data quality", but all these notions are pretty different
from each other. 

So, does this help to answer the question what "good data quality" is? Or
why data is sometimes "not fit for purpose"? No single one of those
validators can claim to represent /the/ answer. An RDF parser may tell you
that your data is fine, but Eyeball doesn't agree. Your data may even be OWL
DL compliant, but some URIs are not dereferencible, or there is trouble with
FOAF pragmatics being used, or the used prefixes look weird to some other
validator, etc...
 
Someone may define "good data quality" to be data that passes *all* these
validators (and possibly additional ones not listed). But this would be an
even much higher bar for web data to reach compared to what I exemplarily
discussed in another post, namely to use full OWL DL compliance as a
criterion for the quality of data, because OWL DL compliance would then be
only part of this definition (represented by the Pellet validator). And I
already claimed that hoping "only" for full OWL DL compliance of "real web"
data (e.g. the LOD stuff) is pretty unrealistic for the majority of existing
(and upcoming) web data.

So, what is "good quality data" at the end? Why isn't data sometimes not fit
for purpose? From this discussion, I can't really tell! As I stated in my
first mail: The "data quality" question is a tricky one. :)

Another point: I just asked you the question what you mean by "valid RDF",
since I am using this term for syntactically correct RDF (according to the
RDF spec). Now, if you really meant this, than this would rather be the
/weakest/ possible criterion for data quality, since if a document that
claims to be an RDF document turns out to not be parseable at all, then it's
actually not RDF, and I would say that it doesn't really count as data (it's
just an undefined soup of characters). So I would not really want to
understand "valid RDF" as a criterion for "good quality data", similarly as
I would not understand knowing the alphabet as a criterion for good writing.
:-) 

In fact, it looks to me that the pedantic web people do not discuss the
topic of invalid RDF documents at all. All the items in the FAQ at the
pedantic-web page [2] already assume syntactically correctness of the
investigated RDF data. The data problems discussed there are on a higher
level, typically of a kind that is covered by one of the validators that
look for specific flaws, such as literals of a datatype that does not match
the range of the used property, etc. 

Sure, there will certainly be quite a bunch of broken RDF documents on the
web. But it should be obvious to their authors that they need to be fixed,
since otherwise they are simply invisible to tools that want to exploit
existing data on the web. And fixing (only syntactically) broken RDF isn't
that difficult, anyway, provided that one uses an appropriate RDF authoring
tool. Hence, I see no necessity for the pedantic-web folks to put this issue
on their list.

Cheers,
Michael

[1] <http://pedantic-web.org/tools.html>
[2] <http://pedantic-web.org/fops.html>

From: paoladimaio10@googlemail.com [mailto:paoladimaio10@googlemail.com] On
Behalf Of Paola Di Maio
Sent: Monday, April 19, 2010 8:43 PM
To: Michael Schneider
Cc: semantic-web@w3.org; Polleres, Axel
Subject: Re: data quality

Hi Michael




May I ask what you mean by "valid RDF" here?

any RDF which does not validate 

You refer to "many validators"? Which? There are, indeed, many, for
different languages. Do you only mean the RDF validators?


sorry, maybe that was incorrect,  I took the word validators from third tab
on this
page

 http://pedantic-web.org/

Maybe you can provide a serious example for what you mean by /invalid/ RDF?
By "serious" I mean something that could really be found in some document on
the web, where people believed that it would be valid, but it isn't (no
typos).


I personally have limited experience with RDF
but  I remember once one of the RDF elements (fields? properties?) was
supposed to be a URI
but the RDF generator we used did not specify it had to be uri, so we
entered a word (literal?) 
and validation failed, when a valide URI was entered, the RDF validated
I am sure the pedantic people will have compiled a catalogue of reasons why
validation fails?


hope I address your questions

P

--
Dipl.-Inform. Michael Schneider
Research Scientist, Information Process Engineering (IPE)
Tel  : +49-721-9654-726
Fax  : +49-721-9654-727
Email: michael.schneider@fzi.de
WWW  : http://www.fzi.de/michael.schneider
=======================================================================
FZI Forschungszentrum Informatik an der Universität Karlsruhe
Haid-und-Neu-Str. 10-14, D-76131 Karlsruhe
Tel.: +49-721-9654-0, Fax: +49-721-9654-959
Stiftung des bürgerlichen Rechts, Az 14-0563.1, RP Karlsruhe
Vorstand: Prof. Dr.-Ing. Rüdiger Dillmann, Dipl. Wi.-Ing. Michael Flor,
Prof. Dr. Dr. h.c. Wolffried Stucky, Prof. Dr. Rudi Studer
Vorsitzender des Kuratoriums: Ministerialdirigent Günther Leßnerkraus
=======================================================================
Received on Tuesday, 20 April 2010 22:19:35 UTC