Re: data quality from adasal on 2010-04-21 (semantic-web@w3.org from April 2010)

From: adasal <adam.saltiel@gmail.com>
Date: Wed, 21 Apr 2010 10:42:30 +0100
To: Michael Schneider <schneid@fzi.de>
Cc: paoladimaio10@googlemail.com, semantic-web@w3.org, "Polleres, Axel" <axel.polleres@deri.org>
Message-ID: <t2oe8aa138c1004210242x6da8cb1axfa652b9f98b24dc4@mail.gmail.com>
Data is intentionally made available in a declared format but will not
necessarily comply with that format, unintentionally. (As Alex points out.)
There are a variety of formats, no one is best since purpose varies, some
are mutually exclusive.
There are a variety of interpretations of compliance to any one format,
these are mutually exclusive.
Some of those interpretations used are incorrect (unintentionally).
Data is intentionally consumed according to the declared format constrained
as above.

Assume that all these constraints are successfully met in a particular case
and there is a most tightly constrained format applied, perhaps OWL DL.
Could the result possibly be described as 'not fit for purpose' in any
possible circumstance?
I assume that it could.
I assume that it is in principle impossible to guarantee 'fit for purpose':-
The set of logically correct statements is infinite.
The set of hypothetco counter factual logically correct statements is also
infinite.
There is no sure, general, method that can distinguish members of the second
set from members of the first set (unless the second set member is declared
as such).
If this is correct the only action that may be taken to ascertain 'fit for
purpose' or remedy that which isn't would be to apply purpose and
circumstance specific heuristics with predictably variable results.

I expect this issue is of far greater importance than we allow.
There are points of dissonance and amplification of the impedance mismatch.
Where machine communicates with machine mistaking a logically correct
statement as factually correct.
Where machine communicates with human, the ultimate consumer of the data,
who then must apply their own heuristics, also with variable results.
Where one human group communicates with another human group through machine
where their intentions are in conflict, which is the common state of affairs
in the world. One group will apply their heuristics and find the data
reasonable and correct, the other will not.

The semantic web is designed to allow data to be enriched with other data.
One thought it that in so doing it will help to avoid the above problems.
Another thought is that it will exacerbate those problems as the growth of
associated properties to any one entity exponentially explodes, dodgy doing
of one sort or another can hide away in the interstices.

If I am right can only concentrate in two areas.

   1. Exemplary usage, education and demonstration.
   2. Circumstance and purpose constrained heuristics.

I also do not think we should be confused by problems of validation and
compliance, but build on that which is valid and does comply. It seems to me
that there are enough problems within this working subset.


Adam


On 20 April 2010 23:18, Michael Schneider <schneid@fzi.de> wrote:

> Hi Paola!
>
> The list of validators [1] that you mention is very heterogeneous, reaching
> from basic RDF syntax checkers, over special-purpose validators that look
> for certain flaws (such as Eyeball), up to a full-fledged OWL DL
> syntax-and-semantics validator (btw, the Pellet-based validator happens to
> be largely outdated). Each of these validators can be said to represent a
> certain notion of "data quality", but all these notions are pretty
> different
> from each other.
>
> So, does this help to answer the question what "good data quality" is? Or
> why data is sometimes "not fit for purpose"? No single one of those
> validators can claim to represent /the/ answer. An RDF parser may tell you
> that your data is fine, but Eyeball doesn't agree. Your data may even be
> OWL
> DL compliant, but some URIs are not dereferencible, or there is trouble
> with
> FOAF pragmatics being used, or the used prefixes look weird to some other
> validator, etc...
>
> Someone may define "good data quality" to be data that passes *all* these
> validators (and possibly additional ones not listed). But this would be an
> even much higher bar for web data to reach compared to what I exemplarily
> discussed in another post, namely to use full OWL DL compliance as a
> criterion for the quality of data, because OWL DL compliance would then be
> only part of this definition (represented by the Pellet validator). And I
> already claimed that hoping "only" for full OWL DL compliance of "real web"
> data (e.g. the LOD stuff) is pretty unrealistic for the majority of
> existing
> (and upcoming) web data.
>
> So, what is "good quality data" at the end? Why isn't data sometimes not
> fit
> for purpose? From this discussion, I can't really tell! As I stated in my
> first mail: The "data quality" question is a tricky one. :)
>
> Another point: I just asked you the question what you mean by "valid RDF",
> since I am using this term for syntactically correct RDF (according to the
> RDF spec). Now, if you really meant this, than this would rather be the
> /weakest/ possible criterion for data quality, since if a document that
> claims to be an RDF document turns out to not be parseable at all, then
> it's
> actually not RDF, and I would say that it doesn't really count as data
> (it's
> just an undefined soup of characters). So I would not really want to
> understand "valid RDF" as a criterion for "good quality data", similarly as
> I would not understand knowing the alphabet as a criterion for good
> writing.
> :-)
>
> In fact, it looks to me that the pedantic web people do not discuss the
> topic of invalid RDF documents at all. All the items in the FAQ at the
> pedantic-web page [2] already assume syntactically correctness of the
> investigated RDF data. The data problems discussed there are on a higher
> level, typically of a kind that is covered by one of the validators that
> look for specific flaws, such as literals of a datatype that does not match
> the range of the used property, etc.
>
> Sure, there will certainly be quite a bunch of broken RDF documents on the
> web. But it should be obvious to their authors that they need to be fixed,
> since otherwise they are simply invisible to tools that want to exploit
> existing data on the web. And fixing (only syntactically) broken RDF isn't
> that difficult, anyway, provided that one uses an appropriate RDF authoring
> tool. Hence, I see no necessity for the pedantic-web folks to put this
> issue
> on their list.
>
> Cheers,
> Michael
>
> [1] <http://pedantic-web.org/tools.html>
> [2] <http://pedantic-web.org/fops.html>
>
> From: paoladimaio10@googlemail.com [mailto:paoladimaio10@googlemail.com]
> On
> Behalf Of Paola Di Maio
> Sent: Monday, April 19, 2010 8:43 PM
> To: Michael Schneider
> Cc: semantic-web@w3.org; Polleres, Axel
> Subject: Re: data quality
>
> Hi Michael
>
>
>
>
> May I ask what you mean by "valid RDF" here?
>
> any RDF which does not validate
>
> You refer to "many validators"? Which? There are, indeed, many, for
> different languages. Do you only mean the RDF validators?
>
>
> sorry, maybe that was incorrect,  I took the word validators from third tab
> on this
> page
>
>  http://pedantic-web.org/
>
> Maybe you can provide a serious example for what you mean by /invalid/ RDF?
> By "serious" I mean something that could really be found in some document
> on
> the web, where people believed that it would be valid, but it isn't (no
> typos).
>
>
> I personally have limited experience with RDF
> but  I remember once one of the RDF elements (fields? properties?) was
> supposed to be a URI
> but the RDF generator we used did not specify it had to be uri, so we
> entered a word (literal?)
> and validation failed, when a valide URI was entered, the RDF validated
> I am sure the pedantic people will have compiled a catalogue of reasons why
> validation fails?
>
>
> hope I address your questions
>
> P
>
> --
> Dipl.-Inform. Michael Schneider
> Research Scientist, Information Process Engineering (IPE)
> Tel  : +49-721-9654-726
> Fax  : +49-721-9654-727
> Email: michael.schneider@fzi.de
> WWW  : http://www.fzi.de/michael.schneider
> =======================================================================
> FZI Forschungszentrum Informatik an der Universität Karlsruhe
> Haid-und-Neu-Str. 10-14, D-76131 Karlsruhe
> Tel.: +49-721-9654-0, Fax: +49-721-9654-959
> Stiftung des bürgerlichen Rechts, Az 14-0563.1, RP Karlsruhe
> Vorstand: Prof. Dr.-Ing. Rüdiger Dillmann, Dipl. Wi.-Ing. Michael Flor,
> Prof. Dr. Dr. h.c. Wolffried Stucky, Prof. Dr. Rudi Studer
> Vorsitzender des Kuratoriums: Ministerialdirigent Günther Leßnerkraus
> =======================================================================
>
>
Received on Wednesday, 21 April 2010 09:43:21 UTC