AW: Low Quality Data (was before Re: AW: ANN: LOD Cloud - Statistics and compliance with best practices) from Christian Fuerber on 2010-10-26 (public-lod@w3.org from October 2010)

From: Christian Fuerber <c.fuerber@unibw.de>
Date: Tue, 26 Oct 2010 22:07:50 +0200
To: "'Kingsley Idehen'" <kidehen@openlinksw.com>
Cc: <juanfederico@gmail.com>, <public-lod@w3.org>, <martin.hepp@ebusiness-unibw.org>
Message-ID: <000401cb7549$7828e700$687ab500$@unibw.de>

Hi Kingsley,

thanks for the discussion. My comments are inline:

> Christian,
> 
> No matter how you cut it, this matter is inherently subjective, ditto
every
> comment I am going to make about this matter via my comments below:
> 
> We have to understand and accept that heterogeneity is a fact of life that
is
> magnified by the Web.

I totally agree with you!

> 
> In the real world we coalesce around "world views" and their subjective
> truths.
> 
> You can never explicitly deem one data space or the data sets it hosts as
> being canonically high or low quality. Of course, said data sets or host
data
> spaces may or may not appropriately serve a specific data driven need for:
a
> human, humans, agents, or a collection of agents working on behalf of
> humans.
> 
> Nothing wrong with constraints that serve the needs of a specific data
driven
> task, we just can't deem any subjective criteria as canonical re.
> data quality, in a general sense.

I agree that data quality criteria and the state of data quality generally
depend on the task the data is used for. But IMO there are a few exceptions,
i.e. data quality rules that we can commonly agree upon. Maybe we can
commonly agree that property dbpedia-owl:populationTotal cannot obtain
negative values like these http://bit.ly/9MCqQ2 . Do you think data quality
rules such as "the population of a populated place can never be below 0" may
be a commonly acceptable data quality rule or even an absolute truth?

Regarding the data quality constraints at http://semwebquality.org/ : They
were not designed to constrain community driven data creation. Primarily,
they were designed for closed settings and to alleviate data quality checks
before using data for certain tasks, so we can gain insight about quality
problems and heterogeneities that may lie in the data. This will especially
be important, if we intend to build applications upon SemWeb data or use the
data to make decisions. We designed them as SPIN query templates, so
everybody may define their own and, therefore, subjective data quality
rules.

Cheers,

Christian

> 
> One person's Spam is another person's Ham. Such is the case in the real-
> world and so it shall remain re. Web of Linked Data. Context is king!
> 
> IMHO. The beauty of the Web of Linked lies in our ability to "agree to
> disagree" without shedding an ounce of blood. Basically, we arrive at
deeper
> insights via true exploitation of gestalt -- which doesn't require
imposition of
> absolute truth on anyone. Heterogeneity is the spice of life. We are
> inherently imperfect by design.
> 
> 
> --
> 
> Regards,
> 
> Kingsley Idehen
> President&  CEO
> OpenLink Software
> Web: http://www.openlinksw.com
> Weblog: http://www.openlinksw.com/blog/~kidehen
> Twitter/Identi.ca: kidehen
> 
> 
> 
>

Received on Tuesday, 26 October 2010 20:13:44 UTC