Re: AW: Low Quality Data (was before Re: AW: ANN: LOD Cloud - Statistics and compliance with best practices) from Kingsley Idehen on 2010-10-26 (public-lod@w3.org from October 2010)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Tue, 26 Oct 2010 17:45:39 -0400
To: Christian Fuerber <c.fuerber@unibw.de>
CC: juanfederico@gmail.com, public-lod@w3.org, martin.hepp@ebusiness-unibw.org
Message-ID: <4CC74C03.3000005@openlinksw.com>
On 10/26/10 4:07 PM, Christian Fuerber wrote:
> Hi Kingsley,
>
> thanks for the discussion. My comments are inline:
>
>> Christian,
>>
>> No matter how you cut it, this matter is inherently subjective, ditto
> every
>> comment I am going to make about this matter via my comments below:
>>
>> We have to understand and accept that heterogeneity is a fact of life that
> is
>> magnified by the Web.
> I totally agree with you!
>
>> In the real world we coalesce around "world views" and their subjective
>> truths.
>>
>> You can never explicitly deem one data space or the data sets it hosts as
>> being canonically high or low quality. Of course, said data sets or host
> data
>> spaces may or may not appropriately serve a specific data driven need for:
> a
>> human, humans, agents, or a collection of agents working on behalf of
>> humans.
>>
>> Nothing wrong with constraints that serve the needs of a specific data
> driven
>> task, we just can't deem any subjective criteria as canonical re.
>> data quality, in a general sense.
> I agree that data quality criteria and the state of data quality generally
> depend on the task the data is used for. But IMO there are a few exceptions,
> i.e. data quality rules that we can commonly agree upon.

Yes, but when we agree, we are still being subjective. In the real-world 
we coalesce around claims that become subjectively accepted norms.

> Maybe we can
> commonly agree that property dbpedia-owl:populationTotal cannot obtain
> negative values like these http://bit.ly/9MCqQ2 . Do you think data quality
> rules such as "the population of a populated place can never be below 0" may
> be a commonly acceptable data quality rule or even an absolute truth?
A preference for a specific usecase oriented towards population analysis.

Now, here is what I would do in this sort of case (and I do it from time 
to time). I make my own data space, fix the data and crossference back 
to DBpedia.

Since this is my own world view, expressed in my own data space, I use 
owl:sameAs (for coreference between DBpedia and my tweaked DBpedia 
entities) in conjunction with the ability to apply this inference 
context conditionally re., SPARQL queries or Faceted exploration. Thus, 
only when you apply my inference rules to your interaction with DBpedia 
will my data come into scope re. construction of your query solution.  
As you can see, my world view is conditional. You only encounter it when 
you explicitly bring it into scope.

Now imagine if this particular data set (within my data space) is 
useful, and others seek to use my data for solutions where population 
accuracy is vital, while still working with DBpedia URIs. In blade 
fashion ( sorta similar to ORDBMS data blades or yore), I can do the 
following:

1. Make my dataset available to everyone or just you (I can scope access 
to you or groups by WebID)
2. Give you have the option to use or ignore my data.

As you can see, I am espousing choice rather than one data space rules 
them all, under all circumstances, so damned that "owl:sameAs" property 
and its potential for inaccurate mappings potentially send "all or 
nothing" forward-chained reasoners out to the linked data abyss :-)


> Regarding the data quality constraints at http://semwebquality.org/ : They
> were not designed to constrain community driven data creation. Primarily,
> they were designed for closed settings and to alleviate data quality checks
> before using data for certain tasks, so we can gain insight about quality
> problems and heterogeneities that may lie in the data.

This is all good!

We just need to qualify when using the term "quality" in the context of 
data etc..
> This will especially
> be important, if we intend to build applications upon SemWeb data or use the
> data to make decisions.

Yes, of course.
> We designed them as SPIN query templates, so
> everybody may define their own and, therefore, subjective data quality
> rules.

Yes, I like SPIN a lot!

We just have to live with the "subjective fact" that data, information, 
and knowledge are all inherently subjective :-)

Kingsley
> Cheers,
>
> Christian
>
>> One person's Spam is another person's Ham. Such is the case in the real-
>> world and so it shall remain re. Web of Linked Data. Context is king!
>>
>> IMHO. The beauty of the Web of Linked lies in our ability to "agree to
>> disagree" without shedding an ounce of blood. Basically, we arrive at
> deeper
>> insights via true exploitation of gestalt -- which doesn't require
> imposition of
>> absolute truth on anyone. Heterogeneity is the spice of life. We are
>> inherently imperfect by design.
>>
>>
>> --
>>
>> Regards,
>>
>> Kingsley Idehen
>> President&   CEO
>> OpenLink Software
>> Web: http://www.openlinksw.com
>> Weblog: http://www.openlinksw.com/blog/~kidehen
>> Twitter/Identi.ca: kidehen
>>
>>
>>
>>
>
>
>
>


-- 

Regards,

Kingsley Idehen	
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen
Received on Tuesday, 26 October 2010 21:46:09 UTC