Re: data quality from Polleres, Axel on 2010-04-19 (semantic-web@w3.org from April 2010)

From: Polleres, Axel <axel.polleres@deri.org>
Date: Mon, 19 Apr 2010 11:09:50 +0100
To: <paoladimaio10@googlemail.com>, <adam.saltiel@gmail.com>, <uk-government-data-developers@googlegroups.com>
Cc: <semantic-web@w3.org>
Message-ID: <316ADBDBFE4F4D4AA4FEEF7496ECAEF9035DF69D@EVS1.ac.nuigalway.ie>
Paola, 

You may want to check: 
http://www.pedantic-web.org/

on our efforts to improve data quality.

We also have a paper on findings so far at LDOW [1].

Cheers,
Axel

1. Aidan Hogan, Andreas Harth, Alexandre Passant, Stefan Decker, and Axel Polleres. Weaving the pedantic web. In 3rd International Workshop on Linked Data on the Web (LDOW2010) at WWW2010, Raleigh, USA, April 2010. 

________________________________

From: semantic-web-request@w3.org 
To: adasal ; uk-government-data-developers@googlegroups.com 
Cc: Semantic Web 
Sent: Mon Apr 19 10:51:14 2010
Subject: data quality 



Something else I wanted to add but forgot as it was late post:


One of the issues that is coming up related to the discussion below, is the quaity of data
(which came up in the gov data list a while back, hence in cc)

A question then is: why (in some cases) is the data 'not fit for purpose?'

Again several possible hypotheses in each case may  need to be tested

is the data inconsistent because the real world is inconsistent (the world seems to hang together even when it does not make sense to us
while data models dont) - in which case maybe there is not much tha we can do, other than to continue to attempt creating plausible
models of the world

is  the data any use before it is opened and rdfized? or does something  happen in the rdfization process?


lets not forget that to obtain meaningful outputs from dbases, a lot of work needs to go in it, I am thinking normalisation of schemas
but also, data cleaning, which constitutes a majority of efforts in data mining

I dont think the fact that data is expressed in RDF would automatically make it good


Again, a good diggin of a significant set of examples of when 'data is not fit for purpose' could yield some clues as to what kind of work needs to be done

So I would be inclined when something doesnt work, not just trhow it away, but study it systematically


After all, most of what we know in medicine has com from  dissecting corpses


PDM







On Sun, Apr 18, 2010 at 11:16 PM, Paola Di Maio <paola.dimaio@gmail.com> wrote:


 In this thread, and the parallel ones, I see different problem spaces
 its a complex issue that should be broken down
 

 one is the query composition
 another is the availability of data
 and then the ease of use/utility of the tools
 (probably more)

 then there are some conflicts, for example on the one hand the w3c  produces standards (rdf, owl) on the other hand
 the tools and platforms that implement them, without necessarily  making  the user tasks intuitive enough, 

 having had random conversations with platform developers it looks they want to monetize on their work, and are not in a hurry
 to achieve any results until their finances are secured

   perhaps the w3c could act more as 'the customer' , and promote the adoption
 of usability standard alongside technical ones (an argument that I occasionally try to make)

 from the query composition front, what about a tool that would facilitate the generation of rdf data
 when its not available?

 assuming Danny eventually works out the optimal query, for example to include specific data in relation to his side of 
 
 the valley, where humidity, wind, sun exposure, soil composition and other local properties make up a microclimate
 shouldnt there be a(any) place where this data could be entered so that the query can be performed?

 (I still think for some answers the web may still not be the best place, but lets think hypotetically)

 the nearest thing I have seen to the SW has been when I saw a demo of semantic wiki ( Denny Vrandivich )
 one could enter new data on the fly, and the table would update, I thought that's cool, I could probably work with this
 (I may to have to go through the examples a few times, but it looked doable to me when I so the demo)

 surely  query manipulation can be made foolproof, after all forms were invented for that purpose if I remember,  (an interface that would allow addin more fields to the query)

 last i heard of semantic media wiki 'there were issues'

 I would have thought thats a good place to start, anyone knows what happens and if there are any test implementations or tutorials?
 what can be so wrong with it>

 Once  tasks are defined, the data is reliable and good enough, datasets can be added on the fly as needed, and the tools are
 straighforward and the tasks (say querying and manipulating queries) are made more intuitive, then I am sure its all about setting up
 good enough pilot studies from different fields of application with, for each, enough people and community involvmene


 since everybody is already more or less working on different aspect of the above, I am sure that some magic
 can be done simply with a bit more coordination of the different efforts


 the cost/benefit issue is also complex, 
 depending what is calculated as cost and what as benefit, as there are different classes of both,

 IMHO to society at large,  and to the public pocket the last ten years of publicly funded research have been a relatively quantifiable cost
 (can work out  some ballpark figure by looking at the sw research expenditure, but I am afraid to do it)
 among the benefits have been lots of phds, salaries, some careers some new knowledge and innovation, 
 but
 some (including myself) argue
 that visible  'public' benefits are not (yet)  adequate to the public costs, which remain imho not fully justified

 In my analysis this has turned out to be a  problem with our research industry,(very generalised statement) where research expenditure
 is often in a grey policy area, not clearly enough demarcated what public benefits should derive from research, and another
 can of worms altogether

 to an average organisation that is confronted with the option to invest in sw technologies today, it may just be too early , unquantifiable costs and risks, but also limited business/revenue models etc

 (how is giving my website users some sw functionality is going to provide my customers with more value?)

 I think I have heard of some benefits being reaped in the non public domain, but because of that, we dont know for sure
 what happens behind firewalls

 Assuming some cohesion of purpose can be arranged, and  that research can provide a wide enough range of more real world well defined
 pilot schemes (where the cost/benefit analysis each pilot project is clear upfront, and utility metrics adopted, for example)
 with a sufficiently healthy stakeholder base not too easily alienated, I am sure it would be possible to make at least some sense
 of the word done so far

 anything anywhere near the direction above can probably only be achieved  by a community, which it looks is trying to pull itself
 together here? 

 :-)


 PDM

 On Sun, Apr 18, 2010 at 10:22 PM, adasal <adam.saltiel@gmail.com> wrote:
 

   Agriculture oriented data spaces (ontology and instance data)
   

  How could that ever be automatic?


   Agriculture oriented data spaces (ontology and instance data)
   

  Cannot anticipate every possible query, or even broad area of interest, in DBpedia.
  There must be an impulse to make a query of some sort. The issue is how complex that query must be.
  Isn't the implicit question why cannot some small query be enough to draw out the information I want?
  Here the query terms should be enough to form a coherent query. In this example they should translate into a sparql query. But that is not enough, because DBPedia needs a schema and some instance data too. erm.
  Or perhaps it could be semi-automatic? 
  
  Imagine that there is a repository with sample kinds of data in it. I think this would be easy to use.
  I want to build up a query about tomato seeds, planting, region, time of year. So some general data is classified along those lines. That would be combined into a schema. Maybe some of it would be a subset of other schemas, so in my making the choice further useful suggestions could be made. I would then be asked to refine the parameters of the query by actual region, etc.
  I am assuming that interested parties would make available basic meta data sets with human understandable sample data.
  
  Am I making any sort of sensible suggestion here? Is this different to what already exists as available triples? I am unsure. There is something circular here.
  
  Even so we are still left with that data that has not been classified because there is no interested party to do so, or because the type of classification is new, complex or transient.
  
  Adam
  

  On 18 April 2010 21:56, Danny Ayers <danny.ayers@gmail.com> wrote:
  

   Thanks Kingsley
   
   still not automatic though, is it?
   

   On 18 April 2010 22:38, Kingsley Idehen <kidehen@openlinksw.com> wrote:
   > Danny Ayers wrote:
   >>
   >> Kingsley, how do I find out when to plant tomatos here?
   >>
   >
   > And you find the answer to that in Wikipedia via
   > <http://en.wikipedia.org/wiki/Tomato>? Of course not.
   >
   > Re. DBpedia, if you have a Agriculture oriented data spaces (ontology and
   > instance data) that references DBpedia (via linkbase) then you will have a
   > better chance of an answer since we would have temporal properties and
   > associated values in the Linked Data Space (one that we can mesh with
   > DBpedia even via SPARQL).
   >
   > Kingsley
   >>
   >> On 17 April 2010 19:36, Kingsley Idehen <kidehen@openlinksw.com> wrote:
   >>
   >>>
   >>> Danny Ayers wrote:
   >>>
   >>>>
   >>>> On 16 April 2010 19:29, greg masley <roxymuzick@yahoo.com> wrote:
   >>>>
   >>>>
   >>>>>
   >>>>> What I want to know is does anybody have a method yet to successfully
   >>>>> extract data from Wikipedia using dbpedia? If so please email the
   >>>>> procedure
   >>>>> to greg@masleyassociates.com
   >>>>>
   >>>>>
   >>>>
   >>>> That is an easy one, the URIs are similar - you can get the pointer
   >>>> from db and get into wikipedia. Then you do your stuff.
   >>>>
   >>>> I'll let Kingsley explain.
   >>>>
   >>>>
   >>>>
   >>>
   >>> Greg,
   >>>
   >>> Please add some clarity to your quest.
   >>>
   >>> DBpedia the project is comprised of:
   >>>
   >>> 1. Extractors for converting Wikipedia content into Structured Data
   >>> represented in a variety of RDF based data representation formats
   >>> 2. Live instance with the extracts from #1 loaded into a DBMS that
   >>> exposes a
   >>> SPARQL endpoint (which lets you query over the wire using SPARQL query
   >>> language).
   >>>
   >>> There is a little more, but I need additional clarification from you.
   >>>
   >>>
   >>> --
   >>>
   >>> Regards,
   >>>
   >>> Kingsley Idehen       President & CEO OpenLink Software     Web:
   >>> http://www.openlinksw.com

   >>> Weblog: http://www.openlinksw.com/blog/~kidehen <http://www.openlinksw.com/blog/%7Ekidehen> 
   >>> Twitter/Identi.ca: kidehen
   >>>
   >>>
   >>>
   >>>
   >>>
   >>>
   >>
   >>
   >>
   >>
   >
   >
   > --
   >
   > Regards,
   >
   > Kingsley Idehen       President & CEO OpenLink Software     Web:
   > http://www.openlinksw.com

   > Weblog: http://www.openlinksw.com/blog/~kidehen <http://www.openlinksw.com/blog/%7Ekidehen> 
   > Twitter/Identi.ca: kidehen
   >
   >
   >
   >
   >
   
   
   
   
   --
   http://danny.ayers.name

   
   





 -- 
 Paola Di Maio
 **************************************************
 “Logic will get you from A to B. Imagination will take you everywhere.”
 Albert Einstein
 **************************************************
 
 




-- 
Paola Di Maio
**************************************************
“Logic will get you from A to B. Imagination will take you everywhere.”
Albert Einstein
**************************************************
Received on Monday, 19 April 2010 10:10:42 UTC