Re: DBpedia Data Quality Evaluation Campaign from Matthew Gamble on 2012-11-15 (public-lod@w3.org from November 2012)

From: Matthew Gamble <matthew.gamble@gmail.com>
Date: Thu, 15 Nov 2012 18:59:13 +0000
To: Sören Auer <auer@informatik.uni-leipzig.de>
Cc: Giovanni Tummarello <giovanni.tummarello@deri.org>, zaveri@informatik.uni-leipzig.de, dbpedia-discussion <dbpedia-discussion@lists.sourceforge.net>, Linking Open Data <public-lod@w3.org>, dbpedia-data-quality@googlegroups.com
Message-Id: <EDED3148-55BC-4C0D-95DF-D858CE7BC75A@gmail.com>
Interesting approach.  

The classification of errors seems like a reasonably complicated task for crowd sourcing. Have you thought about a few gold standard examples to help train new workers?  i.e. ones where you know the answer and can tell them if they got it right or wrong.

You might also get better results if you present the worker with a side by side comparison of the original page and the extraction. I think this would help in two ways (1) make clear the point that it is about extraction accuracy - i.e. they are comparing one with another and, (2) help in cases such as truncated text (which appears to be one of your error types) - we only know if it was truncated during extraction if we can see the source (it might just have been that way in the source). 

(I believe you can even get provenance information about the location in a page a triple was extracted from so you could even line it up for them! [1]). 

I might also suggest that you split each task down to a single triple at a time so that each task is smaller/easier - I'm not sure(though I may be wrong) that there is any benefit from showing the whole page of extracted triples in one go.

Interested to see how this works out!

Best,
Matthew

[1] http://wiki.dbpedia.org/Datasets#h18-18
---
Matthew Gamble
PhD Candidate
School of Computer Science
University of Manchester
gamble@cs.manchester.ac.uk

On 15 Nov 2012, at 18:19, Sören Auer wrote:

> Am 15.11.2012 19:12, schrieb Giovanni Tummarello:
>> Am i really supposed to know if any of the fact below is wrong?
>> really?
> 
> Its not about factual correctness, but about correct extraction and
> representation. If Wikipedia contains false information DBpedia will
> too, so we can not change this (at that point). What we want to improve,
> however, is the quality of the extraction.
> 
> Best,
> 
> Sören
> 
>> dbp-owl:PopulatedPlace/area
>> "10.63" (@type = http://dbpedia.org/datatype/squareKilometre)
>> dbp-owl:abstract
>> "La Chapelle-Saint-Laud is a commune in the Maine-et-Loire department
>> of western France." (@lang = en)
>> dbp-owl:area
>> "1.063e+07" (@type = http://www.w3.org/2001/XMLSchema#double)
>> dbp-owl:canton
>> dbpedia:Canton_of_Seiches-sur-le-Loir
>> dbp-owl:country
>> dbpedia:France
>> dbp-owl:department
>> dbpedia:Maine-et-Loire
>> dbp-owl:elevation
>> "85.0" (@type = http://www.w3.org/2001/XMLSchema#double)
>> dbp-owl:intercommunality
>> dbpedia:Pays_Loire-Angers
>> dbp-owl:intercommunality
>> dbpedia:Communauté_de_communes_du_Loir
>> dbp-owl:maximumElevation
>> "98.0" (@type = http://www.w3.org/2001/XMLSchema#double)
>> dbp-owl:minimumElevation
>> "28.0" (@type = http://www.w3.org/2001/XMLSchema#double)
>> dbp-owl:populationTotal
>> "583" (@type = http://www.w3.org/2001/XMLSchema#integer)
>> dbp-owl:postalCode
>> "49140" (@lang = en)
>> dbp-owl:region
>> dbpedia:Pays_de_la_Loire
>> dbp-prop:areaKm
>> "11" (@type = http://www.w3.org/2001/XMLSchema#integer)
>> dbp-prop:arrondissement
>> "Angers" (@lang = en)
>> dbp-prop:canton
>> dbpedia:Canton_of_Seiches-sur-le-Loir
>> dbp-prop:demonym
>> "Capellaudain, Capellaudaine" (@lang = en)
>> dbp-prop:department
>> dbpedia:Maine-et-Loire
>> dbp-prop:elevationM
>> "85" (@type = http://www.w3.org/2001/XMLSchema#integer)
>> dbp-prop:elevationMaxM
>> "98" (@type = http://www.w3.org/2001/XMLSchema#integer)
>> dbp-prop:elevationMinM
>> "28" (@type = http://www.w3.org/2001/XMLSchema#integer)
>> dbp-prop:insee
>> "49076" (@type = http://www.w3.org/2001/XMLSchema#integer)
>> dbp-prop:intercommunality
>> dbpedia:Pays_Loire-Angers
>> dbp-prop:intercommunality
>> dbpedia:Communauté_de_communes_du_Loir
>> 
>> On Thu, Nov 15, 2012 at 4:58 PM,  <zaveri@informatik.uni-leipzig.de> wrote:
>>> Dear all,
>>> 
>>> As we all know, DBpedia is an important dataset in Linked Data as it is not
>>> only connected to and from numerous other datasets, but it also is relied
>>> upon for useful information. However, quality problems are inherent in
>>> DBpedia be it in terms of incorrectly extracted values or datatype problems
>>> since it contains information extracted from crowd-sourced content.
>>> 
>>> However, not all the data quality problems are automatically detectable.
>>> Thus, we aim at crowd-sourcing the quality assessment of the dataset. In
>>> order to perform this assessment, we have developed a tool whereby a user
>>> can evaluate a random resource by analyzing each triple individually and
>>> store the results. Therefore, we would like to request you to help us by
>>> using the tool and evaluating a minimum of 3 resources. Here is the link to
>>> the tool: http://nl.dbpedia.org:8080/TripleCheckMate/, which also includes
>>> details on how to use it.
>>> 
>>> In order to thank you for your contributions, a lucky winner will win either
>>> a Samsung Galaxy Tab 2 or an Amazon voucher worth 300 Euro. So, go ahead,
>>> start evaluating now !! Deadline for submitting your evaluations is 9th
>>> December, 2012.
>>> 
>>> If you have any questions or comments, please do not hesitate to contact us
>>> at dbpedia-data-quality@googlegroups.com.
>>> 
>>> Thank you very much for your time.
>>> 
>>> Regards,
>>> DBpedia Data Quality Evaluation Team.
>>> https://groups.google.com/d/forum/dbpedia-data-quality
>>> 
>>> ----------------------------------------------------------------
>>> This message was sent using IMP, the Internet Messaging Program.
>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 
>
Received on Thursday, 15 November 2012 18:59:45 UTC