Re: DBpedia Data Quality Evaluation Campaign from Amrapali Zaveri on 2012-11-16 (public-lod@w3.org from November 2012)

From: Amrapali Zaveri <zaveri@informatik.uni-leipzig.de>
Date: Fri, 16 Nov 2012 10:55:36 +0100
To: Matthew Gamble <matthew.gamble@gmail.com>
Cc: Sören Auer <auer@informatik.uni-leipzig.de>, Giovanni Tummarello <giovanni.tummarello@deri.org>, dbpedia-discussion <dbpedia-discussion@lists.sourceforge.net>, Linking Open Data <public-lod@w3.org>, dbpedia-data-quality@googlegroups.com
Message-Id: <92123E5E-F576-4649-B3C4-587F757DD573@informatik.uni-leipzig.de>

Dear Matthew,

Thanks for your suggestions. 

On 15 Nov 2012, at 19:59, Matthew Gamble wrote:

> Interesting approach.  
> 
> The classification of errors seems like a reasonably complicated task for crowd sourcing. Have you thought about a few gold standard examples to help train new workers?  i.e. ones where you know the answer and can tell them if they got it right or wrong.

In order to help users understand the errors, we have provided an example and description for each of them. 
> 
> You might also get better results if you present the worker with a side by side comparison of the original page and the extraction. I think this would help in two ways (1) make clear the point that it is about extraction accuracy - i.e. they are comparing one with another and, (2) help in cases such as truncated text (which appears to be one of your error types) - we only know if it was truncated during extraction if we can see the source (it might just have been that way in the source). 

We have provided a link to the corresponding Wikipedia page where the user can look at the original data and compare it with the extracted data in DBpedia. 
> 
> (I believe you can even get provenance information about the location in a page a triple was extracted from so you could even line it up for them! [1]). 

Yes, that's a good idea. However, in the current version of the tool it only displays what a user would see while browsing DBpedia.
> 
> I might also suggest that you split each task down to a single triple at a time so that each task is smaller/easier - I'm not sure(though I may be wrong) that there is any benefit from showing the whole page of extracted triples in one go.

Each triple is in fact separated out so that a user can only choose and specify individually which triple has a data quality problem. Also, it is important to know which resource the triple belongs to, in order to evaluate it's quality. 

> 
> Interested to see how this works out!
> 
> Best,
> Matthew
> 
> [1] http://wiki.dbpedia.org/Datasets#h18-18
> ---
> Matthew Gamble
> PhD Candidate
> School of Computer Science
> University of Manchester
> gamble@cs.manchester.ac.uk
> 
Thanks.
Regards,
Ms. Amrapali Zaveri Gokhale

University of Leipzig - Department of Computer Science
Paulinum 618, Augustusplatz 10, 04109 Leipzig, Germany
http://aksw.org/AmrapaliZaveri

Received on Friday, 16 November 2012 09:52:54 UTC