Re: [Dbpedia-discussion] DBpedia Data Quality Evaluation Campaign from paul@ontology2.com on 2012-11-15 (public-lod@w3.org from November 2012)

From: <paul@ontology2.com>
Date: Thu, 15 Nov 2012 13:59:39 -0500
To: <zaveri@informatik.uni-leipzig.de>, "dbpedia-discussion" <dbpedia-discussion@lists.sourceforge.net>, <public-lod@w3.org>
Cc: <dbpedia-data-quality@googlegroups.com>
Message-ID: <B7D848825AD647E7BD28A2265B2C2D70@cecille>

    I'd be pretty skeptical that the error rate for unpaid evaluators would 
be less than the error rate in the data itself.  Are you making it clear to 
people what the standard of performance is?  Are we supposed to check stuff 
against a human reading of Wikipedia or actually verify the facts?

    When I see data quality problems in Freebase or DBpedia they often 
involve global properties that aren't detectable at the level of individual 
nodes.  For instance,  there are the two great trees of living things and 
geographical containment.  Often these have obscure breakages at high level 
nodes that will break any algorithm that assumes these things are trees. 
And it generally turns out that things are sketchy at certain high level 
nodes where some taxonomists introduce levels of classification that others 
don't and don't get me started on those anglophone islands on the other side 
of the English channel.  In cases like that you can't count on getting 
accurate answers from average people and your odds aren't even that good if 
you ask an expert.

     Certainly there is a lot of noise in the category assignments in 
Wikipedia.  It might be reasonable to expect people to flag incorrect 
category assignments but without some global view,  finding the ones that 
are missing (maybe 40% of them in some cases) is too much to ask.

Received on Thursday, 15 November 2012 18:59:38 UTC