RE: data integration, data sets, flamingo

On Wed, 19 Nov 2003, Butler, Mark wrote:

>> There appear to be duplicate entries in the file, but this is of
>> basically no use because the duplicates aren't labelled.

> The data here may be deliberately dirty as Flamingo was trying to
> investigate data cleaning.

I don't think you understood my point.  In order to do experiments in data
cleaning, you necessarily need a 'dirty' dataset.  So, if you are
interested in detecting duplicates, then you need a dataset that has
duplicates in it.  However, in order to calculate performance measures,
your dataset also has to be labelled --- that is, someone has to go
through by hand and annotate the errors that you would want a
data-cleaning algorithm to find so you can see what percentage of the
errors your algorithm actually finds.

Labelling is difficult business.  Misha Bilenko at U Texas has some
labelled datasets available on the web for download.  I downloaded one and
counted a different number of duplicates than he had reported in a paper.
When I asked him about it he said he had found additional duplicates since
the paper was published.  For this reason, unlabelled datasets are not
nearly as attractive as labelled ones.  For reference, Bilenko's datasets
are available at: http://www.cs.utexas.edu/users/ml/riddle/

Nick

Received on Wednesday, 19 November 2003 12:09:21 UTC