RE: data integration, data sets, flamingo from Butler, Mark on 2003-11-19 (www-rdf-dspace@w3.org from November 2003)

From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
Date: Wed, 19 Nov 2003 11:35:28 -0000
To: www-rdf-dspace@w3.org
Message-ID: <E864E95CB35C1C46B72FEA0626A2E8082062A8@0-mail-br1.hpl.hp.com>

Hi Nick

> I took a look at the Flamingo datasets.  The imdb "dataset" 
> is just a flat
> file of about 54,000 text names, one per line.  There appear to be
> duplicate entries in the file, but this is of basically no 
> use because the
> duplicates aren't labelled.  How can you tell how well you 
> are doing if
> you don't have a gold standard to match against?

The data here may be deliberately dirty as Flamingo was trying to
investigate data cleaning. 

This data was taken from the Internet Movie Database. Another project,
called Niagra, has made that information available in a better form in XML -
see

Actors
http://www.cs.wisc.edu/niagara/data/xml-actors/

Movies
http://www.cs.wisc.edu/niagara/data/xml-movies/

These datasets aren't huge, but they are a starting point?

Obviously its outside the scope of SIMILE, and I'm wary to suggest more work
as Eric's very busy, but I think if he could get the IMDB folks to release
part of their data, as we've got Artstor to do, that would be an interesting
dataset to build sample RDF applications around. 

regards,

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/

Received on Wednesday, 19 November 2003 07:37:40 UTC