- From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
- Date: Wed, 19 Nov 2003 11:35:28 -0000
- To: www-rdf-dspace@w3.org
Hi Nick > I took a look at the Flamingo datasets. The imdb "dataset" > is just a flat > file of about 54,000 text names, one per line. There appear to be > duplicate entries in the file, but this is of basically no > use because the > duplicates aren't labelled. How can you tell how well you > are doing if > you don't have a gold standard to match against? The data here may be deliberately dirty as Flamingo was trying to investigate data cleaning. This data was taken from the Internet Movie Database. Another project, called Niagra, has made that information available in a better form in XML - see Actors http://www.cs.wisc.edu/niagara/data/xml-actors/ Movies http://www.cs.wisc.edu/niagara/data/xml-movies/ These datasets aren't huge, but they are a starting point? Obviously its outside the scope of SIMILE, and I'm wary to suggest more work as Eric's very busy, but I think if he could get the IMDB folks to release part of their data, as we've got Artstor to do, that would be an interesting dataset to build sample RDF applications around. regards, Dr Mark H. Butler Research Scientist HP Labs Bristol mark-h_butler@hp.com Internet: http://www-uk.hpl.hp.com/people/marbut/
Received on Wednesday, 19 November 2003 07:37:40 UTC