- From: Kevin Smathers <kevin.smathers@hp.com>
- Date: Wed, 19 Nov 2003 08:35:45 -0800
- To: "Butler, Mark" <Mark_Butler@hplb.hpl.hp.com>
- Cc: www-rdf-dspace@w3.org
Butler, Mark wrote: >Hi Nick > > > >>I took a look at the Flamingo datasets. The imdb "dataset" >>is just a flat >>file of about 54,000 text names, one per line. There appear to be >>duplicate entries in the file, but this is of basically no >>use because the >>duplicates aren't labelled. How can you tell how well you >>are doing if >>you don't have a gold standard to match against? >> >> > >The data here may be deliberately dirty as Flamingo was trying to >investigate data cleaning. > >This data was taken from the Internet Movie Database. Another project, >called Niagra, has made that information available in a better form in XML - >see > >Actors >http://www.cs.wisc.edu/niagara/data/xml-actors/ > >Movies >http://www.cs.wisc.edu/niagara/data/xml-movies/ > >These datasets aren't huge, but they are a starting point? > >Obviously its outside the scope of SIMILE, and I'm wary to suggest more work >as Eric's very busy, but I think if he could get the IMDB folks to release >part of their data, as we've got Artstor to do, that would be an interesting >dataset to build sample RDF applications around. > > > IMDB sells its metadata: http://www.imdb.com/Licensing/ Another large source of metadata is the CDDB repository (http://www.gracenote.com) of CD track titles and artist names, which is partially available for free in the forked FreeDB project (http://www.freedb.org). Cheers, -kls -- ======================================================== Kevin Smathers kevin.smathers@hp.com Hewlett-Packard kevin@ank.com Palo Alto Research Lab 1501 Page Mill Rd. 650-857-4477 work M/S 1135 650-852-8186 fax Palo Alto, CA 94304 510-247-1031 home ======================================================== use "Standard::Disclaimer"; carp("This message was printed on 100% recycled bits.");
Received on Wednesday, 19 November 2003 11:38:25 UTC