- From: Kevin Smathers <kevin.smathers@hp.com>
- Date: Wed, 19 Nov 2003 08:35:45 -0800
- To: "Butler, Mark" <Mark_Butler@hplb.hpl.hp.com>
- Cc: www-rdf-dspace@w3.org
Butler, Mark wrote:
>Hi Nick
>
>
>
>>I took a look at the Flamingo datasets. The imdb "dataset"
>>is just a flat
>>file of about 54,000 text names, one per line. There appear to be
>>duplicate entries in the file, but this is of basically no
>>use because the
>>duplicates aren't labelled. How can you tell how well you
>>are doing if
>>you don't have a gold standard to match against?
>>
>>
>
>The data here may be deliberately dirty as Flamingo was trying to
>investigate data cleaning.
>
>This data was taken from the Internet Movie Database. Another project,
>called Niagra, has made that information available in a better form in XML -
>see
>
>Actors
>http://www.cs.wisc.edu/niagara/data/xml-actors/
>
>Movies
>http://www.cs.wisc.edu/niagara/data/xml-movies/
>
>These datasets aren't huge, but they are a starting point?
>
>Obviously its outside the scope of SIMILE, and I'm wary to suggest more work
>as Eric's very busy, but I think if he could get the IMDB folks to release
>part of their data, as we've got Artstor to do, that would be an interesting
>dataset to build sample RDF applications around.
>
>
>
IMDB sells its metadata: http://www.imdb.com/Licensing/
Another large source of metadata is the CDDB repository
(http://www.gracenote.com) of CD track titles and artist names, which is
partially available for free in the forked FreeDB project
(http://www.freedb.org).
Cheers,
-kls
--
========================================================
Kevin Smathers kevin.smathers@hp.com
Hewlett-Packard kevin@ank.com
Palo Alto Research Lab
1501 Page Mill Rd. 650-857-4477 work
M/S 1135 650-852-8186 fax
Palo Alto, CA 94304 510-247-1031 home
========================================================
use "Standard::Disclaimer";
carp("This message was printed on 100% recycled bits.");
Received on Wednesday, 19 November 2003 11:38:25 UTC