Re: data integration, data sets, flamingo

Butler, Mark wrote:

>Hi Nick
>
>  
>
>>I took a look at the Flamingo datasets.  The imdb "dataset" 
>>is just a flat
>>file of about 54,000 text names, one per line.  There appear to be
>>duplicate entries in the file, but this is of basically no 
>>use because the
>>duplicates aren't labelled.  How can you tell how well you 
>>are doing if
>>you don't have a gold standard to match against?
>>    
>>
>
>The data here may be deliberately dirty as Flamingo was trying to
>investigate data cleaning. 
>
>This data was taken from the Internet Movie Database. Another project,
>called Niagra, has made that information available in a better form in XML -
>see
>
>Actors
>http://www.cs.wisc.edu/niagara/data/xml-actors/
>
>Movies
>http://www.cs.wisc.edu/niagara/data/xml-movies/
>
>These datasets aren't huge, but they are a starting point?
>
>Obviously its outside the scope of SIMILE, and I'm wary to suggest more work
>as Eric's very busy, but I think if he could get the IMDB folks to release
>part of their data, as we've got Artstor to do, that would be an interesting
>dataset to build sample RDF applications around. 
>
>  
>
IMDB sells its metadata:  http://www.imdb.com/Licensing/

Another large source of metadata is the CDDB repository 
(http://www.gracenote.com) of CD track titles and artist names, which is 
partially available for free in the forked FreeDB project 
(http://www.freedb.org).

Cheers,
-kls

-- 
========================================================
   Kevin Smathers                kevin.smathers@hp.com    
   Hewlett-Packard               kevin@ank.com            
   Palo Alto Research Lab                                 
   1501 Page Mill Rd.            650-857-4477 work        
   M/S 1135                      650-852-8186 fax         
   Palo Alto, CA 94304           510-247-1031 home        
========================================================
use "Standard::Disclaimer";
carp("This message was printed on 100% recycled bits.");

Received on Wednesday, 19 November 2003 11:38:25 UTC