- From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
- Date: Tue, 28 Oct 2003 19:44:30 -0000
- To: SIMILE public list <www-rdf-dspace@w3.org>
Hi team, Kevin and I have been working on finding overlaps in the corpori - more details on how we did this below. I'm circulating this to the list because these overlaps will help inform our discussion on the demo script: Advertising Auctions Aurora Borealis Blacksmith Shop blue skies Buddhist Priest Calendar Economics electric motors Engraving Exercise Japanese Women Language Lightning Lithograph Locomotive Moon Mother and Child Musical Instruments Optics Painting Philosophy Photo Planets Prince of Idzu Projects Radio telescopes Rainbows searching Shakespeare Smithsonian Institution Speech Stars Syllabus Table Telescopes Tools Woman Women Woodblock print Francisco Goya Millard Fillmore Francisco Franco Matthew Calbraith Perry Keiga Kawahara Mathew Brady We did this in two ways - inspection by eye, and some text processing on the corpori e.g. 1. extracting the elements from the XML using a Perl script 2. passed the data through sort and uniq (turning case sensitivity off here was important) 3. used sed to remove "technical data" e.g. id numbers, file paths etc 4. repeated step 2 5. concated the IMS and Artstor files together, performed sort and uniq again but with the -d option so uniq only prints duplicates. Dr Mark H. Butler Research Scientist HP Labs Bristol mark-h_butler@hp.com Internet: http://www-uk.hpl.hp.com/people/marbut/
Received on Tuesday, 28 October 2003 14:45:42 UTC