Overlaps in the Artstor and IMS corpori

Hi team,

Kevin and I have been working on finding overlaps in the corpori - more
details on how we did this below. I'm circulating this to the list because
these overlaps will help inform our discussion on the demo script:

Advertising
Auctions
Aurora Borealis
Blacksmith Shop
blue skies
Buddhist Priest
Calendar
Economics
electric motors
Engraving
Exercise
Japanese Women
Language
Lightning
Lithograph
Locomotive
Moon
Mother and Child
Musical Instruments
Optics
Painting
Philosophy
Photo
Planets
Prince of Idzu
Projects
Radio telescopes
Rainbows
searching
Shakespeare
Smithsonian Institution
Speech
Stars
Syllabus
Table
Telescopes
Tools
Woman
Women
Woodblock print

Francisco Goya
Millard Fillmore
Francisco Franco
Matthew Calbraith Perry
Keiga Kawahara
Mathew Brady

We did this in two ways - inspection by eye, and some text processing on the
corpori e.g.

1. extracting the elements from the XML using a Perl script
2. passed the data through sort and uniq (turning case sensitivity off here
was important)
3. used sed to remove "technical data" e.g. id numbers, file paths etc
4. repeated step 2
5. concated the IMS and Artstor files together, performed sort and uniq
again but with the -d option so uniq only prints duplicates.

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/

Received on Tuesday, 28 October 2003 14:45:42 UTC