Demo subset available

Hi all,

I've checked in a subset database using overlaps from OCW and ArtStor to build a set of records that should have rich overlaps.  Overlaps were selected automatically based on full-text search of the ArtStor database using the OCW list of contributors as the list of words to look for.   This results in about 2400 ArtStor records being selected from the full corpus.

The subset database can be found in CVS under simile/corpus/demo, together with the three most important open courseware courses for overlaps of this type.  In the file simile/corpus/demo/found.txt is a listing showing the output of my full-text search tool, which shows why specific records were selected for inclusion in the subset.

Consider the following search for example:

Keys{ANDERSON} = 79
Keys{LAURIE} = 4
Keys{1947-} skipped
153614651:153616684:2401197:      <Subject>Anderson, Laurie,1947-</Subject>
153614651:153616865:2401202:      <Type>Anderson, Laurie,1947-</Type>

In this output, the keyword ANDERSON was found 79 times in the ArtStor corpus, and LAURIE was found 4 times.  Of those there were two records where those keywords were found within several lines proximity of one another.  The three numbers shown are the file location of the start of the image record, the file location of the line displayed, and the line number of the line displayed.  

The simile/corpus/demo/artstor.xml file includes each of the full records identified as containing the all keywords.   The tool I used to perform the data reduction is written in Perl and can be found in simile/tools/subset

Cheers,
-kls

Received on Thursday, 15 January 2004 17:34:27 UTC