Crossover between IMS and ArtStor from Kevin Smathers on 2003-10-16 (www-rdf-dspace@w3.org from October 2003)

From: Kevin Smathers <kevin.smathers@hp.com>
Date: Thu, 16 Oct 2003 09:45:53 -0700
Cc: SIMILE public list <www-rdf-dspace@w3.org>
Message-ID: <3F8ECB41.7090601@hp.com>

In addition to linking author records between IMS and ArtStor, I've also been experimenting with joins between the IMS 'Keyword' fields, and the ArtStor 'Subject', and 'Type' fields with some occasionally useful results.  

Examples of overlaps include keywords like: 'Shakespeare', 'Aurora', 
'Language',  'Stars', 'telescopes', 'searching', &c. 

Since the controlled vocabularies don't match overlaps in the two systems are imperfect, but they do sometimes yield results that I could imagine an educator using to fill in missing materials for a course.

Mark suggested I describe the tools I'm using for these joins, which are the standard Unix text tools: grep, cut, sort, wc, and uniq.

To get a list of keywords used in IMS for example I use:

$ grep Keyword *.xml | cut -d: -f2 | sort | uniq | less

To find the same keywords in the ArtStor database I use:

$ grep -i stars | grep -e "<Subject>" -e "<Type>" | wc -l

The advantage of these tools is that they can handle very large text files very quickly.  Searching through the full ArtStor database with a search like this takes on the order of 5 seconds on my three year old Linux server.

The word stars appears in the Subject or Type fields a total of 160 times, with the following set of values:

$ grep -i stars UCSD_XML.xml | grep -e "<Subject>" -e "<Type>" | sort | uniq -c
      1     <Subject>Mary,Blessed Virgin, Saint--Madonna of the Twelve Stars</Subject>
      1     <Subject>seven stars
     26     <Subject>Stars
     52     <Subject>Stars</Subject>
      1     <Type>Mary,Blessed Virgin, Saint--Madonna of the Twelve Stars</Type>
      1     <Type>seven stars</Type>
     78     <Type>Stars</Type>

$ grep -i aurora UCSD_XML.xml | grep -e "<Subject>" -e "<Type>" | sort | uniq -c
      1     <Subject>Auroras</Subject>
      4     <Subject>Aurora</Subject>
      1     <Subject>Casino dell&apos;Aurora</Subject>
      1     <Type>Auroras</Type>
      4     <Type>Aurora</Type>
      1     <Type>Casino dell&apos;Aurora</Type>

In summary I think the unix text tools are useful for getting an idea of how the schemas may be usefully cross referenced without requiring any significant work before being able to get a rough idea of how many records such a join would return.
-- 
========================================================
   Kevin Smathers                kevin.smathers@hp.com    
   Hewlett-Packard               kevin@ank.com            
   Palo Alto Research Lab                                 
   1501 Page Mill Rd.            650-857-4477 work        
   M/S 1135                      650-852-8186 fax         
   Palo Alto, CA 94304           510-247-1031 home        
========================================================
use "Standard::Disclaimer";
carp("This message was printed on 100% recycled bits.");

Received on Thursday, 16 October 2003 12:47:13 UTC