- From: Kevin Smathers <kevin.smathers@hp.com>
- Date: Thu, 16 Oct 2003 09:45:53 -0700
- Cc: SIMILE public list <www-rdf-dspace@w3.org>
In addition to linking author records between IMS and ArtStor, I've also been experimenting with joins between the IMS 'Keyword' fields, and the ArtStor 'Subject', and 'Type' fields with some occasionally useful results. Examples of overlaps include keywords like: 'Shakespeare', 'Aurora', 'Language', 'Stars', 'telescopes', 'searching', &c. Since the controlled vocabularies don't match overlaps in the two systems are imperfect, but they do sometimes yield results that I could imagine an educator using to fill in missing materials for a course. Mark suggested I describe the tools I'm using for these joins, which are the standard Unix text tools: grep, cut, sort, wc, and uniq. To get a list of keywords used in IMS for example I use: $ grep Keyword *.xml | cut -d: -f2 | sort | uniq | less To find the same keywords in the ArtStor database I use: $ grep -i stars | grep -e "<Subject>" -e "<Type>" | wc -l The advantage of these tools is that they can handle very large text files very quickly. Searching through the full ArtStor database with a search like this takes on the order of 5 seconds on my three year old Linux server. The word stars appears in the Subject or Type fields a total of 160 times, with the following set of values: $ grep -i stars UCSD_XML.xml | grep -e "<Subject>" -e "<Type>" | sort | uniq -c 1 <Subject>Mary,Blessed Virgin, Saint--Madonna of the Twelve Stars</Subject> 1 <Subject>seven stars 26 <Subject>Stars 52 <Subject>Stars</Subject> 1 <Type>Mary,Blessed Virgin, Saint--Madonna of the Twelve Stars</Type> 1 <Type>seven stars</Type> 78 <Type>Stars</Type> $ grep -i aurora UCSD_XML.xml | grep -e "<Subject>" -e "<Type>" | sort | uniq -c 1 <Subject>Auroras</Subject> 4 <Subject>Aurora</Subject> 1 <Subject>Casino dell'Aurora</Subject> 1 <Type>Auroras</Type> 4 <Type>Aurora</Type> 1 <Type>Casino dell'Aurora</Type> In summary I think the unix text tools are useful for getting an idea of how the schemas may be usefully cross referenced without requiring any significant work before being able to get a rough idea of how many records such a join would return. -- ======================================================== Kevin Smathers kevin.smathers@hp.com Hewlett-Packard kevin@ank.com Palo Alto Research Lab 1501 Page Mill Rd. 650-857-4477 work M/S 1135 650-852-8186 fax Palo Alto, CA 94304 510-247-1031 home ======================================================== use "Standard::Disclaimer"; carp("This message was printed on 100% recycled bits.");
Received on Thursday, 16 October 2003 12:47:13 UTC