Faceted search tools, next steps on vocabulary mapping

Vineet, David, Haystack team

> I tried out the SIMILE datasets yesterday. My system is 
> designed to be 
> general purpose, and as expected the navigation worked as per my 
> expectations. Some issues cropped up in other parts of Haystack, as 
> Prof. Karger mentioned in his e-mail earlier today. 

First thanks for taking a look at the data. I'm keen to fix the issues as
soon as we have consensus - please see my other email?

> Otherwise, the only 
> effort involved in getting SIMILE data running in Haystack involved 
> converting the schema to rdf and loading the data into Haystack.

I guess you mean RDF/XML - its in N3 at the moment which is still RDF right?


When I have time, I'd like to create an proper automated build process for
SIMILE, so where files adopt a particular canonical format but other formats
are required by the team it is possible to build these files automatically.
For a long time, the CVS has only had a few users in HP. So now we need to
do some reorganisation to support the whole team better, but at the moment
I'm busy on other things, particularly the demo script.  
 
> I am however interested in more detailed expectations of how 
> you would 
> expect the system to perform on the SIMILE dataset. If 
> possible, I would 
> like one or two, detailed, click-by-click, scenarios of a person 
> browsing the system. I want to make sure that I have not made any 
> trade-offs while designing a general-purpose system.

I'm currently working towards this on the demo script, however I will
explain my current thinking here - feedback is very welcome. 

A while back Kevin and I did some work on identifying overlaps between the
two data sets - see

http://lists.w3.org/Archives/Public/www-rdf-dspace/2003Oct/0108.html

So it would be interesting to load the IMS and the Artstor data into
Haystack, then see if browsing on any of these terms returned records from
both. This requires using RDFS or OWL to map between the two sets of data -
is this easy to do in Haystack? The other possibility is to read some IMS
data and some Artstor data into Jena, then load the schema and run an
inferencer, then serialize the data back out and read it into Haystack. 

The other issue here is we need to decide on the map, and also fix a few
remaining inconsistencies between the two datasets. 

For example in artstor, the term "telescope" might be mentioned in two
places

<http://web.mit.edu/simile/metadata/artstor/id#UCSD_41822000860534>
      vra:subject
<http://web.mit.edu/simile/metadata/artstor/subject#telescopes> ;
      vra:typeAAT "telescopes" .

whereas in IMS it might be mentioned here
<ocw:OcwWeb/Earth--Atmospheric--and-Planetary-Sciences/12-409Hands-On-Astron
omy--Observing-Stars-and-PlanetsSpring2002/CourseHome/index.htm>
      dc:subject "planets" , "spectroscopy" , "stars" , "moon" ,
"telescopes" .

In the Artstor data, we are turning telescopes into a URI as it is a
controlled term based on a suggestion from Eric, although as I've noted
before subject isn't always used this way - for more discussion see
http://lists.w3.org/Archives/Public/www-rdf-dspace/2003Oct/0114.html

Another overlap is Matthew Calbraith Perry e.g. in IMS

<http://ocw.mit.edu/NR/rdonlyres/6581A505-899A-498F-9754-6EAD461BDA44/0/01_t
itlepage_s.jpg>
      dc:contributor <ocwc:Perry%2C%20Matthew%20Calbraith> .

<http://ocw.mit.edu/NR/rdonlyres/F524752D-3926-4849-B2E2-4B3C66506440/0/18_X
IV_28_093_s.jpg>
      dc:description "Portrait of Perry, photograph by Mathew Brady; Matthew
Calbraith Perry, daguerreotype by P. Haas" .

whereas in ArtStor

<http://web.mit.edu/simile/metadata/artstor/id#UCSD_41822003055447>
      vra:subject
<http://web.mit.edu/simile/metadata/artstor/subject#Perry,_Matthew_Calbraith
,1794-1858> ; 
      vra:title "Commodore Perry (left) and Captain Henry A. Adams, as seen
by a Japanese artist" ;
      vra:typeAAT "Perry, Matthew Calbraith,1794-1858" .

this case is clearly more complicated than the first one?

Perhaps one way to determine the map is to just go through the list of
overlaps, and build a comprehensive list of all the instances of shared
terms between the two vocabularies, as then we will have a good
understanding of the problem?

comments here are very welcome, regards

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/

Received on Wednesday, 19 November 2003 09:06:29 UTC