notes from meeting between MacKenzie Smith and Mark Butler, 25th July 2003

Hi team,

Here are the notes I made from the meeting between MacKenzie 
Smith and myself just after the plenary on the 25th of July. 
I think other people in the team may be interested. 

MB: Can you give more details of why you think the library community 
may be suspicious of the Semantic Web?

MS: More RDF rather the Semantic Web in general. When RDF first came out, 
the library community investigated it because it seemed relevant to a 
number of problems they face. However the tools at that point weren't 
good, the standard was not fully documented and later changed so they 
didn't have a good experience. So in order to encourage them to use RDF 
we need to demonstrate that it is compelling.

MB: What about topic maps? Did the library community look at them?

MS: The people who looked at topic maps were the same people who looked 
at RDF. We looked at because we were interested in using it to model a 
website. TM's can be used for this, but they seemed more aimed at the 
creation of things like thesauri. The library community tends to deal 
with very big thesauri, which are normally delivered as a service so 
hard to describe in TM form.

MB: What about XML? For example XML Schema itself is limited, and 
there has been a big argument in the community about it leading to 
the creation of completing standards like Schematron and RELAX or 
RELAX/NG. Have people considered these alternatives?

MS: Yes, although the proliferation of schema languages makes things 
harder. For example it is likely TEI (text encoding initative) will 
use RELAX.

One of the problems here is the library is always looking for quick 
fixes, but those quick fixes may not be the best solutions long term. 
So perhaps RDF is a better long term solution.

One example of this is METS, which requires the creation of XML Schema. 
The XML Schema for METS started out very lightweight, but it's getting much
bigger and more complex as time goes by and new requirements are thrown
at it. As that happens, the tools to process METS instances get more complex
too. There's a profile specification now to allow METS implementations to
be interoperable because there are so many variants of how it's been used.
For more details see the Library of Congress website that has some examples.


There are various extension schemas for METS, such as a Dublin Core one, 
but sometimes people are creating schemas for things such as technical 
metadata which may be too complex for other users, this is leading to 
lots of different variants.

MB: At the meeting you demonstrated a system called SiteSearch. Can you 
give some more details of how it works, why it was insufficient for your 
needs and how SIMILE can better it?

MS: SiteSearch was written by OCLC and is now available on an open source 
basis. It uses ASN.1 as an encoding syntax, BER (basic encoding rules) and 
Z39.50. It was pretty much pre-web. In some ways though the architecture 
is similar to Tamino, you specify the record structure, you write rules 
about records sets, this was fine for one or two schemas, it had some 
features for dealing with unstructured text.

MB: During the meeting you said that the databases used by library systems 
are typically not relational databases - can you give more details? 
Typically the only discussion I've seen on databases preceeding relational 
databases are brief footnotes?

MS: Actually I was looking for a text book on these systems, and its 
pretty hard to find them. Typically the databases used by libraries are 
specialist, are hierarchical in nature and use inverted indexes. They 
are similiar in many ways to XML databases, and typically we find that 
relational databases are not sufficient for this application due to the 
number of JOINS required. 

One example of a recent system is the TED (templated database) system 
developed at Harvard. 

(MB: more details here
http://hul.harvard.edu/ois/systems/ted/index.html
http://hul.harvard.edu/ois/systems/ted/f-adminMDfaq.html
for an example of usage of TED, and a visual image database,  see
http://ted.hul.harvard.edu:8080/ted/deliver/home?_collection=bil )

Harvard uses Tamino, but they have created programs to ingest, index, 
display and crosswalk metadata. You can control the labels used to 
represent fields, which fields are editable, which are indexed, which 
are displayed to users. There is more stuff on the TED website. 

MB: What about Sitesearch? At the meeting you demonstrated a system 
that uses SiteSearch called VIA
http://via.harvard.edu:748/WebZ/Authorize?sessionid=0:next=/html/VIA.html:st
yle=via
but looking at the Sitesearch website
http://www.sitesearch.oclc.org/
it seems to be unavailable - are there materials available anywhere else?

MS: I think this is just a temporary thing, but you could always try 
contacting Harry Wagner or Ralph Levan if you can't find any information.

MB: One of things people are experimenting with in RDF at the moment is 
the use of quads rather than triples. Quads let you denote where the 
statement came from in a more efficient manner than reification, and provide

the possibility of retaining all information rather have to throw away 
information when records are combined. This might be important because 
sometimes which version of the information you use depends on the context 
it will be used in, for example if you had two fields taken from records 
using different schemas you might want to retain both. How relevant is 
this to the library community?

MS: The library community has given a lot of thought to integrating data 
sources, it always depends on where the data came from. They have developed 
complex protocols for determining who's data is better. Specifically OCLC 
has done some work on how to represent these rules for trust models. Eric 
Miller should know where to find more information about this. Typically 
when merging records you do throw information away based on these rules, 
the only time you might arguably want to keep both versions of the 
information is where you have performed a crosswalk. 

One of the reasons for being interested in RDF in this context is you do 
not need the schema in order to use the metadata, in XML the schema is 
used to encode the metadata but RDF has the same schema for everything. 
XML Schema can hide things in the data. For example in XML the tag names 
have meaning, so the markup itself has semantic information but this is 
not true in RDF.

MB: Can you describe why you think it might be hard to convert VRA to IMS?

MS: VRA has a hierarchical structure e.g.

VRA
   Groups .... surrogates
    Works ...... surrogates
      Subworks ...surrogates

You have to have the work component.

However in IMS you just have an aggregation of objects. The objects 
themselves can be complex, but the association of objects is flat.

Therefore one VRA object can map onto muliple IMS objects. This causes 
problems for example how do we title an object? In VRA, we could have a 
group about the Louvre, which contain works e.g. image collections and 
subworks of specific aspects of the Louvre e.g. a photograph of a 
particular entrance. However the title of the subwork might just be 
"East Door" which by itself might not be very helpful, you need to 
know its part of specific collection about the Louvre. Other examples 
of non-hierarchical schemas include Dublin Core.

This requires some kind of inheritance as the subworks have same  / 
similiar properties to the work. Other examples of this problem would 
be collections of documents by authors, for examples collections of 
papers, books and letters by a famous author. 

MB: So can't we do this in other schemas using relations?

MS: Not all schemas have a way of linking instances. METS has end 
pointers that lets you link to other METS instances via the struct map. 
It does this using XLink. 

MB: You mentioned that you thought the library community were more 
familiar with CIDOC - can you give more details?

MS: CIDOC has been developed by the museum community. The library 
community has known about it for some time, due to the overlap between 
the two areas. So I just mentioned CIDOC in comparison to Harmony - it 
seems that Harmony is more a research project, and was done by the computer 
science community rather than the emerging from a problem domain such as 
the library community or the museum community. For more information on 
CIDOC see
http://www.willpowerinfo.myby.co.uk/cidoc/#English

Dr Mark H. Butler
Research Scientist                HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/

Received on Monday, 18 August 2003 04:56:11 UTC