- From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
- Date: Mon, 18 Aug 2003 09:55:38 +0100
- To: www-rdf-dspace@w3.org
Hi team,
Here are the notes I made from the meeting between MacKenzie
Smith and myself just after the plenary on the 25th of July.
I think other people in the team may be interested.
MB: Can you give more details of why you think the library community
may be suspicious of the Semantic Web?
MS: More RDF rather the Semantic Web in general. When RDF first came out,
the library community investigated it because it seemed relevant to a
number of problems they face. However the tools at that point weren't
good, the standard was not fully documented and later changed so they
didn't have a good experience. So in order to encourage them to use RDF
we need to demonstrate that it is compelling.
MB: What about topic maps? Did the library community look at them?
MS: The people who looked at topic maps were the same people who looked
at RDF. We looked at because we were interested in using it to model a
website. TM's can be used for this, but they seemed more aimed at the
creation of things like thesauri. The library community tends to deal
with very big thesauri, which are normally delivered as a service so
hard to describe in TM form.
MB: What about XML? For example XML Schema itself is limited, and
there has been a big argument in the community about it leading to
the creation of completing standards like Schematron and RELAX or
RELAX/NG. Have people considered these alternatives?
MS: Yes, although the proliferation of schema languages makes things
harder. For example it is likely TEI (text encoding initative) will
use RELAX.
One of the problems here is the library is always looking for quick
fixes, but those quick fixes may not be the best solutions long term.
So perhaps RDF is a better long term solution.
One example of this is METS, which requires the creation of XML Schema.
The XML Schema for METS started out very lightweight, but it's getting much
bigger and more complex as time goes by and new requirements are thrown
at it. As that happens, the tools to process METS instances get more complex
too. There's a profile specification now to allow METS implementations to
be interoperable because there are so many variants of how it's been used.
For more details see the Library of Congress website that has some examples.
There are various extension schemas for METS, such as a Dublin Core one,
but sometimes people are creating schemas for things such as technical
metadata which may be too complex for other users, this is leading to
lots of different variants.
MB: At the meeting you demonstrated a system called SiteSearch. Can you
give some more details of how it works, why it was insufficient for your
needs and how SIMILE can better it?
MS: SiteSearch was written by OCLC and is now available on an open source
basis. It uses ASN.1 as an encoding syntax, BER (basic encoding rules) and
Z39.50. It was pretty much pre-web. In some ways though the architecture
is similar to Tamino, you specify the record structure, you write rules
about records sets, this was fine for one or two schemas, it had some
features for dealing with unstructured text.
MB: During the meeting you said that the databases used by library systems
are typically not relational databases - can you give more details?
Typically the only discussion I've seen on databases preceeding relational
databases are brief footnotes?
MS: Actually I was looking for a text book on these systems, and its
pretty hard to find them. Typically the databases used by libraries are
specialist, are hierarchical in nature and use inverted indexes. They
are similiar in many ways to XML databases, and typically we find that
relational databases are not sufficient for this application due to the
number of JOINS required.
One example of a recent system is the TED (templated database) system
developed at Harvard.
(MB: more details here
http://hul.harvard.edu/ois/systems/ted/index.html
http://hul.harvard.edu/ois/systems/ted/f-adminMDfaq.html
for an example of usage of TED, and a visual image database, see
http://ted.hul.harvard.edu:8080/ted/deliver/home?_collection=bil )
Harvard uses Tamino, but they have created programs to ingest, index,
display and crosswalk metadata. You can control the labels used to
represent fields, which fields are editable, which are indexed, which
are displayed to users. There is more stuff on the TED website.
MB: What about Sitesearch? At the meeting you demonstrated a system
that uses SiteSearch called VIA
http://via.harvard.edu:748/WebZ/Authorize?sessionid=0:next=/html/VIA.html:st
yle=via
but looking at the Sitesearch website
http://www.sitesearch.oclc.org/
it seems to be unavailable - are there materials available anywhere else?
MS: I think this is just a temporary thing, but you could always try
contacting Harry Wagner or Ralph Levan if you can't find any information.
MB: One of things people are experimenting with in RDF at the moment is
the use of quads rather than triples. Quads let you denote where the
statement came from in a more efficient manner than reification, and provide
the possibility of retaining all information rather have to throw away
information when records are combined. This might be important because
sometimes which version of the information you use depends on the context
it will be used in, for example if you had two fields taken from records
using different schemas you might want to retain both. How relevant is
this to the library community?
MS: The library community has given a lot of thought to integrating data
sources, it always depends on where the data came from. They have developed
complex protocols for determining who's data is better. Specifically OCLC
has done some work on how to represent these rules for trust models. Eric
Miller should know where to find more information about this. Typically
when merging records you do throw information away based on these rules,
the only time you might arguably want to keep both versions of the
information is where you have performed a crosswalk.
One of the reasons for being interested in RDF in this context is you do
not need the schema in order to use the metadata, in XML the schema is
used to encode the metadata but RDF has the same schema for everything.
XML Schema can hide things in the data. For example in XML the tag names
have meaning, so the markup itself has semantic information but this is
not true in RDF.
MB: Can you describe why you think it might be hard to convert VRA to IMS?
MS: VRA has a hierarchical structure e.g.
VRA
Groups .... surrogates
Works ...... surrogates
Subworks ...surrogates
You have to have the work component.
However in IMS you just have an aggregation of objects. The objects
themselves can be complex, but the association of objects is flat.
Therefore one VRA object can map onto muliple IMS objects. This causes
problems for example how do we title an object? In VRA, we could have a
group about the Louvre, which contain works e.g. image collections and
subworks of specific aspects of the Louvre e.g. a photograph of a
particular entrance. However the title of the subwork might just be
"East Door" which by itself might not be very helpful, you need to
know its part of specific collection about the Louvre. Other examples
of non-hierarchical schemas include Dublin Core.
This requires some kind of inheritance as the subworks have same /
similiar properties to the work. Other examples of this problem would
be collections of documents by authors, for examples collections of
papers, books and letters by a famous author.
MB: So can't we do this in other schemas using relations?
MS: Not all schemas have a way of linking instances. METS has end
pointers that lets you link to other METS instances via the struct map.
It does this using XLink.
MB: You mentioned that you thought the library community were more
familiar with CIDOC - can you give more details?
MS: CIDOC has been developed by the museum community. The library
community has known about it for some time, due to the overlap between
the two areas. So I just mentioned CIDOC in comparison to Harmony - it
seems that Harmony is more a research project, and was done by the computer
science community rather than the emerging from a problem domain such as
the library community or the museum community. For more information on
CIDOC see
http://www.willpowerinfo.myby.co.uk/cidoc/#English
Dr Mark H. Butler
Research Scientist HP Labs Bristol
mark-h_butler@hp.com
Internet: http://www-uk.hpl.hp.com/people/marbut/
Received on Monday, 18 August 2003 04:56:11 UTC