- From: Butler, Mark <Mark_Butler@hplb.hpl.hp.com>
- Date: Mon, 18 Aug 2003 09:55:38 +0100
- To: www-rdf-dspace@w3.org
Hi team, Here are the notes I made from the meeting between MacKenzie Smith and myself just after the plenary on the 25th of July. I think other people in the team may be interested. MB: Can you give more details of why you think the library community may be suspicious of the Semantic Web? MS: More RDF rather the Semantic Web in general. When RDF first came out, the library community investigated it because it seemed relevant to a number of problems they face. However the tools at that point weren't good, the standard was not fully documented and later changed so they didn't have a good experience. So in order to encourage them to use RDF we need to demonstrate that it is compelling. MB: What about topic maps? Did the library community look at them? MS: The people who looked at topic maps were the same people who looked at RDF. We looked at because we were interested in using it to model a website. TM's can be used for this, but they seemed more aimed at the creation of things like thesauri. The library community tends to deal with very big thesauri, which are normally delivered as a service so hard to describe in TM form. MB: What about XML? For example XML Schema itself is limited, and there has been a big argument in the community about it leading to the creation of completing standards like Schematron and RELAX or RELAX/NG. Have people considered these alternatives? MS: Yes, although the proliferation of schema languages makes things harder. For example it is likely TEI (text encoding initative) will use RELAX. One of the problems here is the library is always looking for quick fixes, but those quick fixes may not be the best solutions long term. So perhaps RDF is a better long term solution. One example of this is METS, which requires the creation of XML Schema. The XML Schema for METS started out very lightweight, but it's getting much bigger and more complex as time goes by and new requirements are thrown at it. As that happens, the tools to process METS instances get more complex too. There's a profile specification now to allow METS implementations to be interoperable because there are so many variants of how it's been used. For more details see the Library of Congress website that has some examples. There are various extension schemas for METS, such as a Dublin Core one, but sometimes people are creating schemas for things such as technical metadata which may be too complex for other users, this is leading to lots of different variants. MB: At the meeting you demonstrated a system called SiteSearch. Can you give some more details of how it works, why it was insufficient for your needs and how SIMILE can better it? MS: SiteSearch was written by OCLC and is now available on an open source basis. It uses ASN.1 as an encoding syntax, BER (basic encoding rules) and Z39.50. It was pretty much pre-web. In some ways though the architecture is similar to Tamino, you specify the record structure, you write rules about records sets, this was fine for one or two schemas, it had some features for dealing with unstructured text. MB: During the meeting you said that the databases used by library systems are typically not relational databases - can you give more details? Typically the only discussion I've seen on databases preceeding relational databases are brief footnotes? MS: Actually I was looking for a text book on these systems, and its pretty hard to find them. Typically the databases used by libraries are specialist, are hierarchical in nature and use inverted indexes. They are similiar in many ways to XML databases, and typically we find that relational databases are not sufficient for this application due to the number of JOINS required. One example of a recent system is the TED (templated database) system developed at Harvard. (MB: more details here http://hul.harvard.edu/ois/systems/ted/index.html http://hul.harvard.edu/ois/systems/ted/f-adminMDfaq.html for an example of usage of TED, and a visual image database, see http://ted.hul.harvard.edu:8080/ted/deliver/home?_collection=bil ) Harvard uses Tamino, but they have created programs to ingest, index, display and crosswalk metadata. You can control the labels used to represent fields, which fields are editable, which are indexed, which are displayed to users. There is more stuff on the TED website. MB: What about Sitesearch? At the meeting you demonstrated a system that uses SiteSearch called VIA http://via.harvard.edu:748/WebZ/Authorize?sessionid=0:next=/html/VIA.html:st yle=via but looking at the Sitesearch website http://www.sitesearch.oclc.org/ it seems to be unavailable - are there materials available anywhere else? MS: I think this is just a temporary thing, but you could always try contacting Harry Wagner or Ralph Levan if you can't find any information. MB: One of things people are experimenting with in RDF at the moment is the use of quads rather than triples. Quads let you denote where the statement came from in a more efficient manner than reification, and provide the possibility of retaining all information rather have to throw away information when records are combined. This might be important because sometimes which version of the information you use depends on the context it will be used in, for example if you had two fields taken from records using different schemas you might want to retain both. How relevant is this to the library community? MS: The library community has given a lot of thought to integrating data sources, it always depends on where the data came from. They have developed complex protocols for determining who's data is better. Specifically OCLC has done some work on how to represent these rules for trust models. Eric Miller should know where to find more information about this. Typically when merging records you do throw information away based on these rules, the only time you might arguably want to keep both versions of the information is where you have performed a crosswalk. One of the reasons for being interested in RDF in this context is you do not need the schema in order to use the metadata, in XML the schema is used to encode the metadata but RDF has the same schema for everything. XML Schema can hide things in the data. For example in XML the tag names have meaning, so the markup itself has semantic information but this is not true in RDF. MB: Can you describe why you think it might be hard to convert VRA to IMS? MS: VRA has a hierarchical structure e.g. VRA Groups .... surrogates Works ...... surrogates Subworks ...surrogates You have to have the work component. However in IMS you just have an aggregation of objects. The objects themselves can be complex, but the association of objects is flat. Therefore one VRA object can map onto muliple IMS objects. This causes problems for example how do we title an object? In VRA, we could have a group about the Louvre, which contain works e.g. image collections and subworks of specific aspects of the Louvre e.g. a photograph of a particular entrance. However the title of the subwork might just be "East Door" which by itself might not be very helpful, you need to know its part of specific collection about the Louvre. Other examples of non-hierarchical schemas include Dublin Core. This requires some kind of inheritance as the subworks have same / similiar properties to the work. Other examples of this problem would be collections of documents by authors, for examples collections of papers, books and letters by a famous author. MB: So can't we do this in other schemas using relations? MS: Not all schemas have a way of linking instances. METS has end pointers that lets you link to other METS instances via the struct map. It does this using XLink. MB: You mentioned that you thought the library community were more familiar with CIDOC - can you give more details? MS: CIDOC has been developed by the museum community. The library community has known about it for some time, due to the overlap between the two areas. So I just mentioned CIDOC in comparison to Harmony - it seems that Harmony is more a research project, and was done by the computer science community rather than the emerging from a problem domain such as the library community or the museum community. For more information on CIDOC see http://www.willpowerinfo.myby.co.uk/cidoc/#English Dr Mark H. Butler Research Scientist HP Labs Bristol mark-h_butler@hp.com Internet: http://www-uk.hpl.hp.com/people/marbut/
Received on Monday, 18 August 2003 04:56:11 UTC