Modelling Thesauri for the Semantic Web

Brian Matthews, Alistair Miles, and Michael Wilson

CCLRC, Rutherford Appleton Laboratory, Didcot, OXON OX11 0QX, UK
{b.m.matthews, a.j.miles, m.d.wilson}@rl.ac.uk

Abstract. For the Semantic Web to be effective the notion of controlled vocabulary shared by a community is central. Structuring of controlled vocabularies, including taxonomies, thesauri and ontologies has been studied for many years. In this paper we shall investigate the use of one such method of defining controlled vocabularies; thesauri. We compare approaches which have been proposed for expressing such controlled vocabularies in the Semantic Web, and propose a common forms to allow a migration path from existing thesauri to the Semantic Web. We then discuss how such a common thesuarus format might be used in the Semantic Web.

1. Introduction

The idea of the Semantic Web is founded on the notion of controlled vocabulary, a language of terms with agreed meanings which can be shared in a particular community. There has been work for many years on structuring of controlled vocabularies, including taxonomies, thesauri and ontologies, and they are widely used in the fields of digital libraries and information retreival to control resource cataloging, querying and filtering.

If Semantic Web technologies based on RDF are going to be adopted as currently HTML and XML are, a clear migration path from present technologies to new ones is needed. Thesauri are used throughout the information retrieval world as a method of providing controlled vocabularies for indexing and querying [25]. The World-Wide Web Consortium (W3C) is developing standards for the representation of ontologies to constrain the vocabularies of resource descriptions based on RDF (RDF Schema [8], OWL [26]). Such ontologies will allow distributed authoritative definition of vocabularies that support cross-referencing. Such ontology representations are planned to fulfil the role traditionally undertaken by thesauri in digital libraries. Therefore a migration path is required from current thesauri to the semantic web or support for their co-existence if those ontologies are to be adopted and assimilated into existing information retrieval infrastructure.

This offers an opportunity existing work on thesuari to leverage the uptake of the Semantic web; by delivering established vocabularies to the wider community, the use of the semantic markup for web resources can be quickly enabled without additional effort in defining the terminology used. Further, thesauri can then form the basis for developing richer ontological structures.

In this paper we shall investigate the use of one such method of defining controlled vocabularies, thesauri. We shall compare approaches which have been advocated for expressing such controlled vocabularies in the Semantic Web, and propose common forms to allow a migration path from existing thesauri to the Semantic Web. We shall then discuss how such a common thesuarus format might be used in the Semantic Web.

1.1. Knowledge Organisation Systems and Thesauri

Authoritative lists of categorisation terms or controlled vocabularies, generically known as Knowledge Organisation Systems (KOS), have been used in libraries for centuries to catalogue print media. Using terms from a limited controlled vocabulary to search on increases the precision of search, and when the term is both locatable in the controlled vocabulary and actually used to index documents it will improve the recall [20]. Since the 1970's those word lists have been structured as thesauri to improve the location and selection of terms within and across authorities [21]. A thesaurus is a compilation of words and phrases showing synonymous, hierarchical, and other relationships and dependencies, the function of which is to provide a standardised vocabulary for information storage and retrieval systems [1].

The structure of thesauri is controlled by international standards that are among the most influential ever developed for the library and information field. The main three standards define the relations to be used between terms in monolingual thesauri (ISO 2788:1986 [18]), the additional relations for multilingual thesauri (ISO 5964:1985 [17]), and methods for examining documents, determining their subjects, and selecting index terms (ISO 5963:1985 [16]). ISO 2788 contains separate sections covering indexing terms, compound terms, basic relationships in a thesaurus, display of terms and their relationships, and management aspects of thesaurus construction. The general principles in ISO 2788 are considered language- and culture-independent. As a result, ISO 5964:1985, refers to ISO 2788 and uses it as a point of departure for dealing with the specific requirements that emerge when a single thesaurus attempts to express "conceptual equivalencies" among terms selected from more than one natural language [4].

The ISO standards for thesauri (ISO 2788 and ISO 5964:1985) are developed and maintained by the International Organization for Standardization, Technical Committee 46 whose remit is Information and Documentation. ISO 5964:1985 is currently undergoing review by ISO TC46/SC 9, and it is expected that among changes to it will be the inclusion of a standard interchange format for thesauri. To facilitate the growth of the Semantic Web, it would be sensible to try to ensure that such an interchange format is as compatible with Semantic Web ontology representations as possible.

In order to develop a migration path from current thesauri to Semantic Web representations, it is necessary to understand the semantics of thesauri and how they relate to those of ontologies.

1.3. Thesaurus classes and relations

When searching for information, query terms entered retrieve answers. If the query term is not used to index items then the user needs to know the preferred term and to use that instead. If the user receives too few answers they want to broaden the search to recall more items, whereas if the search produces too many answers, they want to narrow the search to produce fewer answers. The hierarchical links in a thesaurus map onto this desired functionality of broader (BT) and narrower (NT) search term.

There is no well founded epistemological basis for the standard thesaurus relations in terms of the meaning of words. Thesauri are taxonomic hierarchies of terms, but the hierarchical relationship is not simple type subsumption - for example, school can be a narrower term of education. The subsumption relation can be understood as a relation of implication which relates more specific to more general concepts in conceptual taxonomies. The relationships between broader and narrower terms include:

  1. Type/subset subsumption (generalisation relation)
  2. Part of (partitative relation)
  3. Instance
  4. Entailment across situational relations (Barwise and Perry, 1983). For example, in an education situation a school would be a location where education takes place (e.g. BT education NT school), or a teacher a person who provides education (e.g. BT education NT teacher). This common usage is excluded from the ISO standard.

This unrestrained use of subsumption ("is-a" relation) to accomplish a variety of representation tasks is also common to semantic networks and ontologies (Brachman, 1983). Difficulties in integration and reuse are the major side-effects of such unclear and often inconsistent use of the subsumption relation.

Amann & Fundulaki [3] have suggested a method for developing RDF Schema from thesauri illustrated on the Getty Art and Architecture Thesaurus [6]. However, they only account for the first of these relationships in the AAT which is considerably more semantically exact than most thesauri. Most thesauri also permit multiple inheritance which flaws their method (e.g. the term economics could both be included in a hierarchy of social factors and one of academic subjects). Further methods will need to be sought to relate the hierarchical relations of most existing thesauri to a more precise semantics.

The objects in thesaurus hierarchies are either concepts or terms. Older thesauri use the terms themselves as the nodes, while more recent ones use concepts to which the terms apply (e.g. [6]) usually identified by a numeric ID. Doerr and Fundulaki [11, 12] have proposed the introduction of concepts to the ISO standard for thesauri as part of its revision. Terms are word forms in a language, usually written in upper case. In English, nouns are usually represented as plurals as terms. Terms can be any part of speech, and may be multi-word idioms, or in English, phrasal verbs.

All thesauri

Multilingual thesauri

Top Term

TT

Exact Equivalent

Broader Term

BT

Inexact Equivalent

Narrower Term

NT

Partial Equivalent

Related Term

RT

One to Many Equivalent

Used For

UF

Language of

Use

USE

 

Scope Note

SN

 

Table 1: The relations in all thesauri, with their standard two letter abbreviations and those specific to multilingual thesauri.

Table 1 summarises the standard thesaurus relationships. For any hierarchy there is one and only one Top Term, which can be regarded as a second class of object. The third, fourth and fifth classes are scope notes, dates and histories. Scope notes can be sub-typed into different classes in individual thesauri, but the standard does not do so itself.

In monolingual thesauri, the relation TT exists between any term and its top term in a hierarchy, while BT, NT, RT, UF can exist between two terms. The SN relation exists between terms and scope notes.

BT and NT are reciprocal relations on the edges in the hierarchy, BT pointing towards the top term of the hierarchy and NT to the terminal nodes. NT has a relationship akin to child while BT is akin to parent.

To contrast with these hierarchical relations, there are associative relations: UF, USE and RT. The UF terms is a synonym term with its reciprocal USE, which are used to show that one and only one of a set of terms with equivalent meaning is preferred by the categorisation system and is used for indexing. The referred-from terms include synonyms in direct and inverted word order, alternative spellings (including singular and plural forms), alternative endings, changed or canceled headings, and abbreviations and acronyms. For phrase headings entered in the inverted form, USE references are made from the straight form. For phrase headings entered in the straight form, USE references are made from the inverted form in selective cases. For compound headings and for topical headings subdivided by other topics, USE references are made from the reversed form, thus bringing each significant term to the initial position. Occasionally, USE references are made to broader headings from narrower terms not used as valid headings. USE references are not generally made from equivalents in foreign languages.

If non-preferred terms are used in queries for searching, then they should be mapped to the preferred term which has been used for actual labelling. Since terms are words, and terms can occur in multiple hierarchies, it is possible for a word to be the preferred term in one hierarchy (possibly using its major sense) while also being a non-preferred term in another hierarchy (possibly using a minor sense). In this case the simple use of the word in a query is ambiguous as to which sense is intended. This example shows that there is a notion of concept behind different senses of words within thesauri.

The RT relation is used between two terms that hold an associative relation, but which are not related in the broader/narrower relation of the hierarchy, or through the UF synonymy. Such references may be made for the following types of relationships: headings with meanings that overlap to some extent, headings representing a discipline and the object studied, and headings representing persons and their fields of endeavour (examples: Ships RT Boats and boating; Birds RT Ornithology; Medicine RT Physicians)

It is conventional in multilingual thesauri to have a hierarchy of terms for each language labelled with the language, then to establish relations between individual items across those language hierarchies using one of the four relations.

Given these types and relations are included in the ISO compliant mono- and multi-lingual thesauri, then any useful Thesaurus Interchange format must include them too.

1.4. Constraints on the Model

Additionally, we wish to constrain the thesaurus model with extra conditions on the consistency of the thesaurus.

Here, the constraints are given in using a formal model. A thesaurus includes a set of concepts Concept, a set of scope note ScopeNote and a set of terms Term.

These constraints cannot be expressed directly in RDF Schema, and points to a OWL based approach best representing thesauri.

2. A Comparison of Different Approaches.

Several different approaches have been proposed by different groups to to modelling thesauri using RDF Schema or Semantic Web ontology languages such as DAML+OIL and OWL. In this section we describe some of these approaches, giving a categorisation of the different approaches.

2.1. A Term-Based Approach

The most straightforward approach in either RDFS or DAML+OIL/OWL is to model terms as a class of resources, following closely a simple formalisation of the ISO standard.  A distinction is made between those terms which are the preferred representation of a concept, and those that are not, usually by creating two subclasses of the overarching term class. Broader/Narrower/Related links between terms are modelled as properties.  The domain/range of these properties is then restricted to the preferred-term class, so that these properties can only be used to link members of the preferred term class.  The preferred/non-preferred (use for/use) links between non-preferred terms and their preferred alternative are also modelled as properties. These classes and properties are summarised in table 2.


Classes SubClassOf Properties Range
Term   Resource value   {literal}
Preferred-Term Term
BT [Preferred-Term]
NT [Preferred-Term]
RT [Preferred-Term]
UF [Entry-Term]
Entry-Term Term USE [Preferred-Term]

Table 2: summary of the RDF classes and properties in the Term-Based approach.

This approach is taken by the Gateway to Educational Materials ([GEM]), and a fundamentaly similar approach is taken by the Dynamics Research Corporation as part of the DAML programme. This latter approach used DAML+OIL and therefore can express more of the constraints on the thesaurus model.

The main strength of this approach is its simplicity; the GEM thesaurus format is particularly straightforward, and it follows the thesaurus standard very closely, accurately models the structure of a monolingual thesaurus. However, for more ambiguous word structures, it is not clear how that it would be satisfactorily extended, especially to the multilingual case as the simple use of preferred term to in the hierarchy does not map well to multiple meanings in different languages. Further, this model does not cope well with conceptual drift where terms change in meaning. This makes it hard to maintain and extend. However, the simplicity of the design does make this approach attractive, especially when using DAML+OIL ontology constructors.

2.2. A Subclass Approach

In this approach, terms are modelled as classes themselves.   The built in sub/super-class properties of whatever ontology language is being used are then used to model the broader/narrower links between terms.  The Related, Use For and Use links are modelled as in the above approach, although the domain/range of these properties are of course Class. This approach has been taken by the AGROVOC thesaurus, built using the KAON tool [2].  

This approach has the maximal reuse of existing properties and classes, and a natural approach to handling broader and narrower relationships, such as in a scientific taxonomy where the hierarchical nature of the vocabulary reflects a subclass relationship between the concepts, and the thesuarus has been carefully constructed to reflect that hierarchy. Such thesauri are also very easy to mapping into a full ontology, as the broader/narrower relation maps directly into ontologies built-in hierarchy relatinship.

Nevertheless, using the sub-/super-class properties of an ontology language imply a strict semantic association between two objects, that of class subsumption.  However, as described above, the broader/narrower relations between terms are never consistently used for just this meaning, and often the broader/narrower property can mean is-a, instance-of, part of or other relationships between terms.  Therefore using a property where the semantic meaning is strictly defined in a situation where in fact the meaning could be quite different is perhaps inappropriate, expecially if the thesauri modelled using this approach are being distributed and used by people who were not involved in their development.   A further objection comes if snippets of thesauri in this format were used together with snippets of other RDFS/ontology documents.  Then it would be hard to separate which were the classes that are being used to represent terms, and which are not.

2.3. Term Approach with Categories

Some thesauri are built with categories in addition to a hierarchy of terms.   All described terms are declared to be a member of some category.  Using Semantic Web languages, the approach is the same as the term approach, extended to include categories as a distinct class of resources.   Properties are then used to declare that terms are members of a specific category. The RDF Schema relationships are summarised in table 3.


Classes Sub-Class Of Properties Range
Term Resource     value {literal}
memberOf [Category]
Category Resource     hasMember      [Term]
Preferred-Term Term   
BT [Preferred-Term]
NT                    [Preferred-Term]
RT                      [Preferred-Term]
UF                      [Entry-Term]
Entry-Term  Term    USE [Preferred-Term]

Table 3: summary of the RDF classes and properties in the Category-Based approach.

This approach is taken by the CERES RDF Schema [9]. Categories allow a more restricted type of semantic relationship than the overloaded broader/narrower relation between terms, using the instanceOf relationship – a term is an instanceOf a category.  A category can be viewed as a class of terms.  This also provides an additional way of organising and accessing the terms.

Categories could be modelled as the top-level terms in the term hierarchy, and so could be redundant.  However, this would mean using the broader/narrower properties for a more specific relationship, which means the loss of some semantic information.

Thus, within term-based schema, a significant difference is whether or not ‘Categories’ are allowed.  A ‘Category’ is modelled as a class of objects distinct from terms.   Every term belongs to some category.   The question is, are categories necessary, or are they better modelled as terms which sit at the top of the generalisation hierarchy?

Diagram of CERES RDF Schema

Figure 1: summary of the classes and properties in the Category-Based approach.

2.4. Concept-Based Approach

In the modelling of thesauri it is tacitly assumed that a group of terms that is the preferred term and its entry terms is being used to describe some abstract concept; the text of the ISO standard is rather ambiguous about this point, which is a fine one for text-based or locally stored thesauri for largely human usage.  Strictly speaking, the broader/narrower/related links (the statements about generality) are not being made between terms, but between the concepts they stand for.  However, in the term-based approach, each preferred term is taken as a proxy for the concept it stands for, and the broader/narrower/related links are made between these.

In a concept based approach, the concepts are modelled explitly as a distinct class of resources. Broader/Narrower/Related links are modelled as properties, but the domain/range are Concepts. Each concept is then linked to the terms that can be used to represent it, one which is preferred and any number which are not. The RDF Schema properties and classes for the concept-based approach is given in table 4.


Classes SubClassOf Properties Range
Concept    Resource    
classificationCode       {literal}
hasBroader      [Concept]
hasNarrower     [Concept]
IsRelatedTo     [Concept]
hasPreferredTerm [Term]
hasNonPreferredTerm [Term]
Term Resource     value {literal}

Table 4: summary of the classes and properties in the Concept-Based approach.

Whether a term is preferred or not could be modelled explicitly as in the standard approach by using sub-classes of the term class.  It could also be inferred from its relationship with the concept it stands for.

This approach has been realised in RDF by Cross, Brickley and Koch [10], and taken further into the multilingual case by Matthews, Miller and Wilson [22, 23, 24], which was prototyped on the HASSET social science thesuarus [15].

This approach is more complex than other approaches with an extra layer of indirection. For example to find the preffered term of a given term, we have to first find the concept of the term via a reverse traversal of the hasNonPreferredTerm property, and then traverse the hasPreferredTerm property, whilst in a term-based approach this would need a single traversal of the USE property.

Nevertheless, this approach has been argued for by Doerr and Fundulaki [12], who argue that this approach solves confusion caused by overloading of terms, where one term can be used for many concepts; in this model this would be reflected directly, with scope notes explaining the qualification, whilst a term-based approach would potentially have the confusion of which meaning is intended. This confusion is exacerbated when the multi-lingual case is considered, as equivalence between terms when there are potentially many alternatives of translation could cause great ambiguity.

Further, from an practical view of maintenance, it is more straightforward to maintain systems which evolve over time as the meaning of terms change. For example, this approach allows easy reshuffling of the preferred/non-prefferd terms, without disturbing the generality hierarchy of the concepts.

The concept-based approach captures the intution that in practical thesaurus construction, the broader-narrower relationship reflects the extension of the concept, that is the resources which can be classified under those terms. Thus Doerr and Fundulaki state:

Under this definition, if DescriptorA has a broader meaning than DescriptorB, then the instance set of the latter is a subset of the former.

Thus in this case, the broader/narrower relationship does afterall represent a proper subclass inclusion, but not of the extensions of the terms, but rather the extensions of the concepts.

3. Towards common format for multilingual Thesauri

We make a distinction between concept-based models and term-based models.   In a term-based approach, although it may still be tacitly implied that a set of terms represent an abstract concept, the concept is not reified in the model. Terms which are preferred terms become the nodes in the generalisation hierarchy.  In concept-based models, it is made explicit that a set of terms is used to represent some abstract concept.  One of these terms is the preferred term, the others are non-preferred terms.  Broader/narrower/related relations are mode between concepts, not between terms – the concepts are the nodes in the generalisation hierarchy.    In a multilingual thesaurus equivalence relations are made between the concepts.

Traditionally thesauri, especially monolingual, have been term based, and most of the schema discussed above follow this tradition. In many cases this would be an acceptable format for thesauri. However, the strength of the unabiguity and maintainability of the concept model for thesuari is persuasive, despite the extra complexity the model involves. Thus we propose a update of the RDF Schema developed by Matthews, Miller and Wilson [22]. This schema is reproduced as an appendix to this paper,whilst the major classes and properties are presented in Figure 2.

Simplified Thesaurus Model

Figure 2: Simplified multilingual thesaurus model.

Note that the familiar properties of broaderConcept and narrowerConcept are subproperties of conceptRelation, whilst notions of equivalence between concepts, used to define either multilingual thesauri, or thesauri defined from a different domain, are subproperties of the conceptEquivalence property.

In this approach, we have taken the view that different versions of the same thesaurus in different languages have their own concept hierarchies, with relations between them. This contrasts with other approaches which take the view that there in one heirarchy with alternate preferred terms for different languages.

3.1. RDF Schema vs. OWL

The approach taken here in Appendix A is to use RDF Schema. This is to allow a simplicity of expression which can be used immediately be put to use before standardisation of OWL has been completed. However, as has been noted, it would be appropriate to express the thesuarus in OWL, thus allowing the expression of constraints, notably the inverse relationship between broaderConcept and narrowerConcept, and the uniqueness of the prefferedTerm. By using OWL, we could express these as follows:

  <owl:Class rdf:ID="Concept">
    <rdfs:subClassOf rdf:resource="#ThesaurusObject"/>
  </owl:Class>
  <owl:ObjectProperty rdf:ID="hasBroaderConcept">
    <rdfs:subPropertyOf rdf:resource="#ConceptRelation"/>
    <owl:inverseOf rdf:resource="#hasNarrowerConcept"/>
  </owl:ObjectProperty>
  <owl:ObjectProperty rdf:ID="hasNarrowerConcept">
    <rdfs:subPropertyOf rdf:resource="#ConceptRelation"/>
    <owl:inverseOf rdf:resource="#hasBroaderConcept"/>
  </owl:ObjectProperty>
  <owl:FunctionalProperty rdf:ID="hasPreferredTerm">
    <rdfs:subPropertyOf rdf:resource="#isIndicatedBy"/>
  </owl:FunctionalProperty>

A further option would be to Concept class itself a subclass of the OWL Class:

  <owl:Class rdf:ID="Concept">
    <rdfs:subClassOf rdf:resource="#ThesaurusObject"/>
    <rdfs:subClassOf rdf:resource="owl:Class"/>
  </owl:Class>

This on the would make the move towards converting thesauri into ontologies more explicit. However, their would be a possibility of confusion here. The instances of such concept classes are not terms, but the resources which are classified under those concepts.

3.2. Adding further relationships

In using the concept based approach, we have lost the traditional properties associated with thesauri, namely the relations BT, NT, UF, USE and other defined in the ISO standard, which are properties between terms rather than concepts. However, we can reintroduce them into the same OWL thesaurus model, thus allowing:

  <owl:ObjectProperty rdf:ID="BT">
    <rdfs:domain rdf:resource="#Term"/>
    <rdfs:range rdf:resource="#Term"/>
  </owl:ObjectProperty>
  <owl:ObjectProperty rdf:ID="NT">
    <rdfs:domain rdf:resource="#Term"/>
    <rdfs:range rdf:resource="#Term"/>
  </owl:ObjectProperty>
  <owl:ObjectProperty rdf:ID="UF">
    <rdfs:domain rdf:resource="#Term"/>
    <rdfs:range rdf:resource="#Term"/>
  </owl:ObjectProperty>
  <owl:ObjectProperty rdf:ID="USE">
    <rdfs:domain rdf:resource="#Term"/>
    <rdfs:range rdf:resource="#Term"/>
  </owl:ObjectProperty>

However, these are not independent properties in their own right, but rather derived properties, which should satisfy the following relationships:

BT = isIndicatedBy-1 ; hasBroaderConcept ; isIndicatedBy
NT = isIndicatedBy-1 ; hasNarrowerConcept ; isIndicatedBy
UF = hasPreferredTerm-1 ; hasNonPreferredTerm
USE = hasNonPreferredTerm-1 ; hasPreferredTerm

Where ; represents relational composition; (p ; q)(a,c) if and only if there exists some b such that: p(a,b) and q(b,c). Note also we do not need to assert that NT and BT are inverses; this is a derived property of inverse nature of hasBroaderConcept and hasNarrowerConcept. However, OWL at present does not have a sufficiently expressive language to express the above equivalences and to derive such consequences.

Other thesaurus format allow other relationships. For example, the DAML+OIL ontology from DRC has a ACK /AF relations between terms for abbreviations and acronyms. The framework we describe is extensible so such properties would be straightforward to add.

4. A Thesaurus Server

In order to demonstrate the utility of the thesaurus format, we have been developing a set of tools. The first of these is a thesuarus store and browser SophiaM.

The centre of SophiaM is a RDF store and query system built using the Jena toolkit [27]. This demonstrates that once you have an RDF representation of the thesaurus, it is straightforward to implement tools quickly upon the existing semantic web infrastructure. A set of predefined RDQL queries constructors have then been provided as an interface to the Jena store to provide a thesaurus API to the system. THis can then be used to provide tools which use the thesaurus. The first of these is a thesaurus browser and editor; Figure 3 gives a typical use of the system.


Screendump of the SophiaM thesuarus browser

Figure 3: Screen capture of the SophiaM thesuarus browser"

The thesaurus browser allows you to browse and search around a multilingual thesaurus, following broader and narrower links, related terms and language alternatives. Further screens allow an authorised user to edit their thesaurus.

4.1. Thesaurus APIs

As a part of the development of the thesaurus browser, we have developed a Java API and an implementation based on the Jena system described above. This API provides a comprehensive range of functions and methods for manipulating and searching a thesaurus.


Screendump of the Java thesaurus API

Figure 4: Screen capture of the Java thesaurus API"

Figure 4 provides a screen image showing part of the API for the class Concept. We shall not go into more detail for lack of space.

5. Conclusions and Future Work

In this paper we have considered the problem of using existing thesauri to support the development of the Semantic Web and considered various approaches proposed to effecting this migration, detailing their strengths and weaknesses. From this we have proposed an RDF Schema to support concept based thesauri, which is a more appropriate method of modelling thesuauri, especially multilingual ones. OWL ontologies and their extensions offer ways of expressing the properties of thesuari in richer fashion; we indicate how this may be acheived.

We have begun to produce tool support for the thesaurus format. We would also like to produce a range of tools and applications, including a thesuarus server based on web service, and a generic thesuarus based query refinement tool, as well as simple applications built using the RDF based thesaurus.

The next steps are to consider the migration from thesauri to a richer ontology, with a larger range of constraints. This has been considered by some authors, for example Wielinga [28]. However, in general this is a difficult problem due to the free way in which thesaurus designers have interpreted the standard thesaurus relationships. Ultimately, this may well always require human intervention, but ways of assisting the process should be considered, as should ways of mapping between thesauri. However, we believe that in a large number of applications, simple cataloguing and searching of web resources for example, the simple thesaurus structure is likely to prove sufficient.

Ackowledgements

We would like to thank the advice encouragement of colleagues, especially Dan Brickley and Ken Miller. This work supported by the European project Semantic Web Advanced Development in Europe (SWAD-Europe). Further information on the Thesaurus workpackage within SWAD can be found at http://www.w3c.rl.ac.uk/SWAD/thesaurus.html.

References

  1. Aitchison, J., Gilchrist, A. Bawden, D. (1997) Thesaurus construction and use: a practical manual (3rd Edition) Aslib: London
  2. The Food and Agriculture Organization of the United Nations: AGROVOC http://kaon.semanticweb.org/Members/rvo/ontologies/AGROVOC.zip
  3. Amann B. & Fundulaki. I. (1999). Integrating Ontologies and Thesauri to Build RDF Schemas. In ECDL-99: Research and Advanced Technologies for Digital Libraries, Lecture Notes in Computer Science, pages 234--253, Paris, France. Springer-Verlag.
  4. Austin, D.. "Vocabulary Control and Information Technology." Aslib Proceedings 38 (January 1986): 1-15.
  5. Barwise, J and Perry, J.(1983) Situations and Attitudes, Cambridge, MA: MIT Press.
  6. AAT (1994), Introduction to the Art & Architecture Thesaurus. Published on behalf of The Getty Art History Information Program, Oxford University Press, New York, 1994.
  7. Brachman, R.J. (1983). What IS-A Is and Isn't: An Analysis of Taxonomic Links in Semantic Networks. Computer, 16(10):30--36, October 1983.
  8. Dan Brickley and R V Guha. (2000). Resource Description Framework (RDF) Schema Specification 1.0., Candidate W3C Recommendation http://www.w3.org/TR/2000/CR-rdf-schema-20000327.
  9. California Enviromental Resource Evaluation System (CERES) thesaurus formathttp://ceres.ca.gov/thesaurus/RDF.html
  10. Cross, P., Brickley, D. & Koch T (2000). Conceptual relationships for encoding thesauri, classification systems and organised metadata collections and a proposal for encoding a core set of thesaurus relationships using an RDF Schema. http://www.desire.org/results/discovery/rdfthesschema.html
  11. Doerr, M. & Fundulaki, I. (1998). SIS-TMS, A Thesaurus Management System for Distributed Digital Collections, In Proceedings of the Second European Conference on Digital Libraries, Heraklion 1998.
  12. Doerr, M. & Fundulaki, I.(1998) A proposal on extended interthesaurus links semantics. Technical Report TR-215, Institute of Computer Science-FORTH, March 1998.
  13. The Gateway to Educational Materials Thesaurus (GEM) http://www.fao.org/agrovoc/
  14. Hall, M. (2001) CALL Thesaurus Ontology in DAML. http://orlando.drc.com/daml/ontology/CALL-Thesaurus/G3/CALL-Thesaurus-ont-g3r1.daml, Dynamics Research Corporation , 26 September 2001.
  15. HASSET (1999). Humanities and Social Science Electronic Thesaurus. http://biron.essex.ac.uk/searching/zhasset.html
  16. ISO 5963:1985 Documentation -- Methods for examining documents, determining their subjects, and selecting indexing terms 1985 (5 p.)
  17. ISO 5964:1985 Documentation--Guidelines for the establishment and development of multilingual thesauri 1985. (61 p.)
  18. ISO 2788:1986 Documentation--Guidelines for the establishment and development of monolingual thesauri 2nd ed., 1986. (32 p.).
  19. ISO 639:1988 Code for the representation of names of languages, 1988 (17 p.)
  20. Lancaster, W. F. (1987) Vocabulary Control for Information Retrieval, 2nd ed. Washington, DC: Information Resources Press.
  21. Mandel, C. A. (1987) Multiple Thesauri in Online Library Bibliographic Systems. Washington, DC: Library of Congress.
  22. Matthews, B.M., Miller, K., Wilson, M.D.,(2001) A proposed RDF Schema Thesuarus from the Limber project http://www.limber.rl.ac.uk/External/thesaurus-iso.rdf and prototyped using the ELSST social science thesaurus. http://www.limber.rl.ac.uk/External/ELSST_demo_RDF.xml

  23. Matthews, B.M., Miller, K., Ramfos, A., Ryssevik, J., Wilson, M.D., (2001) Internationalising data access through LIMBER, in D.L.Day and L.M.Dunckley (eds) Designing for Global Markets 3: proceedings of iwips2001, pgs 129-142, Open University: Milton Keynes.
  24. Miller, K., Matthews, B.M. (2001) Having the right connections: the LIMBER project , Journal of Digital Information 1(8) (http://jodi.ecs.soton.ac.uk/)
  25. Middleton, M. (2000). Controlled Vocabulary Resource Guide. http://www.fit.qut.edu.au/InfoSys/middle/cont_voc.html
  26. Peter F. Patel-Schneider, P.F., Horrocks, I., Hayes, P., van Harmelen, F., eds. (2003). Web Ontology Language (OWL) Abstract Syntax and Semantics W3C Last Call Working Draft 31 March 2003. http://www.w3.org/TR/owl-semantics/
  27. Jena toolkit, HP Laboratories, Bristol (2003) http://www.hpl.hp.com/semweb/
  28. B.J. Wielinga, A Th Schreiber, J Wielemaker, JAC Sandberg, . (2001) From Thesaurus to Ontology, K-CAP '01, ACM

Appendix A: An RDF Schema for Thesauri

<!-- This is the Thesaurus Interchange Format (TIF) 
        for multilingual thesauri.  
 Authors:	A J Miles, B M Matthews
	Date:		30/05/2003
 Version: 1.1
 -->
<rdf:RDF xml:lang="en" 
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
         xmlns:lc="http://www.limber.rl.ac.uk/External/thesaurus-iso.rdf#" 
         xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
 <rdfs:Class rdf:ID="Thesaurus">
  <rdfs:comment>
    Properties may be added to an instance of the Thesaurus class to 
    describe for example the name of the thesaurus, the subject, 
    a textual description, the creators, the date of last modification etc.
  </rdfs:comment>
  <rdfs:subClassOf 
     rdf:resource="http://www.w3.org/2000/01/rdf-schema#Resource"/>
 </rdfs:Class>
 <rdfs:Class rdf:ID="ThesaurusObject">
  <rdfs:comment>
  The superclass of all classes of object that are part of a thesaurus.
  </rdfs:comment>
  <rdfs:subClassOf 
     rdf:resource="http://www.w3.org/2000/01/rdf-schema#Resource"/>
 </rdfs:Class>
 <rdfs:Class rdf:ID="Concept">
  <rdfs:comment>
   A unique concept defined within a vocabulary scheme, such as a thesaurus 
    or classification scheme. Instances can use the rdfs:isDefinedBy property 
    with a vocabulary namespace as its value, to indicate the vocabulary to 
    which the concept belongs.
  </rdfs:comment>
  <rdfs:subClassOf rdf:resource="#ThesaurusObject"/>
 </rdfs:Class>
 <rdfs:Class rdf:ID="Term">
  <rdfs:comment>
   Instances of this class represent the written forms of concepts, 
    capturing a word or phrase that expresses the concept.
  </rdfs:comment>
  <rdfs:subClassOf rdf:resource="#ThesaurusObject"/>
 </rdfs:Class>
 <rdfs:Class rdf:ID="ScopeNote">
  <rdfs:comment>
   Provides a comment on the concept, for disambiguation, explanation etc.  
    The string is given by the rdf:value of ScopeNote.
  </rdfs:comment>
  <rdfs:subClassOf rdf:resource="#ThesaurusObject"/>
 </rdfs:Class>
 <rdfs:Class rdf:ID="ScopeNoteType">
  <rdfs:subClassOf rdf:resource="#ThesaurusObject"/>
 </rdfs:Class>
 <rdf:Property rdf:ID="classificationCode">
  <rdfs:comment>The unique identifier of a concept.</rdfs:comment>
  <rdfs:domain rdf:resource="#Concept"/>
  <rdfs:range rdf:resource="http://www.w3.org/2000/01/rdf-schema#Literal"/>
 </rdf:Property>
 <rdf:Property rdf:ID="termValue">
  <rdfs:comment>The string value of a term</rdfs:comment>
  <rdfs:domain rdf:resource="#Term"/>
  <rdfs:range rdf:resource="http://www.w3.org/2000/01/rdf-schema#Literal"/>
 </rdf:Property>
 <rdf:Property rdf:ID="inLanguageOf">
  <rdfs:comment>The language of a concept or scope note.</rdfs:comment>
  <rdfs:domain rdf:resource="#Concept"/>
  <rdfs:domain rdf:resource="#ScopeNote"/>
  <rdfs:range 
    rdf:resource="http://www.limber.rl.ac.uk/External/ISO639.rdf#LanguageCode"/>
 </rdf:Property>
 <rdf:Property rdf:ID="hasTypeOf">
  <rdfs:comment>The type of a scope note.</rdfs:comment>
  <rdfs:domain rdf:resource="#ScopeNote"/>
  <rdfs:range rdf:resource="#ScopeNoteType"/>
 </rdf:Property>
 <rdf:Property rdf:ID="hasScopeNote">
  <rdfs:comment>A scope note of a concept.</rdfs:comment>
  <rdfs:domain rdf:resource="#Concept"/>
  <rdfs:range rdf:resource="#ScopeNote"/>
 </rdf:Property>
 <rdf:Property rdf:ID="isIndicatedBy">
  <rdfs:comment>A defining term for a concept.</rdfs:comment>
  <rdfs:domain rdf:resource="#Concept"/>
  <rdfs:range rdf:resource="#Term"/>
 </rdf:Property>
 <rdf:Property rdf:ID="hasPreferredTerm">
  <rdfs:comment>
   The preferred term for a concept w.r.t. a specific language.
 </rdfs:comment>
  <rdfs:subPropertyOf rdf:resource="#isIndicatedBy"/>
 </rdf:Property>
 <rdf:Property rdf:ID="hasNonPreferredTerm">
  <rdfs:comment>A non-preferred term for a concept.</rdfs:comment>
  <rdfs:subPropertyOf rdf:resource="#isIndicatedBy"/>
 </rdf:Property>
 <rdf:Property rdf:ID="ConceptRelation">
  <rdfs:comment>
   A generalisation of all possible relationships between two concepts 
    in the same language.
  </rdfs:comment>
  <rdfs:domain rdf:resource="#Concept"/>
  <rdfs:range rdf:resource="#Concept"/>
 </rdf:Property>
 <rdf:Property rdf:ID="hasBroaderConcept">
  <rdfs:comment>
   The subject has a broader concept specified by the object.
 </rdfs:comment>
  <rdfs:subPropertyOf rdf:resource="#ConceptRelation"/>
 </rdf:Property>
 <rdf:Property rdf:ID="hasNarrowerConcept">
  <rdfs:comment>
   The subject has a narrower concept specified by the object.
  </rdfs:comment>
  <rdfs:subPropertyOf rdf:resource="#ConceptRelation"/>
 </rdf:Property>
 <rdf:Property rdf:ID="isRelatedTo">
  <rdfs:comment>The related concept relation.</rdfs:comment>
  <rdfs:subPropertyOf rdf:resource="#ConceptRelation"/>
 </rdf:Property>
 <rdf:Property rdf:ID="ConceptEquivalence">
  <rdfs:comment>
    A generalisation of all possible relationships between two concepts 
    in different languages.
 </rdfs:comment>
  <rdfs:domain rdf:resource="#Concept"/>
  <rdfs:range rdf:resource="#Concept"/>
 </rdf:Property>
 <rdf:Property rdf:ID="hasExactEquivalent">
  <rdfs:comment>
   The object of this property is a concept which is identical in 
    meaning and scope to the subject of this property.
 </rdfs:comment>
  <rdfs:subPropertyOf rdf:resource="#ConceptEquivalence"/>
 </rdf:Property>
 <rdf:Property rdf:ID="hasInexactEquivalent">
  <rdfs:comment>
   The object of this property expresses the same general concept as the 
    subject, although the meanings of these terms are not precisely identical.
 </rdfs:comment>
  <rdfs:subPropertyOf rdf:resource="#ConceptEquivalence"/>
 </rdf:Property>
 <rdf:Property rdf:ID="hasPartialEquivalent">
  <rdfs:comment>
   The subject cannot be matched to an exactly equivalent concept in 
    the target language, but a near translation is achieved by the object 
    of this property, which is a concept with a slightly narrower or 
    broader meaning.
 </rdfs:comment>
  <rdfs:subPropertyOf rdf:resource="#ConceptEquivalence"/>
 </rdf:Property>
 <rdf:Property rdf:ID="hasOneToManyEquivalent">
  <rdfs:comment>
   The subject can only be expressed in the target language as a combination 
    of two or more concepts from the target language.
 </rdfs:comment>
  <rdfs:subPropertyOf rdf:resource="#ConceptEquivalence"/>
 </rdf:Property>
 <lc:ScopeNoteType rdf:ID="General"/>
 <lc:ScopeNoteType rdf:ID="Hierarchy"/>
 <lc:ScopeNoteType rdf:ID="Translation"/>
 <lc:ScopeNoteType rdf:ID="Editor"/>
 <lc:ScopeNoteType rdf:ID="History"/>
</rdf:RDF>