SKOS Use Case: incorporating taxonomies into search

Introduction
In the sections below I describe an existing use case for employing 
taxomomies in a search engine. This use case currently exists in the Sun 
Labs Search Engine: software that will become open sourced as part of 
Sun's wider open sourcing strategy. We anticipate that the search engine 
software will become open sourced in Autumn 2007.

The purpose of the search engine is to provide means of searching 
mutiple corpora of documents within an intranet.

The architecture of the search engine is pretty straightforward. A 
search engine consists (generally) of two aspects: an indexing engine to 
  produce an inverted index of all the terms in all the documents in 
some corpora; and a query engine that queries the inverted index for 
user-specified terms to find documents in the corpora. (A term may 
consist of multiple tokens.) Like most search engines, ours includes a 
syntax to allow the user control the interpretation of search terms in a 
query (such as NOT, NEAR, AND, OR etc.)

Functionality
The Sun search engine can be configured to use "external knowledge 
sources" when indexing and querying. These external knowledge sources 
are used to provide "variants" for a term.

Currently we use two kinds of knowledge sources: morphological and 
taxonomic.

The knowledge sources are used as follows:

When the indexing engine encounters a term, it asks a knowledge source 
for variants of that term and includes those variants in the index.

When the query engine searches for a term, it asks a knowledge source 
for variants of that term and expands the query to include those variants.

(The above description omits the complexity of dealing with noun phrases 
and the way in which we use proximity to calculate query weights.)

The way in which a knowledge source responds when asked for the variants 
of a term is particular to the knowledge source. For example, a 
knowledge source that incorporates morphological knowledge will respond 
with morphological variants of a term (e.g. variants of 'serve' might 
include 'server', 'serving', 'serves', etc.) In the past we have 
referred to this as "Conceptual Indexing". [1]

The way in which a taxomomic knowledge source responds when asked for 
variants is currently dependent on the taxonomy and its representation 
language. Our implementation provides Java wrappers for taxonomies that 
can be represented in RDF or OWL (using Sesame, Jena and the Java OWL 
API). Users of the search engine are currently required to implement 
subclasses of the wrapper class(es) for each taxonomy they wish to use. 
This is a somewhat clumsy solution and one which I hope SKOS can 
improve: we would like to be able to provide users with a simple API 
that can provide the variants for a term.

Vocabularies
We have tested this implementation with several taxonomies, including 
the Gene Ontology [2] and Sun's Unified Product Taxonomy, which is part 
of our swoRDFish programme [3].

[1] http://research.sun.com/knowledge/
[2] http://www.geneontology.org/
[3] http://www.w3.org/2001/sw/meetings/tech-200303/w3_plenary.sxi

Received on Thursday, 16 November 2006 11:08:05 UTC