- From: Bernard Horan <Bernard.Horan@sun.com>
- Date: Thu, 16 Nov 2006 11:07:44 +0000
- To: public-swd-wg@w3.org
Introduction In the sections below I describe an existing use case for employing taxomomies in a search engine. This use case currently exists in the Sun Labs Search Engine: software that will become open sourced as part of Sun's wider open sourcing strategy. We anticipate that the search engine software will become open sourced in Autumn 2007. The purpose of the search engine is to provide means of searching mutiple corpora of documents within an intranet. The architecture of the search engine is pretty straightforward. A search engine consists (generally) of two aspects: an indexing engine to produce an inverted index of all the terms in all the documents in some corpora; and a query engine that queries the inverted index for user-specified terms to find documents in the corpora. (A term may consist of multiple tokens.) Like most search engines, ours includes a syntax to allow the user control the interpretation of search terms in a query (such as NOT, NEAR, AND, OR etc.) Functionality The Sun search engine can be configured to use "external knowledge sources" when indexing and querying. These external knowledge sources are used to provide "variants" for a term. Currently we use two kinds of knowledge sources: morphological and taxonomic. The knowledge sources are used as follows: When the indexing engine encounters a term, it asks a knowledge source for variants of that term and includes those variants in the index. When the query engine searches for a term, it asks a knowledge source for variants of that term and expands the query to include those variants. (The above description omits the complexity of dealing with noun phrases and the way in which we use proximity to calculate query weights.) The way in which a knowledge source responds when asked for the variants of a term is particular to the knowledge source. For example, a knowledge source that incorporates morphological knowledge will respond with morphological variants of a term (e.g. variants of 'serve' might include 'server', 'serving', 'serves', etc.) In the past we have referred to this as "Conceptual Indexing". [1] The way in which a taxomomic knowledge source responds when asked for variants is currently dependent on the taxonomy and its representation language. Our implementation provides Java wrappers for taxonomies that can be represented in RDF or OWL (using Sesame, Jena and the Java OWL API). Users of the search engine are currently required to implement subclasses of the wrapper class(es) for each taxonomy they wish to use. This is a somewhat clumsy solution and one which I hope SKOS can improve: we would like to be able to provide users with a simple API that can provide the variants for a term. Vocabularies We have tested this implementation with several taxonomies, including the Gene Ontology [2] and Sun's Unified Product Taxonomy, which is part of our swoRDFish programme [3]. [1] http://research.sun.com/knowledge/ [2] http://www.geneontology.org/ [3] http://www.w3.org/2001/sw/meetings/tech-200303/w3_plenary.sxi
Received on Thursday, 16 November 2006 11:08:05 UTC