The Urchin-Kowari Project

David Wood (dwood@tucanatech.com)
Taowei David Wang (tw7@cs.umd.edu)
Kendall Clark (kendall@monkeyfist.com)

Introduction

The Urchin-Kowari pilot project, undertaken by members of the UMD Mindlab, replaces Urchin's MySQL database with Tucana's RDF triple store, Kowari (http://kowari.org), which is licensed under the Mozilla Public License v1.1.

The Urchin-Kowari pilot project is available for exploratory usage at <http://bombadil.mindlab.umd.edu/cgi-bin/urchin>.

Urchin (http://urchin.sf.net/) is a customizable Rich Site Summary (RSS) aggregator and filter written in Perl. It consists of two parts: a web application used to query its database, and an administration module that allows editors and other content managers to control RSS channels in the database. Urchin is licensed partially under the GNU General Public License, and partially under the GNU Lesser General Public License."

Urchin can aggregate RSS 0.9*, 1.0, and 2.0 channels; the items of these channels are converted into RDF and stored in a MySQL database. Using the web interface, a user can query the database with Urchin-specific keywords, plaintext, or the RCQL query language. Urchin can then display the results in several output formats, including RSS 1.0 and RSS 0.91 feeds. Users can, thus, create virtual or synthetic RSS feeds, composed of items that match some arbitrary search criteria, and subscribe to them. This allows each user to create a RSS feed tailored to the individual's interest from the available channels in the database.

Our motivation was to take advantage of two facts: all data in Urchin is composed of RDF graphs, and Kowari is a specialized RDF database. Because Urchin is a set of Perl scripts, and Kowari is a Java database, we connect them via SOAP, which gives us a looser coupling of aggregation, management, and persistence components.

We have tried in Urchin-Kowari to reimplement or preserve as much core Urchin functionality as possible. The web interface retains its look, but the query syntax has been modified. We replaced the RDF query language Urchin uses -- RCQL, part of the Perl RDF modules -- with Kowari's interactive Tucana Query Language (iTQL). That change alone has been worth all of the development effort as iTQL is more robust and expressive than RCQL. The other part of the Urchin search interface is a series of keywords that stand for search constrains like NEW, ENGLISH, etc. We have reimplemented some but not all of these in iTQL.

In what follows we describe the benefits of moving from MySQL to Kowari for Urchin's persistence implementation; we offer some technical detail about our prototype implementation; and, finally, we describe some possible next steps.

Native RDF Data Storage

Changing Urchin's backing store does not directly impact Urchin users except in minor changes to query capabilities, about which more below. The most noticeable interface difference is the change from RCQL to iTQL. Most Urchin components are similarly unaffected. Most Urchin components exchange Perl object instances. This allowed much of Urchin to be ported without modification.

Urchin benefits most from Kowari in terms of performance and scalability. Kowari is optimized for the storage and retrieval of RDF statements and provides substantially better query response time, as explained in the next section. Kowari provides for the storage of roughly 50 million RDF statements before write performance degrades significantly on 32-bit operating systems. The effective storage limitation is roughly one order of magnitude higher on 64-bit operating systems due to a combination of memory-mapped index files and a larger index address space.

Kowari also brings with it a more full-featured query language, the interactive Tucana Query Language (iTQL). iTQL is more expressive than RCQL, the language of the RDF::Core::Query Perl Module. Features of interest to Urchin include the ability to traverse RDF graphs directly (using iTQL's 'walk' operator) and perform transitive closure inferencing (using the 'trans' operator), as well as queries over virtual models and nested subqueries. iTQL also allows queries to combine traditional RDF queries with full-text searches of RDF literals.

Advantages of Kowari

A comparison of metadata storage with relational databases highlights the performance advantages of Kowari. Storage of metadata in a relational database typically requires a design with a few (one to six) very narrow, deep tables. Each table might consist of two to five columns, most of which are indexed more than once for performance. A basic query would require all tables to be joined with additional table aliases for nested joins. Large numbers of individual queries are required to perform more complex queries, resulting in slower response times.

Updating metadata statements in a relational database is expensive since all columns are indexed. This would require data pages and indexed columns to be locked. Kowari allows inline writes into its indices and does not have the concept of data pages.

Updates in a relational database are slow when multiple data locks attempt to obtain the same page lock (a "hot spot"). These types of locks also affect reading performance and maintenance. Kowari uses phase trees to ensure read transaction occur on an unmodifiable set of data, so updates do not effect read operations already in progress. This allows consistency in long-running read transactions.

Relational database query optimizers do not perform well when faced with narrow, deep tables with varying amounts of data distribution. The optimizer's distribution pages (costs) do not contain enough detailed information per index. This occurs because columns storing a variety of data (as in RDF statements) are diluted in favor of columns with high uniqueness (such as a column of phone numbers). Additionally, some RDBMS optimizers are designed to examine all possible execution plans based on rules and cost. Since RDF queries tend to be self-recursive and have numerous paths, optimizers battle to determine the best execution plan. In some cases determining the execution plan takes longer than executing the actual query. Kowari has an optimizer specifically designed to handle queries against RDF statements.

Caching of information anticipated to be relevant to subsequent queries is also an issue. Most relational databases cache the most recently accessed tables. This does not suit databases holding RDF data since the relevant table is narrow and deep. Therefore, RDBMS caching mechanisms are often prohibited from coming into play and can even hinder performance. Kowari has a caching mechanism designed for metadata queries, which caches statements, not tables.

Kowari has been designed to hold RDF statements. This allows for the indexing scheme to be optimized for queries that constrained subjects, predicates and/or objects of RDF statements. Due primarily to this indexing scheme, and assisted by the above performance enhancements, Kowari is very fast. Kowari can load RDF statements at a rate of 7,500 RDF statements (triples) per second (fully transacted). Reading bulk statements (that is, recovering a list of statements) may be done as fast as 33,000 statements per second.

State of the Software

The software consists of a modified version of Urchin 0.90 and a modified version of Kowari 1.0.5. While the source code has been modified, we have purposely left the original Urchin's capabilities intact. All of Urchin's original commands, interfaces, and functionalities still work. We choose to retain the original code, backend storage, and functionalities so it would be easy to compare the performance of the two systems (Urchin and Urchin-Kowari). Below we describe the functionalities of Urchin-Kowari and the original Urchin capabilities.

urchinadm (administrative tool)

If one types 'urchinadm' from the command line, one would see the following output, describing the available commands.

  -
  urchinadm [command]
  [Urchin-Kowari Commands]
  
  add         Add/Update channels to Kowari. Reads one URL per line from stdin.
  
  remove      Remove channels in Kowari.  Reads one URL per line from stdin.
  
  list        List all channel source URL's in Kowari. Writes one URL and its model
              name in Kowari per line to stdout.
  
  rebuild     Save the current list of source URL's, wipe Kowari database, and
              reimport all channels.  (Interactive!)
  
  wipe        Delete all data in Kowari, including channel metadata.  (Interactive!)

urchin (cgi script for the Web interface)

Urchin-Kowari's plaintext query relies on Kowari's built-in Lucene Models. Given a plaintext query 'x', it will attempt to find all items whose 'description' or 'title' fields contain a word beginning with 'x'. In essence, it searches for 'x*'.

output options:

Urchin:	HTML table, HTML list, RSS 1.0, RSS .91
Urchin-Kowari:	HTML table, RSS 1.0, Atom

metadata stored in Kowari:

  channel URI -> model name
  item        -> channel URI
  item        -> channel name
  item        -> link
  item        -> language

Model Management

The relationship between aggregated RSS feeds and Kowari models is n + 2, where n is the number of aggregated RSS feeds.

For each RSS feed added to Urchin-Kowari, there is one corresponding Kowari model. One normal Kowari model stores all RDF triples from the feed. However, all RDF triples that have literals as objects are also inseted into a Lucene model. This Lucene model is shared by all channels. The necessity of the Lucene model is to allow partial text matches. The Lucene model has fulltext indexing and allows for a wide range of regular expression searches.

In addition to the model per channel and the shared Lucene model, Urchin-Kowari maintains a metadata model. The metadata model is also shared amongst the channels. This model is created the first time urchinadm is run. The metadata model contains data generated from the channels' statements. The separation between generated (inferred) statements and original statements allows for the separation of real data and metadata. When a channel is removed, the normal model it is associated with as well as the statements associated with it in the medatada model and the Lucene model are deleted. When the database is wiped, all models will be dropped, including the metadata model. But the metadata model and the Lucene model will be craeted as an empty model automatically so new channels can be added correctly.

Metadata Structure/ Channel Management (urchinadm)

ADD:

When an RSS feed URI is added to Urchin-Kowari, several things happen after Urchin has converted the data into RDF triples:

The URI is converted to a 128-bit hash code using the MD5 message digest algorithm.

A normal model is created in Kowari using this hash code. For example,


      create model <rmi://foo/server1#1910d5d9c23f1639a200e0dd9af8ae52>;

The set of RDF triples are inserted into this model.
All RDF triples whose object is a literal (not a resource or a bnode) are inserted into the Lucene model. The Lucene model is used for its full text indexing capabilities. This allows users to query for plain text and partial plain text matches. Though the Lucene model provides other other regular expression capabilities, none other than prefix matching is currently supported.

A triple mapping the original RSS feed URI to its model name is created and entered into the metadata model:


      (<http://www.mindswap.org/news/rss/>,
       <http://www.mindswap.org/urchin-kowari/mapping>,
       'rmi://foo/server1#1910d5d9c23f1639a200e0dd9af8ae52')

The name (title), link, and language properties of the feed are gathered.
For each item <A> in the feed, a triple of the form (<A>, <http://purl.org/rss/1.0.channel>, 'channel URI') is inserted into the metadata model.
For each item <A> in the feed, a triple of the form (<A>, <http://www.mindswap.org/urchin-kowari/itemToChannelName>, 'channel name') is inserted into the metadata model. These triples would only be generated if the title property is specified.
For each item <A> in the feed, a triple of the form (<A>,<http://www.mindswap.org/urchin-kowari/itemToChannelLink>, 'channel link') is inserted into the metadata model. These triples would only be generated if the link property is specified.
For each item <A> in the feed, a triple of the form (<A>, <http://www.mindswap.org/urchin-kowari/itemToChannelLanguage>, 'channel language') is inserted into the metadata model. These triples would only be generated if the language property is specified.

So, for each feed, one model is created. The model is an ordinary Kowari model containing all RDF triples from the feed and has a name that is the hash of the feed's URI. All RDF statements that have literals as objects are inserted into the Lucene model. In addition, several metadata RDF triplse are generated and inserted into the metadata model. Each feed generates at most 1 + 4N (and at least 1 + N) RDF statements and stores them into the metadata model. The first single RDF triple is the feed URI-to-model mapping. The first N mandatory RDF statements are item-to-URI mapping. The last 3N potential statements are item-to-channel_name, item-to-channel_link, item-to-channel_language mappings.

(All these steps happen in RDFStore.pm, function "sendToKowari")

REMOVE:

Given the URI of a feed, the remove operation will remove the two models associated with such feed as well as all the entries in the metadata.

Checks to see if the feed is valid in the database (it may not exist). If true, then the channel will be removed (step 2). Else a message is outputted to stdout.
Removes all items in the Lucene model that are copies of the normal model. Remove all items with the correct title, link, and language in the metadata model. Remove URI-to-model statement in the metadata model. Drop the normal model. The running time of the removal of the statements in the metadata model is linear to the number of items the channel has. So for a channel with n news items, kowari needs to perform n specific removals (where each removal removes all metadata (up to 4 triples) associated with an item), though metadata removal is serialized in a single SOAP message containing a complex iTQL query.

LIST:

This command lists all the channels currently stored in the Kowari database. For each channel, the URI where the feed resides in as well as the actual model name associated with such URI are printed to STDOUT.

REBUILD:

Rebuild will first save the list of channels currently stored in Kowari, perform a 'wipe' operation (see below), and finally re-add the channels one-by-one to Kowari.

WIPE:

This command will delete all data and metadata currently stored in Kowari. It will allow the user to confirm and allow a 12 second 'regret time' to abort the operation.

Once user confirms and allows the 12 second period to pass, the operation will obtain a list of all normal models (this includes the metadata model) and Lucene models currently in Kowari.
Construct a series of ITQL commands to drop all the models aforementioned and wrap the commands in one SOAP message.
Send the commands to Kowari as one SOAP call.
Rebuilds an empty metadata model so further adding to the database will be done correctly.

FAQs

Why are we storing these objects as literals when they really are URI's (in building the metadata)?

From the rss feeds, each item has a designation RDF statement such as:

    s: <http://www.livejournal.com/users/bunny57/178893.html>
    p: <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
    o: <http://purl.org/rss/1.0/item>

These statements do not contain any 'useful' information for users to query for. Unfortunately, we don't have a good way of telling these non-informative statements from informative ones. The only trait we discover is that the informative statements have literals as objects.

Hence when user enters a query Q and we get a set S of subjects that satisfies the query as well as the properties (represented as predicate-object pairs) of each s in S, we only consider a statement useful if the object is a literal. And such statement is inserted into the result set.

Next Steps

There are several possible optimizations of Urchin-Kowari. These include:

Urchin-Kowari's use of SOAP messaging between the components is inefficient if Urchin and Kowari are running on the same machine. In this case the need for SOAP messaging may be avoided by using Larry Wall's Perl-Java bridge. The Inline package for Perl and Perl's Inline::Java module allow Java code to be embedded in Perl. This may be used to call Kowari's iTQLInterpreterBean, a Java Bean for querying a Kowari server.

Urchin's original MySQL schema for RDF storage could only store triples associated with either rss:items or rss:channels. Urchin's query's naturally reflected this limitation. Migrating the storage layer to Kowari allows these query restrictions to be easily removed. Nature Publishing is considering expanding Urchin to include features such as:

Importing non-RSS RDF documents. Importing FOAF files, for example, would allow a user to issue Urchin queries such as "get items written by friends of the author of this item".

Storing RDF rules or OWL ontologies for mapping between RSS and other RDF vocabularies, such as Dublin Core metadata elements or an Atom/RDF mapping.