David Wood (firstname.lastname@example.org)
Taowei David Wang (email@example.com)
Kendall Clark (firstname.lastname@example.org)
The Urchin-Kowari pilot project, undertaken by members of the UMD Mindlab, replaces Urchin's MySQL database with Tucana's RDF triple store, Kowari (http://kowari.org), which is licensed under the Mozilla Public License v1.1.
The Urchin-Kowari pilot project is available for exploratory usage at <http://bombadil.mindlab.umd.edu/cgi-bin/urchin>.
Urchin (http://urchin.sf.net/) is a customizable Rich Site Summary (RSS) aggregator and filter written in Perl. It consists of two parts: a web application used to query its database, and an administration module that allows editors and other content managers to control RSS channels in the database. Urchin is licensed partially under the GNU General Public License, and partially under the GNU Lesser General Public License."
Urchin can aggregate RSS 0.9*, 1.0, and 2.0 channels; the items of these channels are converted into RDF and stored in a MySQL database. Using the web interface, a user can query the database with Urchin-specific keywords, plaintext, or the RCQL query language. Urchin can then display the results in several output formats, including RSS 1.0 and RSS 0.91 feeds. Users can, thus, create virtual or synthetic RSS feeds, composed of items that match some arbitrary search criteria, and subscribe to them. This allows each user to create a RSS feed tailored to the individual's interest from the available channels in the database.
Our motivation was to take advantage of two facts: all data in Urchin is composed of RDF graphs, and Kowari is a specialized RDF database. Because Urchin is a set of Perl scripts, and Kowari is a Java database, we connect them via SOAP, which gives us a looser coupling of aggregation, management, and persistence components.
We have tried in Urchin-Kowari to reimplement or preserve as much core Urchin functionality as possible. The web interface retains its look, but the query syntax has been modified. We replaced the RDF query language Urchin uses -- RCQL, part of the Perl RDF modules -- with Kowari's interactive Tucana Query Language (iTQL). That change alone has been worth all of the development effort as iTQL is more robust and expressive than RCQL. The other part of the Urchin search interface is a series of keywords that stand for search constrains like NEW, ENGLISH, etc. We have reimplemented some but not all of these in iTQL.
In what follows we describe the benefits of moving from MySQL to Kowari for Urchin's persistence implementation; we offer some technical detail about our prototype implementation; and, finally, we describe some possible next steps.
Changing Urchin's backing store does not directly impact Urchin users except in minor changes to query capabilities, about which more below. The most noticeable interface difference is the change from RCQL to iTQL. Most Urchin components are similarly unaffected. Most Urchin components exchange Perl object instances. This allowed much of Urchin to be ported without modification.
Urchin benefits most from Kowari in terms of performance and scalability. Kowari is optimized for the storage and retrieval of RDF statements and provides substantially better query response time, as explained in the next section. Kowari provides for the storage of roughly 50 million RDF statements before write performance degrades significantly on 32-bit operating systems. The effective storage limitation is roughly one order of magnitude higher on 64-bit operating systems due to a combination of memory-mapped index files and a larger index address space.
Kowari also brings with it a more full-featured query language, the interactive Tucana Query Language (iTQL). iTQL is more expressive than RCQL, the language of the RDF::Core::Query Perl Module. Features of interest to Urchin include the ability to traverse RDF graphs directly (using iTQL's 'walk' operator) and perform transitive closure inferencing (using the 'trans' operator), as well as queries over virtual models and nested subqueries. iTQL also allows queries to combine traditional RDF queries with full-text searches of RDF literals.
A comparison of metadata storage with relational databases highlights the performance advantages of Kowari. Storage of metadata in a relational database typically requires a design with a few (one to six) very narrow, deep tables. Each table might consist of two to five columns, most of which are indexed more than once for performance. A basic query would require all tables to be joined with additional table aliases for nested joins. Large numbers of individual queries are required to perform more complex queries, resulting in slower response times.
Updating metadata statements in a relational database is expensive since all columns are indexed. This would require data pages and indexed columns to be locked. Kowari allows inline writes into its indices and does not have the concept of data pages.
Updates in a relational database are slow when multiple data locks attempt to obtain the same page lock (a "hot spot"). These types of locks also affect reading performance and maintenance. Kowari uses phase trees to ensure read transaction occur on an unmodifiable set of data, so updates do not effect read operations already in progress. This allows consistency in long-running read transactions.
Relational database query optimizers do not perform well when faced with narrow, deep tables with varying amounts of data distribution. The optimizer's distribution pages (costs) do not contain enough detailed information per index. This occurs because columns storing a variety of data (as in RDF statements) are diluted in favor of columns with high uniqueness (such as a column of phone numbers). Additionally, some RDBMS optimizers are designed to examine all possible execution plans based on rules and cost. Since RDF queries tend to be self-recursive and have numerous paths, optimizers battle to determine the best execution plan. In some cases determining the execution plan takes longer than executing the actual query. Kowari has an optimizer specifically designed to handle queries against RDF statements.
Caching of information anticipated to be relevant to subsequent queries is also an issue. Most relational databases cache the most recently accessed tables. This does not suit databases holding RDF data since the relevant table is narrow and deep. Therefore, RDBMS caching mechanisms are often prohibited from coming into play and can even hinder performance. Kowari has a caching mechanism designed for metadata queries, which caches statements, not tables.
Kowari has been designed to hold RDF statements. This allows for the indexing scheme to be optimized for queries that constrained subjects, predicates and/or objects of RDF statements. Due primarily to this indexing scheme, and assisted by the above performance enhancements, Kowari is very fast. Kowari can load RDF statements at a rate of 7,500 RDF statements (triples) per second (fully transacted). Reading bulk statements (that is, recovering a list of statements) may be done as fast as 33,000 statements per second.
The software consists of a modified version of Urchin 0.90 and a modified version of Kowari 1.0.5. While the source code has been modified, we have purposely left the original Urchin's capabilities intact. All of Urchin's original commands, interfaces, and functionalities still work. We choose to retain the original code, backend storage, and functionalities so it would be easy to compare the performance of the two systems (Urchin and Urchin-Kowari). Below we describe the functionalities of Urchin-Kowari and the original Urchin capabilities.
If one types 'urchinadm' from the command line, one would see the following output, describing the available commands.
- urchinadm [command] [Urchin-Kowari Commands] add Add/Update channels to Kowari. Reads one URL per line from stdin. remove Remove channels in Kowari. Reads one URL per line from stdin. list List all channel source URL's in Kowari. Writes one URL and its model name in Kowari per line to stdout. rebuild Save the current list of source URL's, wipe Kowari database, and reimport all channels. (Interactive!) wipe Delete all data in Kowari, including channel metadata. (Interactive!)
Urchin-Kowari's plaintext query relies on Kowari's built-in Lucene Models. Given a plaintext query 'x', it will attempt to find all items whose 'description' or 'title' fields contain a word beginning with 'x'. In essence, it searches for 'x*'.
|Urchin:||HTML table, HTML list, RSS 1.0, RSS .91|
|Urchin-Kowari:||HTML table, RSS 1.0, Atom|
channel URI -> model name item -> channel URI item -> channel name item -> link item -> language
The relationship between aggregated RSS feeds and Kowari models is n + 2, where n is the number of aggregated RSS feeds.
For each RSS feed added to Urchin-Kowari, there is one corresponding Kowari model. One normal Kowari model stores all RDF triples from the feed. However, all RDF triples that have literals as objects are also inseted into a Lucene model. This Lucene model is shared by all channels. The necessity of the Lucene model is to allow partial text matches. The Lucene model has fulltext indexing and allows for a wide range of regular expression searches.
In addition to the model per channel and the shared Lucene model, Urchin-Kowari maintains a metadata model. The metadata model is also shared amongst the channels. This model is created the first time urchinadm is run. The metadata model contains data generated from the channels' statements. The separation between generated (inferred) statements and original statements allows for the separation of real data and metadata. When a channel is removed, the normal model it is associated with as well as the statements associated with it in the medatada model and the Lucene model are deleted. When the database is wiped, all models will be dropped, including the metadata model. But the metadata model and the Lucene model will be craeted as an empty model automatically so new channels can be added correctly.
When an RSS feed URI is added to Urchin-Kowari, several things happen after Urchin has converted the data into RDF triples:
create model <rmi://foo/server1#1910d5d9c23f1639a200e0dd9af8ae52>;
(<http://www.mindswap.org/news/rss/>, <http://www.mindswap.org/urchin-kowari/mapping>, 'rmi://foo/server1#1910d5d9c23f1639a200e0dd9af8ae52')
So, for each feed, one model is created. The model is an ordinary Kowari model containing all RDF triples from the feed and has a name that is the hash of the feed's URI. All RDF statements that have literals as objects are inserted into the Lucene model. In addition, several metadata RDF triplse are generated and inserted into the metadata model. Each feed generates at most 1 + 4N (and at least 1 + N) RDF statements and stores them into the metadata model. The first single RDF triple is the feed URI-to-model mapping. The first N mandatory RDF statements are item-to-URI mapping. The last 3N potential statements are item-to-channel_name, item-to-channel_link, item-to-channel_language mappings.
(All these steps happen in RDFStore.pm, function "sendToKowari")
Given the URI of a feed, the remove operation will remove the two models associated with such feed as well as all the entries in the metadata.
This command lists all the channels currently stored in the Kowari database. For each channel, the URI where the feed resides in as well as the actual model name associated with such URI are printed to STDOUT.
Rebuild will first save the list of channels currently stored in Kowari, perform a 'wipe' operation (see below), and finally re-add the channels one-by-one to Kowari.
This command will delete all data and metadata currently stored in Kowari. It will allow the user to confirm and allow a 12 second 'regret time' to abort the operation.
There are several possible optimizations of Urchin-Kowari. These include:
Urchin's original MySQL schema for RDF storage could only store triples associated with either rss:items or rss:channels. Urchin's query's naturally reflected this limitation. Migrating the storage layer to Kowari allows these query restrictions to be easily removed. Nature Publishing is considering expanding Urchin to include features such as: