Urchin RSS/RDF case study from Nature from Timo Hannay on 2004-02-27 (public-semweb-lifesci@w3.org from February 2004)

From: Timo Hannay <t.hannay@nature.com>
Date: Fri, 27 Feb 2004 17:55:15 -0000
To: <public-semweb-lifesci@w3.org>
Message-ID: <01d601c3fd5a$dbc54b50$910119ac@PC1580>

Dear All,

Yesterday Nature Publishing Group released a new version (v0.9) of Urchin
(http://urchin.sourceforge.net/). Those of you who were at the meeting in
Boston last October may recall that this is our open-source framework for
generating, aggregating and filtering RSS feeds. For anyone who doesn't
remember, my slides from that meeting are here:
http://nurture.nature.com/timo/urchin/W3C_Meeting_031029.ppt
What I've written below assumes you know what RSS is all about. If not,
please read something like this:
http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html

Last autumn Urchin was (among other things) able to use its triple store to
save arbitrary metadata from RSS 1.0 feeds -- which, of course, use the RDF
data model -- but wasn't able to much with this information. The main
development in the new version is that it now reconstructs all relevant RDF
metadata in its RSS 1.0 output and, best of all, allows full RDF querying.

Here are some examples from a simple test implementation that we have online
here:
http://nurture.nature.com/cgi-bin/urchin
Please excuse the short and slightly strange list of feeds currently in the
database, which you can see here:
http://nurture.nature.com/cgi-bin/urchin?cmd=feeds
Fortunately this is enough to demonstrate the principles. Urchin
automatically visits each of these feeds every couple of hours and adds any
new items it finds to its database.

(Because we take a RESTful approach, many of the URLs are quite long. In
order that they don't wrap, forcing you to paste them back together, I'll
mostly use alias URLs in what follows.)

1) We'll start by going over some old ground. You can search the feeds in
Urchin for any keyword you like and it will pull up items with this word in
the title or description. For example, I can be updated every time any of
the feeds in the Urchin database mentions "SARS":
http://nurture.nature.com/timo/urchin/test01.html
(Urchin can output in a variety of formats, including RSS, but for now we're
using a simple HTML table for simplicity and ease of viewing.)

2) We can also look for entire phrases, like "stem cell":
http://nurture.nature.com/timo/urchin/test02.html

3) And define more complicated Boolean queries:
http://nurture.nature.com/timo/urchin/test03.html

4) You can also use regular expressions to allow for differences in spelling
as well as more complex wildcard searches:
http://nurture.nature.com/timo/urchin/test04.html

5) Instead of searching the whole database each time, you can define named
aggregates of feeds and restrict searches to these. In the following
example, we've create an aggregate called "npg" that contains only Nature
Publishing Group content:
http://nurture.nature.com/timo/urchin/test05.html

6) We can also limit our searches to items that are current (i.e., ones that
were still in the relevant feed last time Urchin visited it):
http://nurture.nature.com/timo/urchin/test06.html
or to new items (i.e., ones that appeared in the relevant feed for the first
time the last time Urchin visited it):
http://nurture.nature.com/timo/urchin/test07.html
There are other options too, but you get the idea. A fuller list is
provided here:
http://nurture.nature.com/cgi-bin/urchin?cmd=help

7) In case you want to look at something other than the title or
description, there are some built-in metadata field names, such as
"author_name" (which looks in the Dublin Core "dc:creator" metadata field).
Here's an example that tracks everything written by Declan Butler:
http://nurture.nature.com/timo/urchin/test08.html

8) So far so old. Now we get on to the new functionality in v0.9. Instead
of using a hardcoded metadata field name that Urchin already knows about,
you can replace this with the name of any arbitrary RDF metadata field in
the triple store that is directly attached to an item. For example, if we
want to look for anything that cites the article with DOI (digital object
identifier) "10.1021/es034923g" then we look in the "dcterms:references"
metadata field, of which Urchin has no native knowledge:
http://nurture.nature.com/timo/urchin/test09.html
I think it's worth emphasising the potential power of this. Urchin has no
prior knowledge of this metadata field but it is able to import it from any
RSS 1.0 documents where it happens to find it and Urchin users can query
based on the information contained. This provides a great deal of
flexibility for reading and filtering based on any arbitrary metadata
without having to change Urchin's code at all, all thanks to the extreme
interoperability of RDF.

9) One thing you can't do in the above example is to query based on RDF
metadata that is not _directly_ attached to an RSS item in Urchin's RDF
triple store. In order to enable this, Urchin allows full RDF querying
using RCQL, which is entered here:
http://nurture.nature.com/cgi-bin/urchin?cmd=rcql
For example, this query:
http://nurture.nature.com/timo/urchin/test10.html
shows you job vacancies (from our test feed for NatureJobs) that are located
in Cambridge. And this:
http://nurture.nature.com/timo/urchin/test11.html
gives you (deep breath) all items written by anyone who's written an item
citing the document with DOI "10.1093/hmg/ddh065". It might also give you a
headache. ;-)
You may notice that these RDF queries are quite slow. This is partly
because the RCQL has to be converted into SQL and this takes time, but the
main reason is that the resultant SQL can be hideously inefficient. I guess
this is one of the penalties of using an ordinary RDBMS to store RDF and
maybe we'd get better performance from a custom triple store. Anyway, for
now, we would imagine these queries being done once each time Urchin updates
its feeds, then cached, rather than being done on-the-fly each time a query
comes in. (On that note, Urchin also has simple caching functionality, but
I won't go into this here.)

10) As mentioned above, Urchin can output any of these query results in a
variety of formats. For example, here's an RSS 1.0 feed on SARS:
http://nurture.nature.com/timo/urchin/test12.html
and here's a custom HTML page on the same subject:
http://nurture.nature.com/timo/urchin/test13.html
In fact, the HTML above is created by an XSL transformation of the RSS 1.0
output. You can define your own XSL documents, allowing any arbitrary text
output (HTML, JavaScript, plain text, whatever). Note that the HTML example
above makes some nifty use of Urchin's RSS 1.0 metadata. For example the
"Show results from this source" links reuse the 'channel_id' and the
original search term to rerun the search on that channel alone.

11) A final feature: If you want RSS output from an RDF query then you can
also enter RCQL queries in the normal Urchin query box here:
http://nurture.nature.com/cgi-bin/urchin
by typing "RCQL:" before the query. In this case you can leave out the
"Select" bit of the query because Urchin knows what it needs to fetch in
order to create an RSS feed. For example, you can look for Declan Butler's
articles by doing "RCQL: From ?item->dc:creator=>'Declan Butler'", which
gives you this result:
http://nurture.nature.com/timo/urchin/test14.html

We think Urchin is a pretty nice demonstration of the power of RDF as well
as being a useful application in its own right. We expect to be able to
provide Urchin-driven functionality on Nature.com before very long. Watch
this space...

Cheers,

Timo

P.S. If anyone knows of any RSS 1.0 feeds with particularly rich or
interesting metadata, please let us know. Thanks.

-----
Timo Hannay, PhD
Associate Director, New Technology
Nature Publishing Group

********************************************************************************
DISCLAIMER: This e-mail is confidential and should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage mechanism. Neither Macmillan Publishers Limited nor any of its agents accept liability for any statements made which are clearly the sender's own and not expressly made on behalf of Macmillan Publishers Limited or one of its agents. Please note that neither Macmillan Publishers Limited nor any of its agents accept any responsibility for viruses that may be contained in this e-mail or its attachments and it is your responsibility to scan the email and attachments (if any). No contracts may be concluded on behalf of Macmillan Publishers Limited or its agents by means of e-mail communication. Macmillan Publishers Limited Registered in England and Wales with registered number 785998 Registered Office Brunel Road, Houndmills, Basingstoke RG21 6XS
********************************************************************************

Received on Friday, 27 February 2004 12:57:58 UTC