- From: Timo Hannay <t.hannay@nature.com>
- Date: Fri, 27 Feb 2004 17:55:15 -0000
- To: <public-semweb-lifesci@w3.org>
Dear All, Yesterday Nature Publishing Group released a new version (v0.9) of Urchin (http://urchin.sourceforge.net/). Those of you who were at the meeting in Boston last October may recall that this is our open-source framework for generating, aggregating and filtering RSS feeds. For anyone who doesn't remember, my slides from that meeting are here: http://nurture.nature.com/timo/urchin/W3C_Meeting_031029.ppt What I've written below assumes you know what RSS is all about. If not, please read something like this: http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html Last autumn Urchin was (among other things) able to use its triple store to save arbitrary metadata from RSS 1.0 feeds -- which, of course, use the RDF data model -- but wasn't able to much with this information. The main development in the new version is that it now reconstructs all relevant RDF metadata in its RSS 1.0 output and, best of all, allows full RDF querying. Here are some examples from a simple test implementation that we have online here: http://nurture.nature.com/cgi-bin/urchin Please excuse the short and slightly strange list of feeds currently in the database, which you can see here: http://nurture.nature.com/cgi-bin/urchin?cmd=feeds Fortunately this is enough to demonstrate the principles. Urchin automatically visits each of these feeds every couple of hours and adds any new items it finds to its database. (Because we take a RESTful approach, many of the URLs are quite long. In order that they don't wrap, forcing you to paste them back together, I'll mostly use alias URLs in what follows.) 1) We'll start by going over some old ground. You can search the feeds in Urchin for any keyword you like and it will pull up items with this word in the title or description. For example, I can be updated every time any of the feeds in the Urchin database mentions "SARS": http://nurture.nature.com/timo/urchin/test01.html (Urchin can output in a variety of formats, including RSS, but for now we're using a simple HTML table for simplicity and ease of viewing.) 2) We can also look for entire phrases, like "stem cell": http://nurture.nature.com/timo/urchin/test02.html 3) And define more complicated Boolean queries: http://nurture.nature.com/timo/urchin/test03.html 4) You can also use regular expressions to allow for differences in spelling as well as more complex wildcard searches: http://nurture.nature.com/timo/urchin/test04.html 5) Instead of searching the whole database each time, you can define named aggregates of feeds and restrict searches to these. In the following example, we've create an aggregate called "npg" that contains only Nature Publishing Group content: http://nurture.nature.com/timo/urchin/test05.html 6) We can also limit our searches to items that are current (i.e., ones that were still in the relevant feed last time Urchin visited it): http://nurture.nature.com/timo/urchin/test06.html or to new items (i.e., ones that appeared in the relevant feed for the first time the last time Urchin visited it): http://nurture.nature.com/timo/urchin/test07.html There are other options too, but you get the idea. A fuller list is provided here: http://nurture.nature.com/cgi-bin/urchin?cmd=help 7) In case you want to look at something other than the title or description, there are some built-in metadata field names, such as "author_name" (which looks in the Dublin Core "dc:creator" metadata field). Here's an example that tracks everything written by Declan Butler: http://nurture.nature.com/timo/urchin/test08.html 8) So far so old. Now we get on to the new functionality in v0.9. Instead of using a hardcoded metadata field name that Urchin already knows about, you can replace this with the name of any arbitrary RDF metadata field in the triple store that is directly attached to an item. For example, if we want to look for anything that cites the article with DOI (digital object identifier) "10.1021/es034923g" then we look in the "dcterms:references" metadata field, of which Urchin has no native knowledge: http://nurture.nature.com/timo/urchin/test09.html I think it's worth emphasising the potential power of this. Urchin has no prior knowledge of this metadata field but it is able to import it from any RSS 1.0 documents where it happens to find it and Urchin users can query based on the information contained. This provides a great deal of flexibility for reading and filtering based on any arbitrary metadata without having to change Urchin's code at all, all thanks to the extreme interoperability of RDF. 9) One thing you can't do in the above example is to query based on RDF metadata that is not _directly_ attached to an RSS item in Urchin's RDF triple store. In order to enable this, Urchin allows full RDF querying using RCQL, which is entered here: http://nurture.nature.com/cgi-bin/urchin?cmd=rcql For example, this query: http://nurture.nature.com/timo/urchin/test10.html shows you job vacancies (from our test feed for NatureJobs) that are located in Cambridge. And this: http://nurture.nature.com/timo/urchin/test11.html gives you (deep breath) all items written by anyone who's written an item citing the document with DOI "10.1093/hmg/ddh065". It might also give you a headache. ;-) You may notice that these RDF queries are quite slow. This is partly because the RCQL has to be converted into SQL and this takes time, but the main reason is that the resultant SQL can be hideously inefficient. I guess this is one of the penalties of using an ordinary RDBMS to store RDF and maybe we'd get better performance from a custom triple store. Anyway, for now, we would imagine these queries being done once each time Urchin updates its feeds, then cached, rather than being done on-the-fly each time a query comes in. (On that note, Urchin also has simple caching functionality, but I won't go into this here.) 10) As mentioned above, Urchin can output any of these query results in a variety of formats. For example, here's an RSS 1.0 feed on SARS: http://nurture.nature.com/timo/urchin/test12.html and here's a custom HTML page on the same subject: http://nurture.nature.com/timo/urchin/test13.html In fact, the HTML above is created by an XSL transformation of the RSS 1.0 output. You can define your own XSL documents, allowing any arbitrary text output (HTML, JavaScript, plain text, whatever). Note that the HTML example above makes some nifty use of Urchin's RSS 1.0 metadata. For example the "Show results from this source" links reuse the 'channel_id' and the original search term to rerun the search on that channel alone. 11) A final feature: If you want RSS output from an RDF query then you can also enter RCQL queries in the normal Urchin query box here: http://nurture.nature.com/cgi-bin/urchin by typing "RCQL:" before the query. In this case you can leave out the "Select" bit of the query because Urchin knows what it needs to fetch in order to create an RSS feed. For example, you can look for Declan Butler's articles by doing "RCQL: From ?item->dc:creator=>'Declan Butler'", which gives you this result: http://nurture.nature.com/timo/urchin/test14.html We think Urchin is a pretty nice demonstration of the power of RDF as well as being a useful application in its own right. We expect to be able to provide Urchin-driven functionality on Nature.com before very long. Watch this space... Cheers, Timo P.S. If anyone knows of any RSS 1.0 feeds with particularly rich or interesting metadata, please let us know. Thanks. ----- Timo Hannay, PhD Associate Director, New Technology Nature Publishing Group ******************************************************************************** DISCLAIMER: This e-mail is confidential and should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage mechanism. Neither Macmillan Publishers Limited nor any of its agents accept liability for any statements made which are clearly the sender's own and not expressly made on behalf of Macmillan Publishers Limited or one of its agents. Please note that neither Macmillan Publishers Limited nor any of its agents accept any responsibility for viruses that may be contained in this e-mail or its attachments and it is your responsibility to scan the email and attachments (if any). No contracts may be concluded on behalf of Macmillan Publishers Limited or its agents by means of e-mail communication. Macmillan Publishers Limited Registered in England and Wales with registered number 785998 Registered Office Brunel Road, Houndmills, Basingstoke RG21 6XS ********************************************************************************
Received on Friday, 27 February 2004 12:57:58 UTC