experiments with RDF programming in Ruby
This is a simple, experimental RDF system implemented in the Ruby programming language. It serves three purposes: to help me learn Ruby, to support an RDFWeb idea I'm prototyping, and to explore some design options for RDF APIs. This is not production-grade stuff.
Ruby is an object-oriented scripting language that I finally got around to investigating this week. More information is available from the Ruby home page, as well as from sites like Ruby Central, where you can find an online book Programming Ruby and other introductory and reference material.
The code here implements some parts of a a basic RDF API and in-memory store for the Ruby programming language. It is a rewrite of a similar system I worked on in Perl. To understand what an RDF API is for, you probably need a little background on RDF and its relationship to XML and the Web, more than can be provided here. The RDF home page and Semantic Web activity provide some useful links. This brief document on RDF "striped" syntax may help, as might Tim Bray's excellent What is RDF? article. Assuming some basic familiarity with RDF, we still need to know what an RDF API might do...
So what is an RDF API? How does it differ from XML's SAX and DOM interfaces? RDF, after all, is written in XML syntax...
An RDF API provides a way for application authors to access
data that is structured according to the W3C RDF information
model, that is, as a directed labeled graph of
"nodes and arcs". Each Node
represents
something, either a "resource" (some identifiable thing,
whether conceptual, physical or digital) or a "literal" chunk
of data such as a string or number. Ruby fans, for whom (I'm
told) numbers and strings are objects might wonder why RDF
makes such a distinction. But it does, and RDF
implementations consequently distinguish between nodes that
are "resources" and nodes that are "literals".
The connections between these things, in RDF, are called "properties". They correspond to (binary) relations between nodes in the graph, and to the notion of attributes. In RDF we identify properties using Web identifiers, URIs. A URI (Uniform Resource Identifier) is a simple textual string (such as a URL), and provides a useful decentralised convention for naming things on the Internet. The neat idea in RDF is to use URIs to identify not just the things that we might want to describe using RDF, but also to identify the classes and properties that we use to describe them. So in RDF, each "arc" (or edge, or connection, or link) in the graph is named with a URI, and each "node" has a "type" which is some (URI-named) class.
So the RDF information model amounts to little more than a collection of triples (we call them "statements") consisting of nodes and (URI-named) arc labels. These are usefully thought of as forming a graph data structure, and most RDF APIs present themselves as interfaces to this graph.
We call these triples "statements" in acknowledgement that RDF content is supposed to be meaningful, to say something about the world. RDF data corresponds to a set of claims, ie. statements, about the named properties of named objects, where the names are written using URI syntax. An RDF API will offer applications an interface to objects that model the world in these terms, typically allowing data to be added, or questions to be asked of the Graph and Node objects.
So the reason we have RDF APIs, in addition to pure XML interfaces such as SAX and DOM, is that there are many mappings of RDF graphs into XML documents. By coding to an RDF API instead of to an XML API, we can set aside the detail of how our RDF is written, and deal directly with interfaces that care about the content of the RDF. In other words, we can have APIs that load XML/RDF data, and expose an interface couched in terms of "nodes and arcs" (objects and their relationships), rather than in terms of the XML document structures that encode this data.
It should be clear by now that the notion of 'object' is serving at least two purposes. Ruby (and other programming languages) present programmers with objects (that have properties and methods that receive messages). RDF presents programmers with a network of objects (nodes in a graph, each node representing some "resource" or thing), connected by directed, labeled arcs. Somehow we need to represent the latter using the former. This becomes interesting, since both Ruby and RDF have the notion of a class hierarchy, and ways of representing things with properties. The Node-centric RDF API outlined below explores one trick for reflecting RDF's notion of property into Ruby's notion of (missing) methods.
There is a third sense of the word 'object' that should be mentioned at this point: in RDF, the three parts of an RDF statement are called the "subject", "predicate" and "object". The subject is the node whose property is being described, the predicate is the type of property, and the object is the value of the property. To complicate things further, there are sometimes nodes in the graph corresponding to the types of property (such as 'worksFor') and sometimes even for so-called "reified" RDF statements, representations of statements and their component parts, ie the subject, predicate and object. Such complexities can largely be ignored for now; but they're worth mentioning as they are reflected in the basic RDF API. For example, our Graph object allows you to list all the subjects, predicates or objects in the graph. More details on this to follow.
In Ruby, everything is an object. In RDF, everything is a "resource", and is described using the RDF information model. This simplifies the world of Ruby programmers, and simplifiers the world of RDF programmers too. RDF offers a consistent approach to representing a lot of different kinds of data. Our goal here is to find some practical conventions for exposing that data to Ruby applications.
This section is premature; nobody should be using this code yet. Anyway, here is an overview of the system as implemented and planned.
To summarise. In Ruby-RDF you can (nearly):
Ruby-RDF does not include an RDF parser, although it now has a basic reader for the NTriples dump syntax, as well as semi-native parsing support using the Ruby/Sabletron XSLT library and and XSLT2NTriples stylesheet.
The examples/ directory contains some test data in NTriples format (*.nt) and a couple of test scripts. basicrdf.rb is the main library, defining Ruby classes for Node, Statement, Graph etc. It is work in progress, and may change a lot still. I need to set up a proper testing framework; there are currently some tests in examples/ that attempt to round-trip RDF (via NTriple syntax) through Ruby-RDF.
If you have a working Ruby interpreter, this should just
work. Watch out for usual things; for example, you may need
to edit #!/usr/local/bin/ruby
at the top of any
scripts. Also, the require() call may need the directory path
to basicrdf.rb
.
The API features the usual notions of Graph, Statement and
Node based on the corresponding structures in W3C's RDF
information model. A Graph
is an object that
encapsulates some data, conceived of as RDF
statements
, which we think of as the directed
labeled arcs in our graph. A Node
is an object
representing a 'resource' or 'literal' content in the RDF
graph. Resource nodes may be blank or named with a URI;
literal nodes have data content (text etc.).
There is also a simple XSLT-based RDF import facility, using Jason Diamond's XSLT RDF parser and the Ruby Sabletron library. It doesn't derference URIs or understand notion of a base URI yet, nor behave well if the support library is missing (todo: find out about exception handling in Ruby).
The basic idea with this API was to offer several flavours of interface, to see how they compare for application use.
Here's a snippet of code that shows the first version of the API. We load some data (from STDIN), register some namespace with the graph object, and then query the loaded data using the statement-matching and node-centric APIs.
#!/usr/local/bin/ruby
#
# A little test program for Ruby-RDF features
# see http://www.w3.org/2001/12/rubyrdf/intro or mailto:danbri@w3.org
require '../basicrdf' # use the Ruby-RDF library
# get some data into a Graph:
#
db = NTriples.nt2graph() # loads ntriples from STDIN (default)
# Register RDF/XML namespaces with db:
#
FOAF = db.reg_xmlns 'http://xmlns.com/foaf/0.1/', 'foaf'
DC = db.reg_xmlns 'http://purl.org/dc/elements/1.1/', 'dc'
mb_uri = 'mailto:danbri@w3.org' # a sample URI to query data about
# Using Graph.ask: "What resources have a foaf_mbox of mailto:danbri@w3.org?"
#
danbri=db.ask(Statement.new(nil, FOAF+'mbox', mb_uri )).subjects[0]
# ask the data for all statements
# matching some template (with the
# bit we want 'nil'd out. get the
# subjects of these statements and
# take the first answer we find...
# Using Node API: "What are the foaf_mbox properties of 'node' ?"
#
print "Mailbox(es) for the resouce with mailbox #{mb_uri}: \n"
print " #{danbri.foaf_mbox.join(' ') } \n" if danbri
There is a simple in-memory Graph implementation now. You can add data using tell() or query data using ask(). Not all of ask() is implemented yet. I've not started on the Mozilla-like API. There is not much by way of distinction between interface and implementation. The node-centric API works at a proof of concept level, but needs serious attention.
There are more bugs than I can list here. Known problems:
Some design worries...
I like having nodes and graphs loosly coupled: you can change the graph that a node is attached to. I like the current (intended if not yet verified -- I think there may be a bug here) behaviour of only having one node object exist per URI or literal. But I don't think these two approaches are working well together. The nodes that come back from querying a graph, when new, are attached to that graph. When they're not new, we get counter-intuitive behavour.
The ask() facility on a Graph (and possibly on remote web
services, which may share this interface) currently takes
only a simple 'match this triple' argument. I'd like to be
able to ask a Graph a more sophisticated query, eg. pass in
SquishQL expressions and get back a table of bindings. Should
the name ask()
be reserved for this more
ambitious use? How to offer multiple ways of querying the
graph?
It's handy having Node, Statement etc., but sometimes these are overkill. We should be able to use simple string URIs and literals in a number of places that currently expect structured objects, eg. adding triples into a graph.
Indexing: the Graph object currently maintains a couple of indexes (@fp and @bp). Since the ask() method returns a graph (basically a collection of statements) as the result of a query, concern is that we're going to a lot of expense creating a rather transient graph object, indexing its contents etc.
Provenance: I want to keep track of where statements came from, implement aggregation of multiple graphs into a virtual database etc.
Query interface: should be able to add a SquishQL query interface easily enough, though only implementation currently will be remote SOAP web services.
I might not do any more work on this. If I do, I'll be fixing up basic facilities (so it can be used) before worrying about efficiency, scalability, beauty or even full compliance with the specs. I might finish reading the Ruby docs first; if the current code looks like Perl, there's a reason for that...
Other node-oriented RDF API experiments...
Some useful links...
author: danbri
version: $Id: intro.html,v 1.11 2001/12/10 23:59:14 danbri Exp $