RDF Hacking: Understanding the Striped RDF/XML Syntax

(or how I learned to stop worrying and read RDF's XML syntax)

Author: Dan Brickley (danbri@w3.org)

Abstract

This document provides a brief introduction to the underlying structure of the RDF/XML 1.0 graph serialization syntax. The Intended audience is mainly content and tool developers familiar with XML basics, and with the RDF model, who want a minimalistic understanding of RDF's XML syntax, so they can read and write XML with more confidence.

Introduction

When I first got involved with the W3C RDF project, I latched onto the abstract graph model, and pretty much ignored the detail of the RDF syntax for the first few months. This was just as well, since at the time (1997/1998) the exact details of the syntax were still being designed. Meanwhile I was in the RDF Schema WG, were we conducted our discussions mostly in terms of the graph model. When the time came to get my hands dirty with the detail of RDF's XML syntax, I found three tools became essentials: striping, sirpac and rdfviz.

By "striping", I'm talking about the a conceptual tool: RDF's syntax has been informally described as a "striped" graph encoding syntax. This notion can be quite useful when you first encounter RDF written as XML I'll describe this below.. The second indispensible tool was Janne Sareela's SiRPAC. SiRPAC was the first RDF parser, a tool that takes an XML encoding ("serialization") of an RDF graph, and returns a textual or programmatic representation. Playing with SiRPAC made it easy to experiment with RDF /XML files and see the associated node-edge-node triples that constitute the corresponding graph structure. .The final tool that made my life easier as an RDF developer was RDFViz (or, to be honest, GraphViz). Sometime in 2000 I stumbled across the GraphViz tools from AT&T. GraphViz is a graph visualisation toolkit. It can take descriptions of (various kinds of) graph and generate reasonably pretty pictures in various image formats. So I wrote a quick Perl filter that took the output from an RDF/XML parser such as SiRPAC, and generates these .dot files for GraphViz. This was incredibly useful. A much more robust GraphViz-based RDF visualizer is now part of W3C's RDF Validator service.

So, armed with parsers, visualisation tools and the RDF syntax spec, how can a content-producer get a quick feel for the structure of RDF/XML? For me, it was the metaphor of striping that gave me a handle on the essential organising principle of RDF's XML syntax. This is, it should be noted, slightly contrary to the way the original RDF spec is organised.

A Striped Syntax

To have a prayer of understanding the XML syntax for RDF, you need to feel comfortable with the graph-based information model at the heart of RDF. Objects ('resources') linked together by typed relationships or 'properties'. And you need to be at ease with the way RDF tries to use names in URI syntax wherever possible, to name both resources, their types ('classes') and their attributes and interelationships ('properties'). If you're happy with all that, you'll also need some mental baggage from the XML side of things. RDF graphs are encoded in XML, and this encoding makes use some features of XML. You need to know about the basic abstract structure of all XML documents: the tree of elements (some decorated with attribute/value pairs), and about the way these are manifested as nested hierachies of opening and closing angle-bracketted "tags" in XML documents. You'll also perhaps have heard of the notion of a well-formed XML document, of 'namespaces', of DTDs, of XML Schemas and various other features. These are all useful to know about, but the critical concepts to possess here are the basic (i) well-formedness, and (ii)XML namespaces, backed up by general comfort with XML's elements/attributes/nesting structure. Having gotten this far, it isn't such a big leap to grasp the basic pattern that underlies the RDF/XML serialization syntax: striping.

An XML syntax for RDF specifies a strategy for encoding the node-edge-node structure that RDF cares about in terms of the (attribute-decorated) element hierarchy that XML cares about. There are a number of ways this can be done. RDF 1.0 adopts a style that we term 'striped'; other conventions have been proposed, but the focus here is on RDF 1.0. The XML syntax needs to map from RDF's URI-named resources, properties and classes ( nodes, edge-types, node types... if you prefer a more visual terminology) into a class of well-formed XML documents. The XML namespace mechanism is used for this. So our main task here is to explain how the node-edge-node structures from RDF become element and attribute structures in XML. To do this, focus on the notion of striping and forget some annoying details for now.

Gory Details

So, this is what we mean about striping.

Consider a graph of nodes, each with a type (ie. category or 'class'), and each having a bunch of named properties (relationships) connecting it to other nodes, which might be simply string-y values, or further nodes that are themselves at the sharp and/or blunt ends of various other edges in the graph. We need to create XML elements (possibly with associated attributes) that stand for these nodes and arcs. RDF's convention for doing this is called striped because, as you look at the XML element nesting structure, elements alternately represent nodes and edges.

Example

1:<Person>
2:  <name> John </name>2:  <livesWith>  
3:    <Person>
4:      <father>
5:        <Person>
6:          <name> Fred </name>
5:        </Person>
4:      </father>
3:    </Person>
2:  </livesWith>
1:</Person>

Here we're saying, loosly, that "there exists a Person with a name, 'John', and that person 'livesWith' a Person that has a father that is a Person with a name 'Fred' ". The RDF node-and-edge view of this is (@@RDFViz here). Now look at the nesting structure. The first level of XML elements, our first occurance of <Person>, stands for a node (some specific instance of the type of thing we're calling 'Person'). And then the striping starts. The next level in, we see two XML elements: one is 'name', the other 'livesWith'. These stand not for nodes in the graph, but edges. The first is an edge labeled 'name' connecting our person to the node 'John'. The second is an edge labeled 'livesWith' whose blunt end is our first Person node, and whose sharp end points to a second node, also of type Person (if you're counting, this is the 3rd Person node / XML element in the example).. So now we're into the 3rd level of XML nesting, and the striping pattern means that we're now describing a node again. The node here is again given a type, Person, and the sub-elements below it in the XML tree are, accordingly, representations of that Person's properties. So, at this level, we have our last element, a Person element (standing for a node of type Person) and it has just one sub-element, 'name', which provides the label for an edge connecting the 3rd person to the string 'Fred'.

So to recap we've gone: node (of type Person), edge ('name': John); edge ('livesWith'), node (of type Person), edge ('father'), node (of type Person), edge ('name': John). The XML elements at the 1st, 3rd, and 5th levels of nesting all stand for individual nodes, in our scenario they happen to all be of the same type, Person. The XML elements at the 2nd, 4th, and 6th levels of nesting represent labeled edges in the graph, ie. RDF properties.

This is RDF striping. Understanding this basic representational convention is all you need to understand most RDF/XML examples you'll encounter.

Some observations

You can't tell, without starting at the top and counting on your fingers, whether an XML element in the RDF serialisation represents an edge, or a node. But often you can cheat! Look again at the example, and notice that edge even-numbered layer of XML, the red 'edge label' stripes, has a name beginning with a lower case letter. Many RDF vocabularies (including the core RDF specs themselves) adopt this convention. We name properties with a lower case, and classes of thing with an upper case name (eg. 'Person').

I haven't mentioned the rdf:Description element. The RDF 1.0 Model and Syntax spec gives this a lot of attention when presenting the RDF syntax. Basically it can occur on any of the node-describing XML elements (ie. odd-numbered) in the striped syntax. It is redundant, and a bit confusing since apart from the option of putting rdf:Description on the node-describing elements, we can always map from the name of these nodes to an RDF type that is a class for the thing the node describes. In our example, 'Person'. So the existence of rdf:Description in the syntax complicates things. Whenever you see it, pretend you saw a node called 'Resource' instead; that way, you can read it as 'there exists a Resource...'.

We've said nothing about namespaces here yet. RDF uses the XML namespace mechanism to associate all these classes and properties with Web identifiers (URIs). We've said nothing here about the use of XML attributes. Here's a short version. When you see an attribute on a node-level element, eg on the 'Person' elements in the example above, it always stands for an RDF property, whose value is always written a simple literal string.. Except for some some special cases, of course, otherwise things would be too simple. One special case is important: the rdf:about attribute. When you see rdf:about, this is RDF's way of telling you that we know a URI name for the thing concerned. These are not treated as properties, but are in a sense 'built in' to RDF at a deep level. Also rdf:ID, and xmlns:*, xml:lang, xml:base and (@@hmm) others. See the syntax spec for details. But the basic idea is: when you see attributes on a node-level XML element (the ones whose names often begin with capital letters), the attribute represents an edge.

Another important case: representing edges that point to nodes that are described elsewhere (within the same document but not within this part of the element tree; or elsewhere in the Web). For this, RDF has the rdf:resource attribute. This always appears at the edge-level of the XML document, ie. on elements that stand for edges rather than for nodes. Apart from that, it functions similarly to the rdf:about, in that it uses URIs to point off to a node instead of describing it inline.

There are many other corner cases in the spec. RDF's rdf:parseType attribute, for example, complicates the simplistic striping model described here. But for many common cases, the notion of 'striped syntax' will provide some useful mental scaffolding that'll help you read the XML not just "as XML", but as an XML description of the abstract RDF graph. If in doubt, experiment with the free online parsing and visualisation service at W3C.

TODO

Add images, hyperlinks, another example.

Further reading. Ntriple dump for examples

$Id: stripes.html,v 1.13 2001/10/19 21:04:42 wwwrun Exp $