An Overview of the RDF Data Model

Frank Manola, 8 November 2001

1.  Introduction

The World Wide Web provides people with an unprecedented capability to share information across the globe. However, the sheer amount and diversity of Web information now available makes finding and using the right information more difficult. As the Web continues its spectacular growth, the value of software that can find, filter, and combine information in response to specified user requirements greatly increases.

The basic difficulty in providing this software support is that the Web was originally aimed at providing its resources to people, not to other software, and so Web resources do not have descriptions of their meanings or capabilities that software can understand.  For example, the meaning of a Web page is determined by human understanding of the screen content when the page is displayed in a browser. This meaning is inaccessible to a piece of software.  As a result, software such as search engines must rely on such techniques as simple text matching, rather than being able to process Web resources based on an understanding of their true relationships to a user's intentions and needs.

The Semantic Web is going to change all that. The Semantic Web will enhance the Web's inter-linked information and service resources with software-interpretable descriptions of the resources' meanings, capabilities, and inter-relationships.  These descriptions will allow tools such as agents, search engines, or service brokers to more automatically and reliably find and use appropriate resources in response to user requirements.  At the same time, the Semantic Web creates the infrastructure for entirely new classes of agent-oriented capabilities.  @@need some discussion of these other capabilities;   also want to refer to apps involving non-Web resources

The W3C’s Resource Description Framework (RDF) is, as its name suggests, a framework (or approach) for describing Web resources.  The approach is based on some very simple ideas, but when those ideas are taken together, and suitably generalized (as they are in RDF), they provide a means for describing practically anything, in a form that can be processed by software.  The use of RDF (and richer approaches based on it) provides the basic technology for providing the software-interpretable descriptions required to support the Semantic Web.

At the same time, the use of RDF does not necessarily involve the use of inference, logic programming, or related technologies, as the term “semantic” might suggest.  RDF also provides essential support for applications such as providing simple information about Web content (provenance, content ratings), defining privacy policies, specifying site maps, or supporting description-based Web service brokering.

@@Editorial notes:  this version addresses various comments (mostly by Pat) on the original version, and preserves the original order of topic presentation.   There have been some suggestions to move the description of URIs to the beginning, and then talk about Ntriples and graphs ("pure RDF").  A problem with that approach is that I think a primer on RDF ought to start off talking about RDF's general approach to describing things in terms of subjects, properties, and their values, rather than a discussion of how to identify things (in particular, the use of URIs to name properties doesn't have much significance until you talk about what a property is).  I've got a version in the works that presents part of "The General Idea", then does the URI stuff, and then gets into triples, but the URI section is a major interruption to what I believe is the natural flow of the text (at least to someone with a database or other non-Web background). Anyway, we'll see.
 

2.  The general idea

The initial motivation for RDF was to provide a simple way to state properties of (facts about) Web resources, e.g., Web pages.  For example, imagine that we want to record the fact that someone named John Smith created a particular Web page. A straightforward way to state this fact in English would be in the form of a simple statement, e.g.:

“the creator of [the particular Web page we’re talking about ] is John Smith

We’ve underlined parts of this statement to illustrate that, in order to describe the properties of something, we need ways to identify a number of things:

The Web page can be identified by its URL (Uniform Resource Locator), say http://www.foobar.org/index.html , so we could rewrite the statement as:

“the creator of http://www.foobar.org/index.html is John Smith“

In this statement, in addition to using a URL to identify the Web page, we’ve used the word “creator” to identify the property we want to talk about, and the two words “John Smith” to identify the thing (a person) we want to say is the value of  this property.

We can state other properties of this Web page by writing additional English statements of the same general form, using the URL to identify the page, and words (or other strings) to identify the properties and their values.  For example, to specify the date the page was created, and the language in which the page is written, we could write the additional statements:

“the creation-date of http://www.foobar.org/index.html is August 16, 1999“
“the language of http://www.foobar.org/index.html is English “

(note the use of the string "August 16, 1999" to identify a date).

RDF assumes as its basic model that things have properties which have values, and that resources can be described by making statements, similar to those above, that specify those properties and values.  RDF uses a particular terminology for talking about the various parts of statements.  Specifically, the part that identifies the thing the statement is about (the Web page in this example) is called the subject .  The part that identifies the property or characteristic of the subject that the statement specifies (creator, creation-date, or language in this case) is called the predicate, and the part that identifies the value of that property is called the object.  So, taking the statement

“the creator of http://www.foobar.org/index.html is John Smith“

in RDF terminology:

(For the moment, the examples will continue to use words or strings as predicates and objects, as we have in the examples above.  In the next section, we will see that RDF actually uses a more Web-oriented approach for specifying these things.)

RDF statements are similar to a number of other formats for recording information, such as:

and information in these formats can be treated as RDF statements, allowing RDF to be used as a unifying model for integrating data from many sources.

RDF represents statements as nodes and arcs in a graph.  In this notation, a statement is represented by a node for the subject, a node for the object, and a labeled arc between them for the predicate, as in:

single RDF statement

Figure 1

Collections of statements are represented by corresponding collections of nodes and arcs.  So the three statements we’ve given so far would be represented by the following graph:

three RDF statements

Figure 2

The graph is technically a labeled directed graph, since the arcs have labels, and are “directed” (point in a specific direction, from subject to object).

Sometimes it is not convenient to draw graphs, so an alternative way of writing down the statements, called triples, can be used.  In the triples notation, each statement in the graph is written as a simple triple of subject, predicate, and object, in that order.  The triples representing the above three statements would be written:

<http://www.foobar.org/index.html>   creator              "John Smith"
<http://www.foobar.org/index.html>   creation-date   "August 16, 1999"
<http://www.foobar.org/index.html>   language          "English"

Each triple corresponds to a single arc in the graph, complete with the arc’s beginning and ending nodes (the subject and object of the statement).  Unlike the drawn graph, the triple notation requires that a node be separately identified for each statement it appears in.  So, for example, http://www.foobar.org/index.html appears three times (once in each triple) in the triple representation of the graph, but only once in the drawn graph.

In each of the statements we’ve considered so far, the object has been a simple string  (e.g., we've used “John Smith" to identify a particular person, and "August 16, 1999" to identify a particular date).  In RDF, the objects in statements may be any kind of string.  More importantly, the objects in RDF statements may also be the URLs of other Web resources.  This allows us to represent not only the properties of individual resources, but also relationships between those resources and others.  So, for example, we could represent the fact that the resource at  http://www.barbaz.org/myprojects.html has the same creator as the resource at http://www.foobar.org/index.html   by the statement

<http://www.foobar.org/index.html>  sameCreatorAs  <http://www.barbaz.org/myprojects.html>

And we could represent the fact that the resource at  http://www.foobar.org/index.html links to the resource at  http://www.w3.org/ by the statement

<http://www.foobar.org/index.html>    linksTo   <http://www.w3.org/>

Adding these two additional statements to the original ones would give us the graph shown below:

amplified example

Figure 3

This graph illustrates another aspect of the way we represent RDF graphs in drawings: nodes that represent URIs are shown as ellipses, while nodes that represent strings are shown as boxes.
 

3.  RDF identifiers

So far, we’ve described some of the basic ideas behind RDF, showing how simple statements composed of subjects, predicates, and objects provide a way of describing Web resources.   However, to simplify the initial presentation, we’ve oversimplified some of these ideas.  It’s now time to develop these ideas more fully, since they provide much of the potential power of RDF.

We first need to provide further detail about how RDF actually specifies the subjects, predicates, and objects of statements.  So far, the identifiers we’ve used are:

In principle we would like to be able to record information about many things in addition to Web pages.  In particular, we’d like to record information about lots of things that don’t have URLs.  For example, I don’t have a URL, and yet my employer needs to record all sorts of things about me in order to pay my salary, keep track of the work that I’ve been doing, and so on.  My doctor needs to record other sorts of things about me in order to keep track of my medical history:  tests that have been performed (and the results, who performed them, and when), shots I’ve received, etc.

We’ve recorded information about lots of things that don’t have URLs using files (both manual and automated) for many years, and the way we identify those things is by assigning them identifiers: values that we uniquely associate with the individual things.  The identifiers we use to identify various kinds of things go by names like “Social Security Number”, “Part Number”, “license number”, “employee number”, “user-id”, etc.  In some cases, these identifiers (such as Social Security Numbers) are assigned by an official authority of some kind.  In other cases, these identifiers are generated by a private organization or individual.  In some cases, these identifiers have a national or international scope within which they are unique (a Social Security Number has national scope), while in other cases they may only be unique within a very limited scope (my employee number is only unique among the numbers assigned by my specific employer).  Nevertheless, these identifiers serve, if used properly, to identify the things we want to talk about.

As we’ve seen, the Web already provides one form of identifier, the Uniform Resource Locator (URL). A URL is a string that identifies a Web resource by representing its primary access mechanism (essentially, its network “location”).  However, URLs are a subset of a more general and powerful concept, the Uniform Resource Identifier (URI).  URIs (defined in [RFC2396]) are similar to URLs in that different persons or organizations can independently create them, and use them to identify things.  However, unlike URLs, URIs are not limited to identifying things that have network locations, or use other computer access mechanisms.  In fact, we can create a URI to refer to anything we want to talk about, including

URIs essentially constitute an infinite stock of names that can be used to identify things.  As with any other kind of name, you don’t need special authority or permission to create a URI for something, and you can create URIs for things you don’t own (just as you can use whatever name you like for things you don’t own in ordinary language).  Moreover, the extensibility of URIs allows the creation of identifiers for any entity imaginable, including abstract concepts that don't physically exist.

Since the URI is such a general identification mechanism, capable of identifying anything, it should not be surprising that RDF uses URIs as its basic mechanism for identifying things.  Specifically, uses URIs to identify both subjects and objects in RDF statements (the objects in some statements, such as age values or names, will still be identified by strings).  In fact, RDF defines a resource as anything that is identifiable by a URI, and hence using URIs allows RDF to describe practically anything, and to state relationships between such things as well.

Now that we have URIs to identify resources, we can be more complete and precise about recording information.  For example, instead of identifying the creator of the Web page in our original example by the string “John Smith”, we can assign him a URI, say (using a URI based on his employee number) http://www.foobar.org/staffid/85740 .  The RDF statement stating this fact would then have the graph:

a relationship between URIs

Figure 4

or, in the triples notation:

<http://www.foobar.org/index.html>   creator    <http://www.foobar.org/staffid/85740>

One advantage of using a URI to identify the creator of the page in this example is that we can be more precise in our identification.  That is, the creator of the page isn’t the string “John Smith”, or any one of the thousands of people having “John Smith” as their name, but the particular John Smith associated with that URI (whoever created the URI defines the association).  Moreover, since we have a URI for the creator of the page, it is a full-fledged resource, and we can record additional information about him, such as his name, and age, as in the graph

more information about John Smith

Figure 5

or the triples

<http://www.foobar.org/index.html>        creator     <http://www.foobar.org/staffid/85740>
<http://www.foobar.org/staffid/85740>    name       "John Smith"
<http://www.foobar.org/staffid/85740>    age          "27"

We've just shown how RDF uses URIs as subjects and objects in its statements.  However, in these latest examples, we’ve still oversimplified something.   RDF also uses URIs as predicates in RDF statements.  That is, rather than using strings such as “creator” or “name” to identify properties, RDF uses URIs..

Using URIs to identify properties  is important for a number of reasons.  First, it allows us to distinguish the properties we use from properties someone else may use that would otherwise be identified by the same text string.  For instance, in our example, foobar.org uses “name” to mean someone's full name written out as a string (e.g., “John Smith”), but someone else may intend "name" to mean something different (e.g., the name of a variable in a piece of program text).  A program encountering “name” as a property identifier on the Web wouldn’t necessarily be able to distinguish these uses.  However, if foobar.org writes http://www.foobar.org/terms/name for its “name” property, and the other person writes http://geneology.org/terms/name for hers, we can keep straight the fact that there are distinct properties involved (even if a program can't automatically determine the distinct meanings).  Another reason why it is important to use URIs to identify properties is that it allows us to treat RDF properties as resources themselves.  Since properties are resources,  we can record descriptive information about them (e.g., the English description of what foobar.org means by “name”), simply by adding additional RDF statements with the property's URI as the subject.

Using URIs as subjects, objects, and predicates in RDF statements allows us to begin to develop and use a shared vocabulary on the Web, reflecting (and creating) a shared understanding of the concepts we talk about.  For example, now that we know to use URIs (where we can) to identify all the parts of an RDF statement, we can write the statement “the creator of http://www.foobar.org/index.html is John Smith“ as the triples

<http://www.foobar.org/index.html>        <http://purl.org/dc/elements/1.1/creator>    <http://www.foobar.org/staffid/85740> .
<http://www.foobar.org/staffid/85740>   <http://www.foobar.org/terms/name>          "John Smith" .

The URI http://purl.org/dc/elements/1.1/creator for the “creator” property in the first triple is an unambiguous reference to the “creator” attribute in the Dublin Core metadata attribute set, a widely-used collection of attributes (properties) for describing information of all kinds.  The writer of this triple is effectively saying that the relationship between the Web page (identified by http://www.foobar.org/index.html ) and the creator of the page (a distinct person, identified by http://www.foobar.org/staffid/85740 ) is exactly the concept defined by http://purl.org/dc/elements/1.1/creator .  Moreover, anyone else, or any program, which understands http://purl.org/dc/elements/1.1/creator will know exactly what is meant by this relationship. 

Incidentally, the triples above, using URIs in the subject, predicate, and (where appropriate) object positions (and with periods at the ends of the lines), are now in a formal RDF notation called Ntriples, which is defined for linearizing RDF graphs.     

4.  Complex Data  

Things would be very simple if the only types of information we had to record about things were obviously in the form of the simple RDF statements we’ve illustrated so far.  However, most real-world data involves structures that are more complicated than that, at least on the surface.  For instance, in our original example, we recorded the date the Web page was created as a simple string value.  However, suppose we wanted to record the month, day, and year as separate pieces of information?  Or, in the case of John Smith’s personal information, suppose we wanted to record his address.  We might write the whole address out as a string, as in the Ntriple

<http://www.foobar.org/staffid/85740>   <http://www.foobar.org/terms/address>   “1501 Grant Avenue, Bedford, Massachusetts 01730” .

However, suppose we wanted to record the various pieces of information about his address as separate street, city, state, and Zip code values?  How do we do this using RDF?

In RDF, we can represent such structured information by considering the aggregate thing we want to talk about (like John Smith's address) as a separate resource, and then making separate statements about that new resource.  So, in the RDF graph, in order to break up John Smith’s address into its component parts, we create a new node to represent the concept of John Smith’s address, and assign that concept a new URI to identify it, say http://www.foobar.org/addressid/85740 .  We then write RDF statements (create additional arcs and nodes) with that node as the subject, to represent the additional information, producing the graph below:

@@Note:  this figure needs to be redone, with <http://www.foobar.org/addressid/85740> in the current blank address node in the graph

complex address data
 

Figure 6

or the Ntriples:

<http://www.foobar.org/staffid/85740>       <http://www.foobar.org/terms/address>   <http://www.foobar.org/addressid/85740> .
<http://www.foobar.org/addressid/85740> <http://www.foobar.org/terms/street>      "1501 Grant Avenue" .
<http://www.foobar.org/addressid/85740> <http://www.foobar.org/terms/city>         "Bedford" .
<http://www.foobar.org/addressid/85740> <http://www.foobar.org/terms/state>       "Massachusetts" .
<http://www.foobar.org/addressid/85740> <http://www.foobar.org/terms/Zip>          "01730" .

In the drawing of the graph above, the new URI we assigned to identify "John Smith's address" really serves no purpose, since we could just as easily have drawn the graph 

complex address data

Figure 7

In this drawing, which is perfectly good RDF, we've used a node without a label to stand for the concept of "John Smith's address".  This unlabeled node, or bNode (for blank node) functions perfectly well in the drawing without needing a URI.  However, we do need some form of explicit identifier for that node in order to represent this graph in Ntriples.  To see this, we can try to write the Ntriples corresponding to what is shown in Figure 7.  What we would get would be something like:

<http://www.foobar.org/staffid/85740>       <http://www.foobar.org/terms/address>   ??? .
???                                                             <http://www.foobar.org/terms/street>      "1501 Grant Avenue" .
???                                                             <http://www.foobar.org/terms/city>         "Bedford" .
???                                                             <http://www.foobar.org/terms/state>       "Massachusetts" .
???                                                             <http://www.foobar.org/terms/Zip>          "01730" .

where ??? stands for something that  indicates the presence of the bNode.   Since in a complex graph there might be more than one such bNode, we also need a way to differentiate between the various bNodes in the corresponding triples representation.  To do this, the triples notation uses a concept of node identifiers (or nodeIDs) to identify bNodes.  These are temporary identifiers distinct from URIs (and having their own syntax in Ntriples) that are used to indicate the presence of bNodes in the Ntriples representation.  In this example, we might generate the node identifier _:johnaddress to refer to the bNode, in which case the resulting triples might be:

<http://www.foobar.org/staffid/85740>       <http://www.foobar.org/terms/address>   _:johnaddress .
_:johnaddress                                             <http://www.foobar.org/terms/street>      "1501 Grant Avenue" .
_:johnaddress                                             <http://www.foobar.org/terms/city>         "Bedford" .
_:johnaddress                                             <http://www.foobar.org/terms/state>       "Massachusetts" .
_:johnaddress                                             <http://www.foobar.org/terms/Zip>          "01730" .

@@more about bNodes (?)
@@other subjects?

@@Primermodel12