An Overview of the RDF Data Model

Frank Manola, 14 October 2001

1. Introduction

The World Wide Web provides people with an unprecedented capability to share information across the globe. However, the sheer amount and diversity of Web information now available makes finding and using the right information more difficult. As the Web continues its spectacular growth, the value of software that can find, filter, and combine information in response to specified user requirements greatly increases.

The basic difficulty in providing this software support is that the Web was originally aimed at providing its resources to people, not to other software, and so Web resources do not have descriptions of their meanings or capabilities that software can understand. For example, the meaning of a Web page is determined by human understanding of the screen content when the page is displayed in a browser. This meaning is inaccessible to a piece of software. As a result, software such as search engines must rely on such techniques as simple text matching, rather than being able to process Web resources based on an understanding of their true relationships to a user's intentions and needs.

The Semantic Web is going to change all that. A vision of Tim Berners-Lee, the creator of the World Wide Web, the Semantic Web will enhance the Web's inter-linked information and service resources with software-interpretable descriptions of the resources' meanings, capabilities, and inter-relationships. These descriptions will allow tools such as agents, search engines, or service brokers to more automatically and reliably find, filter, and combine appropriate resources in response to user requirements.

The W3C’s Resource Description Framework (RDF) is, as its name suggests, a framework (or approach) for describing Web resources. The approach is based on some very simple ideas, but when those ideas are taken together, and suitably generalized (as they are in RDF), they provide a means for describing practically anything, in a form that can be processed by software. The use of RDF (and richer approaches based on it) provides the basic technology for providing the software-interpretable descriptions required to support the Semantic Web.

At the same time, the use of RDF does not necessarily involve the use of esoteric AI technologies, as the term “semantic” might suggest. RDF also provides essential support for more straightforward applications, such as providing simple information about Web content (provenance, content ratings), defining privacy policies, specifying site maps, or supporting description-based Web service brokering.

2. The general idea

The initial impetus for RDF was to provide a simple way to describe the characteristics of (facts about) Web resources, e.g., Web pages. For example, imagine that we want to record the fact that someone named John Smith created a particular Web page. A straightforward way to state this fact in English would be in the form of a simple statement, e.g.:

“the creator of [the particular Web page we’re talking about] is “John Smith” “

We’ve underlined parts of this statement to illustrate that, in order to describe the characteristics of something, we need ways to identify a number of things:

We need a way to identify the thing we want to describe (the Web page, in this case)
We need a way to identify a specific characteristic (the creator) of the thing that we want to describe
We need a way to identify a specific value (the creator’s name) we want to assign to this characteristic, for the thing we want to describe

The Web page can be identified by its URL (Uniform Resource Locator), say http://www.foobar.org/index.html, so we can now write the statement as:

“the creator of http://www.foobar.org/index.html is “John Smith” “

In this statement, in addition to using a URL to identify the Web page, we’ve used an English string, “creator”, to identify the characteristic we want to talk about, and another string, “John Smith”, to identify the value we want to assign to this characteristic (the creator’s name).

We can describe other characteristics of this Web page by writing additional statements, using the same techniques to identify the page, the characteristics, and their values. For example, to specify the date the page was created, and the overall subject of the page, we could invent additional strings for the characteristics, and their values, and write the additional statements:

“the date-created of http://www.foobar.org/index.html is “August 16, 1999” “
“the subject of http://www.foobar.org/index.html is “my home page” “

RDF basically defines a model for making such statements that describe resources, using specific kinds of identifiers for the resources we want to describe, the characteristics we want to use to describe them, and the values we want to assign to those characteristics for each resource. So far, we’ve seen the use of URLs to identify the resources we want to talk about, and strings for the characteristics and values. Later, we’ll see that more general usage of RDF involves using more general types of identifiers.

The RDF model uses a particular terminology for talking about the various parts of statements. Specifically, the thing the statement is about (the resource, the Web page in this example) is called the subject. The attribute or characteristic of the subject that the statement specifies (creator or date-created in this case) is called the predicate (borrowing a term from mathematical logic), and the value of the predicate (or characteristic) is called the object. So, taking the statement

“the creator of http://www.foobar.org/index.html is “John Smith” “

in RDF terminology:

the subject is the resource identified by http://www.foobar.org/index.html
the predicate is the characteristic identified by “creator”
the object is the person identified by “John Smith”

There are a number of different ways of thinking about individual RDF statements, and the collections of statements that might be used to describe a given resource. For example, we can think of the statements describing a given resource as:

entries in a simple record or catalog listing describing the resource in a data processing system.
rows in a simple relational database.
simple assertions in mathematical logic (this will prove to be a particularly important way of thinking about these statements, which we’ll discuss in more detail later).

RDF “officially” models statements as nodes and arcs in a graph or diagram. In this model, a statement is represented by a node for the subject, a node for the object, and a labeled arc between them for the predicate, as in:

single RDF statement

Collections of statements are represented by collections of nodes and arcs, the subjects and objects forming the nodes of the graph, and the predicates (or characteristics) forming the arcs or links between these nodes. So the three statements we’ve given so far would be represented by the following graph:

three RDF statements

The graph is technically a labeled directed graph, since the arcs have labels, and are “directed” (point in a specific direction, from subject to object).

Sometimes it is not convenient to draw graphs (and, in particular, it is hard for machines to directly process them), so an alternative way of thinking about the statements, called triples, can be used. In the triples notation, each statement is written as a simple triple of subject, predicate, and object, in that order. The triples representing the above three statements would be written:

http://www.foobar.org/index.html   “creator”           “John Smith”
http://www.foobar.org/index.html   “date-created” “August 16, 1999”
http://www.foobar.org/index.html   “subject”          “my home page”

Each triple corresponds to a single arc in the graph, complete with the arc’s beginning and ending nodes (the subject and object of the statement). Unlike the graph notation, the triple notation requires that a node be separately identified for each statement it appears in. So, for example, http://www.foobar.org/index.html appears three times (once in each triple) in the triple representation of the graph, but only once in the actual graph.

In each of the statements we’ve considered so far, the object has been something identified by a simple string value, like “John Smith”. In RDF, the objects in statements may be identified by any type of literal value. More importantly, the objects in RDF statements may also be other Web resources, identified by their URLs. This allows us to represent not only the characteristics (or properties) of individual resources, but also relationships between those resources and others. So, for example, we could represent the fact that the resource http://www.barbaz.org/myprojects.html has the same creator as the resource http://www.foobar.org/index.html, by the statement

http://www.foobar.org/index.html “sameCreatorAs” http://www.barbaz.org/myprojects.html

And we could represent the fact that http://www.foobar.org/index.html links to http://www.w3.org/ by the statement

http://www.foobar.org/index.html    “linksTo”   http://www.w3.org/

Adding these two additional statements to the original ones would give us the graph shown below:

amplified example

This graph illustrates another aspect of the RDF graph notation: nodes that represent resources are shown as ellipses, while nodes that represent literal values are shown as boxes.

3. RDF identifiers

So far, we’ve described some of the basic ideas behind RDF, showing how simple statements composed of subjects, predicates, and objects provide a way of describing Web resources. However, to simplify the initial presentation, we’ve oversimplified some of these ideas. It’s now time to develop these ideas more fully, since they provide much of the potential power of RDF.

We first need to provide further detail about how RDF actually identifies the subjects, predicates, and objects in statements. So far, the identifiers we’ve used are:

URLs to identify Web pages (the only subjects we’ve talked about so far)
either simple literal values or other URLs to identify objects.
simple strings (like “creator”) to identify predicates

In principle we would like to be able to record information about many things in addition to Web pages. In particular, we’d like to record information about lots of things that don’t have URLs. For example, I don’t have a URL, and yet my company needs to record all sorts of things about me in order to pay my salary, keep track of the work that I’ve been doing, and so on. My doctor needs to record other sorts of things about me in order to keep track of my medical history: tests that have been performed (and the results, who performed them, and when), shots I’ve received, etc.

We’ve recorded information about lots of things that don’t have URLs using files (both manual and automated) for many years, and the way we identify those things is by assigning them identifiers: values that we uniquely associate with individual things. The identifiers we use to identify various kinds of things go by names like “Social Security Number”, “Part Number”, “license number”, “employee number”, “user-id”, etc. In some cases, these identifiers (such as Social Security Numbers) are assigned by an official authority of some kind. In other cases, identifiers are generated by a private organization or individual. In some cases, these identifiers have a national or international scope within which they are unique (a Social Security Number has national scope), while in other cases they may only be unique within a very limited scope (my employee number is only unique among the numbers assigned by my specific employer). Nevertheless, these identifiers serve, if used properly, to identify the things we want to talk about.

As we’ve seen, the Web already provides one form of identifier, the Uniform Resource Locator (URL). A URL is a string that identifies a Web resource by representing its primary access mechanism (e.g., its network “location”). However, URLs are a subset of a more general and powerful concept, the Uniform Resource Identifier (URI). A Uniform Resource Identifier (URI) is defined [by RFC2396] as “a compact string of characters for identifying an abstract or physical resource”. Unlike the URL, the URI is not limited to identifying things that have network locations, or use other computer access mechanisms. In fact, a URI can be assigned to identify anything, including

network-accessible things, such as an electronic document, an image, a service (e.g., "today's weather report for Los Angeles"), or a collection of other resources.
things that are not network-accessible, such as human beings, corporations, and bound books in a library.

Using URIs, we can generalize the resources we can talk about in RDF. Moreover, creation of URIs is decentralized; no one person or organization controls who creates them or how they are used. You don’t need any authority or permission to make a URI for something. You can even create URIs for things you don’t own (just as you can use whatever name you like for things you don’t own), or for abstract concepts that don’t physically exist. The extensibility of URIs allows the introduction of identifiers for any entity imaginable.

Since the URI is such a general identification mechanism, capable of identifying anything, it should not be surprising that RDF uses URIs to identify all the resources it talks about. Specifically, RDF uses URIs to identify both the subjects and objects in RDF statements (the objects in some statements, such as age values or names, must still be identified by literal values). As we’ve just noted, using URIs allows us to talk about things that aren’t “network accessible”, and to define relationships between such things as well.

Now that we have URIs to identify resources, we can be more complete and precise about recording information. For example, instead of identifying the creator of the Web page in our original example by the string “John Smith”, we can assign him a URI, say (using a URI based on his employee number) http://www.foobar.org/staffid/85740. The RDF statement stating this fact would then have the graph:

a relationship between URIs

or, in the triples notation:

http://www.foobar.org/index.html “creator” http://www.foobar.org/staffid/85740”

One advantage of using a URI to identify the creator of the page in this example is that we can be precise in our identification. That is, the creator of the page isn’t the string “John Smith”, or any one of the thousands of people having “John Smith” as their name, but the particular John Smith associated with that URI. Moreover, since we have a URI for the creator of the page, it is a full-fledged resource, and we can record additional information about him, such as his name, and age, as in the graph

more information about John Smith

or the triples

http://www.foobar.org/index.html      “creator”    http://www.foobar.org/staffid/85740”
http://www.foobar.org/staffid/85740 “name       “John Smith”
http://www.foobar.org/staffid/85740 “age”         “27”

However, in these latest examples, we’ve still oversimplified something. RDF doesn’t just use URIs to identify subjects and objects in RDF statements; RDF also uses URIs to identify the predicates in RDF statements. That is, rather than actually using simple strings such as “creator” or “name” to identify predicates, RDF uses URIs to identify them.

Using URIs to identify predicates is important for a number of reasons. First, it allows us to distinguish the predicates we use from the predicates someone else may use that would otherwise be identified by the same text string. For instance, in our example, foobar.org uses “name” to mean the full name written out as a string (e.g., “John Smith”), but someone else may intend it to mean only the surname (e.g., “Smith”). A program encountering “name” on the Web wouldn’t necessarily know which was meant. However, if foobar.org writes http://www.foobar.org/terms/name for its “name” predicate, and the other person writes http://geneology.org/terms/name for hers, we can keep them straight. Another reason why it is important to use URIs to identify predicates is that it allows us to consider RDF predicates as resources themselves. As a result, we can record descriptive information about the predicates (e.g., the English description of what foobar.org means by “name”), simply by making additional RDF statements about them.

Finally, using URIs to identify predicates, as well as subjects and objects, allows us to begin to develop and use a shared vocabulary on the Web, reflecting (and creating) a shared understanding of the concepts we talk about. For example, now that we know to use URIs (where we can) to identify all the parts of an RDF statement, we can write the statement “the creator of http://www.foobar.org/index.html is “John Smith” “ as the triples

http://www.foobar.org/index.html   http://purl.org/dc/elements/1.1/creator http://www.foobar.org/staffid/85740
http://www.foobar.org/staffid/85740 http://www.foobar.org/terms/name “John Smith”

The URI http://purl.org/dc/elements/1.1/creator for the “creator” predicate in the first triple is an unambiguous reference to the “creator” attribute in the Dublin Core metadata attribute set, a widely-used collection of attributes (predicates) for describing information of all kinds. The writer of this triple is effectively saying that the relationship between the Web page (identified by http://www.foobar.org/index.html ) and the creator of the page (a distinct person, identified by http://www.foobar.org/staffid/85740 ) is exactly the concept defined by http://purl.org/dc/elements/1.1/creator. Moreover, anyone else, or any program, which understands http://purl.org/dc/elements/1.1/creator will know exactly what is meant by this relationship.
@@ this could be developed a bit better; we also might want to mention, now that we’re using URIs for all the parts of the statements, that there is an “official” syntax for these triples, called Ntriples.

4. Complex Data

Things would be very simple if the only types of information we had to record about things were obviously in the form of the simple RDF statements we’ve illustrated so far. However, most real-world data involves structures that are more complicated than that, at least on the surface. For instance, in our original example, we recorded the date the Web page was created as a simple string value. However, suppose we wanted to record the month, day, and year as separate pieces of information? Or, in the case of John Smith’s personal information, suppose we wanted to record his address. We might write the whole address out as a string, as in

http://www.foobar.org/staffid/85740 http://www.foobar.org/terms/address “1501 Grant Avenue, Bedford, Massachusetts 01730”

However, suppose we wanted to record the various pieces of information about his address as separate street, city, state, and Zip code values? How do we do this using RDF?

In RDF, we represent such structured information by considering the aggregate thing we want to talk about (like “address”) as a separate resource, and then make separate statements about that new resource. So, in the RDF graph, in order to break up John Smith’s address into its component parts, we create a new node to represent the concept “John Smith’s address”. We then write RDF statements (create additional arcs and nodes) with that node as the subject, to represent the additional information, as in:

complex address data

We can represent the same information using the triples notation, using a technique that relational databases have been using for many years. The idea is to assign a separate identifier to the aggregate thing we want to talk about, and then record separate statements about that aggregate thing, identifying the aggregate thing using its newly-assigned identifier. So, in order to break up John Smith’s address into its component parts, we assign a new URI to identify “John Smith’s address”, and proceed to write RDF statements with that URI as the subject (in effect, the new identifier represents the node we added to the graph above to represent “John Smith’s address”), as in:

http://www.foobar.org/staffid/85740 http://www.foobar.org/terms/address http://www.foobar.org/addressid/85740
http://www.foobar.org/addressid/85740 http://www.foobar.org/terms/street “1501 Grant Avenue”
http://www.foobar.org/addressid/85740 http://www.foobar.org/terms/city “Bedford”
http://www.foobar.org/addressid/85740 http://www.foobar.org/terms/state “Massachusetts”
http://www.foobar.org/addressid/85740 http://www.foobar.org/terms/Zip “01730”

Notice that, in the graph representation of this example, we did not give the new node representing the concept of John Smith’s address a URI (although we could have), while in the triples representation we had to assign an identifier to represent that concept. The reason is that in the graph representation the node can stand for itself, without any further identification, while in the triples representation the only way to identify the thing being identified (the address in this case) is by giving it an explicit identifier.

@@this could lead naturally into a discussion of “anonymous” resources and bNodes; but do we want to discuss that in the Primer? What else do we want to consider in the “data model”?

@@we also might want to say something to the effect that RDF statements should be considered assertions or facts (by whoever wrote the RDF), and lead gently into the model theory that way.

@@Primermodel7