An Overview of the RDF Data Model
Frank Manola, 14 October 2001
1. Introduction
The World Wide Web provides people with an unprecedented capability to
share information across the globe. However, the sheer amount and diversity
of Web information now available makes finding and using the right information
more difficult. As the Web continues its spectacular growth, the value of
software that can find, filter, and combine information in response to specified
user requirements greatly increases.
The basic difficulty in providing this software support is that the Web
was originally aimed at providing its resources to people, not to other software,
and so Web resources do not have descriptions of their meanings or capabilities
that software can understand. For example, the meaning of a Web page
is determined by human understanding of the screen content when the page
is displayed in a browser. This meaning is inaccessible to a piece of software.
As a result, software such as search engines must rely on such techniques
as simple text matching, rather than being able to process Web resources
based on an understanding of their true relationships to a user's intentions
and needs.
The Semantic Web is going to change all that. A vision of Tim Berners-Lee,
the creator of the World Wide Web, the Semantic Web will enhance the Web's
inter-linked information and service resources with software-interpretable
descriptions of the resources' meanings, capabilities, and inter-relationships.
These descriptions will allow tools such as agents, search engines, or service
brokers to more automatically and reliably find, filter, and combine appropriate
resources in response to user requirements.
The W3C’s Resource Description Framework (RDF) is, as its name suggests,
a framework (or approach) for describing Web resources. The approach
is based on some very simple ideas, but when those ideas are taken together,
and suitably generalized (as they are in RDF), they provide a means for
describing practically anything, in a form that can be processed by software.
The use of RDF (and richer approaches based on it) provides the basic technology
for providing the software-interpretable descriptions required to support
the Semantic Web.
At the same time, the use of RDF does not necessarily involve the use of
esoteric AI technologies, as the term “semantic” might suggest. RDF
also provides essential support for more straightforward applications, such
as providing simple information about Web content (provenance, content ratings),
defining privacy policies, specifying site maps, or supporting description-based
Web service brokering.
2. The general idea
The initial impetus for RDF was to provide a simple way to describe the
characteristics of (facts about) Web resources, e.g., Web pages. For
example, imagine that we want to record the fact that someone named John Smith
created a particular Web page. A straightforward way to state this fact in
English would be in the form of a simple statement, e.g.:
“the creator of [the particular Web page we’re talking about] is “John
Smith” “
We’ve underlined parts of this statement to illustrate that, in order to
describe the characteristics of something, we need ways to identify a number
of things:
- We need a way to identify the thing we want to describe (the Web page,
in this case)
- We need a way to identify a specific characteristic (the creator) of
the thing that we want to describe
- We need a way to identify a specific value (the creator’s name) we
want to assign to this characteristic, for the thing we want to describe
The Web page can be identified by its URL (Uniform Resource Locator), say
http://www.foobar.org/index.html, so we can now write the statement as:
“the creator of http://www.foobar.org/index.html is “John Smith” “
In this statement, in addition to using a URL to identify the Web page,
we’ve used an English string, “creator”, to identify the characteristic we
want to talk about, and another string, “John Smith”, to identify the value
we want to assign to this characteristic (the creator’s name).
We can describe other characteristics of this Web page by writing additional
statements, using the same techniques to identify the page, the characteristics,
and their values. For example, to specify the date the page was created,
and the overall subject of the page, we could invent additional strings
for the characteristics, and their values, and write the additional statements:
“the date-created of http://www.foobar.org/index.html is “August 16, 1999”
“
“the subject of http://www.foobar.org/index.html is “my home page” “
RDF basically defines a model for making such statements that describe
resources, using specific kinds of identifiers for the resources we want
to describe, the characteristics we want to use to describe them, and the
values we want to assign to those characteristics for each resource.
So far, we’ve seen the use of URLs to identify the resources we want to
talk about, and strings for the characteristics and values. Later,
we’ll see that more general usage of RDF involves using more general types
of identifiers.
The RDF model uses a particular terminology for talking about the various
parts of statements. Specifically, the thing the statement is about
(the resource, the Web page in this example) is called the subject.
The attribute or characteristic of the subject that the statement specifies
(creator or date-created in this case) is called the predicate (borrowing
a term from mathematical logic), and the value of the predicate (or characteristic)
is called the object. So, taking the statement
“the creator of http://www.foobar.org/index.html is “John Smith” “
in RDF terminology:
- the subject is the resource identified by http://www.foobar.org/index.html
- the predicate is the characteristic identified by “creator”
- the object is the person identified by “John Smith”
There are a number of different ways of thinking about individual RDF statements,
and the collections of statements that might be used to describe a given
resource. For example, we can think of the statements describing a
given resource as:
- entries in a simple record or catalog listing describing the resource
in a data processing system.
- rows in a simple relational database.
- simple assertions in mathematical logic (this will prove to be a particularly
important way of thinking about these statements, which we’ll discuss in
more detail later).
RDF “officially” models statements as nodes and arcs in a graph or diagram.
In this model, a statement is represented by a node for the subject, a node
for the object, and a labeled arc between them for the predicate, as in:
Collections of statements are represented by collections of nodes and arcs,
the subjects and objects forming the nodes of the graph, and the predicates
(or characteristics) forming the arcs or links between these nodes.
So the three statements we’ve given so far would be represented by the following
graph:
The graph is technically a labeled directed graph, since the arcs have
labels, and are “directed” (point in a specific direction, from subject
to object).
Sometimes it is not convenient to draw graphs (and, in particular, it is
hard for machines to directly process them), so an alternative way of thinking
about the statements, called triples, can be used. In the triples
notation, each statement is written as a simple triple of subject, predicate,
and object, in that order. The triples representing the above three
statements would be written:
http://www.foobar.org/index.html “creator”
“John Smith”
http://www.foobar.org/index.html “date-created” “August
16, 1999”
http://www.foobar.org/index.html “subject”
“my home page”
Each triple corresponds to a single arc in the graph, complete with the
arc’s beginning and ending nodes (the subject and object of the statement).
Unlike the graph notation, the triple notation requires that a node be separately
identified for each statement it appears in. So, for example, http://www.foobar.org/index.html
appears three times (once in each triple) in the triple representation of
the graph, but only once in the actual graph.
In each of the statements we’ve considered so far, the object has been
something identified by a simple string value, like “John Smith”.
In RDF, the objects in statements may be identified by any type of literal
value. More importantly, the objects in RDF statements may also be
other Web resources, identified by their URLs. This allows us to represent
not only the characteristics (or properties) of individual resources, but
also relationships between those resources and others. So, for example,
we could represent the fact that the resource http://www.barbaz.org/myprojects.html
has the same creator as the resource http://www.foobar.org/index.html, by
the statement
http://www.foobar.org/index.html “sameCreatorAs” http://www.barbaz.org/myprojects.html
And we could represent the fact that http://www.foobar.org/index.html links
to http://www.w3.org/ by the statement
http://www.foobar.org/index.html “linksTo”
http://www.w3.org/
Adding these two additional statements to the original ones would give
us the graph shown below:
This graph illustrates another aspect of the RDF graph notation: nodes
that represent resources are shown as ellipses, while nodes that represent
literal values are shown as boxes.
3. RDF identifiers
So far, we’ve described some of the basic ideas behind RDF, showing how
simple statements composed of subjects, predicates, and objects provide a
way of describing Web resources. However, to simplify the initial
presentation, we’ve oversimplified some of these ideas. It’s now time
to develop these ideas more fully, since they provide much of the potential
power of RDF.
We first need to provide further detail about how RDF actually identifies
the subjects, predicates, and objects in statements. So far, the identifiers
we’ve used are:
- URLs to identify Web pages (the only subjects we’ve talked about so
far)
- either simple literal values or other URLs to identify objects.
- simple strings (like “creator”) to identify predicates
In principle we would like to be able to record information about many
things in addition to Web pages. In particular, we’d like to record
information about lots of things that don’t have URLs. For example,
I don’t have a URL, and yet my company needs to record all sorts of things
about me in order to pay my salary, keep track of the work that I’ve been
doing, and so on. My doctor needs to record other sorts of things about
me in order to keep track of my medical history: tests that have been
performed (and the results, who performed them, and when), shots I’ve received,
etc.
We’ve recorded information about lots of things that don’t have URLs using
files (both manual and automated) for many years, and the way we identify
those things is by assigning them identifiers: values that we uniquely associate
with individual things. The identifiers we use to identify various
kinds of things go by names like “Social Security Number”, “Part Number”,
“license number”, “employee number”, “user-id”, etc. In some cases,
these identifiers (such as Social Security Numbers) are assigned by an official
authority of some kind. In other cases, identifiers are generated by
a private organization or individual. In some cases, these identifiers
have a national or international scope within which they are unique (a Social
Security Number has national scope), while in other cases they may only be
unique within a very limited scope (my employee number is only unique among
the numbers assigned by my specific employer). Nevertheless, these
identifiers serve, if used properly, to identify the things we want to talk
about.
As we’ve seen, the Web already provides one form of identifier, the Uniform
Resource Locator (URL). A URL is a string that identifies a Web resource
by representing its primary access mechanism (e.g., its network “location”).
However, URLs are a subset of a more general and powerful concept, the Uniform
Resource Identifier (URI). A Uniform Resource Identifier (URI) is defined
[by RFC2396] as “a compact string of characters for identifying an abstract
or physical resource”. Unlike the URL, the URI is not limited to identifying
things that have network locations, or use other computer access mechanisms.
In fact, a URI can be assigned to identify anything, including
- network-accessible things, such as an electronic document, an image,
a service (e.g., "today's weather report for Los Angeles"), or a collection
of other resources.
- things that are not network-accessible, such as human beings, corporations,
and bound books in a library.
Using URIs, we can generalize the resources we can talk about in RDF.
Moreover, creation of URIs is decentralized; no one person or organization
controls who creates them or how they are used. You don’t need any
authority or permission to make a URI for something. You can even create
URIs for things you don’t own (just as you can use whatever name you like
for things you don’t own), or for abstract concepts that don’t physically
exist. The extensibility of URIs allows the introduction of identifiers
for any entity imaginable.
Since the URI is such a general identification mechanism, capable of identifying
anything, it should not be surprising that RDF uses URIs to identify all
the resources it talks about. Specifically, RDF uses URIs to identify
both the subjects and objects in RDF statements (the objects in some statements,
such as age values or names, must still be identified by literal values).
As we’ve just noted, using URIs allows us to talk about things that aren’t
“network accessible”, and to define relationships between such things as
well.
Now that we have URIs to identify resources, we can be more complete and
precise about recording information. For example, instead of identifying
the creator of the Web page in our original example by the string “John
Smith”, we can assign him a URI, say (using a URI based on his employee
number) http://www.foobar.org/staffid/85740. The RDF statement stating
this fact would then have the graph:
or, in the triples notation:
http://www.foobar.org/index.html “creator”
http://www.foobar.org/staffid/85740”
One advantage of using a URI to identify the creator of the page in this
example is that we can be precise in our identification. That is,
the creator of the page isn’t the string “John Smith”, or any one of the
thousands of people having “John Smith” as their name, but the particular
John Smith associated with that URI. Moreover, since we have a URI
for the creator of the page, it is a full-fledged resource, and we can record
additional information about him, such as his name, and age, as in the graph
or the triples
http://www.foobar.org/index.html “creator”
http://www.foobar.org/staffid/85740”
http://www.foobar.org/staffid/85740 “name
“John Smith”
http://www.foobar.org/staffid/85740 “age”
“27”
However, in these latest examples, we’ve still oversimplified something.
RDF doesn’t just use URIs to identify subjects and objects in RDF
statements; RDF also uses URIs to identify the predicates in RDF statements.
That is, rather than actually using simple strings such as “creator” or
“name” to identify predicates, RDF uses URIs to identify them.
Using URIs to identify predicates is important for a number of reasons.
First, it allows us to distinguish the predicates we use from the predicates
someone else may use that would otherwise be identified by the same text
string. For instance, in our example, foobar.org uses “name” to mean
the full name written out as a string (e.g., “John Smith”), but someone else
may intend it to mean only the surname (e.g., “Smith”). A program encountering
“name” on the Web wouldn’t necessarily know which was meant. However,
if foobar.org writes http://www.foobar.org/terms/name for its “name” predicate,
and the other person writes http://geneology.org/terms/name for hers, we
can keep them straight. Another reason why it is important to use URIs
to identify predicates is that it allows us to consider RDF predicates as
resources themselves. As a result, we can record descriptive information
about the predicates (e.g., the English description of what foobar.org means
by “name”), simply by making additional RDF statements about them.
Finally, using URIs to identify predicates, as well as subjects and objects,
allows us to begin to develop and use a shared vocabulary on the Web, reflecting
(and creating) a shared understanding of the concepts we talk about.
For example, now that we know to use URIs (where we can) to identify all
the parts of an RDF statement, we can write the statement “the creator of
http://www.foobar.org/index.html is “John Smith” “ as the triples
http://www.foobar.org/index.html http://purl.org/dc/elements/1.1/creator
http://www.foobar.org/staffid/85740
http://www.foobar.org/staffid/85740 http://www.foobar.org/terms/name
“John Smith”
The URI http://purl.org/dc/elements/1.1/creator for the “creator” predicate
in the first triple is an unambiguous reference to the “creator” attribute
in the Dublin Core metadata attribute set, a widely-used collection of attributes
(predicates) for describing information of all kinds. The writer of
this triple is effectively saying that the relationship between the Web
page (identified by http://www.foobar.org/index.html ) and the creator of
the page (a distinct person, identified by http://www.foobar.org/staffid/85740
) is exactly the concept defined by http://purl.org/dc/elements/1.1/creator.
Moreover, anyone else, or any program, which understands http://purl.org/dc/elements/1.1/creator
will know exactly what is meant by this relationship.
@@ this could be developed a bit better; we also might want to mention,
now that we’re using URIs for all the parts of the statements, that there
is an “official” syntax for these triples, called Ntriples.
4. Complex Data
Things would be very simple if the only types of information we had to
record about things were obviously in the form of the simple RDF statements
we’ve illustrated so far. However, most real-world data involves structures
that are more complicated than that, at least on the surface. For
instance, in our original example, we recorded the date the Web page was
created as a simple string value. However, suppose we wanted to record
the month, day, and year as separate pieces of information? Or, in
the case of John Smith’s personal information, suppose we wanted to record
his address. We might write the whole address out as a string, as
in
http://www.foobar.org/staffid/85740 http://www.foobar.org/terms/address
“1501 Grant Avenue, Bedford, Massachusetts 01730”
However, suppose we wanted to record the various pieces of information
about his address as separate street, city, state, and Zip code values?
How do we do this using RDF?
In RDF, we represent such structured information by considering the aggregate
thing we want to talk about (like “address”) as a separate resource, and
then make separate statements about that new resource. So, in the RDF
graph, in order to break up John Smith’s address into its component parts,
we create a new node to represent the concept “John Smith’s address”.
We then write RDF statements (create additional arcs and nodes) with that
node as the subject, to represent the additional information, as in:
We can represent the same information using the triples notation, using a
technique that relational databases have been using for many years.
The idea is to assign a separate identifier to the aggregate thing we want
to talk about, and then record separate statements about that aggregate
thing, identifying the aggregate thing using its newly-assigned identifier.
So, in order to break up John Smith’s address into its component parts,
we assign a new URI to identify “John Smith’s address”, and proceed to write
RDF statements with that URI as the subject (in effect, the new identifier
represents the node we added to the graph above to represent “John Smith’s
address”), as in:
http://www.foobar.org/staffid/85740 http://www.foobar.org/terms/address
http://www.foobar.org/addressid/85740
http://www.foobar.org/addressid/85740 http://www.foobar.org/terms/street
“1501 Grant Avenue”
http://www.foobar.org/addressid/85740 http://www.foobar.org/terms/city
“Bedford”
http://www.foobar.org/addressid/85740 http://www.foobar.org/terms/state
“Massachusetts”
http://www.foobar.org/addressid/85740 http://www.foobar.org/terms/Zip
“01730”
Notice that, in the graph representation of this example, we did not give
the new node representing the concept of John Smith’s address a URI (although
we could have), while in the triples representation we had to assign an
identifier to represent that concept. The reason is that in the graph
representation the node can stand for itself, without any further identification,
while in the triples representation the only way to identify the thing being
identified (the address in this case) is by giving it an explicit identifier.
@@this could lead naturally into a discussion of “anonymous” resources
and bNodes; but do we want to discuss that in the Primer? What
else do we want to consider in the “data model”?
@@we also might want to say something to the effect that RDF statements
should be considered assertions or facts (by whoever wrote the RDF), and lead
gently into the model theory that way.
@@Primermodel7