Data on the Web
Abstract
The ability of an organization to run smoothly and react to change
with agility depends on many factors. One of the most important
factors is rapid access to critical data, the information necessary to
quickly make informed decisions. In order to assure that critical data
can be accessed efficiently, it must be available in a standard form
and it must be consistent.
Making data available in a standard form involves exposing existing
data from proprietary formats, application-specific standards
(including HTML and XML), and databases in an open,
application-neutral standard such as RDF. Assuring that it is
consistent requires the ability to identify equivalent terms used by
different data sources so that the data can be viewed as a seamless
whole.
Introduction
The Architecture of the World Wide Web [AWWW] describes the
operation of the information space which is the World Wide Web. It
deals mostly with the use of the web for hypertext documents, which
form the majority of the web in 2006.
This document describes the use of the web for data. Like the AWWW,
it motivates the selection of principles, constraints, and good
practices with simplified scenarios that it describes in some detail.
Unlike the AWWW, which describes long-existing practice, this document
describes technology which at the time of this writing is deployed by
a growing but relatively small number of sites. The standards
described herein are around five years old rather than 15 years old.
This document therefore makes a far more detailed analysis of its
motivating examples.
Acme Sales
The subject of our primary scenario is the Acme Widget Company, its
employees, and the events that befall them.
For many years, the success of the Acme Widget Company depended on
the contents of an old brown notebook which Joe carried everywhere.
Joe started it many years ago when his memory, and a pocketful of
business cards, failed to be sufficient. In the notebook, Joe
maintained the master list of all their customers, prospective and
actual. Each entry included both the mundane details of their names
and addresses and also notes on their foibles and particular
interests.
Recently, to everyone's great relief, Joe has been persuaded to
transfer all of this information to his PDA which he is required to
back up regularly.
Last year, when Anthea started seeing more customers herself, she
wished she could have the contents of Joe's notebook at her
fingertips. And, of course, now that it was backed up on the company
network, she realized there was no reason that she couldn't.
She asked Leslie, a summer intern, to arrange for the address book
file to be on the internal website, accessible only to management
in case some of the notes were personal.
Leslie knew that it would be easy to put the data on the web server,
but she also knew that Anthea was going to want to access the data
with her laptop computer whereas Joe preferred his PDA. Furthermore,
she knew it was only a matter of time before the other sales managers
would want to access the data as well.
Leslie put the data up in VCards so that it would be easy to import
into the various applications that the sales managers were using. She
also generated web pages so that any of the managers could review the
contacts, even if they didn't need to have their own copy.
When Joe saw the web pages, he noticed that there was more
information on them than he had in his PDA. In particular, he noticed
that each customer page showed the latitude and longitude off in one
corner. Anthea didn't know how it got there, so they asked Leslie.
“Oh”, said Leslie, “that's for something I was working on. It's
not finished yet, but let me show you.”
Leslie opened up a data browser and showed them all of the contact
information along with other information she'd been able to add to
each contact.
“You see,” she explained, “what I did was I converted all of the
data into RDF, that's just a generic data format, you don't really need
to care about that, then I used a public web service to convert all
the postal addresses into latitudes and longitudes.”
“Why,” asked Joe?
“Because then my data browser can put them on a map.”
Leslie clicked on a tab in her browser and revealed a local map
dotted with colored markers. The green ones, she explained, were for
customers and the blue ones were for leads.
Joe and Anthea spent a few minutes looking at the map and
discussing the various customers and leads it revealed. One thing it
revealed was that Joe appeared to do a very good job following up on
all the leads that fell along the routes he took between customers,
but he was neglecting potential leads that were off his normal
routes.
Joe mentally plotted the routes he would take over the next few
days to follow up on a bunch of leads while Anthea asked if she could
see her contacts on that map as well. Leslie explained that it would
be easy, as soon as her data was converted to RDF it would just show
up automatically.
“How can that be,” Anthea asked?
“One of the cool things about RDF,” Leslie explained, “is that it's
very easy to aggregate data stored in RDF. If you have two sets of RDF
data, you can always just mix them together and you get the union without
doing any extra work.”
Leslie said she would convert Leslies contacts into RDF next and
they left her to carry on.
This is a story of the progression of information through forms in
which it is progressively more useful.
In the format Joe's PDA used natively, it could be backed up and
reused, at least by others with a similar brand of PDA. Leslie, coming
to the company with a knowledge of data standards, made sure that it
was available in standard formats that many more people would be able
to use.
The VCard format [VCARD] is an non-proprietary application-specific
data format. This means that the file will be usable by any
VCard-compliant address book software in the future. This preserves
the data from loss due to a change of PDA manufacturer or address book
software.
The RDF format [RDF] is a non-proprietary cross-application format.
This means that the file is usable by generic data processing, storage
and visualization tools. It also means that it can be merged with data
From other applications to form a web of linked data.
Leslie shows the data in a generic data browser.
The conversion of the addresses into latitude and longitude
was a stunt to show off the data, but it demonstrates that geo-spatial
position
is a powerful property against which to display an compare
data.
Acme Manufacturing
Acme Widgets totally depends on Heather. She keeps the machines
running, the machines which make the widgets without which there would
be no Acme. They have 56 machines in all, spread over 8 sites in the
same small town. Heather keeps a spreadsheet, using a row for each
machine, with dates of last and next service, building number,
capacity of the machine, anticipated replacement parts, and other information
related to maintenance.
Heather has to keep in constant touch with Chris, the production
manager. Chris is a crucial person too because he plans the production
runs. It's his planning that gives Acme the ability to deliver to each
customer on time. For each run, he knows which machine will be used
for each order, when they will start and finish, which operator is
assigned to the job, what the expected overruns are, and other information
related to production. He plans several weeks in advance using a
relational database with a PHP-based web interface that allows him
to view the data in various ways and make changes to adapt the schedules
when necessary.
Heather and Chris are always reviewing printed copies of the data
From their respective systems. They have to make sure that Heather has
time to get in and maintain the machines between runs and that Chris
is never starting a production run on a machine which hasn't been
recently oiled and cleaned. Leslie hears them talking over lunch and
offers to help.
Soon Leslie she has a script on her laptop which puts the
spreadsheet data in RDF on the internal website. She also sets up a
open source program to provide an RDF view of Chris's database. They
fire up Leslie's data browser and look at the combined information.
Their common interests, Leslie notes, are the machines, the
maintenance dates, and the production dates. She shows a list, in one
window, of all the maintenance dates on each machine, and in another
window, of all the runs for that machine. Then she merges the two
lists and displays them interleaved on a time-line. Immediately they
notice a problem. One machine has two consecutive production runs with
no routine cleaning between them. Heather and Chris are both relieved
that Leslie has helped them identify a potentially costly
oversight.
Chris is happy with the data viewer, and learns to use it himself.
Heather doesn't want to use the data viewer, but she would like the
same data on her calendar. Leslie fixes up a calendar feed using the
iCalendar [ICAL] application-specific format, which can easily be
generated from RDF. Heather gets a calendar for each machine, which
contains both her data and the production data from Chris.
Here, Leslie has merged data from two traditional forms: a
relational database and a spreadsheet. In each of these cases,the data
is well organized: it has a clearly defined structure, but its
meaning, while clear to Heather and Chris, is undocumented. The
spreadsheet has rows and columns and the database indicates about how
its tables fit together, but nothing in either case indicates what
they mean.
Leslies task involved not only the syntactic conversion from the
original formats, but also the assignment of common terms to
equivalent abstractions. For example, where Heather identified each
machine by building number and machine number, Chris used building
number and operator. Leslie was able to combine the information that
Heather provided with information from the duty roster to infer which
machine was identified by building and operator. Then she could assign
a common term to them, using a URI.
Best Practice
Use URIs for things, not just for documents.
Leslie used ordinary http:
URIs for each machine so
that she could store both HTML and RDF information about the machine
at that location. She knows that that will make it easier for the
managers and operators to lookup information from the schedules.
Best Practice
Ensure that looking up a URI for anything
returns relevant useful information in RDF.
Leslie also established common terms for the starting and ending
times used in each schedule, drawing on the iCalendar concepts of
“dtStart” and “dtEnd” to unify them.
Use of common terms made it easy for her to generate a calendar
feed for Heather.
Best Practice
Use widely shared URIs for common properties such as time,
location, email address, etc.
Acme Order Management
The order management system at Acme is maintained by Clare in the
Billing Department. She uses a custom application which, while not
specific to the business of widget manufacture, does a fine job of
tracking orders and customers.
Joe suggests Leslie take a look at the order system, as she seems
to have shed light on other parts of the company. Leslie is concerned
that this data, which is so crucial to the company, is stored in a
proprietary format. She calls the software company and discovers that,
in fact, since 2001 there has been an option to export the data as XML.
Leslie checks to make sure that the exported XML is complete and also
that the application can import its own XML data. Once she's confident
of that, Leslie makes sure that the XML data is part of the daily
backup to a different machine.
The XML data files seem to make sense and the software company has
provided annotated schemas that indicate how the various elements are used.
A customer file has customer elements which use a combination of
attributes and sub-elements to encode customer numbers, email
addresses, contact information, and so on. A separate order file has a
long list of orders which contain a customer number and many items
consisting of part numbers and quantities.
However, Leslie can't simply import this data into her generic data
browser because there is nothing in the file to tell a machine which
bits of it represent things, which are relationships between things,
and what sort of things these are anyway. Customer numbers, part
numbers, and quantities all just look like numbers.
By talking with Clare about what the system actually does, and
using information from the documents and their schemas, Leslie writes
an XSLT script to extract RDF from each file.
The differences between the original XML and RDF files are
basically that in the RDF file:
- The distinction between different classes of items (customers,
parts, quantities) and their relationships to each other are
manifestly exposed.
- Well-known things such as email address, and so on, are identified
with common terms. So addresses in the billing system use the same terms
as addresses in Joe's contact list, for example.
- Each important artifact, such as an order, a customer, or an Acme product
is given a URI. The URIs are carefully constructed so that each one uniquely
identifies a single artifact, no matter which file it originally occurred in.
Leslie checks out her work by firing up her generic data browser.
Starting with a customer, she can navigate through their recent orders
and the products they ordered. Starting with a product, she can
navigate through orders for that product, the companies that placed
those orders, etc.
Next she writes a small ontology to give her system more to go on.
The ontology describes semantic constraints such as two persons with
the same email address are, in fact, the same person and two companies
with the same homepage URI are, in fact, the same company, etc. It
also provides additional information such as user-friedly labels for
some of the Acme-specific terms that she's generated.
Finally, she pulls all of the data she's collected so far into the
browser. Now she can navigate from sales persons to customers to orders
to parts to machines to maintenance information to operators and back.
In many places, use of the same email address or home page allows
the system to merge data about different aspects of the same customer
(or company, machine, etc.) from the various sources.
In this format, all the distinct customers are easily identified.
Leslie runs off a list and asks Joe and the folks in billing to check
their records and add identifying information where is missing. In the
process, a number of spelling errors are discovered, and out of date
email addresses are cleaned up.
Using an ontology allows the system to make inferences and automatically
manage redundant data.
Acme in Crisis
Crisis engulfs Acme on Saturday when building 3, in the old
rail-yard, burns to the ground.
Once the immediate crisis has passed, the fire has been
extinguished, and all of the people in building 3 have been accounted
for, the effected parties gather around a table in one of the
conference rooms.
The all know that the continued success of their business depends
on meeting as many of their commitments for widget deliveries as
possible.
Everyone wants to know: how bad is it? Which customers are going to
be effected and how important are they?
Chris brings up the data viewer as Leslie showed him weeks ago.
He starts at his schedule, which he knows. Looked at another way, his
schedule contains all the machines. The machines are identified by URI and
the URI for each machine returns information about the building that
it is in.
Now that he knows which machines are down, he can find out which
jobs where scheduled for those machines. It's a long list.
Joe points at the third one down, a run of 15,000 widgets. “Who's
that,” he asks?
“I'm not sure,” replies Chris, “I'm usually just concerned with
the production schedule.”
“Click on the order, Leslie says all the data is linked together
automatically.”
Chris clicks on the order which reveals the order number and other
details of the production run.
“Now click the order number,” suggests Clare.
From the order number, they find the customer number from the Billing
Department and that leads back to the customer name and address from
Joe's address book.
Everyone is relieved to see that only about fifty customers are
going to be impacted. But Joe knows that he's going to have to make
sure they all get a personal visit if he's going to maintain their
trust.
Recalling the mapping trick that Leslie showed them when she first
started, Joe and Anthea are able to plot all the affected customers
on a map. Now he can divide up the sales folks by region and make sure
everyone gets a visit tomorrow.
Here we see the Acme Widget Company adapting quickly despite the
adverse circumstances. The data they can dive into, and connect across
the company, is available because the individual parts of the company
have made all the data available in an application-neutral RDF format.
Suppose this had not been done, and Joe had asked for a list of
customers whose orders involved jobs which were scheduled for machines
in building 3? It wouldn't have been an impossible task, but a
programmer would have had to sit down with each of the systems and
files and write code to extract and collate the relevant information.
It's unlikely that the task could have been achieved in the time scale
Joe needed to keep his customers' trust.
Instead, the staff would have spent all night pouring over hard copy
output from the various systems, hoping that they didn't miss anyone.
Epilogue
@@Well, it was Leslie in a way. He puts into a
position that we could look at the state of ACME in depth, in any
way we needed to, at a moment's notice. That's what made
this possible.
It was leadership too - decisive action taken promptly.
We work as a team. Everyone does their part. We talk to
each other. And we share our data. And we connect out data.
Appendix Best Practices
A summary of the best practices used in
serving RDF data.
Serving Data
When defining a format for data which one would expect to be
machine-processed, and reused in combination with other web data,
then the data should be exported in RDF, possibly using content negotiation,
and possibly using a GRDDL with an application-specific XML
schema, and possibly using alternative serializations such as N3
[and RIF?]
Any thing [abstract or concrete] about which information is given
should be identified with a URI.
A second-best alternative is to use a blank node, but include
values of Inverse Functional Property arcs to uniquely indirectly
identify each node.
That URI should be a HTTP URI.
Dereferencing the URI should yield generally useful information
about the thing: information the enquirer would expected to find
useful.
Dereferencing the URI should not yield an unmanageably large
amount of information about a URI..
When the thing is a Property or a Class, then the RDFS and OWL
ontologies are appropriate and conventional.
HASH/SLASH
When a URI for n arbitrary thing, other than an Information
Resource, is looked up using HTTP,
then there are two possibilities. The URI contains a hash,
@@@ etc see finding
MAKE LINKS OUT
One thing will typically be described by information in many
systems. It will therefore be identified by many URIs. This is to
be expected when different agents support the different
URIs. In this case systems should provide owl:sameAs links
to other URIs where they are known and it would be useful to the
enquirer.
On the choice of Ontology
1. Systems exporting data may use local HTTP terms [URIs for
properties and classes]. These systems will not however
benefit from a great level of interoperability. Their data
will be miscible with other RDF, but nodes will not merge.
2. Data exporters should make an effort to find terms in
well-known terms where they can, and use those instead of local
ones.
2a. It is not necessary at all to use complete ontologies; in
fact a wise design mixes and matches ontologies so as to use the
most appropriate and well-known individual term. This tends to
maximize data reuse.
3. Failing that, agreement may be sought within a community of
interest, application area, country, profession, etc. on common
terms in a new shared ontology.
3a. When building a new shared ontology, it is wise to consider
the implicit model used in existing specifications, such as XML
schemas, non-XML languages, database schemas, etc.
3b. Those making new ontologies should ensure the ns is
persistent. They may apply to W3C for such a URI if the work is
W3C-related.
4. Whenever a local or shared ontology is used, an ontology
document should include where possible a relationship (e.g.
subclass, subproperty) to more widely known ontologies where they
exist. (cf Heflin)
References
VCARD @@