Data on the Web

Introduction

The Architecture of the World Wide Web [AWWW] describes the operation of the information space which is the World Wide Web. It deals mostly with the use of the web for hypertext documents, which form the majority of the web in 2006.

This document describes the use of the web for data. Like the AWWW, it motivates the selection of principles, constraints, and good practices with simplified scenarios that it describes in some detail. Unlike the AWWW, which describes long-existing practice, this document describes technology which at the time of this writing is deployed by a growing but relatively small number of sites. The standards described herein are around five years old rather than 15 years old. This document therefore makes a far more detailed analysis of its motivating examples.

Acme Sales

The subject of our primary scenario is the Acme Widget Company, its employees, and the events that befall them.

For many years, the success of the Acme Widget Company depended on the contents of an old brown notebook which Joe carried everywhere. Joe started it many years ago when his memory, and a pocketful of business cards, failed to be sufficient. In the notebook, Joe maintained the master list of all their customers, prospective and actual. Each entry included both the mundane details of their names and addresses and also notes on their foibles and particular interests.

Recently, to everyone's great relief, Joe has been persuaded to transfer all of this information to his PDA which he is required to back up regularly.

Last year, when Anthea started seeing more customers herself, she wished she could have the contents of Joe's notebook at her fingertips. And, of course, now that it was backed up on the company network, she realized there was no reason that she couldn't.

She asked Leslie, a summer intern, to arrange for the address book file to be on the internal website, accessible only to management in case some of the notes were personal.

Leslie knew that it would be easy to put the data on the web server, but she also knew that Anthea was going to want to access the data with her laptop computer whereas Joe preferred his PDA. Furthermore, she knew it was only a matter of time before the other sales managers would want to access the data as well.

Leslie put the data up in VCards so that it would be easy to import into the various applications that the sales managers were using. She also generated web pages so that any of the managers could review the contacts, even if they didn't need to have their own copy.

When Joe saw the web pages, he noticed that there was more information on them than he had in his PDA. In particular, he noticed that each customer page showed the latitude and longitude off in one corner. Anthea didn't know how it got there, so they asked Leslie.

“Oh”, said Leslie, “that's for something I was working on. It's not finished yet, but let me show you.”

Leslie opened up a data browser and showed them all of the contact information along with other information she'd been able to add to each contact.

“You see,” she explained, “what I did was I converted all of the data into RDF, that's just a generic data format, you don't really need to care about that, then I used a public web service to convert all the postal addresses into latitudes and longitudes.”

“Why,” asked Joe?

“Because then my data browser can put them on a map.”

Leslie clicked on a tab in her browser and revealed a local map dotted with colored markers. The green ones, she explained, were for customers and the blue ones were for leads.

Joe and Anthea spent a few minutes looking at the map and discussing the various customers and leads it revealed. One thing it revealed was that Joe appeared to do a very good job following up on all the leads that fell along the routes he took between customers, but he was neglecting potential leads that were off his normal routes.

Joe mentally plotted the routes he would take over the next few days to follow up on a bunch of leads while Anthea asked if she could see her contacts on that map as well. Leslie explained that it would be easy, as soon as her data was converted to RDF it would just show up automatically.

“How can that be,” Anthea asked?

“One of the cool things about RDF,” Leslie explained, “is that it's very easy to aggregate data stored in RDF. If you have two sets of RDF data, you can always just mix them together and you get the union without doing any extra work.”

Leslie said she would convert Leslies contacts into RDF next and they left her to carry on.

This is a story of the progression of information through forms in which it is progressively more useful.

In the format Joe's PDA used natively, it could be backed up and reused, at least by others with a similar brand of PDA. Leslie, coming to the company with a knowledge of data standards, made sure that it was available in standard formats that many more people would be able to use.

The VCard format [VCARD] is an non-proprietary application-specific data format. This means that the file will be usable by any VCard-compliant address book software in the future. This preserves the data from loss due to a change of PDA manufacturer or address book software.

The RDF format [RDF] is a non-proprietary cross-application format. This means that the file is usable by generic data processing, storage and visualization tools. It also means that it can be merged with data From other applications to form a web of linked data.

Leslie shows the data in a generic data browser. The conversion of the addresses into latitude and longitude was a stunt to show off the data, but it demonstrates that geo-spatial position is a powerful property against which to display an compare data.

Acme Manufacturing

Acme Widgets totally depends on Heather. She keeps the machines running, the machines which make the widgets without which there would be no Acme. They have 56 machines in all, spread over 8 sites in the same small town. Heather keeps a spreadsheet, using a row for each machine, with dates of last and next service, building number, capacity of the machine, anticipated replacement parts, and other information related to maintenance.

Heather has to keep in constant touch with Chris, the production manager. Chris is a crucial person too because he plans the production runs. It's his planning that gives Acme the ability to deliver to each customer on time. For each run, he knows which machine will be used for each order, when they will start and finish, which operator is assigned to the job, what the expected overruns are, and other information related to production. He plans several weeks in advance using a relational database with a PHP-based web interface that allows him to view the data in various ways and make changes to adapt the schedules when necessary.

Heather and Chris are always reviewing printed copies of the data From their respective systems. They have to make sure that Heather has time to get in and maintain the machines between runs and that Chris is never starting a production run on a machine which hasn't been recently oiled and cleaned. Leslie hears them talking over lunch and offers to help.

Soon Leslie she has a script on her laptop which puts the spreadsheet data in RDF on the internal website. She also sets up a open source program to provide an RDF view of Chris's database. They fire up Leslie's data browser and look at the combined information.

Their common interests, Leslie notes, are the machines, the maintenance dates, and the production dates. She shows a list, in one window, of all the maintenance dates on each machine, and in another window, of all the runs for that machine. Then she merges the two lists and displays them interleaved on a time-line. Immediately they notice a problem. One machine has two consecutive production runs with no routine cleaning between them. Heather and Chris are both relieved that Leslie has helped them identify a potentially costly oversight.

Chris is happy with the data viewer, and learns to use it himself.

Heather doesn't want to use the data viewer, but she would like the same data on her calendar. Leslie fixes up a calendar feed using the iCalendar [ICAL] application-specific format, which can easily be generated from RDF. Heather gets a calendar for each machine, which contains both her data and the production data from Chris.

Here, Leslie has merged data from two traditional forms: a relational database and a spreadsheet. In each of these cases,the data is well organized: it has a clearly defined structure, but its meaning, while clear to Heather and Chris, is undocumented. The spreadsheet has rows and columns and the database indicates about how its tables fit together, but nothing in either case indicates what they mean.

Leslies task involved not only the syntactic conversion from the original formats, but also the assignment of common terms to equivalent abstractions. For example, where Heather identified each machine by building number and machine number, Chris used building number and operator. Leslie was able to combine the information that Heather provided with information from the duty roster to infer which machine was identified by building and operator. Then she could assign a common term to them, using a URI.

Best Practice

Use URIs for things, not just for documents.

Leslie used ordinary http: URIs for each machine so that she could store both HTML and RDF information about the machine at that location. She knows that that will make it easier for the managers and operators to lookup information from the schedules.

Best Practice

Ensure that looking up a URI for anything returns relevant useful information in RDF.

Leslie also established common terms for the starting and ending times used in each schedule, drawing on the iCalendar concepts of “dtStart” and “dtEnd” to unify them. Use of common terms made it easy for her to generate a calendar feed for Heather.

Best Practice

Use widely shared URIs for common properties such as time, location, email address, etc.

Acme Order Management

The order management system at Acme is maintained by Clare in the Billing Department. She uses a custom application which, while not specific to the business of widget manufacture, does a fine job of tracking orders and customers.

Joe suggests Leslie take a look at the order system, as she seems to have shed light on other parts of the company. Leslie is concerned that this data, which is so crucial to the company, is stored in a proprietary format. She calls the software company and discovers that, in fact, since 2001 there has been an option to export the data as XML. Leslie checks to make sure that the exported XML is complete and also that the application can import its own XML data. Once she's confident of that, Leslie makes sure that the XML data is part of the daily backup to a different machine.

The XML data files seem to make sense and the software company has provided annotated schemas that indicate how the various elements are used.

A customer file has customer elements which use a combination of attributes and sub-elements to encode customer numbers, email addresses, contact information, and so on. A separate order file has a long list of orders which contain a customer number and many items consisting of part numbers and quantities.

However, Leslie can't simply import this data into her generic data browser because there is nothing in the file to tell a machine which bits of it represent things, which are relationships between things, and what sort of things these are anyway. Customer numbers, part numbers, and quantities all just look like numbers.

By talking with Clare about what the system actually does, and using information from the documents and their schemas, Leslie writes an XSLT script to extract RDF from each file.

The differences between the original XML and RDF files are basically that in the RDF file:

The distinction between different classes of items (customers, parts, quantities) and their relationships to each other are manifestly exposed.
Well-known things such as email address, and so on, are identified with common terms. So addresses in the billing system use the same terms as addresses in Joe's contact list, for example.
Each important artifact, such as an order, a customer, or an Acme product is given a URI. The URIs are carefully constructed so that each one uniquely identifies a single artifact, no matter which file it originally occurred in.

Leslie checks out her work by firing up her generic data browser. Starting with a customer, she can navigate through their recent orders and the products they ordered. Starting with a product, she can navigate through orders for that product, the companies that placed those orders, etc.

Next she writes a small ontology to give her system more to go on. The ontology describes semantic constraints such as two persons with the same email address are, in fact, the same person and two companies with the same homepage URI are, in fact, the same company, etc. It also provides additional information such as user-friedly labels for some of the Acme-specific terms that she's generated.

Finally, she pulls all of the data she's collected so far into the browser. Now she can navigate from sales persons to customers to orders to parts to machines to maintenance information to operators and back.

In many places, use of the same email address or home page allows the system to merge data about different aspects of the same customer (or company, machine, etc.) from the various sources.

In this format, all the distinct customers are easily identified. Leslie runs off a list and asks Joe and the folks in billing to check their records and add identifying information where is missing. In the process, a number of spelling errors are discovered, and out of date email addresses are cleaned up.

Using an ontology allows the system to make inferences and automatically manage redundant data.

Acme in Crisis

Crisis engulfs Acme on Saturday when building 3, in the old rail-yard, burns to the ground.

Once the immediate crisis has passed, the fire has been extinguished, and all of the people in building 3 have been accounted for, the effected parties gather around a table in one of the conference rooms.

The all know that the continued success of their business depends on meeting as many of their commitments for widget deliveries as possible.

Everyone wants to know: how bad is it? Which customers are going to be effected and how important are they?

Chris brings up the data viewer as Leslie showed him weeks ago. He starts at his schedule, which he knows. Looked at another way, his schedule contains all the machines. The machines are identified by URI and the URI for each machine returns information about the building that it is in.

Now that he knows which machines are down, he can find out which jobs where scheduled for those machines. It's a long list.

Joe points at the third one down, a run of 15,000 widgets. “Who's that,” he asks?

“I'm not sure,” replies Chris, “I'm usually just concerned with the production schedule.”

“Click on the order, Leslie says all the data is linked together automatically.”

Chris clicks on the order which reveals the order number and other details of the production run.

“Now click the order number,” suggests Clare.

From the order number, they find the customer number from the Billing Department and that leads back to the customer name and address from Joe's address book.

Everyone is relieved to see that only about fifty customers are going to be impacted. But Joe knows that he's going to have to make sure they all get a personal visit if he's going to maintain their trust.

Recalling the mapping trick that Leslie showed them when she first started, Joe and Anthea are able to plot all the affected customers on a map. Now he can divide up the sales folks by region and make sure everyone gets a visit tomorrow.

Here we see the Acme Widget Company adapting quickly despite the adverse circumstances. The data they can dive into, and connect across the company, is available because the individual parts of the company have made all the data available in an application-neutral RDF format.

Suppose this had not been done, and Joe had asked for a list of customers whose orders involved jobs which were scheduled for machines in building 3? It wouldn't have been an impossible task, but a programmer would have had to sit down with each of the systems and files and write code to extract and collate the relevant information. It's unlikely that the task could have been achieved in the time scale Joe needed to keep his customers' trust.

Instead, the staff would have spent all night pouring over hard copy output from the various systems, hoping that they didn't miss anyone.

Epilogue

@@Well, it was Leslie in a way. He puts into a position that we could look at the state of ACME in depth, in any way we needed to, at a moment's notice. That's what made this possible.

It was leadership too - decisive action taken promptly. We work as a team. Everyone does their part. We talk to each other. And we share our data. And we connect out data.

Appendix Best Practices

A summary of the best practices used in serving RDF data.

Serving Data

When defining a format for data which one would expect to be machine-processed, and reused in combination with other web data, then the data should be exported in RDF, possibly using content negotiation, and possibly using a GRDDL with an application-specific XML schema, and possibly using alternative serializations such as N3 [and RIF?]

Any thing [abstract or concrete] about which information is given should be identified with a URI.

A second-best alternative is to use a blank node, but include values of Inverse Functional Property arcs to uniquely indirectly identify each node.

That URI should be a HTTP URI.

Dereferencing the URI should yield generally useful information about the thing: information the enquirer would expected to find useful.

Dereferencing the URI should not yield an unmanageably large amount of information about a URI..

When the thing is a Property or a Class, then the RDFS and OWL ontologies are appropriate and conventional.

HASH/SLASH

When a URI for n arbitrary thing, other than an Information Resource, is looked up using HTTP, then there are two possibilities. The URI contains a hash, @@@ etc see finding

MAKE LINKS OUT

One thing will typically be described by information in many systems. It will therefore be identified by many URIs. This is to be expected when different agents support the different URIs. In this case systems should provide owl:sameAs links to other URIs where they are known and it would be useful to the enquirer.

On the choice of Ontology

1. Systems exporting data may use local HTTP terms [URIs for properties and classes]. These systems will not however benefit from a great level of interoperability. Their data will be miscible with other RDF, but nodes will not merge.

2. Data exporters should make an effort to find terms in well-known terms where they can, and use those instead of local ones.

2a. It is not necessary at all to use complete ontologies; in fact a wise design mixes and matches ontologies so as to use the most appropriate and well-known individual term. This tends to maximize data reuse.

3. Failing that, agreement may be sought within a community of interest, application area, country, profession, etc. on common terms in a new shared ontology.

3a. When building a new shared ontology, it is wise to consider the implicit model used in existing specifications, such as XML schemas, non-XML languages, database schemas, etc.

3b. Those making new ontologies should ensure the ns is persistent. They may apply to W3C for such a URI if the work is W3C-related.

4. Whenever a local or shared ontology is used, an ontology document should include where possible a relationship (e.g. subclass, subproperty) to more widely known ontologies where they exist. (cf Heflin)

References

VCARD @@

Data on the Web

Abstract

Introduction

Acme Sales

Acme Manufacturing

Acme Order Management

Acme in Crisis

Epilogue

Appendix Best Practices

Serving Data

On the choice of Ontology

References