Semantic Web

Invited Submission to the Transactions of the Institute of Electrical Engineers of Japan (for translation to Japanese) - to appear Summer, 2002

Integrating Applications on the Semantic Web

James Hendler
Computer Science Dept, Univ. of Maryland

Tim Berners-Lee
Director, World Wide Web Consortium

Eric Miller Activity Lead, W3C Semantic Web Activity

The Semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. It is based on the idea of having data on the Web defined and linked such that it can be used for more effective discovery, automation, integration, and reuse across various applications. For the Web to reach its full potential, it must evolve into this Semantic Web, providing a universally accessible platform that allows data to be shared and processed by automated tools as well as by people.

The Semantic Web will provide an infrastructure that enables not just web pages, but databases, services, programs, sensors, personal devices, and even household appliances to both consume and produce data on the web. Software agents can use this information to search, filter and prepare information in new and exciting ways to assist the web user. New languages, making significantly more of the information on the web machine-readable, power this vision and will enable the development of a new generation of technologies and toolkits.

Currently, many parts of the Semantic Web are already in existence, and the business cases for this technology are becoming clearer all the time. In this article, we will focus on the near term objectives of the Semantic Web and discuss the potential benefits to web users that these objectives offer.

Consider the state of documentation systems as they were in 1989 when the Internet was starting to become internationally established. At this time, retrieving and referencing information across remote systems was still an expert's game. The Internet existed, so it was quite possible that you could get access to a remote systems. However, this system more often than not used completely different protocols for accessing information. You might for example telnet (or some other remote login protocol) to one system and learn some particular access protocol before you could search its database. Having found the relevant information, you would have to copy this into your clipboard (or the back of an envelope) and proceed to then paste in (or retype) the information into the relevant document. The type of information obtained was more often a factor of least resistance than the most relevant, timely or accurate.

That was before the Web. Now, given Web technologies, we can link information easily and seamlessly. The majority of systems now have Web servers, and despite still in fact running on different machines and still likely using new versions of the same programs they were running in 1989, the Web interfaces make them seem part of the same smooth, consistent, world of information. For the documents in our lives, linking content is much easier than ever before.

Despite this seemingly miraculous world of information, transferring content between web applications is still surprisingly difficult. There are lots of ways in which our machines can use our web content when they can understand it. When my personal digital assistant's calendar program understands dates, it can alert me when an appointment is coming up. When my email program's address book understands that something is a phone number or an email address, it can set up communication with that person with a click. When my digital phone is given a location in Japan, it can access a program to compute which trains to take and how much it will cost.

Today we are limited in our ability to use these things on the web. Suppose you are browsing the web and you come across a page describing a meeting. It has the time and place and links to other documents including the home pages of other people involved in organizing and attending the meeting. You decide to attend, and click the "register" button. At this point, you would like your calendar to have an entry at the right date and time, with hypertext links to the details. You would like your digital phone to download the event's address and compute the best train route to arrive at that date and time. You would like your Rolodex to seem to contain, until the meeting is over, the contact info for the people involved. You'd like to do all this with one click.

Unfortunately, currently you can't. What you in fact have to do is laboriously cut and paste details into your address book, finding the date and time yourself. You have to copy the contact details by hand from the various attendees' web pages into your address book, manually sorting out the address lines and phone numbers. And you must use the small keys on your digital phone to enter the location and, yet again, the location. This situation is just like the problems we had with the documentation system before the web. In short, for the data in our lives we are still pre-Web!

The Semantic Web already provides technologies that can address these issues and improve things. We will describe three of these - the linking of databases, sharing content between applications using differing XML DTDs or schemas, and the critical emerging application of the discovery and combination of web services.

The web of data - beyond HTML

Given the difficulty described above in networking our personal data, consider the impact for a company. The same pre-web status exists when you look at trying to connect the various data-handling applications on which your company depends or look at users trying to combine information from multiple databases. There is a certain overlap between the stock control system and the accounting system which, if the connection were made, would save a lot of re-keying and associated errors. Today, you hire a programmer to write the "glue code" to extract the data from the stock control system, reformat it, and load it into the accounting system. The same thing happens when you realize that your customer relationship management system could be set up with data from the order control system - in fact your business suffers because it isn't. Again and again, special purpose interfaces are written to bring data from one systems into another. If you have many applications which run in your company, there are a huge number of ways they can be linked together. That linking requires a lot of custom code, and that adds up to a lot of programming hours by a lot of expensive hired help.

Use of the Extensible Markup Language (XML) (http://www.w3.org/XML/) can help. If all the applications are changed to use XML, the programmer only has to learn to handle XML data, not the full range of weird internal formats in which data could otherwise be stored and transferred. This means that some of the application glue can be constructed using XML tools such as XSLT, the transformation language (http://www.w3.org/TR/xslt). The bad news is that the problem of effectively exchanging data doesn't go away. For every pair of applications, in fact for each way in which they need to be linked, someone has to create an "XML to XML bridge." That is, If you take XML files from two different applications, you can't just merge them. To make a (XML) query on an XML document, but add in some constraints from another document, you can't just merge the two queries. It's not as though everything is in relational databases where common elements can be used so that data is joined together.

The problem is that different databases are built using different database schemas, but these schemas are not made explicit. Thus, an XML tag called something like <CX213> is not readily associated with a field in another database called <Income>. One proposed solution to this is to make the schemas more explicit, and to map them to common terms. The XML-Schema language (http://www.w3.org/XML/Schema) allows groups with common interests to express a schema on line. A company, or even a particular business sector, can develop a standard set of XML mappings (i.e. a particular XML schema), and therefore be able to represent their content using a common structure. Unfortunately, this sometimes proves to be difficult, and attempts to develop large scale vocabularies for diverse users are often very difficult.

The problem is that, as we mentioned earlier "there is a certain overlap between the stock control system and the accounting system." A certain overlap, but probably not a complete one. The stock system might, for example, expect every part to have a unique number -- your salespeople will insist on it. On the other hand, the accounting system is designed to allow for the fact that you may buy the same end-customer item from different vendors at different times. The stock number does not uniquely map onto the accounting system. In general, attempts to define single schemas across very large companies (or within competitive industry sectors) have been complicated, and often fruitless -- different users insist on different ways to represent the same data due to their inherently different job and data needs. Where it does prove possible within a small company or cooperative sector, the problem is almost always compounded because somewhere in the supply chain there is some user who is not using the same schema or who is not able to agree completely about schema needs.

A capability beyond that offered by XML-schema is needed if we are to provide mapping capabilities between divergent schemas, or between users who need to use different business vocabularies. It is a truism of computing that to map between dissimilar data structures, a more powerful data representation is needed. For example, the relational calculus, used by relational databases, is more powerful than most of the representations used in older (flat file) databases, and thus it became the standard for mapping between older approaches. More powerful representations, such as entity/relations or object models, are needed if one wants to perform a complex mapping of databases to each other or to query across heterogeneous databases. In short, a more expressive language can allow you to move up a layer of interoperability. Just as older database systems suddenly became compatible by adopting a consistent relational model, so unstructured web data, or XML-schema definitions can, essentially, also adopt a relational model, allowing significantly more power to brought to bear on solving these data-modeling problems.

For this reason, a fundamental component of the Semantic Web is the Resource Description Framework (RDF; http://www.w3.org/RDF/). When information from two sources in RDF need to be merged, you can basically concatenate the files into one big file -- joining on those terms which are defined to correspond to the same Universal Resource Indicators (URIs). When you want to extend a query on an RDF file to include constraints from another, you just add in the contraints as part of the merging. Thus, where XML is made up of elements and attributes - which tells you only about how things are written into the file - RDF data is made up of statements where each statement expresses the value of one property of something -- the exact equivalent of one cell in a database table. All the relational database ideas work - joins and views, for example, are written easily using common tools.

What happens now to your enterprise application integration problem? The information from each application is output in, or converted into, RDF. Any query can run over any selection of this data. Filters can be written very simply, and converters can be used to extract and calculate the data you need. This data in turn is input to those other applications which need it. Basically, the problem is linear in the size of your system. Just as new web servers can be fitted into the web without disturbing the rest, so new RDF applications supply and use information without upsetting the rest of the system. The huge number of custom date interfaces has, seemingly miraculously, disappeared. Like document linking, data can join the web!

Semantic Web Services - bringing programs and data together

In much the same way that databases cannot be easily integrated on the current web without RDF, the same applies to bringing programs to the web. It may seem that programs are easily integrated into the web, after all, you often click on a link and a java or flash program downloads and runs. Unfortunately, this approach doesn't work for many e-business applications, particularly in business-to-business (b2b) applications. Loading someone else's program to run locally is a very different task than bringing information from someone else's applications into your own.

Consider the case of a company which wishes to purchase parts from a vendor, arrange shipping from a large freight company, and have those parts delivered to one of several manufacturing locations based on which plant has the most spare capacity at the time of the delivery. Further, they would like this deal to be brokered on the web with the minimum amount of human interaction -- a salesman types in an order, and the supply chain goes to work! This is similar to the problem seen in databases, but made even more complex by the fact that each of these organizations is likely using a set of different programs, not just databases, to manage their businesses. Worse, these programs may be running on special purpose machines or behind security and firewall protections. The first problem to be solved is to figure out how all these programs can interoperate on the web -- that is, to provide protocols and descriptions of the "services" that these various programs can offer.

A large number of major companies have been working very hard on doing exactly this, and the result is a rapidly expanding "web services" market. This is one of the fastest growing sectors of modern web business -- for example the Gartner groups says that in the near future "using Web services will help reduce costs and improve the efficiency of IT projects by 30 percent." Estimates for the size of the web services market start in the billions and go up from there.

As a result, new protocols and languages are being developed rapidly to try to standardize the ways in which the systems describe what they do. An XML-based protocol, called SOAP (http://www.w3.org/TR/SOAP/), has been developed to provide standard means for allowing programs to invoke other programs on the web. In addition, new web service description and web service architectures languages are emerging. This has become a major focus area for the World Wide Web Consortium's Web Services Activity (http://www.w3.org/2002/ws/).

Semantic Web technologies will enhance the utility of Web Services when they are widely deployed. On the web, the many service providers will need to be able to advertise their services to an extremely wide and varied audience of service users. Providing brokering capabilities, the ability for service users to be automatically matched with service users, is difficult, and revolves around the same sort of vocabulary mapping issues that databases exhibit. In the current implementations, the services describe inputs, outputs, ports and other aspects of calling each other, however the description of what the services do is left for a "content" field, to be filled in by an arbitrary (XML-parsable) description. Thus, the problem is similar to that of databases -- without agreed upon content many different mappings between diverse user communities need to be expressed. Within a pre-arranged group of players there can be the ability to reach some agreements, but this makes the discovery of outside providers, who use different content schema, difficult, and can require a significant amount of prior agreements across supply-chains, a potential source of inflexibility.

As in the case of databases, the expressive power of the Semantic Web languages can help us here. An extension to RDF called RDF-schema (http://www.w3.org/TR/rdf-schema/), and a newly emerging web ontology language (http://www.w3.org/2001/sw/WebOnt/), are able to let us build hierarchies and thesauri that can be used for expressing how terms relate to one another. For example, we can create a schema on the web that expresses the information that there are shipping events, that mailing is a kind of shipping, that overnight-mailing is a kind of mailing, etc. A new service can be linked into this simply by merging its own description with that in the thesaurus, since the merged documents remain legal RDF (as discussed in the database section).

Moreover, the information that links one service description to another doesn't need to rely on the luck of having a common term in natural language to merge on. An outside source, whether it be a different user or developer, a separate thesaurus, or even a random fact found on someone else's web page, can express mapping information. Thus, I could come along and say that what my co-author calls a "lorry" is equivalent to what you call a "truck," and from then on, when we merge the graphs, a connection between lorry and truck can be found. Moreover, the newer languages allow more complex mappings to be expressed and merged - thus I could express that a Nissan-Maxima is an automobile, its type is luxury, and its place of manufacture is Japan, and now when an advertisement for a Nissan dealer's service is linked in, it can be found by any of those properties.

Another enhancement to Web Services provided by the Semantic Web occurs when the specific service one needs is not immediately available. For example, suppose a company which supplies little gift boxes full of candies wants to buy a hundred gross of chocolate hearts and a hundred gross of candy canes, and ship them both to a plant in Nagoya for packaging. We can find vendors of hearts, vendors of candy canes, and lots of shippers - but no one service that can do all three. Essentially, we are composing one new service (the candycane-chocolate-shipping service) out of three others. The composition of Semantic Web Service descriptions allows us to pull this information together even if we don't agree upon the same terms for our descriptions a priori. Further, Semantic Web applications can then be used to analyze ways to reach the goal we need by putting services together in an efficient and effective means (e.g. shipping chocolate requires refrigeration, so merging additional information about chocolate helps ensure effective delivery). Although complex service composition is still very much a research issue, basic composition, done by matching inputs and outputs of the various services, is already doable using existing Semantic Web tools.

So is building the Semantic Web really a futuristic research vision for a bunch of rocket scientists? As you see, the answer is no. The Semantic Web, like the World Wide Web, can grow from taking well established ideas, and making them work interoperability over the Internet. This is done with standards, which is what the World Wide Web Consortium is all about. We are not inventing relational models for data or composition methodologies for services. Rather, we are bringing to the web a number of things we already largely know how to do -- that is, we are allowing them to work together in a decentralized system - without a human having to custom handcraft every connection.

In short, the Semantic Web already provides the kinds of languages and tools needed to attack the problem discussed earlier in this article. We're not that far from the time when you can click on the web page for the meeting, and your computer, knowing that it is indeed a form of appointment, will pick up all the right information, and understand it enough to send it to all the right applications. Further, it will evoke those applications directly (using web services) needing little or no human intervention. The business market for this integration of data and programs is huge, and we believe the companies who choose to start exploiting Semantic Web technologies will be the first to reap the rewards.