Web Choreography Use Case Overview
Web Choreography Use Case Overview |
by |
Bruce R. Barkstrom, Paula L. Sidell |
Atmospheric Sciences Data Center |
NASA Langley Research Center |
Hampton, VA, USA 23681-2199 |
and |
Donald M. Sawyer |
National Space Sciences Data Center |
NASA Goddard Space Flight Center |
Greenbelt, MD, USA |
June 4, 2003 5:00 am EDT |
Web choreography represents the culmination of a complex intersection of many disciplines that must work together in order to achieve a smooth and uninterrupted flow of data and services among many cooperating partners. These disciplines include
This document is intended to provide a collection of use cases that can be used for several purposes:
As we describe below, the use cases that we provide are based on practical experience with a collection of large, distributed data centers and on an international standard describing the operation of such data centers. This basis is useful for several reasons:
The next sections of this document provide a brief description of the OAIS Reference Model [2002] and of the NASA EOSDIS data centers that provide the concrete experience that we use in developing test cases and shaping some of the use cases.
The use cases we consider below are based on the ISO Standard for a Reference Model for an Open Archival Information System (OAIS) [CCSDS, 2001], which has become ISO 14721. This standard was developed by the Consultative Committee on Space Data Systems, an international body concerned with interoperability of governmental space assets. By using this standard, we ensure that there will be a very low probability of patent complications. Furthermore, this standard has caught the attention of the digital library community and is widely regarded as important there. An important aspect of the OAIS is that it seems likely to provide significant guidance to the architecture suggested by the U.S. Library of Congress project on the "National Digital Information Infrastructure and Preservation Project" [NDIIPP, 2003] and to related international projects.
It is useful to quote from the Standard itself in understanding the nature and function of an OAIS: ``An OAIS is an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community. It meets a set of such responsibilities as defined in the standard, and this allows an OAIS archive to be distinguished from other uses of the term `archive'. The term `Open' in OAIS is used to imply that the standard, as well as future related Recommendations and standards, are developed in open forums. It does not imply that access to the archive is unrestricted.
The information being maintained in an OAIS has been deemed to need Long Term Preservation, even if the OAIS itself is not permanent. Long Term is long enough to be concerned with the impacts of changing technologies, including support for new media and data formats, or with a changing user community. Long Term may extend indefinitely. In this reference model there is a particular focus on digital information, both as the primary forms of information held and as supporting information for both digitally and physically archived materials. Therefore, the model accommodates information that is inherently non-digital (e.g., a physical sample), but the modeling and preservation of such information is not addressed in detail.
This reference model:
The Model has been done in UML and breaks down the functions in an archive into six large classes, to which we add one additional class
The OAIS Reference Model can be made much more useful by ensuring that the use cases for web choreography are based on concrete experience. Specifically, we propose to describe the operation of NASA's EOSDIS Distributed Active Archive Centers (DAACs) in terms of the OAIS Reference Model and to use that description for use case instances or scenarios. The experience of these operational data centers can help ensure a realistic recommendation - and not just one based on academic interests. These centers have to deal with security, data production, data distribution, user access, reliability, and cost of operation on a daily basis. For example, the NASA Langley Atmospheric Sciences Data Center (ASDC) currently has about 500 TB of data in its store. It is adding about 20 TB per month to that store, and had more than 3,000 data-ordering users in the last year.
The EOSDIS DAACs hold data from forty instrument teams on fifteen satellites that have been launched during the period from 1997 to 2003, as well as similar data going back into the 1970's. The collection will grow even further in future years, as more instrument teams and satellites are added to the mix. Reber and Todirita [2003] provide a useful on-line summary <http://eosdatainfo.gsfc.nasa.gov/eosdata/>. Table 1 identifies these data centers and the specific kinds of data they hold.
TABLE 1. EOSDIS Data Centers.
|
Briefly, data from Low-Earth Orbiting satellites is sent to a data collection site at White Sands, New Mexico, either by using telemetry relays between the Tracking and Data Relay Satellite System (TDRSS) or by shipments from ground stations. Typically, the ground stations receive telemetry data from an entire orbit within the ten minutes or so that a satellite is above the horizon at the ground station, requiring data transmission of a ninety minute orbit's data collection into a window of opportunity that is only ten percent of the full orbit. Data rates for this window are rather high. After the White Sands site ensures that the telemetry packets from the satellite have been sorted into time-sequenced order, the collection of packets is shipped to a site at Goddard Spaceflight Center in Greenbelt, MD, and then redistributed to the EOSDIS Data Centers or to the instrument science teams in other locations.
Various data centers process the data through software provided by the science teams that develop the instruments and the algorithms for processing the data. In some cases, the processing is quite simple - multiply the data value from a satellite measurement by a constant for calibration and append information allowing the location of the measurement on the Earth to be derived by the user. In others, the science teams do very extensive processing that involves heavy computational loads. At ASDC, for example, one team has provided about one million lines of source code and another about two-thirds of a million lines. We will comment on production paradigms that teams use later. In most cases, the data production at the EOSDIS DAACs falls into moderate-rate, discrete batch production. This means that the DAACs or science team sites engaged in data production are running about 1,000 jobs per day, on average. Some sites run 100 per day or fewer; one runs 10,000 or more, and might be more accurately described as using an assembly line approach to production. The production facilities eventually send the data to the DAACs.
The EOSDIS data centers contain a fair amount of data. ASDC has about 500 TB; other data centers may have as much as 4 PB right now. At the end of mission life, they will probably have between ten and twenty PB in total. In addition to binary data, the DAACs also contain metadata and documentation. In this, they resemble most other enterprises on the WWW. Users are expected to search through the metadata, using database queries, and to use the documentation to find and understand what they can get out of the system. Each of the DAACs has their own interface and ordering tools. They also support a federated search system that can query all of the data centers to produce candidate files for users to order.
After they have ingested the data products, these data centers make the data available to almost anyone who wants it - usually at no cost or just recovering the marginal cost of distribution. Because the EOSDIS data centers are open to the Internet, they suffer the same environment as other open web sites. On a practical level, the lives of data center staff are consumed with data management and user services. The statistics collected as part of the EOSDIS system operations suggest that this system is serving data orders to more than 100,000 users per year, with some reports suggesting the system is getting about 2,000,000 distinct web users hitting the data centers in a year.
The EOSDIS system design started about 1990 - and was not strongly impacted by the sudden emergence of the web. There is so much data that this system has not been able to use databases for the bulk of it. The current system's design is based on a paradigm of file-based search and order, with many of the files as large as 500 MB. By search and order, we mean that users can search for and order one or more files. These data centers have had some experimentation with subsetting and supersetting, which would allow them to customize their data to particular user communities. Four of the larger data centers, have a system developed by a large aerospace contractor. This system has about 1.7 Million lines of code and uses about twenty to twenty five Commercial Off-The-Shelf (COTS) products. All of the data centers have other systems they've developed to handle the specific needs of their data sets and user communities. Based on cost considerations, most of the files in these systems are stored in robotic tape silos, although the U.S. National Research Council recently released a report [NRC, 2003] recommending that the centers move to much more disk-based storage.
Figure 1 identifies the major elements in the OAIS Context, including the relationship with the producer and supplier. The Production Facility identified in this figure, as well as the Supplier and Production Management, are not from the OAIS Reference Model, but are added for the sake of completeness. In the material that follows, we quote extensively from Chapter 2 of the OAIS Reference Model.
Figure 1. OAIS-Producer Context Diagram |
Outside the OAIS are Producers, Consumers, and Management.
In addition to the OAIS, we add a Production Facility, within which (data) products are created. The entities for this part of the environment include
Management provides the OAIS with its charter and scope. The charter may be developed by the archive, but it is important that Management formally endorse archive activities. The scope determines the breadth of both the Producer and Consumer groups served by the archive. Some examples of typical interactions between the OAIS and Management include:
The first contact between the OAIS and the Producer is a request that the OAIS preserve the data products created by the Producer. This contact may be initiated by the OAIS, the Producer or Management. The Producer establishes a Submission Agreement with the OAIS, which identifies the SIPs to be submitted and may span any length of time for this submission. Some Submission Agreements will reflect a mandatory requirement to provide information to the OAIS, while others will reflect a voluntary offering of information. Even in the case where no formal Submission Agreement exists, such as a World Wide Web (WWW) site, a virtual Submission Agreement may exist specifying the file formats and the general subject matter the site will accept.
Within the Submission Agreement, one or more Data Submission Sessions are specified. There may be significant time gaps between the Data Submission Sessions. A Data Submission Session will contain one or more SIPs and may be a delivered set of media or a single telecommunications session. The Data Submission Session content is based on a data model negotiated between the OAIS and the Producer in the Submission Agreement. This data model identifies the logical components of the SIP (e.g., the Content Information, PDI, Packaging Information, and Descriptive Information) that are to be provided and how (and whether) they are represented in each Data Submission Session. All data deliveries within a Submission Agreement are recognized as belonging to that Submission Agreement and will generally have a consistent data model, which is specified in the Submission Agreement. For example, a Data Submission Session may consist of a set of Content Information corresponding to a set of observations, which are carried by a set of files on a CD-ROM. The Preservation Description Information is split between two other files. All of these files need Representation Information which must be provided in some way. The CD-ROM and its directory/file structure are the Packaging Information, which provides encapsulation and identification of the Content Information and PDI in the Data Submission Session. The Submission Agreement indicates how the Representation Information for each file is to be provided, how the CD-ROM is to be recognized, how the Packaging Information will be used to identify and encapsulate the SIP Content Information and PDI, and how frequently Data Submission Sessions (e.g., one per month for two years) will occur. It also gives other needed information such as access restrictions to the data.
Each SIP in a Data Submission Session is expected to meet minimum OAIS requirements for completeness. However, in some cases multiple SIPs may need to be received before an acceptable AIP can be formed and fully ingested within the OAIS. In other cases, a single SIP may contain data to be included in many AIPs. A Submission Agreement also includes, or references, the procedures and protocols by which an OAIS will either verify the arrival and completeness of a Data Submission Session with the Producer or question the Producer on the contents of the Data Submission Session.
Figure 2a shows the most common Producer-OAIS interaction, in which the Producer provides a Submission Information Package to the OAIS, which converts the SIP to an Archival Information Package.
Figure 2. OAIS-Producer Context Diagram |
There are many types of interactions between the Consumer and the OAIS. These interactions include questions to a help desk, requests for literature, catalog searches, orders and order status requests. Figure 2b illustrates the generic data access process, in which a consumer is interested in information, not in ordering a file. The consumer queries the archive, which responds with a Results Set.
The ordering process is of special interest to the OAIS Reference Model, since it deals with the flow of archive holdings between the OAIS and the Consumer. The Consumer establishes an Order Agreement with the OAIS for information. This information may currently exist in the archive or be expected to be ingested in the future. The Order Agreement may span any length of time, and under it one or more Data Dissemination Sessions may take place. A Data Dissemination Session may involve the transfer of a set of media or a single telecommunications session. The Order Agreement identifies one or more AIPs of interest, how those AIPs are to be transformed and mapped into Dissemination Information Packages (DIPs) and how those DIPs will be packaged in a Data Dissemination Session. The Order Agreement will also specify other needed information such as delivery information (e.g., name or mailing address), and any pricing agreements as applicable.
Ordering is a more formal process than querying. Figure 2c illustrates the generic ordering process. In this case, the consumer submits an Order, which allows the archive to convert an AIP into a DIP. The DIP is what is sent to the user.
There are two common order types initiated by Consumers: the Event Based Order and the Adhoc Order.
In the case of an Adhoc Order, the Consumer establishes an Order Agreement with the OAIS for information available from the archive. If the Consumer does not know a priori what specific holdings of the OAIS are of interest, the Consumer will establish a Search Session with the OAIS. During this Search Session the Consumer will use the OAIS Finding Aids that operate on Descriptive Information, or in some cases on the AIPs themselves, to identify and investigate potential holdings of interest. This may be accomplished by the submission of queries and the return of result sets to the Consumer. This searching process tends to be iterative, with a Consumer first identifying broad criteria and then refining these criteria based on previous search results. Once the Consumer identifies the OAIS AIPs of interest, the Consumer may provide an Order Agreement that documents the identifiers of the AIPs the Consumer wishes to acquire, and how the DIPs will be acquired from the OAIS. If the AIPs are available, an Adhoc Order will be placed. However if the AIPs desired are not yet available, an Event Based Order may be placed.
In the case of an Event Based Order, the Consumer establishes an Order Agreement with the OAIS for information expected to be received on the basis of some triggering event. This event may be periodic, such as a monthly distribution of any AIPs ingested by the OAIS from a specific Producer, or it may be a unique event such as the ingestion of a specific AIP. The Order Agreement will also specify other needed information such as the trigger event for new Data Dissemination Sessions and the criteria for selecting the OAIS holdings to be included in each new Data Dissemination Session.
The Order Agreement does not have to be a formal document. In general an OAIS will have a general pricing policy and maintain an information base of the electronic and physical mailing addresses of its users. In this case, the process of developing an Order Agreement may be no more than the completion of a World Wide Web form to specify the AIPs of interest.
The interaction between the Production Facility and Production Management mirrors the interaction between the OAIS and OAIS Management. To put it another way, OAIS Management and Producer Management are separate instances of a more abstract entity, which we can simply label ``Management''.
In many ways, supplier interactions with the Production Facility inversely mirror those between the OAIS and the Consumer, except that the Production Facility now serves in the role of the Consumer and the Suppliers serve in the role of the OAIS. However, the Supplier also resembles the Producer, in that the relationship is usually formalized by contract.
Thus, the first contact between the Production Facility and a Supplier is a request that the Supplier provide certain data products to the Production Facility. This contact may be initiated by the Production Facility, the Supplier, or by Production Management. The Supplier establishes a Submission Agreement with the Production Facility, which identifies the SIPs to be submitted and may span any length of time for this submission. Some Submission Agreements will reflect a mandatory requirement to provide information to the Production Facility, while others will reflect a voluntary offering of information. Even in the case where no formal Submission Agreement exists, such as a World Wide Web (WWW) site, a virtual Submission Agreement may exist specifying the file formats and the general subject matter the site will accept.
The Submission Agreement between the Supplier and the Production Facility is essentially identical in character with the one between a Producer and an OAIS. This is helpful in considering use cases for the model because we do not have to reinvent a separate kind of agreement.
The last section provides a broad overview of the concepts and activities of an OAIS and Production Facility. The core of this work is a collection of thirty to forty use cases. These are generic descriptions of activities and interactions that occupy the resources of the two kinds of facilities we have been describing. Because the number of use cases is fairly large, we believe it is useful to organize the use cases, more or less in accord with the rapidity with which the activities need to operate.
In doing so, we have been strongly guided by the approach taken by S. B. Gershwin [1999, particularly Chapter 10 of this work]. As he suggests in the introductory material for this chapter [op. cit., p. 359], ``Most manufacturing systems are large and complex. It is natural, therefore, to divide the control or management into a hierarchy consisting of a number of different levels. Each level is characterized by the length of the planning horizon and the kind of data required for the decision-making process. Higher levels of the hierarchy typically have long horizons and use highly aggregated data, while lower levels have shorter horizons and use more detailed information. The nature of uncertainties at each level of control also varies.''
We can put this into a more concrete instantiation. For purposes of setting up the use cases, we may think of the top level planning horizon (for Preservation Planning, for example) as ten years. At this level we expect the management of either the OAIS or the Production Facility to try to make sensible plans that have annual subdivisions, with reviews on an annual basis. At the lowest practical level we consider, where CPU's are running individual jobs, the planning horizon may be as short as a few hours, with events occuring every few seconds. Clearly, the information needed to make decisions at this level is much more detailed than it is at the highest level. At the same time, it does not make sense to try to schedule ten years into the future with a precision of seconds. The range from 1 millisecond to ten years is about 3 ×1012. It would require an extraordinary amount of storage - not to mention computation time - to try to keep track of all events over this dynamic range. Breaking down activities into a temporal hierarchy seems to be a sensible design philosophy. Accordingly, we arrange the use cases in more or less an inverse frequency ordering (lowest frequency first).
There are several key concepts we will use in our use case modeling: processes, roles, and resources. Gershwin [op. cit., p. 363] suggests that ``A resource is any part of the production system that is not consumed or transformed during the production process. Machines - both material transformation and inspection machines, workers, pallets, and sometimes tools - if we ignore wear or breakage - can be modeled as resources. Workpieces and processing chemicals cannot.
For the purposes of this ¼ [writing], we define event as a change in the discrete part of the state or a discontinuous change in a rate or parameter. ¼
An activity [or process] is a pair of events associated with a resource. The first event corresponds to the start of the activity, and the second is the end of the activity. Only one activity can appear at a resource at any time.''
We cover the relationship between processes (or activities) in more detail elsewhere in this material. The critical point is that the duration of processes is usually not deterministic. Processes always have a probability distribution for their duration. Because we want to be able to use our use case instances (or scenarios) to provide realistic cost and schedule estimates, we need to make sure that our business process description can accommodate this kind of probabilistic description.
In the subsections that follow immediately, we lay out a schedule-based structure for the use cases. We begin with a high-level breakdown of activity phases that are appropriate for a generic description of the Production Facility. Then, we provide a breakdown of use cases associated with the consumer and rogue user populations. The production use cases and the consumer ones form the driving forces on the activities of both the OAIS and the Production Facility. At the end, we consider the activities that arise out of the other functions these organizations must undertake.
We expect a producer to have six phases (one more than identified in the Producer-Archive Interface Methodology Abstract Standard [2002]:
Figure 3. Large Scale Notional Production Activity Schedule |
Figure 3 gives a notional schedule for production activity with this breakdown. In the subsections that follow, we expand the activities in each of the phases identified in this figure.
Figure 4. Notional Proposal Opportunity Seeking Schedule |
Figure 5. Notional Preliminary Phase Schedule |
Figure 6. Notional Formal Definition Phase Schedule |
Figure 7. Notional Transfer Phase Schedule |
Figure 8. Notional Product Validation Phase Schedule |
Figure 9. Notional Production Phase Schedule |
If we have a stochastic duration, we can also deal quantitatively with reliability and Quality of Service (QOS) calcualtions. As a practical note, we expect each process to have a Cumulative Probability Distribution (CDF) that can be determined empirically. Given this information, it is possible to compute the duration of a group of services that have been assembled into a service composition. Technically, this means that we view the core of web choreography as assembling Directed Acyclic Graphs (DAGs) that describe the precedence relations between the services. Such a graph also provides a quantitative basis for assembling a schedule - with a distribution of uncertainty in the duration of the total service. In other words, if we attach a CDF to the duration of each process, we can, in principle, calculate the probability of providing the service within a specified time interval. This is exactly the meaning we would attach to Quality of Service - ``we agree to provide this service within 2 hours 95% of the time''.
To the extent possible, we encourage the use cases to show transaction protocols that encourage reliability. In other words, the use case instances we present - together with the attached synchronization diagrams - are intended to suggest patterns of interaction, transactions, that can substantially improve reliability. This means that the use cases should encourage the use of ``BEGIN - COMMIT OR ROLLBACK'' protocols, the keeping of transaction journals that are periodically used for automated auditing of activity, and systematic auditing of all activities in the system. This approach also provides a way of substantially reducing the time required to recover from an exception or security breach. Implicitly, we seek to encourage rapid diagnosis and fix, rather than a priori monitoring, particularly if personnel are involved.
We can also see that a CDF can include the probability of an exception occuring. From the standpoint of a DAG, an exception prunes the original graph of completed processes and those that cannot be completed. The exception also adds additional processes that must be grafted onto the revised graph. We then expect that the revised graph will provide the information to calculate a new duration with the exception handled. The probability of raising an exception gives us a convenient way of quantifying the reliability of the system. The revised graph gives us a way of systematically thinking about exception handling policies. Basically, the graph that describes the way the exception is handled gives us a systematic description of how the system will handle problems.
From this perspective, we can use the use cases to assist in security analysis. Several of the use cases deal with attacks by various categories of rogue users. We can use a quantitative model of user activities to estimate the frequency of attacks. Furthermore, the approach we are taking to system design encourages us to lay out potential vulnerabilities and to develop means of reducing the risk of successful attack.
The material we are describing is very complex. It involves both the interaction of computer services and organizations of people. Hopefully, we can make the use cases clear enough that they can help avoid major difficulties in more detailed design stages of system development. At the same time, it is difficult to gain experience with the impact of design choices. While modern approaches to system development do allow some flexibility to redesign the system, it would be helpful to be able to simulate the system - and to adjust the design on the basis of the simulation results.
Accordingly, we will try to provide a way of converting the use case instances to a form where the operation of the system can be simulated. We will also find that this approach (particularly when applied with Gershwin's suggestion of using a hierarchical control philosophy) lends itself to ensuring that the statistics collected by a management information system will be useful in controlling the system. This kind of test capability is also important for quantitative evaluation of system reliability.