Andrew Farmer National Center for Genome Resources, Santa Fe NM firstname.lastname@example.org
ISYS (Integrated SYStem)
is a project begun in 1999 at NCGR to explore the area of biological data and
service integration. Most integration approaches in the field at that time
seemed to focus on enabling complex declarative query capabilities across
potentially distributed data sources; although we recognized the importance of
these problems, our experience with our user communities led us to approach the
problem from the end-user perspective, focusing heavily upon the rich
capabilities of client-side software components for allowing researchers to
build integrated datasets of interest to their particular interests. In
addition, we recognized that the components of the system were going to be
produced in a bottom-up fashion, driven by different funding sources and controlled
by different interests (including other organizations). And while these
separate projects might have many overlaps in terms of the pieces of the
species and biological data type "matrix" that they handled, the
organization of the data model and interfaces of the resources produced by such
projects could not be fit into a Procrustean "one-data
model-fits-all" solution. This conviction was further bolstered by the
apparent failure of heavy-weight committee-deigned standards to have as much
impact on the field as the de facto standards provided by community driven
resources, and the suitability of loosely specified but rapidly adaptable
technologies such as Perl or the WWW to the fast evolving field.
We therefore tried to construct an architecture focused on client-side integration of data and analytical resources that would give the end user an environment in which they could intuitively explore the data, while preserving the autonomy and independent evolvability of the different components that composed a given instance of the system. Although ignorant at that time of the ideas and technologies of the semantic web community, we nevertheless converged upon many of the same design principles that serve as its foundation, and addressed on a small scale many of the same problems that are still being worked out on the world-wide scale of the web. The strengths and shortcomings of our approach to the problems of "open-world" integration may prove instructive to future attempts to explore these areas using the emergent semantic web technologies.
ISYS was conceived as a platform that serves as a
lightweight broker of services and data between independently pluggable
components. These components might be written from scratch to take advantage of
the platform facilities, or might be developed as simple adapters to existing
resources (including server-side data and analysis resources). The behavior of
any given instance of the system depends on the set of components installed at
the user's discretion. Components generally fall into one or the other of two
basic classes: ServiceProviders and Clients; these, in turn, define the two
major brokering functions of the platform, ServiceBroker and EventChannel.
The ServiceBroker aspect of the system comes in two flavors: "static service" and "dynamic service" brokering. Static service brokering is similar to the approach taken in systems like CORBA/RMI/UDDI registries, where strongly typed interface signatures are defined, and clients that are specifically designed to make use of these interfaces can use the broker to find one or more implementations of the service class that they have been designed to use; this is not conceptually all that different from dynamically linked libraries on an internet scale. It should be noted that ISYS allowed an open-world approach to interface design here; any developer could create a new service interface or extend an existing one, and the broker could use Java's built in statically determined inheritance hierarchy to broker service instances according to their implementation of the requested (or specialized) interface.
Dynamic service brokering is a more flexible concept, and makes heavy use of the ISYS semi-structured approach to data modeling which we shall discuss below. The basic idea here was that the user would select an arbitrary set of (possibly heterogeneous) data objects through one of the client interfaces, then ask the system "what could be done with this data?” The system would then ask each registered ServiceProvider in turn to inspect the given data set and determine whether any of its services could operate on some aspect of the data; in practice, this inspection of the data is typically implemented as a lightweight "query" of the dataset for data types (i.e. attributes or properties) on which the service can operate. The list of services would then be provided to the user and they could inspect the list for services that might allow them to further their exploration. The services invoked might produce new data, or an alternative visualization of the data at hand. In some sense, this might be likened to a dynamic, data-driven hyperlink generation between unrelated components; however, unlike the typical web-browser approach to traversing hyperlinks between documents, replacing the source document with the target, executing the services would typically augment the data at hand with either new data to be aggregated with the starting set or with a new visualization perspective on the data selected. It should be noted that all components (even server-side resource adapters) have some object that represented them in the single process space of the client-side Java Virtual Machine, so passing around datasets for inspection by each component's internal logic for recognizing the suitability of the data to its services was not a particular problem in terms of performance.
The second main aspect of the platform, the EventChannel, is used for event-based communication between components. The primary use we have made of this to date is with respect to the visual synchronization of the data in independent graphical interfaces. Here, the user may request (or the system can automatically decide based on the context) that any arbitrary pairs of Client components should be put into "synchronization"; from that time on, when the user causes some aspect of the representation of the data in one interface to change, the change will be packaged as a semantically-tagged event that is communicated to all synchronization partners, for them to interpret as they see fit, typically trying to alter their representation to correspond to the source; for example, change in "selection" or "visibility" of genes in a genomic browser interface can cause the ontology terms to which those genes have been assigned to be similarly selected or brought into focus in an ontology viewer, or the "addition" of new features by a gene prediction engine can cause the genome browser to display new location-based representations of these data. This is similar in appearance to the behavior well-designed user interfaces, but with the difference that these components are not designed with knowledge of one another, and may not even share common object references; the identification of the corresponding "widgets" whose properties should be altered in response to an event is designed to be a function of a common understanding of the data (possibly augmented by user-preferences). Again, the interpretation of the data exchanged between components in these events is facilitated by the use of a semi-structured, property-centric approach to data modeling.
The approach to data modeling within ISYS operates at three levels. At the lowest level, are IsysAttributes; these are conceptually similar to RDF properties, but are implemented as Java interfaces that typically only specify a single accessor method returning a simple data type (String, Integer, etc.). Starting with the properties, rather than with Classes seems to be an extremely sensible approach for data-sharing amongst software components, since more often than not, interoperation between two components takes place on the basis of only a few common attribute types (e.g. SequenceText, GeneSymbol). Using interfaces to provide access to data facilitates wrapping an existing resources' previously defined object types in the common semantic defined by these attributes, and works well when interprocess communication is not necessary, although it is a bit cumbersome to require that an interface be declared when one would really like to merely "semantically tag" a "merely datatyped" member or return type of an existing method. On the positive side, interfaces support the notion of inheritance, allowing one to specialize the semantic (like RDF subproperties) by extending the interface (even if a no new methods were introduced by the subtype). By making the interfaces relatively atomic (as few methods as could describe aspects of a datum that could not be understood in isolation, e.g. a value and a unit), the attributes could be independently asserted of objects in a mix-and-match style that avoided a combinatorial explosion of types as well as reliance on common agreement on high-level notions of the "proper" attributes of classes of objects. Conceptually, one could use Java multiple inheritance to map an existing class to the set of these attributes that were supported, but in practice, this turned out to be clumsy (because of the need for static declaration of the implementation of the interfaces by the implementation class, and the inability to dynamically change the type of an object) as well as inadequate to the heterogeneous nature of the domain (the frequent need to bundle multiple attributes of the same type in a given object, without resorting to weakly typed Collections; the ubiquitous need to return null values for attributes conceptually associated with a class, but not known for a given instance; the difficulty of finding rules without exception in the domain). As with the service type declarations, the attribute interface declarations are viewed as an open hierarchy that can be augmented by third party component developers at will, although the ideal for the sake of interoperation is obviously to promote as much common semantic as reasonable among these independent components; putting the onus on the developer to map their data to the common attribute "ontology" has some advantages in terms of distributing the semantic reconciliation task, but is less flexible than allowing a configurable mapping of a similar nature.
The higher levels of the ISYS data model approach essentially allow one to aggregate multiple attributes into single IsysObjects (interpreted as assertions of the properties of the object), and multiple IsysObjects into IsysObjectCollections which served as the unit of communication in events and dynamic service discovery; each of these presents relatively simple methods for querying and extracting attribute content according to attribute type (and taking into account the attribute hierarchy in resolving the queries). The flexible, dynamic aggregation of attribute assertions into objects (whose "type" is essentially defined by the set of attributes that have been aggregated at any point in time), has proven quite flexible in handling a variety of situations, from the representations of data relational databases to data parsed from flatfile documents, to the data objects represented by the graphical representations of dedicated clients (e.g. the nodes of pathways, or the cells of gene expression matrices), including the ISYS web integration subsystem, which allows data on web pages to be dynamically marked up into semantically rich objects, and effectively allows any web browser to serve as an ISYS client.
Most of the work done with ISYS since the initial release of
the platform has focused on development of new components; however, our forays
into the semantic web during development of the Semantic
MOBY project has revealed an apparently close connection between the design
principles of the semantic web and those we adopted when developing the ISYS
The most obvious parallel is between RDF (as well as RDF-S and OWL) to the ISYS data model approach, and it seems reasonable to consider simply replacing these constructs with the capabilities provided by standard semantic web toolkits such as Jena. Since RDF already defines (several) serialization models, this would provide a simple language-neutral mechanism for extension of the basic approach to interprocess communication for events (possibly utilizing some of the RSS infrastructure or other emergent XML-based messaging protocols for subscription-based or peer-to-peer communication). Further, the use of description-logic based reasoners to maintain "computed" hierarchies on the basis of declared constraints on classes is potentially useful to the problem of maintaining the service and attribute hierarchies, especially in the client-centric case where the "ontology" is defined by the set of components chosen by a given user. The problem of matching data descriptions to service descriptions also seems a natural fit to the strength of semantic reasoners, especially inasmuch as the problem as implemented by client side matching of service descriptions to datasets is very much like the description-logic notion that classes and their subsumption relations as well as instances and their inclusion in classes should be as much as possible intensionally defined rather than declared by extensional enumeration. If extended into distributed service brokering, however, it seems likely that some mechanism along the lines of content negotiation, but formulated in terms of attribute-content specifying queries, rather than MIME-types, might be necessary.
We also discovered that in many cases there seems to be a natural duality between the compiler/validator-oriented, "design/compile-time" syntactic/data-structure approaches to typing (represented by XML DTD/XSD validation, WSDL and other interface definition approaches) and the inference-engine-oriented, run-time computed semantic approaches to typing taken by "knowledge representation" formalisms (represented by RDF/RDF-S/OWL and OWL-S); for example, we were often able to take a static service signature and provide a "dynamically-discoverable" version of it, or take a class definition and present it via the attribute interfaces so as to make it "queryable" via the IsysObject interface. This seems to be roughly along the lines of efforts to provide WSDL services via OWL-S, or to present XSD-typed XML fragments as literals in RDF documents.
The emphasis of ISYS component development on intuitive graphical visualization components is something that seems to have received little attention in the semantic web development community (with notable exceptions such as Haystack). The idea that domain-specific applications would interoperate using the semantic web as their lingua franca seems naturally to translate to the use of event-based communication between graphical components or applications; furthermore, many of these components do not have the document-centric mentality of the ubiquitous web browser and provide a natural context for aggregating data. In addition, this type of interface often allows one to sidestep query definition languages (which few biologists have the time to master) in favor of "gesture-based/WYSIWYG" intensional specification of classes (e.g. "the set of all objects on the map whose location puts them in the vicinity of this mouse drag"; "the set of all genes whose expression is 'obviously' in the second range of a bimodal distribution"). Some recent work with a comparative mapping component framework developed to the ISYS framework has introduced some interesting cases of which user-specifiable rules-based transformations from data attributes into graphical properties, suggesting that higher layers of the semantic stack may also be important in this dimension; it also suggests that the very crude set of semantic types associated with events so far in ISYS could be easily extended into a toolkit-independent language for graphical properties (along the lines of style sheets, but semantically rich).