Andrew Farmer National Center for Genome Resources, Santa Fe NM adf@ncgr.org
ISYS (Integrated SYStem)
is a project begun in 1999 at NCGR to explore the area of biological data and
service integration. Most integration approaches in the field at that time
seemed to focus on enabling complex declarative query capabilities across
potentially distributed data sources; although we recognized the importance of
these problems, our experience with our user communities led us to approach the
problem from the end-user perspective, focusing heavily upon the rich
capabilities of client-side software components for allowing researchers to
build integrated datasets of interest to their particular interests. In
addition, we recognized that the components of the system were going to be
produced in a bottom-up fashion, driven by different funding sources and controlled
by different interests (including other organizations). And while these
separate projects might have many overlaps in terms of the pieces of the
species and biological data type "matrix" that they handled, the
organization of the data model and interfaces of the resources produced by such
projects could not be fit into a Procrustean "one-data
model-fits-all" solution. This conviction was further bolstered by the
apparent failure of heavy-weight committee-deigned standards to have as much
impact on the field as the de facto standards provided by community driven
resources, and the suitability of loosely specified but rapidly adaptable
technologies such as Perl or the WWW to the fast evolving field.
We therefore tried to construct an architecture focused on client-side
integration of data and analytical resources that would give the end user an
environment in which they could intuitively explore the data, while preserving
the autonomy and independent evolvability of the different components that
composed a given instance of the system. Although ignorant at that time of the
ideas and technologies of the semantic web community, we nevertheless converged
upon many of the same design principles that serve as its foundation, and
addressed on a small scale many of the same problems that are still being
worked out on the world-wide scale of the web. The strengths and shortcomings
of our approach to the problems of "open-world" integration may prove
instructive to future attempts to explore these areas using the emergent
semantic web technologies.
ISYS was conceived as a platform that serves as a
lightweight broker of services and data between independently pluggable
components. These components might be written from scratch to take advantage of
the platform facilities, or might be developed as simple adapters to existing
resources (including server-side data and analysis resources). The behavior of
any given instance of the system depends on the set of components installed at
the user's discretion. Components generally fall into one or the other of two
basic classes: ServiceProviders and Clients; these, in turn, define the two
major brokering functions of the platform, ServiceBroker and EventChannel.
The ServiceBroker aspect of the system comes in two flavors: "static
service" and "dynamic service" brokering. Static service
brokering is similar to the approach taken in systems like CORBA/RMI/UDDI
registries, where strongly typed interface signatures are defined, and clients
that are specifically designed to make use of these interfaces can use the
broker to find one or more implementations of the service class that they have
been designed to use; this is not conceptually all that different from
dynamically linked libraries on an internet scale. It should be noted that ISYS
allowed an open-world approach to interface design here; any developer could
create a new service interface or extend an existing one, and the broker could
use Java's built in statically determined inheritance hierarchy to broker
service instances according to their implementation of the requested (or
specialized) interface.
Dynamic service brokering is a more flexible concept, and makes heavy use of
the ISYS semi-structured approach to data modeling which we shall discuss
below. The basic idea here was that the user would select an arbitrary set of
(possibly heterogeneous) data objects through one of the client interfaces,
then ask the system "what could be done with this data?” The system
would then ask each registered ServiceProvider in turn to inspect the given
data set and determine whether any of its services could operate on some aspect
of the data; in practice, this inspection of the data is typically implemented
as a lightweight "query" of the dataset for data types (i.e.
attributes or properties) on which the service can operate. The list of
services would then be provided to the user and they could inspect the list for
services that might allow them to further their exploration. The services
invoked might produce new data, or an alternative visualization of the data at
hand. In some sense, this might be likened to a dynamic, data-driven hyperlink
generation between unrelated components; however, unlike the typical
web-browser approach to traversing hyperlinks between documents, replacing the
source document with the target, executing the services would typically augment
the data at hand with either new data to be aggregated with the starting set or
with a new visualization perspective on the data selected. It should be noted
that all components (even server-side resource adapters) have some object that
represented them in the single process space of the client-side Java Virtual
Machine, so passing around datasets for inspection by each component's internal
logic for recognizing the suitability of the data to its services was not a
particular problem in terms of performance.
The second main aspect of the platform, the EventChannel, is used for
event-based communication between components. The primary use we have made of
this to date is with respect to the visual synchronization of the data in
independent graphical interfaces. Here, the user may request (or the system can
automatically decide based on the context) that any arbitrary pairs of Client
components should be put into "synchronization"; from that time on,
when the user causes some aspect of the representation of the data in one
interface to change, the change will be packaged as a semantically-tagged event
that is communicated to all synchronization partners, for them to interpret as
they see fit, typically trying to alter their representation to correspond to
the source; for example, change in "selection" or
"visibility" of genes in a genomic browser interface can cause the
ontology terms to which those genes have been assigned to be similarly selected
or brought into focus in an ontology viewer, or the "addition" of new
features by a gene prediction engine can cause the genome browser to display
new location-based representations of these data. This is similar in appearance
to the behavior well-designed user interfaces, but with the difference that
these components are not designed with knowledge of one another, and may not
even share common object references; the identification of the corresponding
"widgets" whose properties should be altered in response to an event
is designed to be a function of a common understanding of the data (possibly
augmented by user-preferences). Again, the interpretation of the data exchanged
between components in these events is facilitated by the use of a
semi-structured, property-centric approach to data modeling.
The approach to data modeling within ISYS operates at three levels. At the
lowest level, are IsysAttributes; these are conceptually similar to RDF
properties, but are implemented as Java interfaces that typically only specify
a single accessor method returning a simple data type (String, Integer, etc.).
Starting with the properties, rather than with Classes seems to be an extremely
sensible approach for data-sharing amongst software components, since more
often than not, interoperation between two components takes place on the basis
of only a few common attribute types (e.g. SequenceText, GeneSymbol). Using
interfaces to provide access to data facilitates wrapping an existing
resources' previously defined object types in the common semantic defined by
these attributes, and works well when interprocess communication is not
necessary, although it is a bit cumbersome to require that an interface be
declared when one would really like to merely "semantically tag" a
"merely datatyped" member or return type of an existing method. On
the positive side, interfaces support the notion of inheritance, allowing one
to specialize the semantic (like RDF subproperties) by extending the interface
(even if a no new methods were introduced by the subtype). By making the
interfaces relatively atomic (as few methods as could describe aspects of a
datum that could not be understood in isolation, e.g. a value and a unit), the
attributes could be independently asserted of objects in a mix-and-match style
that avoided a combinatorial explosion of types as well as reliance on common
agreement on high-level notions of the "proper" attributes of classes
of objects. Conceptually, one could use Java multiple inheritance to map an
existing class to the set of these attributes that were supported, but in
practice, this turned out to be clumsy (because of the need for static
declaration of the implementation of the interfaces by the implementation
class, and the inability to dynamically change the type of an object) as well
as inadequate to the heterogeneous nature of the domain (the frequent need to
bundle multiple attributes of the same type in a given object, without
resorting to weakly typed Collections; the ubiquitous need to return null
values for attributes conceptually associated with a class, but not known for a
given instance; the difficulty of finding rules without exception in the
domain). As with the service type declarations, the attribute interface
declarations are viewed as an open hierarchy that can be augmented by third
party component developers at will, although the ideal for the sake of
interoperation is obviously to promote as much common semantic as reasonable
among these independent components; putting the onus on the developer to map
their data to the common attribute "ontology" has some advantages in
terms of distributing the semantic reconciliation task, but is less flexible
than allowing a configurable mapping of a similar nature.
The higher levels of the ISYS data model approach essentially allow one to
aggregate multiple attributes into single IsysObjects (interpreted as
assertions of the properties of the object), and multiple IsysObjects into
IsysObjectCollections which served as the unit of communication in events and
dynamic service discovery; each of these presents relatively simple methods for
querying and extracting attribute content according to attribute type (and
taking into account the attribute hierarchy in resolving the queries). The
flexible, dynamic aggregation of attribute assertions into objects (whose
"type" is essentially defined by the set of attributes that have been
aggregated at any point in time), has proven quite flexible in handling a
variety of situations, from the representations of data relational databases to
data parsed from flatfile documents, to the data objects represented by the
graphical representations of dedicated clients (e.g. the nodes of pathways, or
the cells of gene expression matrices), including the ISYS web integration
subsystem, which allows data on web pages to be dynamically marked up into
semantically rich objects, and effectively allows any web browser to serve as
an ISYS client.
Most of the work done with ISYS since the initial release of
the platform has focused on development of new components; however, our forays
into the semantic web during development of the Semantic
MOBY project has revealed an apparently close connection between the design
principles of the semantic web and those we adopted when developing the ISYS
platform.
The most obvious parallel is between RDF (as well as RDF-S and OWL) to the ISYS
data model approach, and it seems reasonable to consider simply replacing these
constructs with the capabilities provided by standard semantic web toolkits
such as Jena. Since RDF already defines (several) serialization models, this
would provide a simple language-neutral mechanism for extension of the basic
approach to interprocess communication for events (possibly utilizing some of
the RSS infrastructure or other emergent XML-based messaging protocols for
subscription-based or peer-to-peer communication). Further, the use of
description-logic based reasoners to maintain "computed" hierarchies
on the basis of declared constraints on classes is potentially useful to the
problem of maintaining the service and attribute hierarchies, especially in the
client-centric case where the "ontology" is defined by the set of
components chosen by a given user. The problem of matching data descriptions to
service descriptions also seems a natural fit to the strength of semantic
reasoners, especially inasmuch as the problem as implemented by client side
matching of service descriptions to datasets is very much like the
description-logic notion that classes and their subsumption relations as well
as instances and their inclusion in classes should be as much as possible
intensionally defined rather than declared by extensional enumeration. If
extended into distributed service brokering, however, it seems likely that some
mechanism along the lines of content negotiation, but formulated in terms of
attribute-content specifying queries, rather than MIME-types, might be necessary.
We also discovered that in many cases there seems to be a natural duality
between the compiler/validator-oriented, "design/compile-time"
syntactic/data-structure approaches to typing (represented by XML DTD/XSD
validation, WSDL and other interface definition approaches) and the
inference-engine-oriented, run-time computed semantic approaches to typing
taken by "knowledge representation" formalisms (represented by
RDF/RDF-S/OWL and OWL-S); for example, we were often able to take a static service
signature and provide a "dynamically-discoverable" version of it, or
take a class definition and present it via the attribute interfaces so as to
make it "queryable" via the IsysObject interface. This seems to be
roughly along the lines of efforts to provide WSDL services via OWL-S, or to
present XSD-typed XML fragments as literals in RDF documents.
The emphasis of ISYS component development on intuitive graphical visualization
components is something that seems to have received little attention in the
semantic web development community (with notable exceptions such as Haystack). The idea that
domain-specific applications would interoperate using the semantic web as their
lingua franca seems naturally to translate to the use of event-based
communication between graphical components or applications; furthermore, many
of these components do not have the document-centric mentality of the
ubiquitous web browser and provide a natural context for aggregating data. In
addition, this type of interface often allows one to sidestep query definition
languages (which few biologists have the time to master) in favor of
"gesture-based/WYSIWYG" intensional specification of classes (e.g.
"the set of all objects on the map whose location puts them in the
vicinity of this mouse drag"; "the set of all genes whose expression
is 'obviously' in the second range of a bimodal distribution"). Some
recent work with a comparative mapping component framework developed to the
ISYS framework has introduced some interesting cases of which user-specifiable
rules-based transformations from data attributes into graphical properties,
suggesting that higher layers of the semantic stack may also be important in
this dimension; it also suggests that the very crude set of semantic types
associated with events so far in ISYS could be easily extended into a
toolkit-independent language for graphical properties (along the lines of style
sheets, but semantically rich).
In conclusion, there is nothing about Web architecture that precludes the idea
of including rich client-side components and applications as resources (which
happen to exist on the client's desktop) in its web. It seems high time to move
away from "classical" server-side approaches to generating user
interfaces and images for presentation in a hypertext document browser with
imagemaps and perhaps a smattering of JavaScript; even the idea of applets is
still embedded in the context of the document and running in the space of the
browser and restricted (for security reasons) to communication with the
download server. The vision of the semantic web seems to have a perfect niche
for domain-specific user-driven user agents, presenting customized perspectives
on the rich world of data that forms the basis of modern biological research.
Our work on ISYS seems to demonstrate both that such an approach is of interest
to the user community, and that there is much room for leveraging of the
semantic aspects of the web into a rich client framework.