The Bioinformatics Assay Environment:
towards an end-user interface to the SW-LS

A Position Paper for the W3C Workshop on Semantic Web for Life Sciences

Dolev Dotan and Ron Y. Pinter

{dolevd,pinter}@cs.technion.ac.il

Dept. of Computer Science

Technion – Israel Institute of Technology

September 15, 2004

1 Introduction

One of the major goals of modern bioinformatics is to enable nontrivial in silico experiments. This requires information integration across various bioinformatics data repositories and services. To achieve its full potential, such integration should be accessible to users at all levels of expertise, including biologists and bioinformaticians, not only to professional programmers. Unfortunately, the current technologies comprising the Semantic Web for Life Sciences (SW-LS) are accessible only to the latter.

To solve this problem, we introduce the Bioinformatics Assay Environment (BAE), an upcoming rich-client application which will serve as an end-user entry-point to the SW-LS. This system will allow end-users to leverage the SW-LS in a way that is transparent, integrative, traceable and repeatable. Using BAE, users will be able to easily define and enact intricate bioinformatics workflows, which include execution of services as well as posing complex queries. The system will allow users to work either off-line, by composing workflows and subsequently enacting them, or on-line, in an interactive information browsing mode. The latter mode of operation mirrors the way that many Life Science researchers search and use information. Using ontological information, BAE will guide users through the applicable and available SW-LS services at each exploration or design point, thus freeing them from the impossible task of becoming familiar with every resource.

In this paper we present the major features planned for the BAE, and how SW-LS technologies can be used to implement them. We then discuss some of the issues and challenges that we have encountered, which are due to deficiencies in features and technologies related the languages of the Semantic Web. In particular, we found that the available languages for querying RDF and OWL are not expressive enough, and often would not be sufficient to specify users' needs. Thus, we outline some of the additional requirements for query languages over the RDF and OWL data models. Finally, we discuss quality of service (QoS) related issues pertaining to the Semantic Web and the application of Service Oriented Architecture to the world of open-access bioinformatics resources. Note that – due to lack of space – we do not discuss many other important features that are needed, such as standards for service description and global ontologies.

2 Main Features of the Bioinformatics Assay Environment

BAE provides users with a unified interface in which they can perform either off-line workflow design or on-line interactive exploration. In this interface, the workflows are edited and presented using OverFlow – a novel graphical language for workflow and query representation, which is described below. The resultant workflows can then serve as blueprints for future enactments (on the same input or on different inputs) or can be reused as sub-workflows in subsequent sessions (repeatability). In addition, since all intermediate results are kept and can be browsed at a later point, the resultant workflows can also be used for keeping an audit trail, just like a lab book in a "wet" biological assay (traceability).

2.1 The OverFlow language

The OverFlow language seamlessly combines the power of a functional visual data flow language with that of a declarative visual query language. It does so in a way that is intended to make it easier for users to retrieve, process, filter, and manipulate data by designing both queries and data flows in a single diagram and a unified paradigm.

In OverFlow, queries are visually depicted by a mixture of declarative and functional constructs, which are made to reflect the user's intuitive understanding of the query. Generally speaking, the FROM and WHERE parts of a query are drawn in a declarative form, while the other parts of the query – SELECT, GROUP-BY, functions, etc. – are written in a data-flow-like form. This division mirrors the notion that the first part resembles the selection of an interesting part of the schema, which in the user's mind is described as the (declarative) sentence: 'get just those objects of type X which are Y'. This is represented visually by selecting the relevant portion of the schema graph, and then adding filter predicate nodes. The representation of the rest of the query follows the (functional) notion that some or all of the attributes of the data items selected in the first part are chosen by the user and are then 'made to flow' to the query's output port, either directly or via intermediate stations (modules) in which they are transformed – grouped, counted, summarized, sorted, etc. OverFlow supports many advanced SQL and OQL features, such as nested queries, collection classes, server-side functions and methods, advanced joins, recursion, and CASE statements.

As a dataflow language, OverFlow introduces several key innovations. Like in any other dataflow language, OverFlow diagrams contain boxes representing computational modules, which can represent external services and nested workflows; in OverFlow, modules can also represent queries, which may be shown as 'white-boxes' – i.e. on the same diagram. These queries can be easily connected to previous results using input ports, and thus facilitate both follow-up queries (which bring new information) as well as the filtering of previously retrieved results. In addition, OverFlow allows the reuse in dataflow of many of the modules used in query definition, such as functions for grouping, ordering and calculating aggregate properties.

Another innovation introduced in OverFlow is the inclusion of data nodes, which represent the data that flows through the diagram. These nodes can represent – using different visual symbols – objects, collections, data-type attributes, and tuples. Users can select which of the fields of an object or a tuple to send to a module, simply by connecting the relevant fields to the module's input ports. Connecting items in collections to modules which take individual items implicitly results in iteration over the collection. In addition, some advanced data manipulation constructs exist, such as one that allows users to correlate, using tuples, between inputs and outputs of a module. Finally, OverFlow also supports several control-flow constructs, such as conditionals (if), switches (case), loops, and module execution coordination (which can also be used to treat exceptions).

2.2 Using ontology information as guide for workflow design

Through the use of a central ontology, as well as information about the different resources, the system can guide the user in constructing meaningful and legal queries and workflows. A similar approach was taken in projects such as TAMBIS [1]. In BAE, query construction will often start by browsing the ontology in the Ontology Browser view, which presents the ontology in a graph of concepts, where the user can navigate by expanding – from any shown concept – the concepts that are related through hierarchy or property relations. This is done simply by selecting from a context-sensitive menu that contains the applicable relations.

Once a concept is selected, it can be dragged to the OverFlow editor. There, the user formulates the query by adding restrictions on the concept's attributes and properties. In order to do so, the restricted attributes need to be shown and this is done – again – through the use of contextualized menus. Thus, only correct queries can be specified.

The system also helps the users in either composing the workflow or interactively browsing the information. By clicking on a result set and bringing up the context menu, the user can get suggestions for services that are applicable to the results. The system knows what to suggest by performing matchmaking between the schema of the result set (which can be either a collection of objects or a view) with available resource descriptions. By using OWL as the language both for the schema and for the resource description, sophisticated matchmakings can be carried on by subsumption checking.

3 Requirements for RDF and OWL query languages

Currently, the number of query languages available for the RDF data model is small (see review and comparison in [2]) and there is only one query language for OWL, OWL-QL [3]. Unfortunately, these languages fall short of providing the expressive power needed for asking many of the questions that biologists would like to ask. Here we list some of the language features that are missing. Admittedly, these are not new features for query languages, as they exist in languages like SQL and OQL; however, our intent in this section is to highlight the need to include these features in query languages over the Semantic Web data models. We first discuss some basic language requirements that are important in the context of the BAE, and then list the actual language features needed.

3.1 Basic requirements

The general, theoretical requirements for a comprehensive, expressive query language, such as closure, orthogonality, and adequacy, as well as their fulfillment (or lack thereof) in the leading RDF query languages of today, are discussed in [2]. Here we focus on two additional features that are needed: usage of schemas and (more importantly) creation of views.

3.1.1 Usage of schemas

Unlike query languages for structured or object-oriented data, which require the presence of a schema for querying the database, most of today's RDF query languages do not make any use of such information; instead, they allow retrieval of resources based on their properties (including the rdf:type property). While this method is useful for many queries on general RDF graphs, we believe there is a need to add intrinsic language primitives for specifying the RDFS or OWL classes of query variables. This will eliminate the need for repeated rdf:type statements in the query (in the common case where all information is typed), and will thus increase the readability of the language.

Today, RDF query languages use a schema as they would use any RDF document: importing it and its namespace in a USING clause. Separating this process from the query (as in SQL connections) will result in simpler queries, as it would free the user from the need to specify namespace information not only for schemas but also for plain RDF documents.

3.1.2 View creation

Today, none of the query languages for RDF or OWL allow users to create the equivalent of a relational view. This feature is of major importance to BAE (and any system of its kind) since we want to store all intermediate results and then use them in subsequent queries. This is closely related to the closure property, which requires that the results of an operation are – in turn –elements of the same data model. Only a language that adheres to this rule, namely that the output of its queries are RDF triples (for RDF) or OWL instances (for OWL), could then store these results back in the triple store or ABox, respectively.

Views can be materialized or virtual (aka computed), just as in the relational model. However, there can be several flavors for materialized views in RDF and OWL. In the simplest flavor, the results of the query are stored explicitly with no links to the original resources, possibly even under a different namespace. While this flavor most accurately corresponds to the relational model, it is somewhat inconsistent with using the inherent "global graph" feature of RDF, which is one of the flagship features of the Semantic Web. For example, in BAE we would like to allow users to select an instance in the view that is the result of some query and then traverse the graph of locally-available information about it; we might even want to fetch new information to add to that graph. Using this simple type of view, the user will be able to traverse the graph only from those properties which contain URIs.

In cases that need to solve this problem, and make better use of the "global graph" approach, a second type of view is needed. In this type, view members (which can be subgraphs and are the RDF/OWL equivalent for SQL tuples) do not store copies of the original information; instead, they store URI references to it. This allows navigation from all properties of the view. In this solution we encounter a problem with the SW models: not every node in the graph can be referenced since not every node has a URI. This problem was already discovered when designing DQL, the DAML+OIL query language, whose port to OWL is OWL-QL. During the design of this language it was decided "after considerable discussion" that "DQL essentially does nothing, and we assume that servers will be able to create new URI references to designate such entities, which can then be provided as answer bindings and used in subsequent queries" [4].

In BAE, we use typing information in order to guide the user in designing legal queries and finding appropriate services, as described above. In order to keep doing this for views as well we need to create – on the fly – the schema of the view. Furthermore, we need to link this schema with the global ontology, in order to supply users with the correct suggestions as defined by the semantics of the view. This requirement is not unique to BAE and should be defined by the language, just as relational views have a schema.

3.2 Language features

Here is a partial list of features we would like to see added to RDF and OWL query languages:

o Group by and having clauses which could be used for:

Aggregation of results in collections; this is already possible in some RDF query languages using a CONSTRUCT clause.
Application of aggregate operators (COUNT, AVG, MIN, MAX)
Application of statistical operators. Such operators could be defined through the use of functions (see below); however, a standard subset should be considered (as in SQL).

o Collections and corresponding operators (union, difference, distinct, listtoset, flatten…).

o Ordering of results.

o Filtering predicates:

Datatype predicates (e.g. <, <=, between, LIKE (including regular expressions), IN)

· Type predicates: stated vs. inferred types (equivalent to SQL:1999 IS OF ONLY)

· Property predicates – as in OWL restrictions on properties (e.g. restrictions on the number of appearances of a certain property).

o Quantification (e.g. ALL and EXISTS).

o Conditionals (e.g. CASE).

o Nested queries.

o Optional matching of some variables (equivalent to SQL outer joins). This is already available in some RDF QLs and OWL-QL.

o User-defined functions and methods, which can also include Web Services that produce RDF/OWL, directly or via wrappers.

A common critique of the above wish list of features would be that including all of them in the first revision of the query language will cause the language to be ignored, since no vendor could generate a server that is compliant with the standard. However, we believe that since all these constructs are known to be useful for the other, less expressive data models, they would be also useful in the SW data models. To solve this, we suggest that the W3C creates a layered language, much like OWL is layered in the Lite, DL and Full versions, or SQL is layered (chronologically) in SQL92, SQL:1999, and SQL:2003. This way, vendors could create servers that support the basic language as well as subsets of the advanced languages. The subsets may change from vendor to vendor, but the syntax would at least remain standard.

4 QoS issues arising from open-access SOA

BAE intends to empower end-users to easily use multiple open-access bioinformatics resources. However, such a capability can be seriously abused: the possibility given by BAE to transparently run services on a large number of previous results can lead to an unsustainable number of server requests. Even today, when only highly-skilled users who write scripts can make such demands, they manage to block or bring down many servers. To avoid this effect, many providers limit access to their services in one fashion or another. The problem can be seen even when using many web-based bioinformatics tools (e.g. BLAST, SRS), where results can take hours to arrive.

On the other hand, the whole idea of the Semantic Web in particular and of Service Oriented Architecture in general is to support such distributed operation, freeing users from the need to locally install software and data. There seems to be a conflict here that would be very problematic to resolve; future technologies, such as perhaps the grid, would be needed to alleviate it. However, in the meantime platforms such as our BAE should take this problem into consideration.

Our approach to dealing with this problem is to allow a user to specify his or her needs as QoS constraints such as "I want this answered within a day". We consider the following features to enforce these constraints. First, the system should be informed about service policies and statistics, generating alerts before trying to perform operations that are too costly (or even prevent them). Another helpful feature would be for the system to suggest – in relevant cases – to automatically download and install the needed data or application, provided the necessary infrastructure (software, permissions) exists on the local computer. Alternatively, users should be able to direct the system to use an alternative location for any particular service.

5 Summary

In this paper we have outlined our position on how end-users should be empowered so as to access the SW-LS and identify some of the SW features that should be added so meet this challenge. Due to the current state of affairs, our current proof-of-concept is being developed over the relational data model, instead of using the SW data model, which is much more suitable for it. We hope that advances in the development of the Semantic Web languages and related technologies would allow this to change in the near future.

6 References

1. C.A. Goble, R. Stevens, G. Ng, S. Bechhofer, N.W. Paton, P.G. Baker, M. Peim, and A. Brass. Transparent Access to Multiple Bioinformatics Information Sources. IBM Systems Journal Special issue on deep computing for the life sciences, 40(2): 532-552, 2001.

2. Peter Haase, Jeen Broekstra, Andreas Eberhart, Raphael Volz. A comparison of RDF query languages. To appear in the Proceedings of the Third International Semantic Web Conference, Hiroshima, Japan, November 2004.

3. Richard Fikes, Patrick Hayes, and Ian Horrocks. OWL-QL – a language for deductive query answering on the Semantic Web. Journal of Web Semantics, 2004.

4. Richard Fikes, Patrick Hayes, and Ian Horrocks. Designing a Query Language for the Semantic Web. http://www.ihmc.us/users/phayes/FikesHayesHorrocks.pdf.

The Bioinformatics Assay Environment: towards an end-user interface to the SW-LS