A Position Paper for the W3C Workshop on Semantic
Web for Life Sciences
Dolev Dotan and Ron Y. Pinter
{dolevd,pinter}@cs.technion.ac.il
Dept. of Computer Science
Technion – Israel Institute of Technology
September 15, 2004
One of the
major goals of modern bioinformatics is to enable nontrivial in silico
experiments. This requires information integration across various
bioinformatics data repositories and services. To achieve its full potential,
such integration should be accessible to users at all levels of expertise,
including biologists and bioinformaticians, not only to professional
programmers. Unfortunately, the current technologies comprising the Semantic
Web for Life Sciences (SW-LS) are accessible only to the latter.
To solve this
problem, we introduce the Bioinformatics Assay Environment (BAE), an
upcoming rich-client application which will serve as an end-user entry-point to
the SW-LS. This system will allow end-users to leverage the SW-LS in a way that
is transparent, integrative, traceable and repeatable. Using BAE,
users will be able to easily define and enact intricate bioinformatics
workflows, which include execution of services as well as posing complex
queries. The system will allow users to work either off-line, by composing
workflows and subsequently enacting them, or on-line, in an interactive
information browsing mode. The latter mode of operation mirrors the way that
many Life Science researchers search and use information. Using ontological
information, BAE will guide users through the applicable and available SW-LS
services at each exploration or design point, thus freeing them from the
impossible task of becoming familiar with every resource.
In this paper
we present the major features planned for the BAE, and how SW-LS technologies
can be used to implement them. We then discuss some of the issues and
challenges that we have encountered, which are due to deficiencies in features
and technologies related the languages of the Semantic Web. In particular, we found that the
available languages for querying RDF and OWL are not expressive enough, and
often would not be sufficient to specify users' needs. Thus, we outline some of
the additional requirements for query languages over the RDF and OWL data
models. Finally, we discuss quality of service (QoS) related issues pertaining
to the Semantic Web and the application of Service Oriented Architecture to the
world of open-access bioinformatics resources. Note that – due to lack of space – we do not discuss many
other important features that are needed, such as standards for service
description and global ontologies.
BAE provides
users with a unified interface in which they can perform either off-line
workflow design or on-line interactive exploration. In this interface, the workflows are edited and presented
using OverFlow – a novel graphical language for workflow and query
representation, which is described below.
The resultant workflows can then serve as blueprints for future
enactments (on the same input or on different inputs) or can be reused as
sub-workflows in subsequent sessions (repeatability). In addition, since all intermediate
results are kept and can be browsed at a later point, the resultant workflows
can also be used for keeping an audit trail, just like a lab book in a
"wet" biological assay (traceability).
The OverFlow language seamlessly
combines the power of a functional visual data flow language with that of a
declarative visual query language.
It does so in a way that is intended to make it easier for users to
retrieve, process, filter, and manipulate data by designing both queries and
data flows in a single diagram and a unified paradigm.
In OverFlow,
queries are visually depicted by a mixture of declarative and functional
constructs, which are made to reflect the user's intuitive understanding of the
query. Generally speaking, the FROM and WHERE parts of a query are drawn in a
declarative form, while the other parts of the query – SELECT, GROUP-BY,
functions, etc. – are written in a data-flow-like form. This division mirrors
the notion that the first part resembles the selection of an interesting part
of the schema, which in the user's mind is described as the (declarative)
sentence: 'get just those objects of type X which are Y'. This is represented visually by
selecting the relevant portion of the schema graph, and then adding filter
predicate nodes. The
representation of the rest of the query follows the (functional) notion that
some or all of the attributes of the data items selected in the first part are
chosen by the user and are then 'made to flow' to the query's output port,
either directly or via intermediate stations (modules) in which they are
transformed – grouped, counted, summarized, sorted, etc. OverFlow supports many advanced SQL and
OQL features, such as nested queries, collection classes, server-side functions
and methods, advanced joins, recursion, and CASE statements.
As a dataflow
language, OverFlow introduces several key innovations. Like in any other
dataflow language, OverFlow diagrams contain boxes representing computational
modules, which can represent external services and nested workflows; in
OverFlow, modules can also represent queries, which may be shown as
'white-boxes' – i.e. on the same diagram.
These queries can be easily connected to previous results using input
ports, and thus facilitate both follow-up queries (which bring new information)
as well as the filtering of previously retrieved results. In addition, OverFlow allows the reuse
in dataflow of many of the modules used in query definition, such as functions
for grouping, ordering and calculating aggregate properties.
Another
innovation introduced in OverFlow is the inclusion of data nodes, which
represent the data that flows through the diagram. These nodes can represent – using different visual symbols –
objects, collections, data-type attributes, and tuples. Users can select which of the fields of
an object or a tuple to send to a module, simply by connecting the relevant
fields to the module's input ports.
Connecting items in collections to modules which take individual items
implicitly results in iteration over the collection. In addition, some advanced data manipulation constructs
exist, such as one that allows users to correlate, using tuples, between inputs
and outputs of a module. Finally,
OverFlow also supports several control-flow constructs, such as conditionals (if), switches (case), loops,
and module execution coordination (which can also be used to treat exceptions).
Through the
use of a central ontology, as well as information about the different
resources, the system can guide the user in constructing meaningful and legal
queries and workflows. A similar
approach was taken in projects such as TAMBIS [1]. In BAE, query construction will often start by browsing the
ontology in the Ontology Browser view, which presents the ontology in a graph
of concepts, where the user can navigate by expanding – from any shown concept
– the concepts that are related through hierarchy or property relations. This is done simply by selecting from a
context-sensitive menu that contains the applicable relations.
Once a concept
is selected, it can be dragged to the OverFlow editor. There, the user formulates the query by
adding restrictions on the concept's attributes and properties. In order to do so, the restricted
attributes need to be shown and this is done – again – through the use of
contextualized menus. Thus, only
correct queries can be specified.
The system
also helps the users in either composing the workflow or interactively browsing
the information. By clicking on a
result set and bringing up the context menu, the user can get suggestions for
services that are applicable to the results. The system knows what to suggest by performing matchmaking
between the schema of the result set (which can be either a collection of
objects or a view) with available resource descriptions. By using OWL as the language both for
the schema and for the resource description, sophisticated matchmakings can be
carried on by subsumption checking.
Currently, the
number of query languages available for the RDF data model is small (see review
and comparison in [2]) and there is only one query language for OWL, OWL-QL
[3]. Unfortunately, these
languages fall short of providing the expressive power needed for asking many
of the questions that biologists would like to ask. Here we list some of the language features that are
missing. Admittedly, these are not
new features for query languages, as they exist in languages like SQL and OQL;
however, our intent in this section is to highlight the need to include these
features in query languages over the Semantic Web data models. We first discuss some basic language
requirements that are important in the context of the BAE, and then list the
actual language features needed.
The general,
theoretical requirements for a comprehensive, expressive query language, such
as closure, orthogonality, and adequacy, as well as their fulfillment (or lack
thereof) in the leading RDF query
languages of today, are discussed in [2].
Here we focus on two additional features that are needed: usage of
schemas and (more importantly) creation of views.
Unlike query
languages for structured or object-oriented data, which require the presence of
a schema for querying the database, most of today's RDF query languages do not
make any use of such information; instead, they allow retrieval of resources
based on their properties (including the rdf:type
property). While this method is
useful for many queries on general RDF graphs, we believe there is a need to
add intrinsic language primitives for specifying the RDFS or OWL classes of
query variables. This will
eliminate the need for repeated rdf:type
statements in the query (in the common case where all information is typed),
and will thus increase the readability of the language.
Today, RDF
query languages use a schema as they would use any RDF document: importing it
and its namespace in a USING clause.
Separating this process from the query (as in SQL connections) will
result in simpler queries, as it would free the user from the need to specify
namespace information not only for schemas but also for plain RDF documents.
Today, none of
the query languages for RDF or OWL allow users to create the equivalent of a
relational view. This feature is
of major importance to BAE (and any system of its kind) since we want to store
all intermediate results and then use them in subsequent queries. This is closely related to the closure
property, which requires that the results of an operation are – in turn
–elements of the same data model.
Only a language that adheres to this rule, namely that the output of its
queries are RDF triples (for RDF) or OWL instances (for OWL), could then store
these results back in the triple store or ABox, respectively.
Views can be
materialized or virtual (aka computed), just as in the relational model. However, there can be several flavors
for materialized views in RDF and OWL.
In the simplest flavor, the results of the query are stored explicitly
with no links to the original resources, possibly even under a different
namespace. While this flavor most
accurately corresponds to the relational model, it is somewhat inconsistent
with using the inherent "global graph" feature of RDF, which is one
of the flagship features of the Semantic Web. For example, in BAE we would like to allow users to select
an instance in the view that is the result of some query and then traverse the
graph of locally-available information about it; we might even want to fetch
new information to add to that graph.
Using this simple type of view, the user will be able to traverse the
graph only from those properties which contain URIs.
In cases that
need to solve this problem, and make better use of the "global graph"
approach, a second type of view is needed. In this type, view members (which can be subgraphs and are
the RDF/OWL equivalent for SQL tuples) do not store copies of the original
information; instead, they store URI references to it. This allows navigation from all
properties of the view. In this
solution we encounter a problem with the SW models: not every node in the graph
can be referenced since not every node has a URI. This problem was already discovered when designing DQL, the
DAML+OIL query language, whose port to OWL is OWL-QL. During the design of this language it was decided
"after considerable discussion" that "DQL essentially does
nothing, and we assume that servers will be able to create new URI references
to designate such entities, which can then be provided as answer bindings and
used in subsequent queries" [4].
In BAE, we use
typing information in order to guide the user in designing legal queries and
finding appropriate services, as described above. In order to keep doing this for views as well we need to
create – on the fly – the schema of the view. Furthermore, we need to link this schema with the global
ontology, in order to supply users with the correct suggestions as defined by
the semantics of the view. This
requirement is not unique to BAE and should be defined by the language, just as
relational views have a schema.
Here is a
partial list of features we would like to see added to RDF and OWL query
languages:
o
Group by and having
clauses which could be used for:
o
Collections and
corresponding operators (union, difference, distinct, listtoset, flatten…).
o
Ordering of results.
o
Filtering predicates:
·
Type predicates: stated vs.
inferred types (equivalent to SQL:1999 IS OF ONLY)
·
Property predicates – as
in OWL restrictions on properties (e.g. restrictions on the number of
appearances of a certain property).
o
Quantification (e.g.
ALL and EXISTS).
o
Conditionals (e.g.
CASE).
o
Nested queries.
o
Optional matching of
some variables (equivalent to SQL outer joins). This is already available in some RDF QLs and OWL-QL.
o
User-defined functions
and methods, which can also include Web Services that produce RDF/OWL, directly
or via wrappers.
A common
critique of the above wish list of features would be that including all of them
in the first revision of the query language will cause the language to be
ignored, since no vendor could generate a server that is compliant with the
standard. However, we believe that
since all these constructs are known to be useful for the other, less
expressive data models, they would be also useful in the SW data models. To solve this, we suggest that the W3C
creates a layered language, much like OWL is layered in the Lite, DL
and Full versions, or SQL is layered (chronologically) in SQL92,
SQL:1999, and SQL:2003. This way,
vendors could create servers that support the basic language as well as subsets
of the advanced languages. The
subsets may change from vendor to vendor, but the syntax would at least remain
standard.
BAE intends to
empower end-users to easily use multiple open-access bioinformatics
resources. However, such a
capability can be seriously abused: the possibility given by BAE to
transparently run services on a large number of previous results can lead to an
unsustainable number of server requests.
Even today, when only highly-skilled users who write scripts can make
such demands, they manage to block or bring down many servers. To avoid this effect, many providers
limit access to their services in one fashion or another. The problem can be seen even when using
many web-based bioinformatics tools (e.g. BLAST, SRS), where results can
take hours to arrive.
On the other
hand, the whole idea of the Semantic Web in particular and of Service Oriented
Architecture in general is to support such distributed operation, freeing users
from the need to locally install software and data. There seems to be a conflict here that would be very
problematic to resolve; future technologies, such as perhaps the grid, would be
needed to alleviate it. However,
in the meantime platforms such as our BAE should take this problem into
consideration.
Our approach
to dealing with this problem is to allow a user to specify his or her needs as
QoS constraints such as "I want this answered within a day". We consider the following features to
enforce these constraints. First,
the system should be informed about service policies and statistics, generating
alerts before trying to perform operations that are too costly (or even prevent
them). Another helpful feature would
be for the system to suggest – in relevant cases – to automatically download
and install the needed data or application, provided the necessary
infrastructure (software, permissions) exists on the local computer. Alternatively, users should be able to direct
the system to use an alternative location for any particular service.
In this paper we have outlined our position on how end-users should be empowered so as to access the SW-LS and identify some of the SW features that should be added so meet this challenge. Due to the current state of affairs, our current proof-of-concept is being developed over the relational data model, instead of using the SW data model, which is much more suitable for it. We hope that advances in the development of the Semantic Web languages and related technologies would allow this to change in the near future.
1. C.A. Goble, R. Stevens, G. Ng, S. Bechhofer, N.W.
Paton, P.G. Baker, M. Peim, and A. Brass. Transparent Access to Multiple
Bioinformatics Information Sources. IBM Systems Journal Special issue on
deep computing for the life sciences, 40(2): 532-552, 2001.
2. Peter Haase, Jeen Broekstra, Andreas Eberhart, Raphael
Volz. A comparison of RDF query languages. To appear in the Proceedings of the Third International
Semantic Web Conference, Hiroshima, Japan, November 2004.
3. Richard Fikes, Patrick Hayes, and Ian Horrocks. OWL-QL
– a language for deductive query answering on the Semantic Web. Journal of Web Semantics, 2004.
4. Richard Fikes, Patrick Hayes, and Ian Horrocks. Designing a Query Language for the
Semantic Web. http://www.ihmc.us/users/phayes/FikesHayesHorrocks.pdf.