GeneQuiz is an integrated system for large-scale biological sequence
analysis, that goes from a protein sequence to a biochemical function,
using a variety of search and analysis methods and up-to-date
protein and DNA databases. Applying an "expert system"
module to the results of the different methods, GeneQuiz creates
a compact summary of findings. It focuses on deriving a predicted
protein function, based on the available evidence, including the
evaluation of the similarity to the closest homologue in the database
(identical, clear, tentative, or marginal). The analysis yields
everything that can possibly be extracted from the current databases,
including three-dimensional models by homology, when the structure
can be reliably calculated.
GeneQuiz consists of four modules: (1) the database update; GQupdate,
(2) the search system; GQsearch, (3) the interpretation module;
GQreason and (4) the visualization and browsing system; GQbrowse.
The modules are driven by perl programs and RDB, a simple relational
database engine also written in perl. The front-end program for
visualization are WWW-browsers (like Netscape or Mosaic).
The principal design requirement is the complete automation of
all repetitive actions: database updates, efficient sequence similarity
searches, sampling of results in a uniform fashion and the automated
evaluation and interpretation of the results using expert knowledge
coded in rules.
As genome sequence data is being produced at an accelerating pace,
there is a need for faster and more reliable methods of large-scale
sequence analysis. There exists a tremendous choice of algorithms,
a large number of sequence and bibliographic databases, and various
single methods that can be useful in the prediction of protein
function. From this large collection of tools, an optimal constellation
must be chosen that satisfies the requirements for accurate and
sensitive function prediction by homology. Speed is also an important
factor for the analysis, but sensitivity should not be sacrificed.
The technical challenges are two-fold: first, how to identify
sequence similarities in molecular databases efficiently without
losing in sensitivity and second, how to integrate existing software
and databases and document the findings of experts in a multi-user
interactive environment.
Large-scale sequence analysis differs from traditional practices
in two basic respects: first, computational efficiency using fast
algorithms and certain heuristics is essential; second, knowledge
support for expert users is becoming crucial, as the emerging
gene and protein families from genome projects extend beyond the
areas of expertise of a single individual. Therefore, the development
of a system is required which performs the necessary analytical
steps for a large number of sequences as well as providing access
to molecular and bibliographic databases.
With a few hundred -- or even thousand -- protein sequences, represented
by coding DNA sequences called open reading frames (ORFs), to
be analyzed as efficiently as possible, large-scale sequence analysis
calls for two fundamental requirements: first, rapid database
searches and a preliminary abstraction of the output and second,
further annotation and evaluation of results.
The questions in computational genome analysis differ significantly
from a traditional sequence analysis project. Here, the most compelling
question is the identification of homologies in search of a function.
However, the issue of function prediction for proteins is partly
a problem of definition. We can define function prediction as
any evidence towards the identification of various protein sequence
characteristics indicative of substrate recognition and catalysis,
interactions, localization and evolutionary relationships. Therefore,
the characterization of a protein sequence (or an ORF) usually
takes place at various levels of accuracy, for example, from a
simple calculation of coiled-coil forming potential to the derivation
of a three-dimensional model (using WHATIF),
on the basis of homology to a well-characterized protein.
Parallel to the explosion of data production from genome projects,
various databases have been created to accommodate the needs of
specialized scientific communities. The generation of these databases
is done locally, and computer networks with appropriate information
retrieval systems may provide access. The exponential growth of
these databases mandates frequent local updates, sometimes even
during the analysis process. It has been repeatedly proven that
most recent database releases often contain newly deposited sequences
that clarify evolutionary relationships and facilitate function
prediction by homology. These changes complicate the analysis,
since an incremental search should be performed often.
The GQupdate module is responsible for the following automated
tasks: (1) reliable data transfer over the network, (2) reformatting
of the databases where necessary, (3) update of local specialized
databases which are dependent on one of the primary databases
and (4) indexing of the the databases for fast retrieval, using
the SRS system.
GQupdate accesses a variety of databases, including:
After transfer of a newly updated database the GQupdate module
performs automatically all the necessary reformatting procedures
to make the databases searchable by the various search programs,
like BLAST.
To minimize the computing time for database searches we use the
NRDB program from
NCBI to produce a non-redundant
database of protein as well as DNA sequences. In addition, derived
databases such as DSSP,
FSSP and
HSSP,
are updated accordingly.
Finally, the module uses SRS
to index and cross-reference the databases to provide fast retrieval
of sequence database entries for two purposes: first, browsing
of particular entries and instant access to database records,
allowing the user to further examine the literature and the available
documentation online; and second, extraction of sequence data
for further analyses (e.g., multiple sequence alignment and iterative
profile searches).
The GQupdate
module has reached a mature phase and has been made available
to EMBL and EBI
sequence analysis support groups.
To accelerate a first scanning of all databases in the most efficient
way, a hierarchical model for database searches was implemented.
For each query sequence, a new directory is created, and all search
and analysis program output files are stored there. First, searching
with the fastest available tool, the blast
suite of programs, allows the verification and exclusion of those
cases where a clear homology and a possible function can be readily
documented. Next, searches with fasta,
a slower but still widely used, and at times more sensitive, program,
are performed. The search is by default performed against the
non-redundant database, which also includes the proteins translated
from genes in the DNA databases.
The GQsearch control program also allows the distribution of jobs
in a cluster of workstations or a parallel computer for a speed-up
of the searching process.
Additional characteristics of newly sequenced ORFs are of interest,
especially when function by homology cannot be predicted. For
example, coiled-coil regions, transmembrane segments, or previously
described sequence patterns can be of extreme importance for a
further understanding of protein function. Such analyses are always
performed, irrespectively of whether a homology is found. However,
especially for cases where no relatives are found, hints for function
may come from this. The computing time for these analyses is negligible.
In addition to the standard analyses, we use filters for shorter
and more meaningful output lists, multiple alignments, cluster
analysis, secondary structure prediction and views of multidimensional
sequence space.
Various programs provide a wide range of output formats, usually
a compromise between machine and human readability. The lack of
syntax and a standard has necessitated the implementation of a
variety of dedicated parsers for the database search output. In
that respect, the system is independent of the search software.
Parsers for all the database search programs and for most of the
analysis tools have been implemented.
Organizing principles for the storage and the manipulation of
results are necessary elements in this effort, since sequence
database searches and other analyses provide us with a large amount
of essentially unprocessed data. We decided to use a well-developed
formalism in database design, the relational database model. The
result parsers produce entity tables that are directly readable
by a simple relational database (RDB) (developed by Walt Hobbs,
RAND Corporation, Santa Monica, CA-USA). RDB
is a simple yet powerful database system written in perl, it is
highly portable, and can work on intermediate data for later transfer
into commercial RDBMS. The visualization component reads the relational
database tables and provides various views. Merging, sorting,
counting and other familiar operations can also be performed.
GQsearch applies a variety of methods, including:
Each method (database) is defined according to its calling procedure,
options, input and output files in a definition file.
In that way it is easy to extend the analysis, by incorporating
additional methods or exchanging them.
Other methods for the characterization of proteins and their classification
into major functional classes as well as their species distribution
have been developed. In addition, genome comparisons using the
derived data can be performed. A number of relevant publications
on this issue can be found in the reference list at the end of
this document.
The module GQreason for automated reasoning and function assignment,
is the third, and in some ways most crucial, component. Instead
of relying on experts for the interpretation of the performed
searches and analysis tools, this suite of programs controls the
evaluation of findings with very high reliability and reproducibility.
The use of the database search programs and the analysis programs
can in principle also be regarded as application of rules on the
sequence and queries in sequence databases. To obtain the required
performance speed, these conceptual "rules" are coded
algorithmically. Many of these methods do in fact contain certain
heuristics or are based on statistics on known proteins (transmembrane
and secondary structure prediction). Rules, however, are best
suitable for the collection, management, and data analysis of
the result database. Due to the non-algorithmic nature of the
result evaluation, rules are an ideal way of processing and analyzing
the given information. The rules have been derived from long experience
with sequence analysis, and are usually remarkably simple. Their
performance has been tested and iteratively optimized on a large
number of yeast chromosome sequences (about 500) and the complete
genome of Mycoplasma genitalium (another 500 approx.).
>From the result files that are converted to rdb tables by
GQsearch during the first step, these results are extracted, pre-processed
and merged into a "feature" table. This table holds
information for a user-specified number of homologies, gene name,
database identifiers (which, when redundant, are merged into one
field), diagnostic features such as the predictions for secondary
structure, coiled-coils, etc., and finally a reliability score
for each item. The table thus provides a compact summary of results
for a particular query sequence.
In the next step, the function assignment is made, on the basis
of the documentation of the database homologues. This is the most
crucial step, and is explained in some detail below.
For each method or topic, we have an independent component containing
the coded set of rules to treat this task. These modules first
check if the required information is available. If so, their rules
code the expert knowledge about the interpretation of the specific
results. For example, how to assess reliability based on scores
from different methods, which results are worth reporting, and
how to process the information to derive new facts (e.g. the total
number of proteins found with significant homology).
Finally, in the last step, the processed feature tables are summarized
at a higher level into a comprehensive table over the results.
At this level, rules are very strict, and report only clear results.
In this way, the user can trust the derived facts and is relieved
from time-consuming interactive checking. Ambiguous assignments
are marked as such, and help the user to directly focus on those
difficult cases which are not automatically resolved at present.
The fourth module, gives access to the result databases and allows
for interactive evaluation and browsing of related sequences and
other databases like bibliographic entries. The current solution
is based on World Wide Web technology and dynamically provides
html documents that can be displayed with any available Web browser
(like Netscape or Mosaic).
With this technology, it is straightforward to make the results
- and the whole browsing capacity - available to any user connected
to the Internet.
The pages provided by the viewer are dynamically generated from
the result database. This way several user-specified views can
be generated from the same set of data. The translation script
is realised in the perl
scripting language. All www-adresses are in fact calls to these
'CGI'-scripts. They generate the html documents and even insert
functional links to related information (like database entries
of sequences) on the fly. For this purpose, we generate links
to the SRS
database retrieval system, that keeps many biologically relevant
databases indexed and supplies rapid access to this information.
Furthermore, SRS
links connected entries in different databases and helps to easy
move to associated pieces of information. Besides this 'browsing'
functionality we provide some 'zooming in' functionality. The
result database keeps the sources of any information stored, and
GQbrowse automatically generates links to these sources. Thus,
more detailed information and inspection of the original sources
is just a mouse-click away.
Previous versions of the system have been used to analyze the
complete sequence of chromosomes XI, VIII and II and regions of
chromosomes IX and XV from baker's yeast Saccharomyces cerevisiae.
The complete genome will be analyzed as the sequence becomes available.
In addition, large collections from the genome of Mycoplasma
capricolum, a collection of ORFs from Archaea, and various
sets of sequences in collaboration with experimental biologists.
Finally, the two complete genomes of Haemophilus influenzae
and Mycoplasma genitalium have been analyzed. The accuracy
of functional assignments was measured by comparing manually annotated
datasets to the results produced by GeneQuiz. Based on a detailed
analysis of approx. 1,000 query sequences, we estimate the overall
accuracy of the system to be 95% or better. Compared to similar
analyses reported in the literature, the accuracy is about ten
percentage points higher. This significant difference comes from
updated databases, a better definition of composition bias detection,
family analysis and intelligent parsing of functional annotations
for database homologues.