GeneQuiz

A system for large scale sequence analysis

The people behind GeneQuiz


Abstract

GeneQuiz is an integrated system for large-scale biological sequence analysis, that goes from a protein sequence to a biochemical function, using a variety of search and analysis methods and up-to-date protein and DNA databases. Applying an "expert system" module to the results of the different methods, GeneQuiz creates a compact summary of findings. It focuses on deriving a predicted protein function, based on the available evidence, including the evaluation of the similarity to the closest homologue in the database (identical, clear, tentative, or marginal). The analysis yields everything that can possibly be extracted from the current databases, including three-dimensional models by homology, when the structure can be reliably calculated.

GeneQuiz consists of four modules: (1) the database update; GQupdate, (2) the search system; GQsearch, (3) the interpretation module; GQreason and (4) the visualization and browsing system; GQbrowse. The modules are driven by perl programs and RDB, a simple relational database engine also written in perl. The front-end program for visualization are WWW-browsers (like Netscape or Mosaic).

Figure 1: schematic flow chart of the GeneQuiz system.

The principal design requirement is the complete automation of all repetitive actions: database updates, efficient sequence similarity searches, sampling of results in a uniform fashion and the automated evaluation and interpretation of the results using expert knowledge coded in rules.



Introduction

Large-scale sequence analysis

As genome sequence data is being produced at an accelerating pace, there is a need for faster and more reliable methods of large-scale sequence analysis. There exists a tremendous choice of algorithms, a large number of sequence and bibliographic databases, and various single methods that can be useful in the prediction of protein function. From this large collection of tools, an optimal constellation must be chosen that satisfies the requirements for accurate and sensitive function prediction by homology. Speed is also an important factor for the analysis, but sensitivity should not be sacrificed.

The technical challenges are two-fold: first, how to identify sequence similarities in molecular databases efficiently without losing in sensitivity and second, how to integrate existing software and databases and document the findings of experts in a multi-user interactive environment.

Need for automatic tools

Large-scale sequence analysis differs from traditional practices in two basic respects: first, computational efficiency using fast algorithms and certain heuristics is essential; second, knowledge support for expert users is becoming crucial, as the emerging gene and protein families from genome projects extend beyond the areas of expertise of a single individual. Therefore, the development of a system is required which performs the necessary analytical steps for a large number of sequences as well as providing access to molecular and bibliographic databases.

Prime tasks: speed and reliability

With a few hundred -- or even thousand -- protein sequences, represented by coding DNA sequences called open reading frames (ORFs), to be analyzed as efficiently as possible, large-scale sequence analysis calls for two fundamental requirements: first, rapid database searches and a preliminary abstraction of the output and second, further annotation and evaluation of results.

Prediction of protein function by sequence homology

The questions in computational genome analysis differ significantly from a traditional sequence analysis project. Here, the most compelling question is the identification of homologies in search of a function. However, the issue of function prediction for proteins is partly a problem of definition. We can define function prediction as any evidence towards the identification of various protein sequence characteristics indicative of substrate recognition and catalysis, interactions, localization and evolutionary relationships. Therefore, the characterization of a protein sequence (or an ORF) usually takes place at various levels of accuracy, for example, from a simple calculation of coiled-coil forming potential to the derivation of a three-dimensional model (using WHATIF), on the basis of homology to a well-characterized protein.



GQupdate

Automated database updates and indexing

Parallel to the explosion of data production from genome projects, various databases have been created to accommodate the needs of specialized scientific communities. The generation of these databases is done locally, and computer networks with appropriate information retrieval systems may provide access. The exponential growth of these databases mandates frequent local updates, sometimes even during the analysis process. It has been repeatedly proven that most recent database releases often contain newly deposited sequences that clarify evolutionary relationships and facilitate function prediction by homology. These changes complicate the analysis, since an incremental search should be performed often.

The GQupdate module is responsible for the following automated tasks: (1) reliable data transfer over the network, (2) reformatting of the databases where necessary, (3) update of local specialized databases which are dependent on one of the primary databases and (4) indexing of the the databases for fast retrieval, using the SRS system.

GQupdate accesses a variety of databases, including:

After transfer of a newly updated database the GQupdate module performs automatically all the necessary reformatting procedures to make the databases searchable by the various search programs, like BLAST. To minimize the computing time for database searches we use the NRDB program from NCBI to produce a non-redundant database of protein as well as DNA sequences. In addition, derived databases such as DSSP, FSSP and HSSP, are updated accordingly.

Finally, the module uses SRS to index and cross-reference the databases to provide fast retrieval of sequence database entries for two purposes: first, browsing of particular entries and instant access to database records, allowing the user to further examine the literature and the available documentation online; and second, extraction of sequence data for further analyses (e.g., multiple sequence alignment and iterative profile searches).

The GQupdate module has reached a mature phase and has been made available to EMBL and EBI sequence analysis support groups.



GQsearch

Automated database searches and sequence analysis

To accelerate a first scanning of all databases in the most efficient way, a hierarchical model for database searches was implemented. For each query sequence, a new directory is created, and all search and analysis program output files are stored there. First, searching with the fastest available tool, the blast suite of programs, allows the verification and exclusion of those cases where a clear homology and a possible function can be readily documented. Next, searches with fasta, a slower but still widely used, and at times more sensitive, program, are performed. The search is by default performed against the non-redundant database, which also includes the proteins translated from genes in the DNA databases.

The GQsearch control program also allows the distribution of jobs in a cluster of workstations or a parallel computer for a speed-up of the searching process.

Sequence analysis programs

Additional characteristics of newly sequenced ORFs are of interest, especially when function by homology cannot be predicted. For example, coiled-coil regions, transmembrane segments, or previously described sequence patterns can be of extreme importance for a further understanding of protein function. Such analyses are always performed, irrespectively of whether a homology is found. However, especially for cases where no relatives are found, hints for function may come from this. The computing time for these analyses is negligible. In addition to the standard analyses, we use filters for shorter and more meaningful output lists, multiple alignments, cluster analysis, secondary structure prediction and views of multidimensional sequence space.

Parsing of search results

Various programs provide a wide range of output formats, usually a compromise between machine and human readability. The lack of syntax and a standard has necessitated the implementation of a variety of dedicated parsers for the database search output. In that respect, the system is independent of the search software. Parsers for all the database search programs and for most of the analysis tools have been implemented.

A relational database schema

Organizing principles for the storage and the manipulation of results are necessary elements in this effort, since sequence database searches and other analyses provide us with a large amount of essentially unprocessed data. We decided to use a well-developed formalism in database design, the relational database model. The result parsers produce entity tables that are directly readable by a simple relational database (RDB) (developed by Walt Hobbs, RAND Corporation, Santa Monica, CA-USA). RDB is a simple yet powerful database system written in perl, it is highly portable, and can work on intermediate data for later transfer into commercial RDBMS. The visualization component reads the relational database tables and provides various views. Merging, sorting, counting and other familiar operations can also be performed.

List of methods

GQsearch applies a variety of methods, including:

Each method (database) is defined according to its calling procedure, options, input and output files in a definition file. In that way it is easy to extend the analysis, by incorporating additional methods or exchanging them.

New methods for automatic annotation

Other methods for the characterization of proteins and their classification into major functional classes as well as their species distribution have been developed. In addition, genome comparisons using the derived data can be performed. A number of relevant publications on this issue can be found in the reference list at the end of this document.



GQreason

Automated reasoning and function assignment

The module GQreason for automated reasoning and function assignment, is the third, and in some ways most crucial, component. Instead of relying on experts for the interpretation of the performed searches and analysis tools, this suite of programs controls the evaluation of findings with very high reliability and reproducibility.

The use of the database search programs and the analysis programs can in principle also be regarded as application of rules on the sequence and queries in sequence databases. To obtain the required performance speed, these conceptual "rules" are coded algorithmically. Many of these methods do in fact contain certain heuristics or are based on statistics on known proteins (transmembrane and secondary structure prediction). Rules, however, are best suitable for the collection, management, and data analysis of the result database. Due to the non-algorithmic nature of the result evaluation, rules are an ideal way of processing and analyzing the given information. The rules have been derived from long experience with sequence analysis, and are usually remarkably simple. Their performance has been tested and iteratively optimized on a large number of yeast chromosome sequences (about 500) and the complete genome of Mycoplasma genitalium (another 500 approx.).

>From the result files that are converted to rdb tables by GQsearch during the first step, these results are extracted, pre-processed and merged into a "feature" table. This table holds information for a user-specified number of homologies, gene name, database identifiers (which, when redundant, are merged into one field), diagnostic features such as the predictions for secondary structure, coiled-coils, etc., and finally a reliability score for each item. The table thus provides a compact summary of results for a particular query sequence.

In the next step, the function assignment is made, on the basis of the documentation of the database homologues. This is the most crucial step, and is explained in some detail below.

For each method or topic, we have an independent component containing the coded set of rules to treat this task. These modules first check if the required information is available. If so, their rules code the expert knowledge about the interpretation of the specific results. For example, how to assess reliability based on scores from different methods, which results are worth reporting, and how to process the information to derive new facts (e.g. the total number of proteins found with significant homology).

Finally, in the last step, the processed feature tables are summarized at a higher level into a comprehensive table over the results. At this level, rules are very strict, and report only clear results. In this way, the user can trust the derived facts and is relieved from time-consuming interactive checking. Ambiguous assignments are marked as such, and help the user to directly focus on those difficult cases which are not automatically resolved at present.



GQbrowse

Viewing and browsing the results database

The fourth module, gives access to the result databases and allows for interactive evaluation and browsing of related sequences and other databases like bibliographic entries. The current solution is based on World Wide Web technology and dynamically provides html documents that can be displayed with any available Web browser (like Netscape or Mosaic). With this technology, it is straightforward to make the results - and the whole browsing capacity - available to any user connected to the Internet.

The pages provided by the viewer are dynamically generated from the result database. This way several user-specified views can be generated from the same set of data. The translation script is realised in the perl scripting language. All www-adresses are in fact calls to these 'CGI'-scripts. They generate the html documents and even insert functional links to related information (like database entries of sequences) on the fly. For this purpose, we generate links to the SRS database retrieval system, that keeps many biologically relevant databases indexed and supplies rapid access to this information. Furthermore, SRS links connected entries in different databases and helps to easy move to associated pieces of information. Besides this 'browsing' functionality we provide some 'zooming in' functionality. The result database keeps the sources of any information stored, and GQbrowse automatically generates links to these sources. Thus, more detailed information and inspection of the original sources is just a mouse-click away.



Applications

Previous versions of the system have been used to analyze the complete sequence of chromosomes XI, VIII and II and regions of chromosomes IX and XV from baker's yeast Saccharomyces cerevisiae. The complete genome will be analyzed as the sequence becomes available. In addition, large collections from the genome of Mycoplasma capricolum, a collection of ORFs from Archaea, and various sets of sequences in collaboration with experimental biologists. Finally, the two complete genomes of Haemophilus influenzae and Mycoplasma genitalium have been analyzed. The accuracy of functional assignments was measured by comparing manually annotated datasets to the results produced by GeneQuiz. Based on a detailed analysis of approx. 1,000 query sequences, we estimate the overall accuracy of the system to be 95% or better. Compared to similar analyses reported in the literature, the accuracy is about ten percentage points higher. This significant difference comes from updated databases, a better definition of composition bias detection, family analysis and intelligent parsing of functional annotations for database homologues.



References