- From: Schneider Reinhard <reinhard.schneider@embl-heidelberg.de>
- Date: Sat, 20 Jun 2009 11:52:23 +0200
- To: public-widgets-pag@w3.org
- Message-Id: <3A02881D-BEB7-44D4-A131-5825E3BB50A7@embl-heidelberg.de>
please find below a short description about an automated sequence
analysis system for biological
DNA and protein sequences.
The system had an automatic update procedure (db_update) for the
underlying databases. It performed
the updates automatically and triggered a range of actions like
reformatting, indexing and updating
depended system tools.
The first publication of the system appeared 1994 (see reference below).
I also attach a kind of help file for the update script itself
(update.rtf). This software was
later part of a commercial package and the further development is
still in use.
I hope this is of any help and feel free to require more information.
GeneQuiz: a workbench for sequence analysis.
Scharf, M., Schneider, R., Casari, G., Bork, P., Valencia, A.,
Ouzounis, C., Sander, C.
Protein Design Group, European Molecular Biology Laboratory,
Heidelberg, Germany.
Proceedings / . International Conference on Intelligent Systems for
Molecular Biology ; ISMB. International Conference on Intelligent
Systems for Molecular Biology
Volume 2, 1994, Pages 348-353
Abstract
We present the prototype of a software system, called GeneQuiz, for
large-scale biological sequence analysis. The system was designed to
meet the needs that arise in computational sequence analysis and our
past experience with the analysis of 171 protein sequences of yeast
chromosome III. We explain the cognitive challenges associated with
this particular research activity and present our model of the
sequence analysis process. The prototype system consists of two parts:
(i) the database update and search system (driven by perl programs and
rdb, a simple relational database engine also written in perl) and
(ii) the visualization and browsing system (developed under C++/ET++).
The principal design requirement for the first part was the complete
automation of all repetitive actions: database updates, efficient
sequence similarity searches and sampling of results in a uniform
fashion. The user is then presented with "hit-lists" that summarize
the results from heterogeneous database searches. The expert's primary
task now simply becomes the further analysis of the candidate entries,
where the problem is to extract adequate information about functional
characteristics of the query protein rapidly. This second task is
tremendously accelerated by a simple combination of the heterogeneous
output into uniform relational tables and the provision of browsing
mechanisms that give access to database records, sequence entries and
alignment views. Indexing of molecular sequence databases provides
fast retrieval of individual entries with the use of unique
identifiers as well as browsing through databases using pre-existing
cross-references. The presentation here covers an overview of the
architecture of the system prototype and our experiences on its
applicability in sequence analysis.(ABSTRACT TRUNCATED AT 250 WORDS)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++
db_update (GQupdate)
Version 1.1 (December 1994)
Introduction
This document describes the use of the db_update script. This script
written in Perl maintains updated versions of files copied via ftp
from remote sites. To use the script Perl has to be installed on your
system. If this it not the case please ask your local system manager
to get it. In addition we use the perl library 'ftplib' developed by
David Sundstrom.
The basic input for the script is a file called for example
"db_update.list" and is passed as a command line argument to the script.
We first describe the format of this list and second the logic of the
script and how to run it.
The update list
Format
The "update-list" file contains a description of the packages you want
to be updated. A package is a set of files in a given directory on a
remote site. Each package and its associated parameters are indicated
by a special label starting with ":" (like ":DATABASE1").
The parameters for each package are listed after the label line in a
"keyword=value" fashion.
EXAMPLE:
#
# database update parameters
#
:GLOBAL
keyword1=value1
keyword2=value2
job1=job-name
# parameters for PACKAGE1
:PACKAGE1
#
keyword1=value1
keyword2=value2
job1=value
#
# parameters for PACKAGE2
#
:PACKAGE2
#
keyword1=value1
keyword2=[value2a,value2b]
keyword3=value3
#
NOTE:
• Lines starting with "#" are treated as comment lines.
• No space around the "=" are allowed (wrong: keyword1 = no ).
• Don't quote values (wrong: keyword1="wrong" ).
• In case the parameter has more than one value the values have to be
enclosed within brackets "[" with a comma as seperator like in the
above example for the keyword2 of package2.
Global parameters
Some parameters apply for all packages. These global parameters are
grouped in a special package called "GLOBAL" which is a reserved label
in the update list. The following global parameters are defined:
• Work_dir= : is the directory where some misc. files like log-files,
error-files etc. will be stored. Default: current directory.
• Log_file= : the name of the file which will contain the information
of the last execution of the update procedure. Default: "db_update.out".
• History_file= : the name of the file which will contain the
concatenated information of the "Log_files". Default: no history file.
• Error_file= : the name of the file where the standard error is
stored. This file contains also the dialog of the ftp-sessions in case
the "ftp_debug" flag is set. Default is standard error.
• Ftp_timeout= : the number of seconds after a ftp-session will be
terminated when no data transfer is detected between the remote and
local machine. Values from 1 to 9 are set to 10. Default: 0 (no time
out).
• Ftp_anonymous_pwd : the password to be used for anonymous ftp
connections.
• Ftp_debug : flag to trace the dialog of the ftp-sessions. If this
flag is set to "1" the dialog is send to the "Error_file". Default: 0
(no trace)
• job1 : a process running after updating all packages (see
description below)
EXAMPLE:
:GLOBAL
#
Work_dir=/home/my_account/update
Log_file=/home/my_account/update/update.log
History_file=/home/my_account/update/update.history
Error_file=/home/my_account/update/update.error
#
Ftp_timeout=50
Ftp_anonymous_pwd=account@your_domain
Ftp_debug=0
#
job1=/home/my_account/bin/my_program options
job2=rm /home/my_account/*.trace
Package parameters
Each package has several parameters to define the following tasks:
• how to connect to a remote site
• definition of files to be tranferred
• definition of local files to be compared with the remote files
• definition of a renaming procedure for the local files
• detection of local files which are not longer available on the
remote site
• definition of jobs to be executed after the update procedure.
connection parameter
The following parameters are defined (some of these parameters are
mandatory):
• host= : Internet name or number of the remote host (Mandatory).
• user= : user name for the ftp-session. In case the user name is set
to anonymous the global "Ftp_anonymous_pwd" variable is used as the
password. (Mandatory).
• password= : the password for the remote site. If "user" is set to
"anonymous" the global "Ftp_anonymous_pwd" variable is used. If no
password is defined and the user name is not "anonymous" the program
tries to get the password from the ".netrc" file. NOTE: THIS BEHAVIOR
CAN CAUSE SECURITY PROBLEMS. If the password is set to "?" the script
will prompt the user for the password.
• trans_mode= : defines the transition mode. It can be either
"A" (for ASCII mode) or "B" (for binary mode). "B" should be used when
transferred files are compressed or executables (Mandatory). This is
preferred for all UNIX to UNIX transfers.
Local environment parameters
• local_dir= : the name of the directory where the transfered files
will be stored
• files_local= : a pattern (or list of patterns) which defines the
list of local files that will be compared with remote files (it can be
different from the "wanted_file", see below). The pattern has to be
given in the UNIX filename convention.(Mandatory).
Remote environment parameters
• remote_dir= : the name of the directory where the wanted files are
located.
• content_file= : the name of the remote file which contains the list
of remote files (like: "ls-R" or "content.list") (Optional).
• files_wanted= : a pattern (or list of patterns) which defines the
candiates for the update procedure. The pattern has to be given in the
UNIX filename convention. If a "content_file" is defined this file
will be used. Only files which match the given pattern(s) will be
transferred (Mandatory)
Optional parameters
File name conversion parameters [optional]
The following parameters are useful in cases one wants to rename
certain files on the local site. This name conversion is taken into
account when the remote and local files are compared.
• translate_case= : character conversion between upper and lower case
characters. "L" to convert from upper to lower case characters. "U" to
convert from lower to upper case.characters. (Optional).
• translate_old_pattern= : pattern (or list of patterns) replaced by
"trans_new_pattern"
• translate_new_pattern= : pattern (or list of patterns) replacing
the "trans_old_pattern"
Note:
The patterns are defined as Perl regular expressions.
The case conversion takes place before the pattern conversion.
Files with a ".Z" (compressed file) or ".gz" (gnu gzip file) extension
will be automatically uncompressed after transfer unless the
"no_uncompress" option is used (see below).
EXAMPLE:
The following parameters:
translate_case=L
translate_old_pattern=[^pre_,_post$]
translate_new_pattern=[,_data]
would do this:
First all upper case characters in the file name are converted to
lower case characters. After that .the "pre_" string at the beginning
of a file name would be removed and the "_post" string at the end of a
file name would be replaced with the string "_data".
file names on the remote site file name on the local site
-------------------------------------------------------------
Pre_name1 name1
Pre_name2_Post name2_data
pre_NAME3_post name3_data
Name4 name4
Renaming local files [optional]
In some special cases one wants to rename files on the local site as a
result of an additional job running after the update procedure (see
"jobs"). In this case it's necessary to give the procedure a way to
reconstruct the file name on the remote site (actually the file name
after applying the "translate_*" rules).
The following parameters are used for this reconstruction procedure.
• translate_case_opt= : character case conversion (see
"translate_case" above)
• translate_old_pattern_opt= : pattern or list of patterns replaced
by "translate_new_pattern_opt"
• translate_new_pattern_opt= : pattern or list of patterns replacing
the "translate_old_pattern_opt"
The patterns are defined as Perl regular expressions.
EXAMPLE:
Let's assume a local procedure that checks the actual data in
transfered files detects a problem in the file named "very_new.data"
on the remote site and renames it to "very_new.data_uuppps" on the
local site. To reconstruct the original name on the remote site we
would set the parameters:
translate_case_opt=
translate_old_pattern_opt=_uuppps$
translate_new_pattern_opt=[]
Compressed files [optional]
Compressed files (or gzip-files) will be uncompressed (or gunzipped)
by default. If the file should be kept as a compressed file the
parameter "no_uncompress" has to be set to "Y".
• no_uncompress= : N, uncompress files after transfer (Default); Y,
to keep the files compressed.
Recursive update [optional]
By default, only files from the "remote-dir" are compared to the files
in "local_dir". When the recursive option is used, the update rules
also apply to all the sub-directories found in "remote_dir". When a
sub-directory is found in the remote dir, it is created locally if it
does not exists.
• recursive= : N, only scan "remote_dir"; Y, also scan all the sub-
directories.
Keep the file date [optional]
By default, when a file is transfered, the modification date of the
file is the transfert date. When the "keep_file_date" is used, the
modification date of the remote file is used to set the modification
date of the file after the transfert.
• keep_remote_date= N, file modification date is transfert date; Y,
remote and local file have the same modification date.
Final jobs [optional]
In case of a necessary post-processing of the transfered data (like:
reformat some database-format into another format, or creating
additional data-files dependent on the transfered data) one can define
additonal jobs to be executed after transfer with the following syntax:
job1= : list of jobs to be executed after update of the specified
package.
job2= : The job numbering has to be in consecutive order. The script
stops execution of jobs when the order is broken.
WRONG:
job1=
job3=
The following global variables can be used in the specification of the
job:
• $old_files : is the name of the file which contains the names of
files which were detected to be not longer on the remote site.
• $new_files : is the name of the file which contains the file names
of all new files.
• $transferred_files : is the name of the file which contains the
file names of all transfered files (either "new" or "updated").
• $local_dir : name of the local directory.
EXAMPLE:
job1=my_program $transfered_files > /tmp/trace_file
job2=rm $local_dir/*.old
"job1" would execute a program or script with the name "my_program".
The name of the files which were transfered ("new" or "updated") is
passed as the argument "$transfered_files",and the standard output of
the program is redirected to the file "/tmp/trace_file".
"job2" would remove all files with the extension ".old" in the local
directory of the current package.
Complete example
# entry for a database called sigma
:SIGMA
#
# ftp parameter
#
host=ftp.xyz-entenhausen.edu
user=anonymous
password=itsme@local-host.edu
#
# local enviroment
#
local_dir=/home/sigma
files_found=*.dat
#
# remote environment
#
remote_dir=/pub/sigma/compressed_files
content_file=
files_wanted=[*]
#
# file name conversion parameter after transfer
# - remove ppp refix
# - replcae eee by bbb
# - convert file names to lower case
#
trans_old_pattern=[^ppp,eee]
trans_new_patterns=[,bbb]
trans_case=L
#
# optional local name conversion
# - compare a remote "name" with "name_old"
# (if "name" doesn't exist on local site)
#
opt_old_pattern=[_old]
opt_new_pattern=[]
opt_case=
#
# compressed files should be kept compressed
#
no_uncompress=Y
#
# do not apply to sub directories
#
recursive=N
#
# keep the modification date of files
#
keep_file_date=Y
#
# post-processing of the transfered data
# - run the program "rename_old_sigma_files" using the list $old_files
job1=/my_account/bin/rename_old_sigma_files $old_files
Execute the script
The syntax for the script is:
db_update UPDATE_LIST [-f=(database|ALL)] [-t=trace]
• UPDATE_LIST : file which contains the description for the packages
(see above)
• "-f" : used to force an update of a specific package or to force
the update of ALL packages specified in the UPDATE_LIST.
• "-t" : defines the level of tracing the information generated by
the script during execution. The parameter "trace" is an integer that
should be in the range between 0 and 3. The default is set to "0"
which means that only problems and errors are reported. The value "3"
is useful during debugging session.
Global initialisation and consistency checking
The script performs first a syntax check of the UPDATE_LIST. Any
detected error will
be reported and the script exits.
If no syntax error is detected in the UPDATE_LIST the script prompts
for all the necessary passwords for which the interactive password
setting was choosen.
After initialisation of the global parameters from the ":GLOBAL"
section (name for log-files, error-files, ftp-parameters ...) the
script loops over all packages listed in the UPDATE_LIST.
Actions taken for each package
Using the specified parameters for each package the following tasks
are done:
· build a list of local files which should be checked for a newer
version on the remote site.
· connect to the remote site using ftp and get a list of remote files.
· comparison of the remote and local files taking into account that:
- remote files and local files can have different names due to the
"translate_*" parameters in the UPDATE_LIST.
- remote files which are compressed are uncompressed on the local
site, unless the "no_uncompress" option is set.
- local files can have different names due to some post-processing of
the data (see "translate_*_opt" parameter in UPDATE_LIST).
- if different files from the remote site (due to the translation
rules) would get the same name on the local site, only the most recent
file is choosen.
Example:
If a specified translation rule says: "rename data_file.dat* in
data_file.dat " but on the remote site are the following files:
data_file.dat.V7 01.Jan 1999
data_file.dat.V8 01.Jan 2000,
the script copies only the file "data_file.dat.V8".
The result of this comparison are two lists containing:
1.) files that should be transfered because:
a) the local files are older than the once on the remote site or
b) they don't exist on the local site (new files on the remote site)
2.) files that don't exist on the remote site but do exist on the
local site (old files)
· create the missing local sub-directories if "recursive" option is
set to "Y".
· transfer files using ftp.
· renaming files according to the "trans_*" parameters.
· uncompressing of compressed or gzip-files, unless the
"no_uncompress" option is set to "Y" in the UPDATE_LIST.
· change the date of the local file if the "keep_file_date" option is
set to "Y".
· in case of an update the "jobs" specified for the current package
are executed.
· note the current date and time of the update in the log-file.
Final steps
After processing all the packages the script executes the jobs defined
in the GLOBAL section of the UPDATE_LIST. This is done if an update
was done for at least one of the packages.
Automatic execution of the script
To execute the script on a regular basis on can use the "crontab"
program. See the according "man" pages on your system or contact your
local system manager for a detailed description.
To execute the script every night you could write a C-shell script
like the following:
#! /bin/csh
# this file is called "update.crontab"
30 5 * * * /data/update/crontab.csh
The above would start the update procedure every night at 5:30 in the
morning.To start the automatic update procedure you type:
crontab update.crontab
the file "crontab.csh" could look like:
#! /bin/csh
# change directory to working directory
cd /data/update
# run update script
usr/pub/bin/perl db_update UPDATE_LIST >& UPDATE.trace
# produce simple update information with list of transfered files
date > UPDATE.done
echo "" >> UPDATE.done
echo "The following file(s) / packages are updated:" >> UPDATE.done
echo "" >> UPDATE.done
cat /data/update/trace_db/*.transfered >> UPDATE.done
# if something got updated mail it some users
if ( -e UPDATE.done ) then
Mail -s db_update acoount@your_domain < UPDATE.done
endif
Log-files, errors, update-information
The script produces two kind of output-files: global information about
the update procedure and information concerning each of the specified
packages.
1) Global information:
Trace information are send by default to the standard output. In
addition three files are created by the script using the names
specified in the GLOBAL section of the UPDATE_LIST:
• error file : all detected errors and the (optional) dialog of the
ftp-sessions.
• log file : some basic information about the last execution of the
script.
• history file : concatenated log files (optional).
2) Package information:
For each package a set of file in the working directory ("$Work_dir")
is produced, where "X" is name of the package:
"X".remote : list of files found at remote site.
"X".transfered : list of files transfered to the local site.
"X".new : list of files transfered to the local site which did not
exists before.
"X".old : list of files on the local site which are not present on the
remote site.
Received on Monday, 22 June 2009 08:53:25 UTC