- From: Schneider Reinhard <reinhard.schneider@embl-heidelberg.de>
- Date: Sat, 20 Jun 2009 11:52:23 +0200
- To: public-widgets-pag@w3.org
- Message-Id: <3A02881D-BEB7-44D4-A131-5825E3BB50A7@embl-heidelberg.de>
please find below a short description about an automated sequence analysis system for biological DNA and protein sequences. The system had an automatic update procedure (db_update) for the underlying databases. It performed the updates automatically and triggered a range of actions like reformatting, indexing and updating depended system tools. The first publication of the system appeared 1994 (see reference below). I also attach a kind of help file for the update script itself (update.rtf). This software was later part of a commercial package and the further development is still in use. I hope this is of any help and feel free to require more information. GeneQuiz: a workbench for sequence analysis. Scharf, M., Schneider, R., Casari, G., Bork, P., Valencia, A., Ouzounis, C., Sander, C. Protein Design Group, European Molecular Biology Laboratory, Heidelberg, Germany. Proceedings / . International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology Volume 2, 1994, Pages 348-353 Abstract We present the prototype of a software system, called GeneQuiz, for large-scale biological sequence analysis. The system was designed to meet the needs that arise in computational sequence analysis and our past experience with the analysis of 171 protein sequences of yeast chromosome III. We explain the cognitive challenges associated with this particular research activity and present our model of the sequence analysis process. The prototype system consists of two parts: (i) the database update and search system (driven by perl programs and rdb, a simple relational database engine also written in perl) and (ii) the visualization and browsing system (developed under C++/ET++). The principal design requirement for the first part was the complete automation of all repetitive actions: database updates, efficient sequence similarity searches and sampling of results in a uniform fashion. The user is then presented with "hit-lists" that summarize the results from heterogeneous database searches. The expert's primary task now simply becomes the further analysis of the candidate entries, where the problem is to extract adequate information about functional characteristics of the query protein rapidly. This second task is tremendously accelerated by a simple combination of the heterogeneous output into uniform relational tables and the provision of browsing mechanisms that give access to database records, sequence entries and alignment views. Indexing of molecular sequence databases provides fast retrieval of individual entries with the use of unique identifiers as well as browsing through databases using pre-existing cross-references. The presentation here covers an overview of the architecture of the system prototype and our experiences on its applicability in sequence analysis.(ABSTRACT TRUNCATED AT 250 WORDS) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++ db_update (GQupdate) Version 1.1 (December 1994) Introduction This document describes the use of the db_update script. This script written in Perl maintains updated versions of files copied via ftp from remote sites. To use the script Perl has to be installed on your system. If this it not the case please ask your local system manager to get it. In addition we use the perl library 'ftplib' developed by David Sundstrom. The basic input for the script is a file called for example "db_update.list" and is passed as a command line argument to the script. We first describe the format of this list and second the logic of the script and how to run it. The update list Format The "update-list" file contains a description of the packages you want to be updated. A package is a set of files in a given directory on a remote site. Each package and its associated parameters are indicated by a special label starting with ":" (like ":DATABASE1"). The parameters for each package are listed after the label line in a "keyword=value" fashion. EXAMPLE: # # database update parameters # :GLOBAL keyword1=value1 keyword2=value2 job1=job-name # parameters for PACKAGE1 :PACKAGE1 # keyword1=value1 keyword2=value2 job1=value # # parameters for PACKAGE2 # :PACKAGE2 # keyword1=value1 keyword2=[value2a,value2b] keyword3=value3 # NOTE: • Lines starting with "#" are treated as comment lines. • No space around the "=" are allowed (wrong: keyword1 = no ). • Don't quote values (wrong: keyword1="wrong" ). • In case the parameter has more than one value the values have to be enclosed within brackets "[" with a comma as seperator like in the above example for the keyword2 of package2. Global parameters Some parameters apply for all packages. These global parameters are grouped in a special package called "GLOBAL" which is a reserved label in the update list. The following global parameters are defined: • Work_dir= : is the directory where some misc. files like log-files, error-files etc. will be stored. Default: current directory. • Log_file= : the name of the file which will contain the information of the last execution of the update procedure. Default: "db_update.out". • History_file= : the name of the file which will contain the concatenated information of the "Log_files". Default: no history file. • Error_file= : the name of the file where the standard error is stored. This file contains also the dialog of the ftp-sessions in case the "ftp_debug" flag is set. Default is standard error. • Ftp_timeout= : the number of seconds after a ftp-session will be terminated when no data transfer is detected between the remote and local machine. Values from 1 to 9 are set to 10. Default: 0 (no time out). • Ftp_anonymous_pwd : the password to be used for anonymous ftp connections. • Ftp_debug : flag to trace the dialog of the ftp-sessions. If this flag is set to "1" the dialog is send to the "Error_file". Default: 0 (no trace) • job1 : a process running after updating all packages (see description below) EXAMPLE: :GLOBAL # Work_dir=/home/my_account/update Log_file=/home/my_account/update/update.log History_file=/home/my_account/update/update.history Error_file=/home/my_account/update/update.error # Ftp_timeout=50 Ftp_anonymous_pwd=account@your_domain Ftp_debug=0 # job1=/home/my_account/bin/my_program options job2=rm /home/my_account/*.trace Package parameters Each package has several parameters to define the following tasks: • how to connect to a remote site • definition of files to be tranferred • definition of local files to be compared with the remote files • definition of a renaming procedure for the local files • detection of local files which are not longer available on the remote site • definition of jobs to be executed after the update procedure. connection parameter The following parameters are defined (some of these parameters are mandatory): • host= : Internet name or number of the remote host (Mandatory). • user= : user name for the ftp-session. In case the user name is set to anonymous the global "Ftp_anonymous_pwd" variable is used as the password. (Mandatory). • password= : the password for the remote site. If "user" is set to "anonymous" the global "Ftp_anonymous_pwd" variable is used. If no password is defined and the user name is not "anonymous" the program tries to get the password from the ".netrc" file. NOTE: THIS BEHAVIOR CAN CAUSE SECURITY PROBLEMS. If the password is set to "?" the script will prompt the user for the password. • trans_mode= : defines the transition mode. It can be either "A" (for ASCII mode) or "B" (for binary mode). "B" should be used when transferred files are compressed or executables (Mandatory). This is preferred for all UNIX to UNIX transfers. Local environment parameters • local_dir= : the name of the directory where the transfered files will be stored • files_local= : a pattern (or list of patterns) which defines the list of local files that will be compared with remote files (it can be different from the "wanted_file", see below). The pattern has to be given in the UNIX filename convention.(Mandatory). Remote environment parameters • remote_dir= : the name of the directory where the wanted files are located. • content_file= : the name of the remote file which contains the list of remote files (like: "ls-R" or "content.list") (Optional). • files_wanted= : a pattern (or list of patterns) which defines the candiates for the update procedure. The pattern has to be given in the UNIX filename convention. If a "content_file" is defined this file will be used. Only files which match the given pattern(s) will be transferred (Mandatory) Optional parameters File name conversion parameters [optional] The following parameters are useful in cases one wants to rename certain files on the local site. This name conversion is taken into account when the remote and local files are compared. • translate_case= : character conversion between upper and lower case characters. "L" to convert from upper to lower case characters. "U" to convert from lower to upper case.characters. (Optional). • translate_old_pattern= : pattern (or list of patterns) replaced by "trans_new_pattern" • translate_new_pattern= : pattern (or list of patterns) replacing the "trans_old_pattern" Note: The patterns are defined as Perl regular expressions. The case conversion takes place before the pattern conversion. Files with a ".Z" (compressed file) or ".gz" (gnu gzip file) extension will be automatically uncompressed after transfer unless the "no_uncompress" option is used (see below). EXAMPLE: The following parameters: translate_case=L translate_old_pattern=[^pre_,_post$] translate_new_pattern=[,_data] would do this: First all upper case characters in the file name are converted to lower case characters. After that .the "pre_" string at the beginning of a file name would be removed and the "_post" string at the end of a file name would be replaced with the string "_data". file names on the remote site file name on the local site ------------------------------------------------------------- Pre_name1 name1 Pre_name2_Post name2_data pre_NAME3_post name3_data Name4 name4 Renaming local files [optional] In some special cases one wants to rename files on the local site as a result of an additional job running after the update procedure (see "jobs"). In this case it's necessary to give the procedure a way to reconstruct the file name on the remote site (actually the file name after applying the "translate_*" rules). The following parameters are used for this reconstruction procedure. • translate_case_opt= : character case conversion (see "translate_case" above) • translate_old_pattern_opt= : pattern or list of patterns replaced by "translate_new_pattern_opt" • translate_new_pattern_opt= : pattern or list of patterns replacing the "translate_old_pattern_opt" The patterns are defined as Perl regular expressions. EXAMPLE: Let's assume a local procedure that checks the actual data in transfered files detects a problem in the file named "very_new.data" on the remote site and renames it to "very_new.data_uuppps" on the local site. To reconstruct the original name on the remote site we would set the parameters: translate_case_opt= translate_old_pattern_opt=_uuppps$ translate_new_pattern_opt=[] Compressed files [optional] Compressed files (or gzip-files) will be uncompressed (or gunzipped) by default. If the file should be kept as a compressed file the parameter "no_uncompress" has to be set to "Y". • no_uncompress= : N, uncompress files after transfer (Default); Y, to keep the files compressed. Recursive update [optional] By default, only files from the "remote-dir" are compared to the files in "local_dir". When the recursive option is used, the update rules also apply to all the sub-directories found in "remote_dir". When a sub-directory is found in the remote dir, it is created locally if it does not exists. • recursive= : N, only scan "remote_dir"; Y, also scan all the sub- directories. Keep the file date [optional] By default, when a file is transfered, the modification date of the file is the transfert date. When the "keep_file_date" is used, the modification date of the remote file is used to set the modification date of the file after the transfert. • keep_remote_date= N, file modification date is transfert date; Y, remote and local file have the same modification date. Final jobs [optional] In case of a necessary post-processing of the transfered data (like: reformat some database-format into another format, or creating additional data-files dependent on the transfered data) one can define additonal jobs to be executed after transfer with the following syntax: job1= : list of jobs to be executed after update of the specified package. job2= : The job numbering has to be in consecutive order. The script stops execution of jobs when the order is broken. WRONG: job1= job3= The following global variables can be used in the specification of the job: • $old_files : is the name of the file which contains the names of files which were detected to be not longer on the remote site. • $new_files : is the name of the file which contains the file names of all new files. • $transferred_files : is the name of the file which contains the file names of all transfered files (either "new" or "updated"). • $local_dir : name of the local directory. EXAMPLE: job1=my_program $transfered_files > /tmp/trace_file job2=rm $local_dir/*.old "job1" would execute a program or script with the name "my_program". The name of the files which were transfered ("new" or "updated") is passed as the argument "$transfered_files",and the standard output of the program is redirected to the file "/tmp/trace_file". "job2" would remove all files with the extension ".old" in the local directory of the current package. Complete example # entry for a database called sigma :SIGMA # # ftp parameter # host=ftp.xyz-entenhausen.edu user=anonymous password=itsme@local-host.edu # # local enviroment # local_dir=/home/sigma files_found=*.dat # # remote environment # remote_dir=/pub/sigma/compressed_files content_file= files_wanted=[*] # # file name conversion parameter after transfer # - remove ppp refix # - replcae eee by bbb # - convert file names to lower case # trans_old_pattern=[^ppp,eee] trans_new_patterns=[,bbb] trans_case=L # # optional local name conversion # - compare a remote "name" with "name_old" # (if "name" doesn't exist on local site) # opt_old_pattern=[_old] opt_new_pattern=[] opt_case= # # compressed files should be kept compressed # no_uncompress=Y # # do not apply to sub directories # recursive=N # # keep the modification date of files # keep_file_date=Y # # post-processing of the transfered data # - run the program "rename_old_sigma_files" using the list $old_files job1=/my_account/bin/rename_old_sigma_files $old_files Execute the script The syntax for the script is: db_update UPDATE_LIST [-f=(database|ALL)] [-t=trace] • UPDATE_LIST : file which contains the description for the packages (see above) • "-f" : used to force an update of a specific package or to force the update of ALL packages specified in the UPDATE_LIST. • "-t" : defines the level of tracing the information generated by the script during execution. The parameter "trace" is an integer that should be in the range between 0 and 3. The default is set to "0" which means that only problems and errors are reported. The value "3" is useful during debugging session. Global initialisation and consistency checking The script performs first a syntax check of the UPDATE_LIST. Any detected error will be reported and the script exits. If no syntax error is detected in the UPDATE_LIST the script prompts for all the necessary passwords for which the interactive password setting was choosen. After initialisation of the global parameters from the ":GLOBAL" section (name for log-files, error-files, ftp-parameters ...) the script loops over all packages listed in the UPDATE_LIST. Actions taken for each package Using the specified parameters for each package the following tasks are done: · build a list of local files which should be checked for a newer version on the remote site. · connect to the remote site using ftp and get a list of remote files. · comparison of the remote and local files taking into account that: - remote files and local files can have different names due to the "translate_*" parameters in the UPDATE_LIST. - remote files which are compressed are uncompressed on the local site, unless the "no_uncompress" option is set. - local files can have different names due to some post-processing of the data (see "translate_*_opt" parameter in UPDATE_LIST). - if different files from the remote site (due to the translation rules) would get the same name on the local site, only the most recent file is choosen. Example: If a specified translation rule says: "rename data_file.dat* in data_file.dat " but on the remote site are the following files: data_file.dat.V7 01.Jan 1999 data_file.dat.V8 01.Jan 2000, the script copies only the file "data_file.dat.V8". The result of this comparison are two lists containing: 1.) files that should be transfered because: a) the local files are older than the once on the remote site or b) they don't exist on the local site (new files on the remote site) 2.) files that don't exist on the remote site but do exist on the local site (old files) · create the missing local sub-directories if "recursive" option is set to "Y". · transfer files using ftp. · renaming files according to the "trans_*" parameters. · uncompressing of compressed or gzip-files, unless the "no_uncompress" option is set to "Y" in the UPDATE_LIST. · change the date of the local file if the "keep_file_date" option is set to "Y". · in case of an update the "jobs" specified for the current package are executed. · note the current date and time of the update in the log-file. Final steps After processing all the packages the script executes the jobs defined in the GLOBAL section of the UPDATE_LIST. This is done if an update was done for at least one of the packages. Automatic execution of the script To execute the script on a regular basis on can use the "crontab" program. See the according "man" pages on your system or contact your local system manager for a detailed description. To execute the script every night you could write a C-shell script like the following: #! /bin/csh # this file is called "update.crontab" 30 5 * * * /data/update/crontab.csh The above would start the update procedure every night at 5:30 in the morning.To start the automatic update procedure you type: crontab update.crontab the file "crontab.csh" could look like: #! /bin/csh # change directory to working directory cd /data/update # run update script usr/pub/bin/perl db_update UPDATE_LIST >& UPDATE.trace # produce simple update information with list of transfered files date > UPDATE.done echo "" >> UPDATE.done echo "The following file(s) / packages are updated:" >> UPDATE.done echo "" >> UPDATE.done cat /data/update/trace_db/*.transfered >> UPDATE.done # if something got updated mail it some users if ( -e UPDATE.done ) then Mail -s db_update acoount@your_domain < UPDATE.done endif Log-files, errors, update-information The script produces two kind of output-files: global information about the update procedure and information concerning each of the specified packages. 1) Global information: Trace information are send by default to the standard output. In addition three files are created by the script using the names specified in the GLOBAL section of the UPDATE_LIST: • error file : all detected errors and the (optional) dialog of the ftp-sessions. • log file : some basic information about the last execution of the script. • history file : concatenated log files (optional). 2) Package information: For each package a set of file in the working directory ("$Work_dir") is produced, where "X" is name of the package: "X".remote : list of files found at remote site. "X".transfered : list of files transfered to the local site. "X".new : list of files transfered to the local site which did not exists before. "X".old : list of files on the local site which are not present on the remote site.
Received on Monday, 22 June 2009 08:53:25 UTC