W3C home > Mailing lists > Public > public-widgets-pag@w3.org > April to June 2009

Call for prior art on Patent 5,764,992

From: Schneider Reinhard <reinhard.schneider@embl-heidelberg.de>
Date: Sat, 20 Jun 2009 11:52:23 +0200
Message-Id: <3A02881D-BEB7-44D4-A131-5825E3BB50A7@embl-heidelberg.de>
To: public-widgets-pag@w3.org
please find below a short  description about an automated sequence  
analysis system for biological
DNA and protein sequences.
The system had an automatic update procedure (db_update)  for the  
underlying databases. It performed
the updates automatically and triggered a range of actions like  
reformatting, indexing and updating
depended system tools.

The first publication of the system appeared 1994 (see reference below).

I also attach a kind of help file for the update script itself  
(update.rtf). This software was
later part of a commercial package and the further development is  
still in use.

I hope this is of any help and feel free to require more information.

GeneQuiz: a workbench for sequence analysis.

Scharf, M., Schneider, R., Casari, G., Bork, P., Valencia, A.,  
Ouzounis, C., Sander, C.

Protein Design Group, European Molecular Biology Laboratory,  
Heidelberg, Germany.

Proceedings / . International Conference on Intelligent Systems for  
Molecular Biology ; ISMB. International Conference on Intelligent  
Systems for Molecular Biology

Volume 2, 1994, Pages 348-353


We present the prototype of a software system, called GeneQuiz, for  
large-scale biological sequence analysis. The system was designed to  
meet the needs that arise in computational sequence analysis and our  
past experience with the analysis of 171 protein sequences of yeast  
chromosome III. We explain the cognitive challenges associated with  
this particular research activity and present our model of the  
sequence analysis process. The prototype system consists of two parts:  
(i) the database update and search system (driven by perl programs and  
rdb, a simple relational database engine also written in perl) and  
(ii) the visualization and browsing system (developed under C++/ET++).  
The principal design requirement for the first part was the complete  
automation of all repetitive actions: database updates, efficient  
sequence similarity searches and sampling of results in a uniform  
fashion. The user is then presented with "hit-lists" that summarize  
the results from heterogeneous database searches. The expert's primary  
task now simply becomes the further analysis of the candidate entries,  
where the problem is to extract adequate information about functional  
characteristics of the query protein rapidly. This second task is  
tremendously accelerated by a simple combination of the heterogeneous  
output into uniform relational tables and the provision of browsing  
mechanisms that give access to database records, sequence entries and  
alignment views. Indexing of molecular sequence databases provides  
fast retrieval of individual entries with the use of unique  
identifiers as well as browsing through databases using pre-existing  
cross-references. The presentation here covers an overview of the  
architecture of the system prototype and our experiences on its  
applicability in sequence analysis.(ABSTRACT TRUNCATED AT 250 WORDS)


db_update (GQupdate)
Version 1.1 (December 1994)

This document describes the use of the db_update script. This script  
written in Perl maintains updated versions of files copied via ftp  
from remote sites. To use the script Perl has to be installed on your  
system. If this it not the case please ask your local system manager  
to get it. In addition we use the perl library 'ftplib' developed by  
David Sundstrom.
The basic input for the script is a file called for example  
"db_update.list" and is passed as a command line argument to the script.
We first describe the format of this list and second the logic of the  
script and how to run it.

The update list


The "update-list" file contains a description of the packages you want  
to be updated. A package is a set of files in a given directory on a  
remote site. Each package and its associated parameters are indicated  
by a special label starting with ":" (like ":DATABASE1").
The parameters for each package are listed after the label line in a  
"keyword=value" fashion.

#	database update parameters
# parameters for PACKAGE1
# parameters for PACKAGE2

	 Lines starting with "#" are treated as comment lines.
	 No space around the "=" are allowed (wrong: keyword1 = no ).
	 Don't quote values (wrong: keyword1="wrong" ).
	 In case the parameter has more than one value the values have to be  
enclosed within brackets "[" with a comma as seperator like in the  
above example for the keyword2 of package2.
Global parameters

Some parameters apply for all packages. These global parameters are  
grouped in a special package called "GLOBAL" which is a reserved label  
in the update list. The following global parameters are defined:
	 Work_dir= : is the directory where some misc. files like log-files,  
error-files etc. will be stored. Default: current directory.
	 Log_file= : the name of the file which will contain the information  
of the last execution of the update procedure. Default: "db_update.out".
	 History_file= : the name of the file which will contain the  
concatenated information of the "Log_files". Default: no history file.
	 Error_file= : the name of the file where the standard error is  
stored. This file contains also the dialog of the ftp-sessions in case  
the "ftp_debug" flag is set. Default is standard error.
	 Ftp_timeout= : the number of seconds after a ftp-session will be  
terminated when no data transfer is detected between the remote and  
local machine. Values from 1 to 9 are set to 10. Default: 0 (no time  
	 Ftp_anonymous_pwd : the password to be used for anonymous ftp  
	 Ftp_debug : flag to trace the dialog of the ftp-sessions. If this  
flag is set to "1" the dialog is send to the "Error_file". Default: 0  
(no trace)
	 job1 : a process running after updating all packages (see  
description below)


job1=/home/my_account/bin/my_program options
job2=rm /home/my_account/*.trace

Package parameters

Each package has several parameters to define the following tasks:
	 how to connect to a remote site
	 definition of files to be tranferred
	 definition of local files to be compared with the remote files
	 definition of a renaming procedure for the local files
	 detection of local files which are not longer available on the  
remote site
	 definition of jobs to be executed after the update procedure.
connection parameter

The following parameters are defined (some of these parameters are  
	 host= : Internet name or number of the remote host (Mandatory).
	 user= : user name for the ftp-session. In case the user name is set  
to anonymous the global "Ftp_anonymous_pwd" variable is used as the  
password. (Mandatory).
	 password= : the password for the remote site. If "user" is set to  
"anonymous" the global "Ftp_anonymous_pwd" variable is used. If no  
password is defined and the user name is not "anonymous" the program  
tries to get the password from the ".netrc" file. NOTE: THIS BEHAVIOR  
CAN CAUSE SECURITY PROBLEMS. If the password is set to "?" the script  
will prompt the user for the password.
	 trans_mode= : defines the transition mode. It can be either  
"A" (for ASCII mode) or "B" (for binary mode). "B" should be used when  
transferred files are compressed or executables (Mandatory). This is  
preferred for all UNIX to UNIX transfers.

Local environment parameters

	 local_dir= : the name of the directory where the transfered files  
will be stored
	 files_local= : a pattern (or list of patterns) which defines the  
list of local files that will be compared with remote files (it can be  
different from the "wanted_file", see below). The pattern has to be  
given in the UNIX filename convention.(Mandatory).

Remote environment parameters

	 remote_dir= : the name of the directory where the wanted files are  
	 content_file= : the name of the remote file which contains the list  
of remote files (like: "ls-R" or "content.list") (Optional).
	 files_wanted= : a pattern (or list of patterns) which defines the  
candiates for the update procedure. The pattern has to be given in the  
UNIX filename convention. If a "content_file" is defined this file  
will be used. Only files which match the given pattern(s) will be  
transferred (Mandatory)
Optional parameters

File name conversion parameters [optional]

The following parameters are useful in cases one wants to rename  
certain files on the local site. This name conversion is taken into  
account when the remote and local files are compared.
	 translate_case= : character conversion between upper and lower case  
characters. "L" to convert from upper to lower case characters. "U" to  
convert from lower to upper case.characters. (Optional).
	 translate_old_pattern= : pattern (or list of patterns) replaced by  
	 translate_new_pattern= : pattern (or list of patterns) replacing  
the "trans_old_pattern"

The patterns are defined as Perl regular expressions.
The case conversion takes place before the pattern conversion.
Files with a ".Z" (compressed file) or ".gz" (gnu gzip file) extension  
will be automatically uncompressed after transfer unless the  
"no_uncompress" option is used (see below).


The following parameters:

would do this:
First all upper case characters in the file name are converted to  
lower case characters. After that .the "pre_" string at the beginning  
of a file name would be removed and the "_post" string at the end of a  
file name would be replaced with the string "_data".

file names on the remote site		file name on the local site
	Pre_name1										name1
	Pre_name2_Post							name2_data
	pre_NAME3_post							name3_data
	Name4											name4

Renaming local files [optional]

In some special cases one wants to rename files on the local site as a  
result of an additional job running after the update procedure (see  
"jobs"). In this case it's necessary to give the procedure a way to  
reconstruct the file name on the remote site (actually the file name  
after applying the "translate_*" rules).
The following parameters are used for this reconstruction procedure.
	 translate_case_opt= : character case conversion (see  
"translate_case" above)
	 translate_old_pattern_opt= : pattern or list of patterns replaced  
by "translate_new_pattern_opt"
	 translate_new_pattern_opt= : pattern or list of patterns replacing  
the "translate_old_pattern_opt"

The patterns are defined as Perl regular expressions.


Let's assume a local procedure that checks the actual data in  
transfered files detects a problem in the file named "very_new.data"  
on the remote site and renames it to "very_new.data_uuppps" on the  
local site. To reconstruct the original name on the remote site we  
would set the parameters:

Compressed files [optional]

Compressed files (or gzip-files) will be uncompressed (or gunzipped)  
by default. If the file should be kept as a compressed file the  
parameter "no_uncompress" has to be set to "Y".
	 no_uncompress= : N, uncompress files after transfer (Default); Y,  
to keep the files compressed.

Recursive update [optional]

By default, only files from the "remote-dir" are compared to the files  
in "local_dir". When the recursive option is used, the update rules  
also apply to all the sub-directories found in "remote_dir". When a  
sub-directory is found in the remote dir, it is created locally if it  
does not exists.
	 recursive= : N, only scan "remote_dir"; Y, also scan all the sub- 

Keep the file date [optional]

By default, when a file is transfered, the modification date of the  
file is the transfert date. When the "keep_file_date" is used, the  
modification date of the remote file is used to set the modification  
date of the file after the transfert.
	 keep_remote_date= N, file modification date is transfert date; Y,  
remote and local file have the same modification date.

Final jobs [optional]

In case of a necessary post-processing of the transfered data (like:  
reformat some database-format into another format, or creating  
additional data-files dependent on the transfered data) one can define  
additonal jobs to be executed after transfer with the following syntax:

job1= : list of jobs to be executed after update of the specified  
job2= : The job numbering has to be in consecutive order. The script  
stops execution of jobs when the order is broken.


The following global variables can be used in the specification of the  
	 $old_files : is the name of the file which contains the names of  
files which were detected to be not longer on the remote site.
	 $new_files : is the name of the file which contains the file names  
of all new files.
	 $transferred_files : is the name of the file which contains the  
file names of all transfered files (either "new" or "updated").
	 $local_dir : name of the local directory.

job1=my_program $transfered_files > /tmp/trace_file
job2=rm $local_dir/*.old

"job1" would execute a program or script with the name "my_program".  
The name of the files which were transfered ("new" or "updated") is  
passed as the argument "$transfered_files",and the standard output of  
the program is redirected to the file "/tmp/trace_file".
"job2" would remove all files with the extension ".old" in the local  
directory of the current package.

Complete example

# entry for a database called sigma
# ftp parameter
# local enviroment
# remote environment
# file name conversion parameter after transfer
#	- remove ppp refix
#	- replcae eee by bbb
#	- convert file names to lower case
# optional local name conversion
#	- compare a remote "name" with "name_old"
#	  (if "name" doesn't exist on local site)
# compressed files should be kept compressed
# do not apply to sub directories
# keep the modification date of files
# post-processing of the transfered data
#	- run the program "rename_old_sigma_files" using the list $old_files
job1=/my_account/bin/rename_old_sigma_files $old_files

Execute the script

The syntax for the script is:

db_update UPDATE_LIST [-f=(database|ALL)] [-t=trace]
	 UPDATE_LIST : file which contains the description for the packages  
(see above)
	 "-f" : used to force an update of a specific package or to force  
the update of ALL packages specified in the UPDATE_LIST.
	 "-t" : defines the level of tracing the information generated by  
the script during execution. The parameter "trace" is an integer that  
should be in the range between 0 and 3. The default is set to "0"  
which means that only problems and errors are reported. The value "3"  
is useful during debugging session.
Global initialisation and consistency checking

The script performs first a syntax check of the UPDATE_LIST. Any  
detected error will
be reported and the script exits.
If no syntax error is detected in the UPDATE_LIST the script prompts  
for all the necessary passwords for which the interactive password  
setting was choosen.
After initialisation of the global parameters from the ":GLOBAL"  
section (name for log-files, error-files, ftp-parameters ...) the  
script loops over all packages listed in the UPDATE_LIST.
Actions taken for each package

Using the specified parameters for each package the following tasks  
are done:
 build a list of local files which should be checked for a newer  
version on the remote site.
 connect to the remote site using ftp and get a list of remote files.
 comparison of the remote and local files taking into account that:
- remote files and local files can have different names due to the  
"translate_*" parameters in the UPDATE_LIST.
- remote files which are compressed are uncompressed on the local  
site, unless the "no_uncompress" option is set.
- local files can have different names due to some post-processing of  
the data (see "translate_*_opt" parameter in UPDATE_LIST).
- if different files from the remote site (due to the translation  
rules) would get the same name on the local site, only the most recent  
file is choosen.
If a specified translation rule says: "rename data_file.dat* in  
data_file.dat " but on the remote site are the following files:
data_file.dat.V7 01.Jan 1999
data_file.dat.V8 01.Jan 2000,
the script copies only the file "data_file.dat.V8".

The result of this comparison are two lists containing:
1.) files that should be transfered because:
a) the local files are older than the once on the remote site or
b) they don't exist on the local site (new files on the remote site)
2.) files that don't exist on the remote site but do exist on the  
local site (old files)

 create the missing local sub-directories if "recursive" option is  
set to "Y".
 transfer files using ftp.
 renaming files according to the "trans_*" parameters.
 uncompressing of compressed or gzip-files, unless the  
"no_uncompress" option is set to "Y" in the UPDATE_LIST.
 change the date of the local file if the "keep_file_date" option is  
set to "Y".
 in case of an update the "jobs" specified for the current package  
are executed.
 note the current date and time of the update in the log-file.
Final steps

After processing all the packages the script executes the jobs defined  
in the GLOBAL section of the UPDATE_LIST. This is done if an update  
was done for at least one of the packages.
Automatic execution of the script

To execute the script on a regular basis on can use the "crontab"  
program. See the according "man" pages on your system or contact your  
local system manager for a detailed description.
To execute the script every night you could write a C-shell script  
like the following:
	#! /bin/csh
	# this file is called "update.crontab"
	30 5 * * * /data/update/crontab.csh

The above would start the update procedure every night at 5:30 in the  
morning.To start the automatic update procedure you type:
	crontab update.crontab

the file "crontab.csh" could look like:
	#! /bin/csh
	# change directory to working directory
	cd /data/update
	# run update script
	usr/pub/bin/perl db_update UPDATE_LIST >& UPDATE.trace
	# produce simple update information with list of transfered files
	date > UPDATE.done
	echo "" >> UPDATE.done
	echo "The following file(s) / packages are updated:" >> UPDATE.done
	echo "" >> UPDATE.done
	cat /data/update/trace_db/*.transfered >> UPDATE.done
	# if something got updated mail it some users
	if ( -e UPDATE.done ) then
		Mail -s db_update acoount@your_domain < UPDATE.done

Log-files, errors, update-information

The script produces two kind of output-files: global information about  
the update procedure and information concerning each of the specified  
1) Global information:

Trace information are send by default to the standard output. In  
addition three files are created by the script using the names  
specified in the GLOBAL section of the UPDATE_LIST:
	 error file : all detected errors and the (optional) dialog of the  
	 log file : some basic information about the last execution of the  
	 history file : concatenated log files (optional).
2) Package information:

For each package a set of file in the working directory ("$Work_dir")  
is produced, where "X" is name of the package:

"X".remote : list of files found at remote site.
"X".transfered : list of files transfered to the local site.
"X".new : list of files transfered to the local site which did not  
exists before.
"X".old : list of files on the local site which are not present on the  
remote site.

Received on Monday, 22 June 2009 08:53:25 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 19:25:30 UTC