Computing the Smith-Waterman Algorithm on the Illinois Bio-Grid

1
Computing the Smith-Waterman Algorithm on the Illinois Bio-Grid Dave S. Angulo 1 , Nigel M. Parsad 2 , Tom Goodale 3 , Gabrielle Allen 3 , Ed Seidel 3 1 The School of Computer Science, Telecommunications and Information Systems, Depaul University | [email protected] 2 Kurt Rossman Labs, The University of Chicago | [email protected] 3 Albert Einstein Institute, Golm (AEI/MPG) | [email protected] | [email protected] | [email protected] Motivation: To exploit the prodigious computational resources of the Illinois Bio-Grid (IBG) by simultaneously querying multiple protein sequences against multiple protein sequence databases for homology . The Smith-Waterman algorithm will be utilized as it guarantees the optimal local pairwise alignment between homologous sequences. The efficiencies gained by the parallel distribution of both the database query and the dynamic programming load should be substantially greater than the single sequence/single protein database search that is the current computational biology standard. Task Farming Basics on the Grid Smith-Waterman Task Farming (SWTask) on the IBG Machines Involved: A. An N processor Grid which dynamically allocates resources for client processes. B. Have one processor designated as the Master Task Farm Manager – TFM(0). C. Have M processors designated as the Worker Task Farm Managers – TFM (1). Data Involved: A. P source data files (estimate 140) from sequence database. B. Each data file has perhaps 100,000 sequence strings with potentially 4,000 characters per string. C. P can be broken into subsets P’, P” etc. Tasks Involved: A. Download P source data files. Total number of characters to compare is approximately 56,000,000 (140 source files x 100,000 sequence strings x 4,000 characters per string). B. Complete a W x W character expression. For two source files P 1 and P 2 , consider P 1 x P 2. C. Since P 1 x P 2 == P 2 x P 1 , only the upper matrix of comparisons will be performed. Task Management Scenario: A. The TFM(0) gives P TFM(1) processors individual directives to download, process, and “own” one source data file: i). Each TFM(1) processor downloads a source data file, strips off non- essential annotations, and stores the annotations on local disk. ii). Each TFM(1) processor saves the resulting stripped source file in memory for sequence alignment analysis using Smith-Waterman. iii). Each TFM(1) processor remains prepared to send and receive stripped source files to and from other TFM(1) peers on the Grid. iv). The TFM(0) keeps track of which TFM(1) processors own what source file. . B. The TFM(0) gives T TFM(1) processors directives to obtain a second source data file (all or partial) from a TFM(1) peer: i). Each TFM(1) processor asks a TFM(1) peer for the second stripped source file. ii). Each TFM(1) processor then does a pairwise sequence comparison of the two files in memory. iii). Each TFM(1) processor then requests more work from the TFM(0). The TFM(0) may then direct the TFM(1) to ask a peer for a third file in a second thread. C. The TFM(0) tracks and dynamically manages: i). TFM(1) progress. TFM(1) TFM(0) implemented in Cactus TM modules used for starting remote TFM(1)s TFM(1) TFM(1) TFM(0) Designed for the Grid TFM(1) Tasks can be anything – in this case the computation of a bioinformatics application Task Manager Hierarchy: In the traditional Master/Slave task manager architecture, there are problems with slave startup and communication between master and slave. Specific issues include authentication/authorization to start remote jobs, queues on remote sources, and firewalls between resources. A three-level hierarchy provides solutions to these issues: Level 1: The Task Farm Manager (0), a.k.a. TFM(0), farms out tasks to remote resources on the Grid and was the Master in the traditional Master/Slave architecture. Level 2: A Task Farm Manager (1), a.k.a. TFM(1), is started on a queue for each remote resource assigned a task. Level 3 : The specific computational task. This level corresponds to the Slave in the three-level model. Task Manager module Structure: The Task Farm Manager (TFM) utilizes the ASCA generic task farm module as well as the Task Farm Logic Manager module (TFLM). For TFM(0), ASCA(0) requests information from TFLM(0) regarding the minimum number of tasks that can be run (MinTasks), how many tasks are desired (DesiredTasks), and how many processors and how much memory is required per task (TaskRequirements). When a TFM(1) requests a task, the TFM(0) calls GetMoreTasks which manages a list of task id’s for uncompleted tasks. Then for each task, TFM(1) calls GetInputFile which provides the required parameters for the specific source files to be processed. The SWLM module is the logic manager specific to Smith-Waterman applications. SWLM provides info as to what tasks to start and what parameters to run for each input files. The SWTask module (not shown) will communicate with the SWLM to get and process files on the task end. Generic Part Applicati on Specific Strategy: To develop and implement a Smith-Waterman software toolkit (SWTask) to run in the distributed environment of the IBG. This toolkit will be part of a larger IBG Bioinformatic Workbench whose modules will also allow for the Grid-enabled computation of the FASTA and BLAST algorithms. The SWTask will include task farming, data acquisition, and Smith-Waterman software modules. N-processor Grid M-protein sequence databases N X M pairwise protein alignments using Smith-Waterman

description

Computing the Smith-Waterman Algorithm on the Illinois Bio-Grid Dave S. Angulo 1 , Nigel M. Parsad 2 , Tom Goodale 3 , Gabrielle Allen 3 , Ed Seidel 3 1 The School of Computer Science, Telecommunications and Information Systems, Depaul University | [email protected] - PowerPoint PPT Presentation

Transcript of Computing the Smith-Waterman Algorithm on the Illinois Bio-Grid

Page 1: Computing the Smith-Waterman Algorithm on the Illinois Bio-Grid

Computing the Smith-Waterman Algorithm on the Illinois Bio-Grid

Dave S. Angulo1, Nigel M. Parsad2, Tom Goodale3, Gabrielle Allen3, Ed Seidel3 1The School of Computer Science, Telecommunications and Information Systems, Depaul University | [email protected]

2Kurt Rossman Labs, The University of Chicago | [email protected] 3Albert Einstein Institute, Golm (AEI/MPG) | [email protected] | [email protected] | [email protected]

Motivation: To exploit the prodigious computational resources of the Illinois Bio-Grid (IBG) by simultaneously querying multiple protein sequences against multiple protein sequence databases for homology . The Smith-Waterman algorithm will be utilized as it guarantees the optimal local pairwise alignment between homologous sequences. The efficiencies gained by the parallel distribution

of both the database query and the dynamic programming load should be substantially greater than the single sequence/single protein database search that is the current computational biology standard.

Task Farming Basics on the Grid

Smith-Waterman Task Farming (SWTask) on the IBG

Machines Involved:

A. An N processor Grid which dynamically allocates resources for client processes.

B. Have one processor designated as the Master Task Farm Manager – TFM(0).

C. Have M processors designated as the Worker Task Farm Managers – TFM (1).

Data Involved:

A. P source data files (estimate 140) from sequence database.

B. Each data file has perhaps 100,000 sequence strings with potentially 4,000 characters per string.

C. P can be broken into subsets P’, P” etc.

Tasks Involved:

A. Download P source data files. Total number of characters to compare is approximately 56,000,000 (140 source files x 100,000 sequence strings x 4,000 characters per string).

B. Complete a W x W character expression. For two source files P1 and P2, consider P1 x P2.

C. Since P1 x P2 == P2 x P1, only the upper matrix of comparisons will be performed.

Task Management Scenario:

A. The TFM(0) gives P TFM(1) processors individual directives to download, process, and “own” one source data file: i). Each TFM(1) processor downloads a source data file, strips off non-essential annotations, and stores the annotations on local disk. ii). Each TFM(1) processor saves the resulting stripped source file in memory for sequence alignment analysis using Smith-Waterman. iii). Each TFM(1) processor remains prepared to send and receive stripped source files to and from other TFM(1) peers on the Grid. iv). The TFM(0) keeps track of which TFM(1) processors own what source file..

B. The TFM(0) gives T TFM(1) processors directives to obtain a second source data file (all or partial) from a TFM(1) peer:

i). Each TFM(1) processor asks a TFM(1) peer for the second stripped source file.

ii). Each TFM(1) processor then does a pairwise sequence comparison of the two files in memory.

iii). Each TFM(1) processor then requests more work from the TFM(0). The TFM(0) may then direct the TFM(1) to ask a peer for a third file in a second thread.

C. The TFM(0) tracks and dynamically manages:i). TFM(1) progress. ii).TFM(1) task distribution based upon workload sharing and processor speed (via completion requests).

TFM(1)

TFM(0) implementedin Cactus

TM modules used for starting remote TFM(1)s

TFM(1) TFM(1)

TFM(0)

Designed for the Grid

TFM(1)

Tasks can be anything – in this case the computation

of a bioinformatics application

Task Manager Hierarchy:

• In the traditional Master/Slave task manager architecture, there are problems with slave startup and communication between master and slave. Specific issues include authentication/authorization to start remote jobs, queues on remote sources, and firewalls between resources. • A three-level hierarchy provides solutions to these issues:

Level 1: The Task Farm Manager (0), a.k.a. TFM(0), farms out tasks to remote resources on the Grid and was the Master in the traditional Master/Slave architecture.

Level 2: A Task Farm Manager (1), a.k.a. TFM(1), is started on a queue for each remote resource assigned a task.

Level 3 : The specific computational task. This level corresponds to the Slave in the three-level model. Task Manager module Structure:

• The Task Farm Manager (TFM) utilizes the ASCA generic task farm module as well as the Task Farm Logic Manager module (TFLM). For TFM(0), ASCA(0) requests information from TFLM(0) regarding the minimum number of tasks that can be run (MinTasks), how many tasks are desired (DesiredTasks), and how many processors and how much memory is required per task (TaskRequirements).

• When a TFM(1) requests a task, the TFM(0) calls GetMoreTasks which manages a list of task id’s for uncompleted tasks. Then for each task, TFM(1) calls GetInputFile which provides the required parameters for the specific source files to be processed.

• The SWLM module is the logic manager specific to Smith-Waterman applications. SWLM provides info as to what tasks to start and what parameters to run for each input files.

• The SWTask module (not shown) will communicate with the SWLM to get and process files on the task end.

Generic Part

Application Specific

Strategy: To develop and implement a Smith-Waterman software toolkit (SWTask) to run in the distributed environment of the IBG. This toolkit will be part of a larger IBG Bioinformatic Workbench whose modules will also allow for the Grid-enabled computation of the FASTA and BLAST algorithms. The SWTask will include task farming, data acquisition, and Smith-Waterman software modules.

N-processor GridM-protein sequence databases

N X M pairwiseprotein alignments

using Smith-Waterman