Data Parallelism in Bioinformatics Workflows Using...

Data Parallelism in Bioinformatics Workflows Using Hydra

Fábio Coutinho1, Eduardo Ogasawara1, Daniel de Oliveira1 ,Vanessa Braganholo1 , Alexandre A. B. Lima1 ,

Alberto M. R. Dávila2 and Marta Mattoso1

1Federal University of Rio de Janeiro, Brazil2FIOCRUZ, Brazil

Agenda

• Scientific experiments

• Blast workflow

• Data parallelization

• Hydra middleware

• Case study

• Measurements

• Provenance support

• Conclusion

2

Scientific Experiment Scenario

4. ...which need

to be processed

by a cluster or a

supercomputer…

1. Data is generated

and collected

2. Data is first

analyzed by some

programs

3. Large Volume of

Data Produced ...

5. Final Results

are analyzed

3

Blast sequence analysis match

4. ...which need

to be processed

In a HPC

environment…

1. Genomics Data

is collected.

2. Fasta file is

generated.

5. Final Results

are analyzed

SequenceDatabase

A C B C D B

C A D B D

A C - - B C D B

- C A D B - D -

>ACBCD

Fasta file

3. Sequence alignment …Ex. Genbank,

DDBJ, EMBL, …

4

Current Solutions

• mpiBlast, G-Blast and CloudBlast represent solutions for executing Blast in parallel on cluster, grid and cloudenvironments, respectively.

• BlastReduce: Using map-reduce approach to obtain data parallelization

• These solutions are disconnected from the concept of scientific experiment and they are not concerned about capturing and analyzing provenance.

• Scientific Workflow Management Systems (SWfMS) is analternative to represent a scientific experiment (TavernaWorkflows, for example)

• This solution brings some difficulties in obtaining parallelization

5

Experiments modeled as scientific workflows

• SWfMS is an alternative to represent a scientific experiment

• Obtaining parallelization and provenance gathering in distributed environments brings some challenges

6

Data Parallelization Difficulties

• Data fragmentation

o Dependent on the data format

• Activity distribution in HPC from SWfMS

• Data aggregation

o Dependent on the data format

7

FASTA Fragmentation

8

Types of FASTA data parallelization

9

Sequence Fragmentation Database Fragmentation

No data parallelization No No

Data parallelization

Yes No

No Yes*

Yes Yes*

> P. 0200AGTAGTT…

> P. 0400AGTAGTT…

> P. 0300AGTAGTT…

> P. 0100AGTAGTT…

Data Parallelism

10

UniprotKB

DataFragmenter

I = {i1}Input Data

Acitivity node 1

Acitivity node k

…

Di1-f1 Di1-fk

FO1 FOK

Aggregator

O = {o1, … , on)Output Data

…

…

Acitivity node 2

Di1-f2

FO2

Acitivity node (k-1)

Di1-f(k-1)

FO(k-1)

Parameters Parameters

> Plasmodium 0100AGTAGTTATTTAGGGCTATATAAAAGTTTTGAAAA…

Hydra Middleware

• Middleware solution that bridges the SWfMS to HPCEnvironment supporting data parallelism

• Goal: reduce the complexity involved in designing and managing bioinformatics programs while collecting provenance data

SWfMS MTC Environment

Hydra Middleware

11

Hydra Architecture

12

Hydra MTC LayerClient Layer

Uploader

Dispatcher

Gatherer

Downloader

Hydra Setup

SWfMS

Configuration

MUX

Parameter Sweeper

Data Fragmenter

Template

Fragment

Dispatcher/ Monitor

Data Analyzer

Provenance

Scientist

SWfMS

ExternalSchedulers

File System

File SystemFile

System

File System

12

3.a 3.b

4

5

6 7

8

9

1011

12

Compression Techniques in Hydra

Uploader

I’

I

I’

I’

ScientistDistributed Execution

I

O

O’

Uploader

O’

O

O’

Client Layer H

ydra

MTC L

ayer

Pack

Pack

Unpack

Unpack

I : Input DataI’: Compressed Input DataO : Output DataO’: Compressed Output Data

13

Steps to use Hydra

• Adapt the workflow for distribution using Hydra

• Setup the data parallelization

o Configure the workspace template and invoked program

• Setup a data fragmentation cartridge in Hydra

o Develop a program that fragments a bioinformatics data file

• Setup a data aggregation cartridge in Hydra

o A program that aggregate or merge data produced by individual workspaces

14

Blast workflow in VisTrails using Hydra

15

Setup the data parallelization

Case study

• BLAST tool

o Identification of similar sequences between Plasmodium falciparum e o UniProtKB/SWISS-Prot database (june/2008)

o Data fragmentation of input query

17

Experiment

• Workflow developed in VisTrails

• Hydra was setup with:

o Torque cartridge for job submission

o Data fragmentation cartridge using FASTASplitter

o Workspace configuration for blast invocation

o Data aggregation cartridge using BlastMerger

18

Experiment results

• Execution mode in 64 node Altix ICE 8200 with Xeon 8 cores per node installed at Nacad High Performance Computing Center at Federal University of Rio de Janeiro

• Speed-up measured

19

Workflow execution time

20

Activity execution time distribution

21

Provenance Support

o Where are the merged results for some particular experiment?

o What was the alignment method that lead to the best result?

o What were the parameters that lead the best result?

22

Provenance summary

23

Conclusions

• Hydra can be a bridge the SWfMS and the HPC environment

o Data parallelism using workflows

o Evaluated in a real case bioinformatics experiment using BLAST with little overhead

o Supports distributed provenance gathering

o Good speed-up and execution time

24

Future Work

• Balance workload between nodes

• New cartridges for data fragmentation and aggregation

• Automatic setup for fragment size and used nodes

25

Data Parallelism in Bioinformatics Workflows Using Hydra

Fábio Coutinho, Eduardo Ogasawara, Daniel de Oliveira,Vanessa Braganholo, Alexandre A. B. Lima,

Alberto M. R. Dávila and Marta Mattoso

Thank you!

Data Parallelism in Bioinformatics Workflows Using...

Documents

Transcript of Data Parallelism in Bioinformatics Workflows Using...