Data Parallelism in Bioinformatics Workflows Using...
Transcript of Data Parallelism in Bioinformatics Workflows Using...
Data Parallelism in Bioinformatics Workflows Using Hydra
Fábio Coutinho1, Eduardo Ogasawara1, Daniel de Oliveira1 ,Vanessa Braganholo1 , Alexandre A. B. Lima1 ,
Alberto M. R. Dávila2 and Marta Mattoso1
1Federal University of Rio de Janeiro, Brazil2FIOCRUZ, Brazil
Agenda
• Scientific experiments
• Blast workflow
• Data parallelization
• Hydra middleware
• Case study
• Measurements
• Provenance support
• Conclusion
2
Scientific Experiment Scenario
4. ...which need
to be processed
by a cluster or a
supercomputer…
1. Data is generated
and collected
2. Data is first
analyzed by some
programs
3. Large Volume of
Data Produced ...
5. Final Results
are analyzed
3
Blast sequence analysis match
4. ...which need
to be processed
In a HPC
environment…
1. Genomics Data
is collected.
2. Fasta file is
generated.
5. Final Results
are analyzed
SequenceDatabase
A C B C D B
C A D B D
A C - - B C D B
- C A D B - D -
>ACBCD
Fasta file
3. Sequence alignment …Ex. Genbank,
DDBJ, EMBL, …
4
Current Solutions
• mpiBlast, G-Blast and CloudBlast represent solutions for executing Blast in parallel on cluster, grid and cloudenvironments, respectively.
• BlastReduce: Using map-reduce approach to obtain data parallelization
• These solutions are disconnected from the concept of scientific experiment and they are not concerned about capturing and analyzing provenance.
• Scientific Workflow Management Systems (SWfMS) is analternative to represent a scientific experiment (TavernaWorkflows, for example)
• This solution brings some difficulties in obtaining parallelization
5
Experiments modeled as scientific workflows
• SWfMS is an alternative to represent a scientific experiment
• Obtaining parallelization and provenance gathering in distributed environments brings some challenges
6
Data Parallelization Difficulties
• Data fragmentation
o Dependent on the data format
• Activity distribution in HPC from SWfMS
• Data aggregation
o Dependent on the data format
7
FASTA Fragmentation
8
Types of FASTA data parallelization
9
Sequence Fragmentation Database Fragmentation
No data parallelization No No
Data parallelization
Yes No
No Yes*
Yes Yes*
> P. 0200AGTAGTT…
> P. 0400AGTAGTT…
> P. 0300AGTAGTT…
> P. 0100AGTAGTT…
Data Parallelism
10
UniprotKB
DataFragmenter
I = {i1}Input Data
Acitivity node 1
Acitivity node k
…
Di1-f1 Di1-fk
FO1 FOK
Aggregator
O = {o1, … , on)Output Data
…
…
Acitivity node 2
Di1-f2
FO2
Acitivity node (k-1)
Di1-f(k-1)
FO(k-1)
Parameters Parameters
> Plasmodium 0100AGTAGTTATTTAGGGCTATATAAAAGTTTTGAAAA…
Hydra Middleware
• Middleware solution that bridges the SWfMS to HPCEnvironment supporting data parallelism
• Goal: reduce the complexity involved in designing and managing bioinformatics programs while collecting provenance data
SWfMS MTC Environment
Hydra Middleware
11
Hydra Architecture
12
Hydra MTC LayerClient Layer
Uploader
Dispatcher
Gatherer
Downloader
Hydra Setup
SWfMS
Configuration
MUX
Parameter Sweeper
Data Fragmenter
Template
Fragment
Dispatcher/ Monitor
Data Analyzer
Provenance
Scientist
SWfMS
ExternalSchedulers
File System
File SystemFile
System
File System
12
3.a 3.b
4
5
6 7
8
9
1011
12
Compression Techniques in Hydra
Uploader
I’
I
I’
I’
ScientistDistributed Execution
I
O
O’
Uploader
O’
O
O’
Client Layer H
ydra
MTC L
ayer
Pack
Pack
Unpack
Unpack
I : Input DataI’: Compressed Input DataO : Output DataO’: Compressed Output Data
13
Steps to use Hydra
• Adapt the workflow for distribution using Hydra
• Setup the data parallelization
o Configure the workspace template and invoked program
• Setup a data fragmentation cartridge in Hydra
o Develop a program that fragments a bioinformatics data file
• Setup a data aggregation cartridge in Hydra
o A program that aggregate or merge data produced by individual workspaces
14
Blast workflow in VisTrails using Hydra
15
Setup the data parallelization
Case study
• BLAST tool
o Identification of similar sequences between Plasmodium falciparum e o UniProtKB/SWISS-Prot database (june/2008)
o Data fragmentation of input query
17
Experiment
• Workflow developed in VisTrails
• Hydra was setup with:
o Torque cartridge for job submission
o Data fragmentation cartridge using FASTASplitter
o Workspace configuration for blast invocation
o Data aggregation cartridge using BlastMerger
18
Experiment results
• Execution mode in 64 node Altix ICE 8200 with Xeon 8 cores per node installed at Nacad High Performance Computing Center at Federal University of Rio de Janeiro
• Speed-up measured
19
Workflow execution time
20
Activity execution time distribution
21
Provenance Support
o Where are the merged results for some particular experiment?
o What was the alignment method that lead to the best result?
o What were the parameters that lead the best result?
22
Provenance summary
23
Conclusions
• Hydra can be a bridge the SWfMS and the HPC environment
o Data parallelism using workflows
o Evaluated in a real case bioinformatics experiment using BLAST with little overhead
o Supports distributed provenance gathering
o Good speed-up and execution time
24
Future Work
• Balance workload between nodes
• New cartridges for data fragmentation and aggregation
• Automatic setup for fragment size and used nodes
25
Data Parallelism in Bioinformatics Workflows Using Hydra
Fábio Coutinho, Eduardo Ogasawara, Daniel de Oliveira,Vanessa Braganholo, Alexandre A. B. Lima,
Alberto M. R. Dávila and Marta Mattoso
Thank you!