Parallel DNA Sequence Alignment
-
Upload
giuliana-carullo -
Category
Technology
-
view
641 -
download
2
description
Transcript of Parallel DNA Sequence Alignment
SEQUENCE ALIGNMENT SPEED-UP
A PARALLEL APPROACH
University of Salerno
Parallel and Concurrent Computing Course
19 February 2013
Giuliana Carullo Luca Pepe
Daniele Valenza
• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan
Sequence alignment is a process for comparing two or more DNA or RNA sequences.
Sequence alignment is performed in order to find similar or identical regions in the provided sequences, or to check if it is a known sequence stored in a database.
DNA STRUCTURE
DNA bases: A C G T
Bounds: (A, T) (C, G)
DNA ALIGNMENT
Affinity measures:• MATCH
• MISMATCH
• GAP
MATCHING TYPE:• SIMPLE
• REVERSE AND COMPLEMENT
Q: ATGATTACC DNA String
R(Q): CCATTAGTA Reverse
C (R(Q)): GGTAATCAT Complement
• Global Alignment:
• Local Alignment:
• Local Alignment:
DNA ALIGNMENT TYPES
• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan
Searching all the perfect matchings of a small query string in a biggerDNA string.
INPUT: DNA String, Query String
OUTPUT: Number of occurences, Occurences starting positions
SIMPLE SEARCH
Variables Notation
# Workers 𝑛
Query length 𝑙𝑞
DNA Length 𝑙𝑑
Relative pos. 𝑂𝑓𝑓𝑖
Absolute pos. 𝑠𝑖
Searching the «best» n alignments of a small query string in a biggerDNA string
INPUT: DNA String, Query String
OUTPUT: Best alignments starting positions
APPROXIMATE SEARCH
Variables Notation
# Workers 𝑛
Query length 𝑙𝑞
DNA Length 𝑙𝑑
Relative pos. 𝑂𝑓𝑓𝑖
Absolute pos. 𝑠𝑖
APPROXIMATE SEARCH – SIMILARITY EVALUATION
Character similarity function
𝑠𝑖 = 𝑥, 𝑀𝑎𝑡𝑐ℎ𝑦, 𝑀𝑖𝑠𝑚𝑎𝑡𝑐ℎ𝑧, 𝐺𝑎𝑝
x > 0; y, z ≤ 0
(In this work gaps are not considered)
Objective function to maximize:
𝑆 =
𝑖
𝑙𝑞
𝑠𝑖
• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan
The common approach to all solutions is based on Map Reduce model:
• Master node splits the string intochunks and scatters them to workers node.
• The workers perform the computation and results are sentback to the master.
• Master combines the single solutions and returns the output.
GENERAL IDEA
Attention must be paid to the cross-matching strings
GENERIC SPLIT AND COMPUTATION
Complete Matching
PartialMatching
𝑇0
𝑇1
𝑇2
𝑇3
𝑇4
𝑇5
𝑇6
𝑇7
Query string
DNA string
𝑙𝑑/n 𝑙𝑑/n 𝑙𝑑/n
Chunk size
Chunk 𝒊 − 𝟏 Chunk 𝒊 Chunk 𝒊 + 𝟏
GENERIC REDUCE PHASE
𝑖 ∗ (𝑙𝑑 𝑛) + 𝑜𝑓𝑓𝑖
Worker ID Offset
𝑖 𝑜𝑓𝑓𝑖
𝑗 𝑜𝑓𝑓𝑗
𝑙𝑞
Query string
DNA string
Size
WORKERS OUTPUT
FINAL OUTPUT
𝑗 ∗ (𝑙𝑑 𝑛) + 𝑜𝑓𝑓𝑗 𝑙𝑞
Positions
𝑠𝑖
𝑠𝑗
𝑠𝑖 𝑠𝑗
• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan
Bigger chunk:
The master sends to every worker a chunk of sizes =𝑙𝑑
𝑛+ 𝑙𝑞 − 1 such
that cross chunk matching strings can be found.
On Demand:
The master sends chunks of sizes =𝑙𝑑
𝑛, whether a worker finds a partial
matching at the end of its chunk, it asks the remaining part r ≤ 𝑙𝑞 − 𝑘such that cross chunk matching strings can be found.
Two possible heuristics: big request and small request
Side to Side Sliding Query (3SQ):
Every worker receives a chunk of sizes =𝑙𝑑
𝑛and computes its complete
matchings and all partial matchings. Partial matchings will be combined by the master in order to find Cross-Chunk Matchings.
SOLUTION APPROACHES
𝑇0
𝑇1
𝑇2
𝑇3
𝑇4
𝑇5
𝑇6
𝑇7
BIGGER CHUNK APPROACH
Complete Matching
Chunk 𝒊 − 𝟏
𝑙𝑑/n
Query string
DNA string
𝑙𝑞-1
Chunk size
Chunk 𝒊
𝑙𝑑/n 𝑙𝑞-1
Chunk 𝒊 + 𝟏
𝑙𝑑/n 𝑙𝑞-1
Same Char
ADVANTAGES:
• it does not requires intra-workers communication;
• it does not produce duplicated occurrences;
• the master has an extremely small sequential work to perform.
DISADVANTAGES:
• each worker (except the last one) receives 𝑙𝑞 − 1 extra characters Thus, an extra bandwidth 𝑏𝑒 usage is produced such as:
𝑏𝑒 = 𝑙𝑞 − 1 ⋅ (𝑛 − 1)
BIGGER CHUNK APPROACH
Bigger Chunk
analogous to generic approach
On Demand:
analogous to generic approach
Side to Side Sliding Query (3SQ):
additional work is performed by master node for combining partialmatchings.
REDUCE PHASE
Bigger chunk:
The master sends to every worker a chunk of sizes =𝑙𝑑
𝑛+ 𝑙𝑞 − 1 such
that cross chunk matching strings can be found.
On Demand:
The master sends chunks of sizes =𝑙𝑑
𝑛, whether a worker finds a partial
matching at the end of its chunk, it asks the remaining part r ≤ 𝑙𝑞 − 𝑘such that cross chunk matching strings can be found.
Two possible heuristics: big request and small request
Side to Side Sliding Query (3SQ):
Every worker receives a chunk of sizes =𝑙𝑑
𝑛and computes its complete
matchings and all partial matchings. Partial matchings will be combined by the master in order to find Cross-Chunk Matchings.
SOLUTION APPROACHES
𝑇0
…
𝑇𝑗
𝑇𝑗 + 1
𝑇𝑗 + 2
𝑇𝑗 + 3
ON DEMAND – BIG REQUEST APPROACH
𝑙𝑑/n
v v v v
v x
v v v v
…
Chunk 𝒊
Chunk 𝒊 + 𝟏
Complete Matching
PartialMatching
Query string
DNA string
Chunk size
ADVANTAGES:
• extra data is requested only when needed
• it does not produce duplicated occurrences
• a single request is performed for each worker
DISADVANTAGES:
• extra overhead for the big request
• potential useless extra characters
ON DEMAND – BIG REQUEST APPROACH
𝑇0
…
𝑇𝑗
𝑇𝑗 + 1
𝑇𝑗 + 2
𝑇𝑗 + 3
ON DEMAND – SMALL REQUEST APPROACH
𝑙𝑑/n
v v v v
v x
v v v v
…
Chunk 𝒊
Chunk 𝒊 + 𝟏
Complete Matching
PartialMatching
Query string
DNA string
Chunk size
ADVANTAGES:
• extra data are requested only when needed
• it does not produce duplicated occurrences
• better bandwidth usage than big request
DISADVANTAGES:
• Number of requests grows proportionally to the length of the query
ON DEMAND – SMALL REQUEST APPROACH
Two kind of communication can be adopted:
ON DEMAND – CENTRALIZED VS DISTRIBUTED
Centralized: request is made to master node
Distributed: request is made to adjacent right node
k )
ON DEMAND – CENTRALIZED VS DISTRIBUTED
Centralized Distributed
ADVANTAGES Master idle time isreduced.
No extra accesses to DNA are needed.
No linearizationpoint.
DISADVANTAGES Linearization point isadded.
Access to DNA must be performed.
Extra data requestsmay be sloweddown.
Bigger Chunk
analogous to generic approach
On Demand:
analogous to generic approach
Side to Side Sliding Query (3SQ):
additional work is performed by master node for combining partialmatchings.
REDUCE PHASE
Bigger chunk:
The master sends to every worker a chunk of sizes =𝑙𝑑
𝑛+ 𝑙𝑞 − 1 such
that cross chunk matching strings can be found.
On Demand:
The master sends chunks of sizes =𝑙𝑑
𝑛, whether a worker finds a partial
matching at the end of its chunk, it asks the remaining part r ≤ 𝑙𝑞 − 𝑘such that cross chunk matching strings can be found.
Two possible heuristics: big request and small request
Side to Side Sliding Query (3SQ):
Every worker receives a chunk of sizes =𝑙𝑑
𝑛and computes its complete
matchings and all partial matchings. Partial matchings will be combined by the master in order to find Cross-Chunk Matchings.
SOLUTION APPROACHES
…
𝑇0
𝑇1
𝑇2
𝑇3
𝑇𝑗
𝑇𝑗+1
𝑇𝑗+2
𝑇𝑗+3
SIDE TO SIDE SLIDING QUERY (3SQ) APPROACH
Complete Matching
Right-sidePartial
Matching
Query string
DNA string
𝑙𝑑/n 𝑙𝑑/n 𝑙𝑑/n
Chunk size
Chunk 𝒊 − 𝟏 Chunk 𝒊
…
Left-sidePartial
Matching
…
Chunk 𝒊 + 𝟏
…
ADVANTAGES:
• no extra data is required
• it does not produce duplicated occurrences
• no extra communication is needed
• the master does not need to store the DNA string
• it reduces bandwidth consumption to perform cross-chunk strings checking. Indeed workers return bits instead of integers.
DISADVANTAGES:
• Extra work is required to the master (partial matchings combine)
SIDE TO SIDE SLIDING QUERY (3SQ) APPROACH
Bigger Chunk
analogous to generic approach
On Demand:
analogous to generic approach
Side to Side Sliding Query (3SQ):
additional work is performed by master node for combining partialmatchings.
REDUCE PHASE
3SQ REDUCE PHASE
𝑖 ∗ (𝑙𝑑 𝑛)- j
1 1 0 1
𝑙𝑞
Query match
DNA string
Size
WORKER i Right side array
FINAL OUTPUT
𝑖 ∗ (𝑙𝑑 𝑛)- k 𝑙𝑞 Positions
𝑠𝑗
𝑠𝑘
Results array
𝑠𝑘
1 0 0 1
AND
1 0 0 1
WORKER i+1 Left side array
𝑠𝑖
𝒋 𝒌
• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan
Same as simple search
• Splitting phase: same of simple search
• Computation phase:• Similarity function is evaluated for every alignment of query string• Likely simple search, Cross-chunk strings must be considered• Every worker returns its 𝑛 best similarity values, with relative
positions
• Reduce phase:All similarity values are merged in order and the best 𝑛 alignmentsare returned
PARALLELIZATION MODEL
REDUCE PHASE
Off. Similarity
X 10
Y 8
Z 3
Off. Similarity
A 5
B -3
C -6
W. Id
Off. Sim.
1 X 10
1 Y 8
2 U 7
3 A 5
1 Z 3
2 V 2
2 W -1
3 B -3
3 C -6
Pos. Similarity
X’ 10
Y’ 8
U’ 7
ORDERED
MERGE
POS.
TRANSLATION
Off. Similarity
U 7
V 2
W -1
FINAL OUTPUTWorker 1
Worker 2
Worker 3
Bigger chunk:
The master sends to every worker a chunk of size s ≤𝑙𝑑
𝑛+ 𝑙𝑞 − 1 such
that cross chunk matching similarities can be evaluated.
Side to Side Sliding Query (3SQ):
Every worker receives a chunk of size s =𝑙𝑑
𝑛and computes its similarity
values and all partial similarities (leftside and rightside). Partialsimilarities will be summed by the master in order to compute Cross-Chunk String similarity values.
CROSS-CHUNK MATCHING
3SQ PARTIAL SIMILARITY COMBINE PHASE
4 2 0 1
𝑙𝑞
Query match
DNA string
Size
WORKER i Right side array
OUTPUT
W.Id.
Off. Sim
i 𝑠𝑗 5
i 𝑠𝑘 3
i …
Results array
sk
1 0 3 -4
+
5 2 3 -3
WORKER i+1 Left side array
si
𝒋 𝒌𝑠𝑗 = 𝑙𝑐 − (𝑙𝑞 − 1)+ j
𝑙𝑞
Chunk 𝒊 Chunk 𝒊 + 𝟏
• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan
Varying parameters:
• Number of Workers
• Query Length
We plan to evaluate the running times of every presentedalgorithm. The analysis of these results will validate ourproposal, highlighting the algorithm that performs better.
OVERVIEW
SEQUENCE ALIGNMENT SPEED-UP
A PARALLEL APPROACH
University of Salerno
Parallel and Concurrent Computing Course
19 February 2013
Giuliana Carullo Luca Pepe
Daniele Valenza
DEVELOPMENT AND BENCHMARKING
• Implementation
• Introduction
• DNA Splitting
• Bandwidth usage
• Comunication
• Benchmarking
• Testing environment
• Test plan
• Results
• Conclusions
Every proposed algorithm has been
implemented using C language and OpenMPI library
Advantages:
• High performances
• Scalability
• Portability
INTRODUCTION
A natural approach: load it entirely from file, calculate the size (𝑙𝑑), split it in 𝑛 chunks and send them to the workers
Problems:
A DNA genome may be very large (3.0 ×109 bp (base pairs) )
The available memory can’t be enough.
Projectual choice:
The whole DNA is actually never needed
DNA is never entirely loaded in memory, first dna and chunk size are calculated, and then step by step 𝑙𝑐 characters are read from file and sent to a worker.
PROJECTUAL CHOICES: DNA SPLITTING
• The type of messages exchanged during the simple searchcomputation would normally consist in: • Characters (splitting phase)
• Integers (Reduce phase)
Bandwidth usage:
• 1 byte (Char size) x lc x n - Splitting phase
• 4 byte (Integer size) x lc x n (best case, all matchings) – Reduce phase
Can we do better? … YES!
PROJECTUAL CHOICES: BANDWIDTH USAGE
In the Simple Search algorithm, a compression can be performed in order to drastically reduce bandwidth consumption.
Simple Search Reduce phase Compression: instead of sendingactual positions, a bit array of size 𝑙𝑐 is exploited.
Bit array costruction:
for each position, if a matching is found starting from it, the bit isset to 1, 0 otherwise.
Compression Ratio:
1: 32 (E.g, with 4 integers from 4 positions to 128 positions)
PROJECTUAL CHOICES: BANDWIDTH USAGE
COMUNICATION
Master to workers Extra Comunication Workers to Master
Messages DataType
Type Messages DataType
Type Messages DataType
Type
Bigger Chunk N(ld/n+lq-1) Char AsyncSync X N(ld/n) Int
BitSyncSync
On Demand: N(ld/n) Char AsyncSync
N-1(lq-1) Char SyncSync
N(ld/n) IntBit
SyncSync
3SQ N(ld/n) Char AsyncSync X N(ld/n)+2(l
q-1) IntBit
SyncSync
• Implementation
• Introduction
• DNA Splitting
• Bandwidth usage
• Comunication
• Benchmarking
• Testing environment
• Test plan
• Results
• Conclusions
Cluster
8 Nodes - Ethernet 100Mbps connection
Node
CPU: Intel Xeon Dual Core 2.8 Ghz
RAM: 4GB
Hard Drive: 2x 30GB SCSI
Software
OS: Debian 6.0.4
OpenMPI 1.6.1
TESTING ENVIRONMENT
Image for illustrative purposes only
Benchmarking consisted in evaluating and comparing runningtimes of each algorithm as function of the followingparameters
• Number of processors (# workers +1) [2, 4, 8, 16]
• DNA length (Small -5MB-, Medium -149 MB-, Large -292MB-)
• Query length (Small -8byte-, Medium -32byte-, Large -64byte-)
• # best allignments -Approximate search only- (10, 50, 100)
In grey the fixed value for the parameter when not evaluated
TEST PLAN
SIMPLE SEARCH: NUMBER OF WORKERS (1/2)
Results:• Good Scalability for
every algorithm
• 3SQ worse than the others becauseadditional sequentialwork must be performed.
SIMPLE SEARCH: NUMBER OF WORKERS (2/2)
Results:• Bigger Chunk Bit
performs better thanint solution.
• Increasing processors, bigger chunk performsbetter than the othersbecause more cross-chunk matchings occur.
• No relevantimprovements occurredbetween 8 and 16 processors.
SIMPLE SEARCH: SPEED UP
0
0,5
1
1,5
2
2,5
3
3,5
4
2 4 8 16
SPEE
DU
P
NUMBER OF PROCESSORS
Speed Up Simple Search
DNA Size: Big Query Size: Small
BC-bit OD-cent BC-int OD-dist 3SQ
Results:
• Increasing speedup for every algorithm(except BC-int)
• The speedup growsproportionally to
𝑛 + 1
• BC-int suffers from network bottleneck due to the size of the messages.
SIMPLE SEARCH: DNA LENGTH
Results:• Good Scalability for
every algorithm
• 3SQ worse: additionalsequential work thanothers….
• Bigger Chunk Bit performs better thanint solution
• Execution times growslinearly respect to DNA size
SIMPLE SEARCH: QUERY LENGTH
Results:
• 3SQ is highly sensible to querylength variations due to partialmatching combine phase.
• No significative variations for other algorithms since single Query Matching is interruptedon first mismatch found.
APPROXIMATE SEARCH: NUMBER OF WORKERS
Results:• Running times
decrease linearlyrespectively to the number of processors.
• 3SQ is only slightlyworse than Biggerchunk because the sequential work isalmost the same(Ordered Merge)
APPROXIMATE SEARCH: SPEED UP
0
2
4
6
8
10
12
14
16
2 4 8 16
SPEE
DU
P
NUMBER OF PROCESSORS
Speed Up Approximate Search
DNA Size: Medium Query Size: Small
3SQ BC-int
Results:Speed up globally betterthan simple search and close to the ideal value.
APPROXIMATE SEARCH: DNA SIZE
Results:Running times growslinearly respectively to the DNA SIZE
MotivationThe main sequentialcomputation consists in Ordered Merge that haslinear complexity.
APPROXIMATE SEARCH: QUERY SIZE
Results:Running times isinfluenced by Query Size.
MotivationThe computation of similarity function isaffected by query length.
APPROXIMATE SEARCH: NUMBER OF BEST ALIGNMENTS
Results:Running times growsalmost linearly.
MotivationEach worker returns to the master its Number of best alignments and the ordered merge process isaffected by it.
0,00
5,00
10,00
15,00
20,00
25,00
30,00
35,00
10 50 100
RU
NN
ING
TIM
E (S
ECO
ND
S)
NUMBER OF BEST ALIGNMENT
Approximate SearchDNA size: Big Processor: 16 Query Size:
Small
BC-int 3SQ
• Implementation
• Introduction
• DNA Splitting
• Bandwidth usage
• Comunication
• Benchmarking
• Testing environment
• Test plan
• Results
• Conclusions
The winner is….
Bigger Chunk
On Demand
3SQ
Further improvements can be applied to the presented algorithms
Splitting phase: DNA alphabet consists merely in 4 characters, 2 bit are enough to rappresent the character, instead of 8 bit
Bit Mapping:
e.g. A=00, T=01, C=10, G=11
Compression Ratio:
1: 4 (E.g, with 1 character from 1 base to 4 bases)
IMPROVEMENTS
3SQ algorithm:
Partial matchings combine phase can be performed in a distributedmanner
Each node sends its left or right partial matching to left or right sibling, which will combine it with his results and send them to master.
In this way sequential work can be reduced
IMPROVEMENTS
Thanks !