Parallel DNA Sequence Alignment

SEQUENCE ALIGNMENT SPEED-UP

A PARALLEL APPROACH

University of Salerno

Parallel and Concurrent Computing Course

19 February 2013

Giuliana Carullo Luca Pepe

Daniele Valenza

• Introduction

• Problem definition

• Simple Search

• Approximate Search

• Parallelization

• Cross-Chunk Matching

• Bigger chunk

• On Demand

• Side to Side Sliding Query


• Test plan

Sequence alignment is a process for comparing two or more DNA or RNA sequences.

Sequence alignment is performed in order to find similar or identical regions in the provided sequences, or to check if it is a known sequence stored in a database.

DNA STRUCTURE

DNA bases: A C G T

Bounds: (A, T) (C, G)

DNA ALIGNMENT

Affinity measures:• MATCH

• MISMATCH

• GAP

MATCHING TYPE:• SIMPLE

• REVERSE AND COMPLEMENT

Q: ATGATTACC DNA String

R(Q): CCATTAGTA Reverse

C (R(Q)): GGTAATCAT Complement

• Global Alignment:

• Local Alignment:

• Local Alignment:

DNA ALIGNMENT TYPES

• Introduction


• Simple Search


• Parallelization


• Bigger chunk

• On Demand



• Test plan

Searching all the perfect matchings of a small query string in a biggerDNA string.

INPUT: DNA String, Query String

OUTPUT: Number of occurences, Occurences starting positions

SIMPLE SEARCH

Variables Notation

# Workers 𝑛

Query length 𝑙𝑞

DNA Length 𝑙𝑑

Relative pos. 𝑂𝑓𝑓𝑖

Absolute pos. 𝑠𝑖

Searching the «best» n alignments of a small query string in a biggerDNA string

INPUT: DNA String, Query String

OUTPUT: Best alignments starting positions

APPROXIMATE SEARCH

Variables Notation

# Workers 𝑛

Query length 𝑙𝑞

DNA Length 𝑙𝑑

Relative pos. 𝑂𝑓𝑓𝑖

Absolute pos. 𝑠𝑖

APPROXIMATE SEARCH – SIMILARITY EVALUATION

Character similarity function

𝑠𝑖 = 𝑥, 𝑀𝑎𝑡𝑐ℎ𝑦, 𝑀𝑖𝑠𝑚𝑎𝑡𝑐ℎ𝑧, 𝐺𝑎𝑝

x > 0; y, z ≤ 0

(In this work gaps are not considered)

Objective function to maximize:

𝑆 =

𝑖

𝑙𝑞

𝑠𝑖

• Introduction


• Simple Search


• Parallelization


• Bigger chunk

• On Demand



• Test plan

The common approach to all solutions is based on Map Reduce model:

• Master node splits the string intochunks and scatters them to workers node.

• The workers perform the computation and results are sentback to the master.

• Master combines the single solutions and returns the output.

GENERAL IDEA

Attention must be paid to the cross-matching strings

GENERIC SPLIT AND COMPUTATION

Complete Matching

PartialMatching

𝑇0

𝑇1

𝑇2

𝑇3

𝑇4

𝑇5

𝑇6

𝑇7

Query string

DNA string

𝑙𝑑/n 𝑙𝑑/n 𝑙𝑑/n

Chunk size

Chunk 𝒊 − 𝟏 Chunk 𝒊 Chunk 𝒊 + 𝟏

GENERIC REDUCE PHASE

𝑖 ∗ (𝑙𝑑 𝑛) + 𝑜𝑓𝑓𝑖

Worker ID Offset

𝑖 𝑜𝑓𝑓𝑖

𝑗 𝑜𝑓𝑓𝑗

𝑙𝑞

Query string

DNA string

Size

WORKERS OUTPUT

FINAL OUTPUT

𝑗 ∗ (𝑙𝑑 𝑛) + 𝑜𝑓𝑓𝑗 𝑙𝑞

Positions

𝑠𝑖

𝑠𝑗

𝑠𝑖 𝑠𝑗

• Introduction


• Simple Search


• Parallelization


• Bigger chunk

• On Demand



• Test plan

Bigger chunk:

The master sends to every worker a chunk of sizes =𝑙𝑑

𝑛+ 𝑙𝑞 − 1 such

that cross chunk matching strings can be found.

On Demand:

The master sends chunks of sizes =𝑙𝑑

𝑛, whether a worker finds a partial

matching at the end of its chunk, it asks the remaining part r ≤ 𝑙𝑞 − 𝑘such that cross chunk matching strings can be found.

Two possible heuristics: big request and small request

Side to Side Sliding Query (3SQ):

Every worker receives a chunk of sizes =𝑙𝑑

𝑛and computes its complete

matchings and all partial matchings. Partial matchings will be combined by the master in order to find Cross-Chunk Matchings.

SOLUTION APPROACHES

𝑇0

𝑇1

𝑇2

𝑇3

𝑇4

𝑇5

𝑇6

𝑇7

BIGGER CHUNK APPROACH

Complete Matching

Chunk 𝒊 − 𝟏

𝑙𝑑/n

Query string

DNA string

𝑙𝑞-1

Chunk size

Chunk 𝒊

𝑙𝑑/n 𝑙𝑞-1

Chunk 𝒊 + 𝟏

𝑙𝑑/n 𝑙𝑞-1

Same Char

ADVANTAGES:

• it does not requires intra-workers communication;

• it does not produce duplicated occurrences;

• the master has an extremely small sequential work to perform.

DISADVANTAGES:

• each worker (except the last one) receives 𝑙𝑞 − 1 extra characters Thus, an extra bandwidth 𝑏𝑒 usage is produced such as:

𝑏𝑒 = 𝑙𝑞 − 1 ⋅ (𝑛 − 1)

BIGGER CHUNK APPROACH

Bigger Chunk

analogous to generic approach

On Demand:



additional work is performed by master node for combining partialmatchings.

REDUCE PHASE

Bigger chunk:




On Demand:









SOLUTION APPROACHES

𝑇0

…

𝑇𝑗

𝑇𝑗 + 1

𝑇𝑗 + 2

𝑇𝑗 + 3

ON DEMAND – BIG REQUEST APPROACH

𝑙𝑑/n

v v v v

v x

v v v v

…

Chunk 𝒊

Chunk 𝒊 + 𝟏

Complete Matching

PartialMatching

Query string

DNA string

Chunk size

ADVANTAGES:

• extra data is requested only when needed

• it does not produce duplicated occurrences

• a single request is performed for each worker

DISADVANTAGES:

• extra overhead for the big request

• potential useless extra characters

ON DEMAND – BIG REQUEST APPROACH

𝑇0

…

𝑇𝑗

𝑇𝑗 + 1

𝑇𝑗 + 2

𝑇𝑗 + 3

ON DEMAND – SMALL REQUEST APPROACH

𝑙𝑑/n

v v v v

v x

v v v v

…

Chunk 𝒊

Chunk 𝒊 + 𝟏

Complete Matching

PartialMatching

Query string

DNA string

Chunk size

ADVANTAGES:

• extra data are requested only when needed


• better bandwidth usage than big request

DISADVANTAGES:

• Number of requests grows proportionally to the length of the query

ON DEMAND – SMALL REQUEST APPROACH

Two kind of communication can be adopted:

ON DEMAND – CENTRALIZED VS DISTRIBUTED

Centralized: request is made to master node

Distributed: request is made to adjacent right node

k )

ON DEMAND – CENTRALIZED VS DISTRIBUTED

Centralized Distributed

ADVANTAGES Master idle time isreduced.

No extra accesses to DNA are needed.

No linearizationpoint.

DISADVANTAGES Linearization point isadded.

Access to DNA must be performed.

Extra data requestsmay be sloweddown.

Bigger Chunk


On Demand:




REDUCE PHASE

Bigger chunk:




On Demand:









SOLUTION APPROACHES

…

𝑇0

𝑇1

𝑇2

𝑇3

𝑇𝑗

𝑇𝑗+1

𝑇𝑗+2

𝑇𝑗+3

SIDE TO SIDE SLIDING QUERY (3SQ) APPROACH

Complete Matching

Right-sidePartial

Matching

Query string

DNA string

𝑙𝑑/n 𝑙𝑑/n 𝑙𝑑/n

Chunk size

Chunk 𝒊 − 𝟏 Chunk 𝒊

…

Left-sidePartial

Matching

…

Chunk 𝒊 + 𝟏

…

ADVANTAGES:

• no extra data is required


• no extra communication is needed

• the master does not need to store the DNA string

• it reduces bandwidth consumption to perform cross-chunk strings checking. Indeed workers return bits instead of integers.

DISADVANTAGES:

• Extra work is required to the master (partial matchings combine)

SIDE TO SIDE SLIDING QUERY (3SQ) APPROACH

Bigger Chunk


On Demand:




REDUCE PHASE

3SQ REDUCE PHASE

𝑖 ∗ (𝑙𝑑 𝑛)- j

1 1 0 1

𝑙𝑞

Query match

DNA string

Size

WORKER i Right side array

FINAL OUTPUT

𝑖 ∗ (𝑙𝑑 𝑛)- k 𝑙𝑞 Positions

𝑠𝑗

𝑠𝑘

Results array

𝑠𝑘

1 0 0 1

AND

1 0 0 1

WORKER i+1 Left side array

𝑠𝑖

𝒋 𝒌

• Introduction


• Simple Search


• Parallelization


• Bigger chunk

• On Demand



• Test plan

Same as simple search

• Splitting phase: same of simple search

• Computation phase:• Similarity function is evaluated for every alignment of query string• Likely simple search, Cross-chunk strings must be considered• Every worker returns its 𝑛 best similarity values, with relative

positions

• Reduce phase:All similarity values are merged in order and the best 𝑛 alignmentsare returned

PARALLELIZATION MODEL

REDUCE PHASE

Off. Similarity

X 10

Y 8

Z 3

Off. Similarity

A 5

B -3

C -6

W. Id

Off. Sim.

1 X 10

1 Y 8

2 U 7

3 A 5

1 Z 3

2 V 2

2 W -1

3 B -3

3 C -6

Pos. Similarity

X’ 10

Y’ 8

U’ 7

ORDERED

MERGE

POS.

TRANSLATION

Off. Similarity

U 7

V 2

W -1

FINAL OUTPUTWorker 1

Worker 2

Worker 3

Bigger chunk:

The master sends to every worker a chunk of size s ≤𝑙𝑑


that cross chunk matching similarities can be evaluated.


Every worker receives a chunk of size s =𝑙𝑑

𝑛and computes its similarity

values and all partial similarities (leftside and rightside). Partialsimilarities will be summed by the master in order to compute Cross-Chunk String similarity values.

CROSS-CHUNK MATCHING

3SQ PARTIAL SIMILARITY COMBINE PHASE

4 2 0 1

𝑙𝑞

Query match

DNA string

Size

WORKER i Right side array

OUTPUT

W.Id.

Off. Sim

i 𝑠𝑗 5

i 𝑠𝑘 3

i …

Results array

sk

1 0 3 -4

+

5 2 3 -3

WORKER i+1 Left side array

si

𝒋 𝒌𝑠𝑗 = 𝑙𝑐 − (𝑙𝑞 − 1)+ j

𝑙𝑞

Chunk 𝒊 Chunk 𝒊 + 𝟏

• Introduction


• Simple Search


• Parallelization


• Bigger chunk

• On Demand



• Test plan

Varying parameters:

• Number of Workers

• Query Length

We plan to evaluate the running times of every presentedalgorithm. The analysis of these results will validate ourproposal, highlighting the algorithm that performs better.

OVERVIEW

SEQUENCE ALIGNMENT SPEED-UP

A PARALLEL APPROACH

University of Salerno

Parallel and Concurrent Computing Course

19 February 2013

Giuliana Carullo Luca Pepe

Daniele Valenza

DEVELOPMENT AND BENCHMARKING

• Implementation

• Introduction

• DNA Splitting

• Bandwidth usage

• Comunication

• Benchmarking

• Testing environment

• Test plan

• Results

• Conclusions

Every proposed algorithm has been

implemented using C language and OpenMPI library

Advantages:

• High performances

• Scalability

• Portability

INTRODUCTION

A natural approach: load it entirely from file, calculate the size (𝑙𝑑), split it in 𝑛 chunks and send them to the workers

Problems:

A DNA genome may be very large (3.0 ×109 bp (base pairs) )

The available memory can’t be enough.

Projectual choice:

The whole DNA is actually never needed

DNA is never entirely loaded in memory, first dna and chunk size are calculated, and then step by step 𝑙𝑐 characters are read from file and sent to a worker.

PROJECTUAL CHOICES: DNA SPLITTING

• The type of messages exchanged during the simple searchcomputation would normally consist in: • Characters (splitting phase)

• Integers (Reduce phase)

Bandwidth usage:

• 1 byte (Char size) x lc x n - Splitting phase

• 4 byte (Integer size) x lc x n (best case, all matchings) – Reduce phase

Can we do better? … YES!

PROJECTUAL CHOICES: BANDWIDTH USAGE

In the Simple Search algorithm, a compression can be performed in order to drastically reduce bandwidth consumption.

Simple Search Reduce phase Compression: instead of sendingactual positions, a bit array of size 𝑙𝑐 is exploited.

Bit array costruction:

for each position, if a matching is found starting from it, the bit isset to 1, 0 otherwise.

Compression Ratio:

1: 32 (E.g, with 4 integers from 4 positions to 128 positions)

PROJECTUAL CHOICES: BANDWIDTH USAGE

COMUNICATION

Master to workers Extra Comunication Workers to Master

Messages DataType

Type Messages DataType

Type Messages DataType

Type

Bigger Chunk N(ld/n+lq-1) Char AsyncSync X N(ld/n) Int

BitSyncSync

On Demand: N(ld/n) Char AsyncSync

N-1(lq-1) Char SyncSync

N(ld/n) IntBit

SyncSync

3SQ N(ld/n) Char AsyncSync X N(ld/n)+2(l

q-1) IntBit

SyncSync

• Implementation

• Introduction

• DNA Splitting

• Bandwidth usage

• Comunication

• Benchmarking


• Test plan

• Results

• Conclusions

Cluster

8 Nodes - Ethernet 100Mbps connection

Node

CPU: Intel Xeon Dual Core 2.8 Ghz

RAM: 4GB

Hard Drive: 2x 30GB SCSI

Software

OS: Debian 6.0.4

OpenMPI 1.6.1

TESTING ENVIRONMENT

Image for illustrative purposes only

Benchmarking consisted in evaluating and comparing runningtimes of each algorithm as function of the followingparameters

• Number of processors (# workers +1) [2, 4, 8, 16]

• DNA length (Small -5MB-, Medium -149 MB-, Large -292MB-)

• Query length (Small -8byte-, Medium -32byte-, Large -64byte-)

• # best allignments -Approximate search only- (10, 50, 100)

In grey the fixed value for the parameter when not evaluated

TEST PLAN

SIMPLE SEARCH: NUMBER OF WORKERS (1/2)

Results:• Good Scalability for

every algorithm

• 3SQ worse than the others becauseadditional sequentialwork must be performed.

SIMPLE SEARCH: NUMBER OF WORKERS (2/2)

Results:• Bigger Chunk Bit

performs better thanint solution.

• Increasing processors, bigger chunk performsbetter than the othersbecause more cross-chunk matchings occur.

• No relevantimprovements occurredbetween 8 and 16 processors.

SIMPLE SEARCH: SPEED UP

0

0,5

1

1,5

2

2,5

3

3,5

4

2 4 8 16

SPEE

DU

P

NUMBER OF PROCESSORS

Speed Up Simple Search

DNA Size: Big Query Size: Small

BC-bit OD-cent BC-int OD-dist 3SQ

Results:

• Increasing speedup for every algorithm(except BC-int)

• The speedup growsproportionally to

𝑛 + 1

• BC-int suffers from network bottleneck due to the size of the messages.

SIMPLE SEARCH: DNA LENGTH

Results:• Good Scalability for

every algorithm

• 3SQ worse: additionalsequential work thanothers….

• Bigger Chunk Bit performs better thanint solution

• Execution times growslinearly respect to DNA size

SIMPLE SEARCH: QUERY LENGTH

Results:

• 3SQ is highly sensible to querylength variations due to partialmatching combine phase.

• No significative variations for other algorithms since single Query Matching is interruptedon first mismatch found.

APPROXIMATE SEARCH: NUMBER OF WORKERS

Results:• Running times

decrease linearlyrespectively to the number of processors.

• 3SQ is only slightlyworse than Biggerchunk because the sequential work isalmost the same(Ordered Merge)

APPROXIMATE SEARCH: SPEED UP

0

2

4

6

8

10

12

14

16

2 4 8 16

SPEE

DU

P

NUMBER OF PROCESSORS

Speed Up Approximate Search

DNA Size: Medium Query Size: Small

3SQ BC-int

Results:Speed up globally betterthan simple search and close to the ideal value.

APPROXIMATE SEARCH: DNA SIZE

Results:Running times growslinearly respectively to the DNA SIZE

MotivationThe main sequentialcomputation consists in Ordered Merge that haslinear complexity.

APPROXIMATE SEARCH: QUERY SIZE

Results:Running times isinfluenced by Query Size.

MotivationThe computation of similarity function isaffected by query length.

APPROXIMATE SEARCH: NUMBER OF BEST ALIGNMENTS

Results:Running times growsalmost linearly.

MotivationEach worker returns to the master its Number of best alignments and the ordered merge process isaffected by it.

0,00

5,00

10,00

15,00

20,00

25,00

30,00

35,00

10 50 100

RU

NN

ING

TIM

E (S

ECO

ND

S)

NUMBER OF BEST ALIGNMENT

Approximate SearchDNA size: Big Processor: 16 Query Size:

Small

BC-int 3SQ

• Implementation

• Introduction

• DNA Splitting

• Bandwidth usage

• Comunication

• Benchmarking


• Test plan

• Results

• Conclusions

The winner is….

Bigger Chunk

On Demand

3SQ

Further improvements can be applied to the presented algorithms

Splitting phase: DNA alphabet consists merely in 4 characters, 2 bit are enough to rappresent the character, instead of 8 bit

Bit Mapping:

e.g. A=00, T=01, C=10, G=11

Compression Ratio:

1: 4 (E.g, with 1 character from 1 base to 4 bases)

IMPROVEMENTS

3SQ algorithm:

Partial matchings combine phase can be performed in a distributedmanner

Each node sends its left or right partial matching to left or right sibling, which will combine it with his results and send them to master.

In this way sequential work can be reduced

IMPROVEMENTS

Thanks !

Parallel DNA Sequence Alignment

Technology

Transcript of Parallel DNA Sequence Alignment