InCoB2007 - August 30, 2007 - HKUST

27
InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King Mongkut’s Institute of Technology, Ladkrabang, Thailand National Center for Genetic Engineering and Biotechnology, Thailand Dr. Surin Kittitornkun Dr. Sissades Tongsima Kridsadakorn Chaichoompu [email protected] 1

description

“Speedup Bioinformatics Applications on Multicore-based Processor using Vectorizing & Multithreading Strategies”. InCoB2007 - August 30, 2007 - HKUST. Kridsadakorn Chaichoompu [email protected]. Dr. Sissades Tongsima. Dr. Surin Kittitornkun. - PowerPoint PPT Presentation

Transcript of InCoB2007 - August 30, 2007 - HKUST

Page 1: InCoB2007 - August 30, 2007 - HKUST

InCoB2007 - August 30, 2007 - HKUST

“Speedup Bioinformatics Applications on Multicore-based Processor using

Vectorizing & Multithreading Strategies”

King Mongkut’s Institute of Technology, Ladkrabang,

Thailand

National Center for Genetic Engineering and Biotechnology, Thailand

Dr. Surin KittitornkunDr. Sissades Tongsima

Kridsadakorn [email protected]

1

Page 2: InCoB2007 - August 30, 2007 - HKUST

Outline

Introduction Case Study Existing works Speedup of our approach Comparison Discussion Our strategies Limitation Conclusion

2

Page 3: InCoB2007 - August 30, 2007 - HKUST

Motivation

New modern processors are launched How to make a use of new technologies?

Dual-core CPU Quad-core CPU

3

Page 4: InCoB2007 - August 30, 2007 - HKUST

Motivation [2]

What is the difference between old and new CPUs?

4

Dual-core, Max. speedup ~2x Quad-core, Max. speedup ~4x

Page 5: InCoB2007 - August 30, 2007 - HKUST

Problems

Old sequential software is still used?Yes, especially the science and bioinformatics tools

Why do the scientists still use?Mostly they care about novel algorithms and

knowledge. They don't care about speed Why don't we use the PC cluster?

Very expensive, consume much more electric power. You don't need the PC cluster if you want to use a small software for searching, matching or grouping data

5

Page 6: InCoB2007 - August 30, 2007 - HKUST

Our Contribution

The hardware was changed, Old sequential software should be changed. To harness the power of the new multicore architecture certain compiler techniques must be considered

Using a popular ClustalW application as our case study, the optimization and multithreading techniques were applied to speedup ClustalW

6

Page 7: InCoB2007 - August 30, 2007 - HKUST

Case Study: ClustalW

ClustaW is a general purpose multiple alignment program for DNA or proteins.

7

Page 8: InCoB2007 - August 30, 2007 - HKUST

All pairwisealignments

ClustalW example

S1 ALSKS2 TNSDS3 NASKS4 NTSD

S1 S2 S3 S4

S1 0 9 4 7

S2 0 8 3

S3 0 7

S4 0

1. Align S1 with S3

2. Align S2 with S4

3. Align (S1, S3) with (S2, S4)

Distance Matrix

Multiple Alignment Steps

NeighborJoining

-ALSKNA-SK

-TNSDNT-SD

-ALSK-TNSDNA-SKNT-SD

MultipleAlignment

S1 S3

S2

S4

Distance

8

Page 9: InCoB2007 - August 30, 2007 - HKUST

Existing works

ClustalW-MPI: ClustalW analysis using distributed and parallel computingK.B. Li, Bioinformatics 19, 2003

Parallel MSA: Parallel Multiple Sequence Alignment with Dynamic SchedulingJ. Luo, I. Ahmad, M. Ahmed and R. Paul, ITCC’05

SGI: Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal, and MULTICLUSTALD. Mikhailov, Haruna C., and R. Gomperts, SGI ChemBio

9

Page 10: InCoB2007 - August 30, 2007 - HKUST

Speedup of our approach

*Note: Running mode defines as follows: (I) ClustalW without optimization (II) ClustalW with optimization (III) ClustalW with optimization and our assist (IV) MT-ClustalW without optimization (V) MT-ClustalW with optimization (VI) MT-ClustalW with optimization and our assist

2.12244,672474,1095,472,407VI

1.98253,188473,3595,900,891V

1.70252,984511,0477,009,875IV

1.21327,985880,9699,656,750III

1.14338,016881,12510,387,046II

-333,110932,71811,918,672I

Test data - 800 sequences, 1000 amino acids

ProgressiveAlignment

NeighborJoining

DistanceMatrix

Overallspeedup

Elapsed times (ms)Runningmode*

10

Data set Protein sequences from NCBIRun time: from 3 h. 40 m. down to 1 h. 43 m.

Page 11: InCoB2007 - August 30, 2007 - HKUST

ClustalW

Speedup of the optimized versions of ClustalW as a function of number of sequences. The sequence lengths are fixed at 800 and 1000 amino acids.

10.00%

14.00%

18.00%

22.00%

26.00%

200 400 600 800

Number of sequences

Sp

eed

up

(%

)

len800, Only compiler-optimization len800, Optimization w ith our assist

len1000, Only compiler-optimization len1000, Optimization w ith our assist

11

Page 12: InCoB2007 - August 30, 2007 - HKUST

Multithreaded ClustalW

Speedup of the optimized versions of MT-ClustalW as a function of number of sequences. The sequence lengths are fixed at 800 and 1000 amino acids.

95.00%

100.00%

105.00%

110.00%

115.00%

200 400 600 800

Number of sequences

Sp

eed

up

(%

)

len800, Only compiler-optimization len800, Optimization w ith our assist

len1000, Only compiler-optimization len1000, Optimization w ith our assist

12

Page 13: InCoB2007 - August 30, 2007 - HKUST

Comparison

13

 ClustalW-MPI Parallel MSA SGI ClustalW-MTV

Number of sequences 500 80 600 600

Sequence length 1100 289-399 390 400

Machine PC Cluster PC Cluster Single PCShared memory

Single PCShared memory

Processors 2   2 2 2

Speedup 1.75x 1.8x 1.8x 2.25x

Why does the speedup is over 2x?Because of the special unit in the new CPU

Does the special unit normally work with common software?No, we have to activate it.

Page 14: InCoB2007 - August 30, 2007 - HKUST

Speedup > 2x for dual-CPU? [1]

Amdahl’s Law

14

kf

fS

1

1S Speedup

Original Program

Modified Program

k

1-f f

Page 15: InCoB2007 - August 30, 2007 - HKUST

Speedup > 2x for dual-CPU? [2]

15

mtopttotal SpeedupSpeedupSpeedup

06.270.121.1 totalSpeedup

Speedup 1.21

Speedup 1.70

Data set 800 sequences, 1000 amino acids

Page 16: InCoB2007 - August 30, 2007 - HKUST

Our strategies

Step 1: Analyzing and Profiling To find the software structure and where the

bottle neck is Step 2: Applying the methodologies

Multithreading & Vectorizing (one of the optimization method)

Step 3: Validating To compare the result with the original one. For

sure, the result is not changed

16

Page 17: InCoB2007 - August 30, 2007 - HKUST

Strategy: Multithreading

The Proposed Multithreading StrategyTo improve the bottle neck of the software which

is non-threaded part To rise the throughput of the program by

applying multithreading strategy Reduce the overhead of thread creation

17

Page 18: InCoB2007 - August 30, 2007 - HKUST

Profile the software

Profiled by Intel Thread Profiler

Distance matrix

Neighbor joining

Progressive alignment

18

Page 19: InCoB2007 - August 30, 2007 - HKUST

Implementation

Apply the Thread library for this loop19

Page 20: InCoB2007 - August 30, 2007 - HKUST

Trick

Reduce Thread Creation Overhead

T1 T2 T2 T4

P1 P2 P3 P4

P5 P6 P7 P8

P9 P10 P11 P12

4 Threads

Parameters

20

Page 21: InCoB2007 - August 30, 2007 - HKUST

Strategy: Vectorizing

Proposed Optimizing and Vectorizing Methodology Find the frequent used functions in the programApplying the Loop Optimizing MethodologiesUse the advantage of Intel C++ Compiler to

optimize the code, also enable vectorizing option

21

Page 22: InCoB2007 - August 30, 2007 - HKUST

Frequent used functions

22

Function Clockticks (%) Methodology*

diff 33.36 A,B

prfscore 15.93 C

forward_pass 14.91 -

calc_score 12.93 D

reverse_pass 11.45 A

pdiff 5.85 -

*Note: A is Loop reversal, B is Loop fission, C is Type Casting, and D is Procedure call reduction

Profiled by Intel VTune

Page 23: InCoB2007 - August 30, 2007 - HKUST

Loop Reversal

That is to run a loop backward. Reversal of for loops is always legal, since the execution is not defined in terms of the order of the index set.

for (i=se2;i>0;i--){ HH[i] = -1; DD[i] = -1;}

for (i=1;i<=se2;i++){ HH[i] = -1; DD[i] = -1;}

23

Page 24: InCoB2007 - August 30, 2007 - HKUST

Loop Fission

A single loop can be broken into two or more smaller loops. Loop fission can break up the block of conditionally executed statements.

for (j=0;j<=N;j++){ hh = HH[j] + RR[j]; if (hh>=midh) if (HH[j]!=DD[j]&&RR[j]==SS[j]) { midh=hh; midj=j; }}

for (j=0;j<=N;j++){ temp[j] = HH[j] + RR[j];}

for (j=0;j<=N;j++){ if (temp[j]>=midh) if (HH[j]!=DD[j]&&RR[j]==SS[j]) { midh=temp[j]; midj=j; }} 24

Page 25: InCoB2007 - August 30, 2007 - HKUST

Limitation

Available compliers and programming languagesC/C++ Intel C++ complier (Windows,

Linux, Mac)Fortran Intel Fortran complier (Windows,

Linux, Mac) Available processors

CPU with Hyper-thread technology or above (Intel, AMD)

25

Page 26: InCoB2007 - August 30, 2007 - HKUST

Conclusion

Generic compiling strategy to assist the compiler in improving the performance of bioinformatics applications written in C/C++

Proposed framework: multithreading and vectorizing strategies

Higher speedup by taking the advantage of multicore architecture technology

Proposed optimization could be more appropriate than making use of parallelization on a small cluster computer

26

Page 27: InCoB2007 - August 30, 2007 - HKUST

Questions?

Thank you

27