Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok...

32
Ankit Agrawal, Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical Engg. and Computer Science Northwestern University June 21, 2010

Transcript of Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok...

Page 1: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Ankit Agrawal, Sanchit Misra, Daniel Honbo, Alok Choudhary

Dept. of Electrical Engg. and Computer Science

Northwestern UniversityJune 21, 2010

Page 2: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Motivation Pairwise Statistical Significance MPIPairwiseStatSig Experiments and Results Future Work

2

Page 3: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Motivation

3

Page 4: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Sequence-Comparison Applications

Multiple Sequence

Alignment

Database Search

Protein Structure Prediction

PhylogeneticTree

Construction

Genome Assembly

4

Page 5: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Pairwise Local Sequence Alignment

DNA: A, G, C, T

Protein: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y

KQTGKG| |||KSAGKG

CTGTCG–CTGC|| ||

-TGC–CG–TG-

5

Page 6: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

C T G T C G – C T G C- T G C – C G – T G -

-5 10 10 -2 -5 -2 -5 -5 10 10 -5

Alignment Score = 11

Match score: 10Mismatch score: -2Gap score: -5

6

Page 7: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

http://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt 7

Page 8: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Central application of pairwise sequence alignment• Identifying related sequence pairs

(evolved from a common ancestor, also known as homologs)

More related sequence pairs should have higher alignment scores

8

Page 9: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Alignment Score Distribution

Alignment score distribution depends on:

• Alignment program• Scoring scheme• Sequence lengths• Sequence

compositions

P-value

x < y, but x is more statistically significant

than y

Compared to alignment score, statistical significance is a better indicator of biological significance 9

Page 10: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

[Karlin and Altchul, 1990] described rigorous statistical theory for ungapped alignment scores, following an Extreme Value distribution (EVD). In the limit of large sequence lengths m and n, the

statistics of HSP (High-Scoring Segment Pairs which correspond to local sequence alignment) scores are characterized by K and λ.

xE Kmne λ−=

Pr( ) 1 ES x e−> ≈ −

Statistical parameters K and λ characterize the EVD curve

10

Page 11: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Statistical/Biological Significance

P-value

Statistical Significance of Local Alignment Scores

P-value

P-value of an alignment score: The probability that an alignment with this score or higher occurs by chance alone

11

Page 12: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Sequence-Specificity

• Report statistical significance taking into account the properties and features of the specific sequence-pair being aligned.

Statistical Significance

Accuracy

• Accurate estimation of P-values for high scores in the right tail region.

Retrieval Accuracy

• Ability to identify related sequences.• Should assign lower P-values to pairs of related sequences than to pairs of

unrelated sequences.

Speed• Fast enough to be usable in practice

Characteristics of a good pairwisealignment based sequence-comparison

strategy

12

Page 13: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Dependent on the database. Not sequence specific. BLAST2.0 [Altschul et al., 1997]

Likelihood that a similarity as good or better would be obtained by two random sequences with average amino-acid composition and lengths similar to the sequences that produced the score.

FASTA [Pearson, 2000] Expectation that a sequence would obtain a similarity score

against an unrelated sequence drawn at random from the sequence database that was searched.

13

Page 14: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Database-independent Sequence specific Useful to evaluate alignments generated by an

alignment program, independent of any database. Better retrieval accuracy than database statistical

significance. Good for a few sequence pairs, but would be extremely

slow for a large number of sequence pairs.

14

Page 15: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Pairwise Statistical Significance vs. Database Statistical Significance

Agrawal and Huang (2009) Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices, IEEE/ACM Transactions on Computational Biology and Bioinformatics

Page 16: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Acceleration of PSS estimation

using HPC methods

All-pervasive application of

sequence alignment methods in

bioinformatics

Ever-increasing

sequence data

PSS gives biologically better

estimates of statistical

significance than DSS

PSS is too slow, when applied to many sequence

pairs.

16

Page 17: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

PairwiseStatistical

Significance

17

Page 18: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

The PairwiseStatSig function Generates a score distribution by aligning Seq1 with N

shuffled versions of Seq2 using scoring scheme SC Fits a EVD to the empirical score distribution using censored

maximum likelihood fitting Reports the pairwise statistical significance estimate of the

pairwise alignment score between Seq1 and Seq2 using the EVD formula for P-value with estimated K and λ 18

Pairwise Statistical Significance

Scoring scheme

Number of shuffles

Page 19: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Pairwise Statistical Significance

Execution time break-up for different stages of pairwise statistical significance estimation

Page 20: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

MPIPairwiseStatSig

20

Page 21: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

MPIPairwiseStatSig

Page 22: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

MPIPairwiseStatSig

Page 23: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Experiments and

Results

23

Page 24: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Experiments and Results

Sequence pairs of length 100, 200, 400, 800, 1600. Number of processors: 2, 4, 8, 16, 32, 64. Processors: 2.8GHz dual Intel Xeon nodes Substitution matrix: BLOSUM50 Gap penalty: 10+2k for a gap of length k. Number of shuffles, N= 1000.

Page 25: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Experiments and Results

Page 26: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Experiments and Results

Page 27: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Experiments and Results

Page 28: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Experiments and Results

Page 29: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Experiments and Results

Page 30: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Experiments and Results

Page 31: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Improving MPIPairwiseStatSig Reducing running time

Explore combining inter- and intra-task parallelism using other HPC techniques like using FPGA, GPU, etc.

Using heuristics to reduce time for constructing alignment score distribution.

MPIPairwiseStatSig Applications Any application requiring to judge relatedness of two

sequences based on sequence-data alone Database search Progressive multiple sequence alignment Phylogenetic tree construction

31

Future Work

Page 32: Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok …salsahpc.indiana.edu/ECMLS2010/presentation/Agrawal...Ankit Agrawal , Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical

Thank You!

32