Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation...

31
Pair-wise Sequence Alignment hat happened to the sequences of similar gen random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI ++ P++ ++DV+SY Seq. 2: 451 EVI---EHKPYNHKADVFSYA Homology vs. similarity What is pair-wise sequence alignment? •Why pair-wise alignment?

Transcript of Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation...

Page 1: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Pair-wise Sequence Alignment

•What happened to the sequences of similar genes?random mutationdeletion, insertion

Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI ++ P++ ++DV+SY Seq. 2: 451 EVI---EHKPYNHKADVFSYA

•Homology vs. similarity

•What is pair-wise sequence alignment?

•Why pair-wise alignment?

Page 2: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Some concepts

•Optimal alignment

•Global alignment

•Gaps

•Local alignment

•Gap penalty

•Substitution matrix

Page 3: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Dotplot

•What dotplot shows

•What dotplot does not show

•A simplified representation

Page 4: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Sequence Alignment

•Dynamic programminga method for some optimization problemsdetermine a scoring schemebest solution based on a scoring scheme

•Total number of possible alignments for length n~ 22n / sqrt(2n)

•Needleman-Wunsch - global

Page 5: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

•Questions•How does it work?•How to come up with a DP approach to an exponential problem? •How to implement a DP approach?

Page 6: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Dynamic Programming Algorithm

F(i,j) = max

•Break a problem into subproblems•Solve each subproblem separately

F(i-1,j-1) + s(xi, yj)F(i,j-1) + gF(i-1,j) + g

s(xi, yj) : substitution score for aligning xi with yj

g : gap penalty

F(i,j) : The max score for aligning 1st i symbols of sequence 1 with 1st j symbols of sequence 2

Page 7: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Example

•Initialization• matrix filling (scoring)•Trace back

ACTCG ACAGTAG

Match: 1Mismatch: 0Gap: -1

Page 8: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

0 -1 -2 -3 -4 -5 -6 -7

-1 1 0 -1 -2 -3 -4 -5

-2 0 2 1 0 -1 -2 -3

-3 -1 1 2 1 1 0 -1

-4 -2 0 1 2 1 1 0

-5 -3 -1 0 2 2 1 2

A C A G T A G

A

C

T

C

G

i=0

i=1

i=2

i=3

i=4

i=5

j =0, 1, 2, 3, 4, 5, 6, 7

Page 9: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Local Alignment: Smith- Waterman•Biological significance

F(i,j) = max F(i-1,j-1) + s(xi, yj)F(i,j-1) + gF(i-1,j) + g

0

•O(n2) time

Page 10: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 1 0 0

0 0 0 1 1 0 0 0 0 0 2 1

0 0 0 0 0 0 0 0 0 1 0 1

0 1 1 0 0 0 1 0 1 0 0 0

0 0 0 0 0 1 0 2 1 0 0 1

0 1 1 0 0 0 2 0 3 2 1 0

0 0 0 0 0 1 1 3 2 2 1 2

0 1 1 0 0 0 2 2 4 3 2 1

A A C C T A T A G C T

G

C

G

A

T

A

T

A

AACCTATAGCT ||||GCGATATA

Local Alignment

Page 11: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Issues in alignment•Different ways to fill the table

•Multiple optimal alignments

•s(xi, yj) – from substitution matrix

• gap penalty:linear: w(k) = gk

Affine: w(k) =h + gk, k>=1

0, k=0

Page 12: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Gap models•New gap vs. gap extension

•A gap of length k vs. k gaps of length 1

•1 insersion / deletion event vs. k events

• gap penalty:linear: w(k) = gk

Affine: w(k) =h + gk, k>=1

0, k=0

Page 13: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Affine Gap Penalty

M( i, j ) : best score when xi aligned with yjIx (i, j) : best score when xi aligned with a gapIy (i, j) : best score when yj aligned with a gap

•Aligning 1st i symbols of x with 1st j symbols of y

•? Wrong with the F(i,j) formula if AGP is used

•Three matrices

Page 14: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

DP for global alignment for AGP

M (i, j) = maxM(i-1, j-1) + s(xi, yj)Ix (i-1, j-1) + s(xi, yj)ly (i-1, j-1) + s(xi, yj)

Ix (i, j) = maxM(i-1, j) + h + gIy(i-1, j) + h + glx (i-1, j) + g

Iy (i, j) = maxM(i, j-1) + h + gIx(i, j-1) + h + gly (i, j-1) + g

Page 15: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

DP for global alignment using AGP•Initialization

M(0, 0) =0Ix(i, 0) = h+gily(0, j) = h+gjall other cases: -

•Start at the largest element in the three matricesM(m, n), Ix(m, n), ly(m, n)

•Traceback to (0,0)

Page 16: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

DP for local alignment for AGP

M (i, j) = maxM(i-1, j-1) + s(xi, yj)Ix (i-1, j-1) + s(xi, yj)ly (i-1, j-1) + s(xi, yj)0

Ix (i, j) = maxM(i-1, j) + h + gIy(i-1, j) + h + g // ignoredlx (i-1, j) + g

Iy (i, j) = maxM(i, j-1) + h + gIx(i, j-1) + h + g // ignoredly (i, j-1) + g

Page 17: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

DP for Local Alignment for AGP•Initialization

M(0, 0) =0Ix(i, 0) = 0ly(0, j) = 0all other cases: -

•Start at the largest M(i, j), Ix(i, j), ly(i, j)

•Traceback till M(i, j) = 0

Page 18: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Database searching methods

•Need more efficient methods

•Dynamic programming - O(n2L), L: size of database

•Why DP is slow?

•Ideas: Regions that are similar likely to share short identical subsequences

•Quick search for the regions, then check carefully locally

Page 19: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

FASTA related methods

•Word, word size (2,6), sensitivity vs. speed

•What are the words in the query also in target

•Pre-computed table that stores locations of words – “hashing”

•Heuristic approximation

1. Quick initial “guess” – common subsequences

•An example

Page 20: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

FASTA related methods

•Use Smith-Waterman method in a band, 32 aa wide around the best score

2. Find the region with high population of common words•Process diagonals, rescore, join regions, using gaps

3. Local alignment (DP) in the region identified

Page 21: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Limitation of FASTA•Speed vs. sensitivity

•Can miss biologically significant similaritysome proteins do not share identical a.a.initial stepDifferent codons encodes same protein

•Identical words

Page 22: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

BLAST •Previous 2 kinds approaches

1. Word list•Incorporate similarity measurement for words

– PAM120e.g. ACDE

•Theoretically sound •search for common subsequences

•Scan for word occurrenceshash tableFinite state machine

(Stephen F. Altschul et al, Nucleic Acids Research 1997, 25(17) 3389-3402)

Page 23: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

BLAST

2. Extend words to HSP (locally optimal pairs)•Find additional words within threshold•Merge within distance A

3. Select significant HSPs, use DP in banded region

Page 24: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Mini Presentations

1. Previous BLAST 2. Major concepts in BLAST 3. Statistical issue 4. Gapped local alignment –Gapped 5. Position-specific scoring matrix (PSSM) –

overall idea, architecture, multiple -alignment construction

6. PSSM – target frequency estimation, application to BLAST

(Stephen F. Altschul et al, Nucleic Acids Research 1997, 25(17) 3389-3402)

Page 25: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Multiple Sequence Alignment•Motivation

•What is MSA?

•How do we extend knowledge of pair-wise alignment?

•An example: AGAC, AC, AGAGAC--AC

AGACAG--

ACAG

Some possibilitiesAG-- --AC AGAC

•Fix pair-wise alignment and then add? •Evaluate all the possible alignment of N sequences?

Page 26: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

•Sum of pairs (SP) scoring methodsGiven a alignment of N sequences, each of which has length L, in the LxN alignment:

Pair-wise sum for each column, then sum all columns

Scoring MSA

•Example(c(match)=1, c(mismatch)=-1, c(gap)=-2, c(gap,gap) =0

SP4=SP(I,-,I,V) = -2+1-1-2-2-1=-7SP = SP1 +SP2 + … + SP8

AQPILLLVALR-LL—-AK-ILLL-CPPVLILV

•SP tends to overweight a single mutationSP(A,A,A,C) = 0, SP(A,A,A,A) = 6

Page 27: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

•DP of N dimensions using SPTime: in the order of (LN)(2N-1)N2 ~ O((2L)NN2)

Extension of DP for N sequences •Extend F(i,j) for N dimensions

Page 28: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

STAR method

•DP provide optimal solution but costly

•Heuristic methods – STAR, CLUSTALW, …•Progressive alignment

•STAR- pair-wise - build similarity matrix- find a “star” sequence- use “star” to align other sequence- once gap, all time gap

Page 29: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

STAR method

•Example

Page 30: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

CLUSTAL family

•Build Similarity tree – “clustering”•Alignment starts at most similar sequences

•What are the disadvantages of STAR method?

1.Pair-wise alignment --> distance matrixFast approximate approach or DP

Page 31: Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

CLUSTALW2. Construct similarity tree, “the guide tree”

•Start with most similar sequences•Align group with group using pair-wise alignment•e.g.

3. Progressive alignment

UPGMA (un-weighted pair-group method using arithmetic average)