Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation...

Post on 25-Dec-2015

220 views 0 download

Tags:

Transcript of Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation...

Pair-wise Sequence Alignment

•What happened to the sequences of similar genes?random mutationdeletion, insertion

Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI ++ P++ ++DV+SY Seq. 2: 451 EVI---EHKPYNHKADVFSYA

•Homology vs. similarity

•What is pair-wise sequence alignment?

•Why pair-wise alignment?

Some concepts

•Optimal alignment

•Global alignment

•Gaps

•Local alignment

•Gap penalty

•Substitution matrix

Dotplot

•What dotplot shows

•What dotplot does not show

•A simplified representation

Sequence Alignment

•Dynamic programminga method for some optimization problemsdetermine a scoring schemebest solution based on a scoring scheme

•Total number of possible alignments for length n~ 22n / sqrt(2n)

•Needleman-Wunsch - global

•Questions•How does it work?•How to come up with a DP approach to an exponential problem? •How to implement a DP approach?

Dynamic Programming Algorithm

F(i,j) = max

•Break a problem into subproblems•Solve each subproblem separately

F(i-1,j-1) + s(xi, yj)F(i,j-1) + gF(i-1,j) + g

s(xi, yj) : substitution score for aligning xi with yj

g : gap penalty

F(i,j) : The max score for aligning 1st i symbols of sequence 1 with 1st j symbols of sequence 2

Example

•Initialization• matrix filling (scoring)•Trace back

ACTCG ACAGTAG

Match: 1Mismatch: 0Gap: -1

0 -1 -2 -3 -4 -5 -6 -7

-1 1 0 -1 -2 -3 -4 -5

-2 0 2 1 0 -1 -2 -3

-3 -1 1 2 1 1 0 -1

-4 -2 0 1 2 1 1 0

-5 -3 -1 0 2 2 1 2

A C A G T A G

A

C

T

C

G

i=0

i=1

i=2

i=3

i=4

i=5

j =0, 1, 2, 3, 4, 5, 6, 7

Local Alignment: Smith- Waterman•Biological significance

F(i,j) = max F(i-1,j-1) + s(xi, yj)F(i,j-1) + gF(i-1,j) + g

0

•O(n2) time

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 1 0 0

0 0 0 1 1 0 0 0 0 0 2 1

0 0 0 0 0 0 0 0 0 1 0 1

0 1 1 0 0 0 1 0 1 0 0 0

0 0 0 0 0 1 0 2 1 0 0 1

0 1 1 0 0 0 2 0 3 2 1 0

0 0 0 0 0 1 1 3 2 2 1 2

0 1 1 0 0 0 2 2 4 3 2 1

A A C C T A T A G C T

G

C

G

A

T

A

T

A

AACCTATAGCT ||||GCGATATA

Local Alignment

Issues in alignment•Different ways to fill the table

•Multiple optimal alignments

•s(xi, yj) – from substitution matrix

• gap penalty:linear: w(k) = gk

Affine: w(k) =h + gk, k>=1

0, k=0

Gap models•New gap vs. gap extension

•A gap of length k vs. k gaps of length 1

•1 insersion / deletion event vs. k events

• gap penalty:linear: w(k) = gk

Affine: w(k) =h + gk, k>=1

0, k=0

Affine Gap Penalty

M( i, j ) : best score when xi aligned with yjIx (i, j) : best score when xi aligned with a gapIy (i, j) : best score when yj aligned with a gap

•Aligning 1st i symbols of x with 1st j symbols of y

•? Wrong with the F(i,j) formula if AGP is used

•Three matrices

DP for global alignment for AGP

M (i, j) = maxM(i-1, j-1) + s(xi, yj)Ix (i-1, j-1) + s(xi, yj)ly (i-1, j-1) + s(xi, yj)

Ix (i, j) = maxM(i-1, j) + h + gIy(i-1, j) + h + glx (i-1, j) + g

Iy (i, j) = maxM(i, j-1) + h + gIx(i, j-1) + h + gly (i, j-1) + g

DP for global alignment using AGP•Initialization

M(0, 0) =0Ix(i, 0) = h+gily(0, j) = h+gjall other cases: -

•Start at the largest element in the three matricesM(m, n), Ix(m, n), ly(m, n)

•Traceback to (0,0)

DP for local alignment for AGP

M (i, j) = maxM(i-1, j-1) + s(xi, yj)Ix (i-1, j-1) + s(xi, yj)ly (i-1, j-1) + s(xi, yj)0

Ix (i, j) = maxM(i-1, j) + h + gIy(i-1, j) + h + g // ignoredlx (i-1, j) + g

Iy (i, j) = maxM(i, j-1) + h + gIx(i, j-1) + h + g // ignoredly (i, j-1) + g

DP for Local Alignment for AGP•Initialization

M(0, 0) =0Ix(i, 0) = 0ly(0, j) = 0all other cases: -

•Start at the largest M(i, j), Ix(i, j), ly(i, j)

•Traceback till M(i, j) = 0

Database searching methods

•Need more efficient methods

•Dynamic programming - O(n2L), L: size of database

•Why DP is slow?

•Ideas: Regions that are similar likely to share short identical subsequences

•Quick search for the regions, then check carefully locally

FASTA related methods

•Word, word size (2,6), sensitivity vs. speed

•What are the words in the query also in target

•Pre-computed table that stores locations of words – “hashing”

•Heuristic approximation

1. Quick initial “guess” – common subsequences

•An example

FASTA related methods

•Use Smith-Waterman method in a band, 32 aa wide around the best score

2. Find the region with high population of common words•Process diagonals, rescore, join regions, using gaps

3. Local alignment (DP) in the region identified

Limitation of FASTA•Speed vs. sensitivity

•Can miss biologically significant similaritysome proteins do not share identical a.a.initial stepDifferent codons encodes same protein

•Identical words

BLAST •Previous 2 kinds approaches

1. Word list•Incorporate similarity measurement for words

– PAM120e.g. ACDE

•Theoretically sound •search for common subsequences

•Scan for word occurrenceshash tableFinite state machine

(Stephen F. Altschul et al, Nucleic Acids Research 1997, 25(17) 3389-3402)

BLAST

2. Extend words to HSP (locally optimal pairs)•Find additional words within threshold•Merge within distance A

3. Select significant HSPs, use DP in banded region

Mini Presentations

1. Previous BLAST 2. Major concepts in BLAST 3. Statistical issue 4. Gapped local alignment –Gapped 5. Position-specific scoring matrix (PSSM) –

overall idea, architecture, multiple -alignment construction

6. PSSM – target frequency estimation, application to BLAST

(Stephen F. Altschul et al, Nucleic Acids Research 1997, 25(17) 3389-3402)

Multiple Sequence Alignment•Motivation

•What is MSA?

•How do we extend knowledge of pair-wise alignment?

•An example: AGAC, AC, AGAGAC--AC

AGACAG--

ACAG

Some possibilitiesAG-- --AC AGAC

•Fix pair-wise alignment and then add? •Evaluate all the possible alignment of N sequences?

•Sum of pairs (SP) scoring methodsGiven a alignment of N sequences, each of which has length L, in the LxN alignment:

Pair-wise sum for each column, then sum all columns

Scoring MSA

•Example(c(match)=1, c(mismatch)=-1, c(gap)=-2, c(gap,gap) =0

SP4=SP(I,-,I,V) = -2+1-1-2-2-1=-7SP = SP1 +SP2 + … + SP8

AQPILLLVALR-LL—-AK-ILLL-CPPVLILV

•SP tends to overweight a single mutationSP(A,A,A,C) = 0, SP(A,A,A,A) = 6

•DP of N dimensions using SPTime: in the order of (LN)(2N-1)N2 ~ O((2L)NN2)

Extension of DP for N sequences •Extend F(i,j) for N dimensions

STAR method

•DP provide optimal solution but costly

•Heuristic methods – STAR, CLUSTALW, …•Progressive alignment

•STAR- pair-wise - build similarity matrix- find a “star” sequence- use “star” to align other sequence- once gap, all time gap

STAR method

•Example

CLUSTAL family

•Build Similarity tree – “clustering”•Alignment starts at most similar sequences

•What are the disadvantages of STAR method?

1.Pair-wise alignment --> distance matrixFast approximate approach or DP

CLUSTALW2. Construct similarity tree, “the guide tree”

•Start with most similar sequences•Align group with group using pair-wise alignment•e.g.

3. Progressive alignment

UPGMA (un-weighted pair-group method using arithmetic average)