Biological Sequence Comparison: Dynamic Programming...

1

Biological Sequence Comparison:Dynamic Programming Algorithms

Similarity Score Matrices

William R. Pearson

Algorithms for Biological SequenceComparison

algorithm value scoring gap timecalculated matrix penalty required

Needleman- global arbitrary penalty/gap O(n2) Needleman andWunsch similarity q Wunsch, 1970

Sellers (global) unity penalty/residue O(n2) Sellers, 1974distance r k

Smith- local Sij < 0.0 affine O(n2) Smith and Waterman, 1981Waterman similarity q + r k optimal Gotoh, 1982

SRCHN approx local Sij < 0.0 penalty/gap O(n)-O(n2) Wilbur and Lipman, 1983similarity lookup-diagonal

FASTA approx. local Sij < 0.0 limited size O(n2)/K Lipman and Pearson, 1985similarity q + r k lookup-rescan Pearson and Lipman, 1988

BLASTP maximum Sij < 0.0 multiple O(n2)/K Altshul et al., 1990segment score segments DFA-extend

BLAST2.0 approx. local Sij < 0.0 q+r k O(n2)/K Altshul et al., 1997similarity lookup-extend

2

The sequence alignment problem:PMILGYWNVRGL PMILGYWNVRGL PM-ILGYWNVRGL: : : ::: : : : :::PPYTIVYFPVRG PPYTIVYFPVRG PPYTIV-YFPVRG

PMILGYWNVRGL PMILGYWNVRGL PM-ILGYWNVRGL: . :. . :. ::: :. :.:. :::PPYTIVYFPVRG PPYTIVYFPVRG PPYTIV-YFPVRG

P M I L G Y W N V R G LP XP XY x X xT x xI X x x xV x x x X xY x X xF x x x x x xP XV x x x X xR XG X

Global:-PMILGYWNVRGL :. .:. :::PPYTIVYFPVRG-

Local:AAAAAAAPMILGYWNVRGLBBBBB :. .:. :::XXXXXXPPYTIVYFPVRGYYYYYY

Algorithms for Biological SequenceComparison

Global Local Distance

HBHU vs HBHU Hemoglobin beta-chain - human 725 725 0HAHU Hemoglobin alpha-chain - human 314 322 152MYHU Myoglobin - Human 121 166 212GPYL Leghemoglobin - Yellow lupin 8 43 239

LZCH Lysozyme precursor - Chicken -107 32 220NRBO Pancreatic ribonuclease - Bovine -124 31 280CCHU Cytochrome c - Human -160 26 321

MCHU vs MCHU Calmodulin - Human 671 671 0TPHUCS Troponin C, skeletal muscle 395 438 161PVPK2 Parvalbumin beta - Pike -57 115 313CIHUH Calpain heavy chain - Human -2085 100 2463AQJFNV Aequorin precursor - Jelly fish -65 76 391KLSWM Calcium binding protein - Scallop -89 52 323

QRHULD vs EGMSMG EGF precursor -591 655 2549

3

Genomic Alignments

• nw/sw/lalign - dynamic programming -O(n2)• searchn - lookup on diagonals - O(n)-O(n2)• fasta - lookup on diagonals/rescan - O(n2)/K• blast - DFA, extend O(n2)/K• blastz - lookup/extend• ssaha, blat, - lookup - waba• mummer, avid - Suffix tree alignment• dialign, glass, lagan

Algorithms for Global and Local SimilarityScores

Global:

Local:

4

+1 : match-1 : mismatch-2 : gap

Global and Local Alignment Paths

A B D D E F G H IA \ \ \ \ \ \ \ \ \ 1 _-1 -1 -1 -1 -1 -1 -1 -1B \ ! \ \ \ \ \ \ \ -1 2 _ 0 _-2 -2 -2 -2 -2 -2D \ ! \ \ \ \ \ \ -1 0 3 _ 1 _-1 _-3 -3 -3 -3E \ \ ! ! \ \ \ \ -1 -2 1 2 2 _ 0 _-2 _-4 -4

G \ \ ! \ ! \ \ \ -1 -2 -1 0 1 1 1 _-1 _-3K \ \ \ ! \ ! \ ! \ \ \ \ -1 -2 -3 -2 -1 0 0 0 _-2H \ \ \ \ ! \ ! \ ! \ \ \ -1 -2 -3 -4 -3 -2 -1 1 _-1I \ \ \ \ \ ! \ ! \ ! ! \ -1 -2 -3 -4 -5 -4 -3 -1 2

Optimum global alignment ( score: 2) A B D D E F G H I (top) A B D - E G K H I (side)or A B - D E G K H I

A B D D E F G H IA \ 1 0 0 0 0 0 0 0 0B \ 0 2 _ 0 0 0 0 0 0 0D ! \ \ 0 0 3 _ 1 0 0 0 0 0E ! \ \ 0 0 1 2 2 _ 0 0 0 0G \ ! \ \ \ 0 0 0 0 1 1 1 0 0K \ \ \ 0 0 0 0 0 0 0 0 0H \ 0 0 0 0 0 0 0 1 0I \ 0 0 0 0 0 0 0 0 2

Optimal local alignment (score 3): A B D (top) A B D (side)

Global Local

A C G T

A

C

G

G

T

+1 : match–1 : mis-match–2 : gap

5

Smith-Waterman Space, TimeRequirements

score-space: O(n);time: O(n2)

alignment-space: O(n);time: O(n2)

FAST alignment by lookup 1 9GT8.7 KITQSNATQ .::. :::XURT8C LLTQTRATQ 1 9

1.Scan query,build 2 tables:

AT 7IT 2KI 1LL -1LT -1NA 6QS 4QT -1SN 5TR -1TQ 8

400 entries

1 -12 -13 -14 -15 -16 -17 -18 3

n-1 entries

K I T Q S N A T Q

LLT T TQ Q QTRA AT T TQ Q Q

87654321012345678 2 2 4 2 5

LL LT TQ QT TR RA AT TQ

O(n) spaceO(n+m) time (if few repeat hits)

6

4. Banded Smith-WatermanNLPYL-I..: . :QVPLVEI

2. Extend along diagonal(local maximum)

1. Identify identical matches(length = ktup)

FASTA

Q

V

P

L

V

E

I

N L P Y L I

Outcome: one continuous, near-optimal gapped alignment

3. Join diagonal segments (DP)(maintain linearity)(optimal sum score)

BLAST

Q

V

P

L

V

E

I

N L P Y L I

2. extend from diagonal ends(X-drop threshold)

1. neighborhood word hits(word length)

Outcome: multiple HSPs, multiple linkages; only partially aligned

NL I.: :PL I

NLP..:QVP

L.E

3. report HSP linkages(maintain linearity)(probability)

7

fasta.bioch.virginia.edu/noptalign

a

Similarity Scoring Matrices - Summary

• Similarity scoring matrices are “log-odds” matrices,reporting the “odds” that an alignment reflectshomology rather than chance

• One can predict evolutionary changes using a simplerandom model, which can generate mutationfrequencies at any evolutionary distance

• The optimal scoring matrix has an evolutionarydistance that matches that of the alignment

• Shallower scoring matrices have more informationcontent, or “bits/residue”, and thus can be used to findshorter domains

• Scoring matrices set evolutionary look back times

8

Scoring Matrices – Concepts

• Where do scoring matrices come from– Transition probabilities and PAMs– Scoring matrices as log-odds values

(log(p[related]/p[chance])• The PAM 250 matrix• Scoring matrices and information content• The BLOSUM matrices• Effective matrices and gap penalties

DNA transition probabilities –1 PAM

a c g ta 0.99 0.001 0.008 0.001 = 1.0c 0.001 0.99 0.001 0.008 = 1.0g 0.008 0.001 0.99 0.001 = 1.0t 0.001 0.008 0.001 0.99 = 1.0

a

t g

c a

t g

c

0.99

0.008

0.001

0.001

9

Matrix multiplesM^2={ PAM 2{0.980, 0.002, 0.016, 0.002},{0.002, 0.980, 0.002, 0.016},{0.016, 0.002, 0.980, 0.002},{0.002, 0.016, 0.002, 0.980}}

M^5={ PAM 5{0.952, 0.005, 0.038, 0.005},{0.005, 0.951, 0.005, 0.038},{0.038, 0.005, 0.952, 0.005},{0.005, 0.038, 0.005, 0.952}}

M^10={ PAM 10{0.907, 0.010, 0.073, 0.010},{0.010, 0.907, 0.010, 0.073},{0.073, 0.010, 0.907, 0.010},{0.010, 0.073, 0.010, 0.907}}

M^100={ PAM 100{0.499, 0.083, 0.336, 0.083},{0.083, 0.499, 0.083, 0.336},{0.336, 0.083, 0.499, 0.083},{0.083, 0.336, 0.083, 0.499}}

M^1000={ PAM 1000{0.255, 0.245, 0.255, 0.245}, {0.245, 0.255, 0.245, 0.255}, {0.255, 0.245, 0.255, 0.245}, {0.245, 0.255, 0.245, 0.255}}

qij = M^20= PAM20{0.828, 0.019, 0.133, 0.019},{0.019, 0.828, 0.019, 0.133},{0.133, 0.019, 0.828, 0.019},{0.019, 0.133, 0.019, 0.828}}

Where do scoring matrices come from?

pi(a,c,g,t)=pj=0.25

!

"S = logqij

p j

#

$ % %

&

' ( (

probability of mutation

probability of alignmentby chance

!

"S =10logqa,a

pa

#

$ %

&

' (

=10log0.828

0.25

#

$ %

&

' ( = 5.2

!

"S =10logqa,c

pc

#

$ %

&

' (

=10log0.019

0.25

#

$ %

&

' ( = )11.2

!

"2 =log(2)

10= 0.33

10

Two expressions for Sij

Transition frequency(probability)

- Durbin et al.

Alignment frequency(probability)

-Altschul

!

"S = logqija

pi p j

#

$ % %

&

' ( (

!

"S = logqijt

p j

#

$ % %

&

' ( (

!

"S = logqija

= piqijt

pi p j

#

$ % %

&

' ( (

!

Altschul qija

= pi " Durbin qijt

Scoring matrices at DNAPAMs - ratios

PAM1={ ratio=1/3.13=+1/-3 H=1.90{ 1.99, -6.23, -6.23, -6.22},{-6.23, 1.99, -6.23, -6.23}, {-6.23, -6.23, 1.99, -6.23}, {-6.23, -6.23, -6.23, 1.99}}

PAM2={ ratio=1/2.65=+2/-5 H=1.82{ 1.97, -5.24, -5.24, -5.24},{-5.24, 1.98, -5.24, -5.24},{-5.24, -5.24, 1.98, -5.24},{-5.24, -5.24, -5.24, -5.24}}

PAM10={ ratio=1/1.61=+2/-3 H=1.40{ 1.86, -3.00, -3.00, -3.00},{-3.00, 1.86, -3.00, -3.00},{-3.00, -3.00, 1.86, -3.00},{-3.00, -3.00, -3.00, 1.86}}

PAM20={ ratio=1/1.21=+4/-5 H=1.05{ 1.72, -2.09, -2.09, -2.09},{-2.09, 1.72, -2.09, -2.09},{-2.09, -2.09, 1.72, -2.09}, {-2.09, -2.09, -2.09, 1.72}}

PAM30={ ratio=1/1=+1/-1 H=0.80{ 1.59, -1.59, -1.59, -1.59},{-1.59, 1.59, -1.59, -1.59},{-1.59, -1.59, 1.59, -1.59},{-1.59, -1.59, -1.59, 1.59}}

blastn (DNA)

PAM45={ ratio=1.23/1=+5/-4 H=0.54{ 1.40, -1.14, -1.14, -1.14},{-1.14, 1.40, -1.14, -1.14},{-1.14, -1.14, 1.40, -1.14},{-1.14, -1.14, -1.14, 1.40}}

fasta (DNA)

11

Normalizedfrequencies

of the amino-acids

0.010Trp0.047Asp0.015Met0.050Glu0.030Tyr0.051Pro0.033Cys0.058Thr0.033Cys0.065Val0.037Ile0.070Ser0.038Gln0.081Lys0.040Phe0.085Leu0.040Asn0.087Ala0.041Arg0.089Gly

Relative mutabilitiesof the amino-acids

18Trp74Val20Cys93Gln40Leu94Met41Phe96Ile41Tyr97Thr49Gly100Ala56Pro102Glu56Lys106Asp65Arg130Ser66His134Asn

LIHGEQCDNRA

2534017157500371795L

303581713363066I

1023243104322610321H

112301016215610579G

4220831940266E

0765012093Q

001033C

5320154D

17109N

30R

A

Numbers of accepted mutations

numbers of accepted mutations (x10) from closelyrelated sequences. 1572 changes (20x20) tabulated

12

Mutation probability matrix for 1 PAMLIHGEQCDNRA

994722411600313L

99872002121322I

1099120120131881H

10199357311112121G

13249865350567010E

3123127987605493Q

01100099730011C

0146536098594206D

1321664036982214N

1310001010199131R

46221178310929867A

Mutation probability matrix for 250 PAMsLIHGEQCDNRA

994722411600313L

99872002121322I

1099120120131881H

10199357311112121G

2365129111745E

523651016553Q

122211521112C

2365107111845D

23646527644N

236235234173R

6861298599613A

13

The PAM250 matrixCys 12Ser 0 2Thr -2 1 3Pro -1 1 0 6Ala -2 1 1 1 2Gly -3 1 0 -1 1 5Asn -4 1 0 -1 0 0 2Asp -5 0 0 -1 0 1 2 4Glu -5 0 0 -1 0 0 1 3 4Gln -5 -1 -1 0 0 -1 1 2 2 4His -3 -1 -1 0 -1 -2 2 1 1 3 6Arg -4 0 -1 0 -2 -3 0 -1 -1 1 2 6Lys -5 0 0 -1 -1 -2 1 0 0 1 0 3 5Met -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6Ile -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5Leu -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6Val -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4Phe -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9Tyr 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10Trp -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W

A R N D E I LA 8R -9 12N -4 -7 11D -4 -13 3 11E -3 -11 -2 4 11I -6 -7 -7 -10 -7 12L -8 -11 -9 -16 -12 -1 10

Pam40 A R N D E I LA 2R -2 6N 0 0 2D 0 -1 2 4E 0 -1 1 3 4I -1 -2 -2 -2 -2 5L -2 -3 -3 -4 -3 2 6

Pam250

Where do scoring matrices comefrom?

• Scoring matrices can canbe designed for differentevolutionary distances(less=shallow;more=deep)

• Deep matrices allowmore substitution

!

"S = logqij

pi p j

#

$ % %

&

' ( (

frequency of replace-ment in homologs

frequency of align-ment by chance

14

A R N D E I LA 8R -9 12N -4 -7 11D -4 -13 3 11E -3 -11 -2 4 11I -6 -7 -7 -10 -7 12L -8 -11 -9 -16 -12 -1 10

Pam40 A R N D E I LA 2R -2 6N 0 0 2D 0 -1 2 4E 0 -1 1 3 4I -1 -2 -2 -2 -2 5L -2 -3 -3 -4 -3 2 6

Pam250

Where do scoring matrices come from?

qij : replacement frequency at PAM40, 250qR:N ( 40) = 0.000435 pR = 0.051 qR:N (250) = 0.002193 pN = 0.043 λ2 Sij = lg2 (qij/pipj) λe Sij = ln(qij/pipj) pRpN = 0.002193λ2 SR:N( 40) = lg2 (0.000435/0.00219)= -2.333λ2 = 1/3; SR:N( 40) = -2.333/λ2 = -7λ SR:N(250) = lg2 (0.002193/0.002193)= 0

!

"S = logqij

pi p j

#

$ % %

&

' ( (

frequency of replace-ment in homologs

frequency of align-ment by chance

PAM and % identity

15

Scoring matrix scale – revisited

!

"si, j = logqij

pi p j

si, j = logqij

pi p j

#

$ % %

&

' ( ( /"

sij '= sij )10* "'= "10

!

"e is the unique solution of :

pi p j

i, j

# e"e si , j =1, alternatively :

pi p j

i, j

# 2"2si , j =1 gives "2

!

"S =10 # logqR ,N

pR pN

$

% &

'

( )

=10 # log0.000435

0.051# 0.043

$

% &

'

( ) = *7.03

!

"2 =log(2)

10= 0.33

PAM250 scores are in 1/3 bitunits (scaled with λ2=0.33)PAM120, BLOSUM62 scores arescaled in 1/2 bit units (λ2=0.5)

Local alignment scores as measures of information

Information content: number of bitsrequired to represent all possibilitieswith optimal encoding

!

H = " pi log2i

# pi

H = " 0.5log2 0.5 = 0.50,1

# ("1) + 0.5("1)

=1 bit

H = " 0.25log21

4= " 0.25("2)

a,c,g,t

#a,c,g,t

#

= 2 bits

H = " pi20aa

# log2 pi = 4.19 bits

!

E = pi p jsij20aa

"

= #0.84 (PAM250)

= #1.64 (PAM120)

= #0.52 (BLOSUM62)

!

H = qijasij

20aa

" = pi20aa

" qijtsij

= 0.35 (PAM250)

= 0.98 (PAM120)

= 0.70 (BLOSUM62)

average informationcontent / position

16

Relative entropy H of PAMmatrices

More about scoring matrices ...

PAM series:• Evolutionary model -

extrapolated from PAM1• PAM20: 20% change

(mammals)• PAM250: 250% change

(<20% identity)• Gap penalties should vary• shallow matrices (PAM10-

40) for short sequences andshort distances

BLOSUM series• Empirically determined,

no extrapolation (nomodel)

• BLOSUM45-50 - distant(1/3 bits)

• BLOSUM80 -very highlyconserved (not smallchange), highinfo/position

• BLOSUM62 - 1/2 bits

17

Scoring Matrices and Gap-penalties -BLAST vs FASTA

BLAST• default scoring matrix:

BLOSUM62 (1/2 bit)• default gap penalty:

-11 (open)/-1(extend)(lowest -9/-1, -8/-2)

FASTA• default matrix:

BLOSUM50 (1/3 bit)• default gap penalty:

old: -12 (first residue)/-2= new: -10 (open)/-2(ext)

• BLOSUM62 -7/-1• PAM120 -16/-4• PAM20 -24/-4

Similarity Scoring Matrices - Summary

• Similarity scoring matrices are “log-odds” matrices,reporting the “odds” that an alignment reflectshomology rather than chance

• One can predict evolutionary changes using a simplerandom model, which can generate mutationfrequencies at any evolutionary distance

• The optimal scoring matrix has an evolutionarydistance that matches that of the alignment

• Shallower scoring matrices have more informationcontent, or “bits/residue”, and thus can be used to findshorter domains

• Scoring matrices set evolutionary look back times

Biological Sequence Comparison: Dynamic Programming...

Documents

Transcript of Biological Sequence Comparison: Dynamic Programming...