Biological Sequence Comparison: Dynamic Programming...
Transcript of Biological Sequence Comparison: Dynamic Programming...
![Page 1: Biological Sequence Comparison: Dynamic Programming ...people.virginia.edu/~wrp/cshl05/pdf/cshl_ecg_l2_05.pdfSimilarity Score Matrices William R. Pearson Algorithms for Biological](https://reader035.fdocuments.in/reader035/viewer/2022071004/5fc190bb8f8b840d751094e5/html5/thumbnails/1.jpg)
1
Biological Sequence Comparison:Dynamic Programming Algorithms
Similarity Score Matrices
William R. Pearson
Algorithms for Biological SequenceComparison
algorithm value scoring gap timecalculated matrix penalty required
Needleman- global arbitrary penalty/gap O(n2) Needleman andWunsch similarity q Wunsch, 1970
Sellers (global) unity penalty/residue O(n2) Sellers, 1974distance r k
Smith- local Sij < 0.0 affine O(n2) Smith and Waterman, 1981Waterman similarity q + r k optimal Gotoh, 1982
SRCHN approx local Sij < 0.0 penalty/gap O(n)-O(n2) Wilbur and Lipman, 1983similarity lookup-diagonal
FASTA approx. local Sij < 0.0 limited size O(n2)/K Lipman and Pearson, 1985similarity q + r k lookup-rescan Pearson and Lipman, 1988
BLASTP maximum Sij < 0.0 multiple O(n2)/K Altshul et al., 1990segment score segments DFA-extend
BLAST2.0 approx. local Sij < 0.0 q+r k O(n2)/K Altshul et al., 1997similarity lookup-extend
![Page 2: Biological Sequence Comparison: Dynamic Programming ...people.virginia.edu/~wrp/cshl05/pdf/cshl_ecg_l2_05.pdfSimilarity Score Matrices William R. Pearson Algorithms for Biological](https://reader035.fdocuments.in/reader035/viewer/2022071004/5fc190bb8f8b840d751094e5/html5/thumbnails/2.jpg)
2
The sequence alignment problem:PMILGYWNVRGL PMILGYWNVRGL PM-ILGYWNVRGL: : : ::: : : : :::PPYTIVYFPVRG PPYTIVYFPVRG PPYTIV-YFPVRG
PMILGYWNVRGL PMILGYWNVRGL PM-ILGYWNVRGL: . :. . :. ::: :. :.:. :::PPYTIVYFPVRG PPYTIVYFPVRG PPYTIV-YFPVRG
P M I L G Y W N V R G LP XP XY x X xT x xI X x x xV x x x X xY x X xF x x x x x xP XV x x x X xR XG X
Global:-PMILGYWNVRGL :. .:. :::PPYTIVYFPVRG-
Local:AAAAAAAPMILGYWNVRGLBBBBB :. .:. :::XXXXXXPPYTIVYFPVRGYYYYYY
Algorithms for Biological SequenceComparison
Global Local Distance
HBHU vs HBHU Hemoglobin beta-chain - human 725 725 0HAHU Hemoglobin alpha-chain - human 314 322 152MYHU Myoglobin - Human 121 166 212GPYL Leghemoglobin - Yellow lupin 8 43 239
LZCH Lysozyme precursor - Chicken -107 32 220NRBO Pancreatic ribonuclease - Bovine -124 31 280CCHU Cytochrome c - Human -160 26 321
MCHU vs MCHU Calmodulin - Human 671 671 0TPHUCS Troponin C, skeletal muscle 395 438 161PVPK2 Parvalbumin beta - Pike -57 115 313CIHUH Calpain heavy chain - Human -2085 100 2463AQJFNV Aequorin precursor - Jelly fish -65 76 391KLSWM Calcium binding protein - Scallop -89 52 323
QRHULD vs EGMSMG EGF precursor -591 655 2549
![Page 3: Biological Sequence Comparison: Dynamic Programming ...people.virginia.edu/~wrp/cshl05/pdf/cshl_ecg_l2_05.pdfSimilarity Score Matrices William R. Pearson Algorithms for Biological](https://reader035.fdocuments.in/reader035/viewer/2022071004/5fc190bb8f8b840d751094e5/html5/thumbnails/3.jpg)
3
Genomic Alignments
• nw/sw/lalign - dynamic programming -O(n2)• searchn - lookup on diagonals - O(n)-O(n2)• fasta - lookup on diagonals/rescan - O(n2)/K• blast - DFA, extend O(n2)/K• blastz - lookup/extend• ssaha, blat, - lookup - waba• mummer, avid - Suffix tree alignment• dialign, glass, lagan
Algorithms for Global and Local SimilarityScores
Global:
Local:
![Page 4: Biological Sequence Comparison: Dynamic Programming ...people.virginia.edu/~wrp/cshl05/pdf/cshl_ecg_l2_05.pdfSimilarity Score Matrices William R. Pearson Algorithms for Biological](https://reader035.fdocuments.in/reader035/viewer/2022071004/5fc190bb8f8b840d751094e5/html5/thumbnails/4.jpg)
4
+1 : match-1 : mismatch-2 : gap
Global and Local Alignment Paths
A B D D E F G H IA \ \ \ \ \ \ \ \ \ 1 _-1 -1 -1 -1 -1 -1 -1 -1B \ ! \ \ \ \ \ \ \ -1 2 _ 0 _-2 -2 -2 -2 -2 -2D \ ! \ \ \ \ \ \ -1 0 3 _ 1 _-1 _-3 -3 -3 -3E \ \ ! ! \ \ \ \ -1 -2 1 2 2 _ 0 _-2 _-4 -4
G \ \ ! \ ! \ \ \ -1 -2 -1 0 1 1 1 _-1 _-3K \ \ \ ! \ ! \ ! \ \ \ \ -1 -2 -3 -2 -1 0 0 0 _-2H \ \ \ \ ! \ ! \ ! \ \ \ -1 -2 -3 -4 -3 -2 -1 1 _-1I \ \ \ \ \ ! \ ! \ ! ! \ -1 -2 -3 -4 -5 -4 -3 -1 2
Optimum global alignment ( score: 2) A B D D E F G H I (top) A B D - E G K H I (side)or A B - D E G K H I
A B D D E F G H IA \ 1 0 0 0 0 0 0 0 0B \ 0 2 _ 0 0 0 0 0 0 0D ! \ \ 0 0 3 _ 1 0 0 0 0 0E ! \ \ 0 0 1 2 2 _ 0 0 0 0G \ ! \ \ \ 0 0 0 0 1 1 1 0 0K \ \ \ 0 0 0 0 0 0 0 0 0H \ 0 0 0 0 0 0 0 1 0I \ 0 0 0 0 0 0 0 0 2
Optimal local alignment (score 3): A B D (top) A B D (side)
Global Local
![Page 5: Biological Sequence Comparison: Dynamic Programming ...people.virginia.edu/~wrp/cshl05/pdf/cshl_ecg_l2_05.pdfSimilarity Score Matrices William R. Pearson Algorithms for Biological](https://reader035.fdocuments.in/reader035/viewer/2022071004/5fc190bb8f8b840d751094e5/html5/thumbnails/5.jpg)
A C G T
A
C
G
G
T
+1 : match–1 : mis-match–2 : gap
![Page 6: Biological Sequence Comparison: Dynamic Programming ...people.virginia.edu/~wrp/cshl05/pdf/cshl_ecg_l2_05.pdfSimilarity Score Matrices William R. Pearson Algorithms for Biological](https://reader035.fdocuments.in/reader035/viewer/2022071004/5fc190bb8f8b840d751094e5/html5/thumbnails/6.jpg)
5
Smith-Waterman Space, TimeRequirements
score-space: O(n);time: O(n2)
alignment-space: O(n);time: O(n2)
FAST alignment by lookup 1 9GT8.7 KITQSNATQ .::. :::XURT8C LLTQTRATQ 1 9
1.Scan query,build 2 tables:
AT 7IT 2KI 1LL -1LT -1NA 6QS 4QT -1SN 5TR -1TQ 8
400 entries
1 -12 -13 -14 -15 -16 -17 -18 3
n-1 entries
K I T Q S N A T Q
LLT T TQ Q QTRA AT T TQ Q Q
87654321012345678 2 2 4 2 5
LL LT TQ QT TR RA AT TQ
O(n) spaceO(n+m) time (if few repeat hits)
![Page 7: Biological Sequence Comparison: Dynamic Programming ...people.virginia.edu/~wrp/cshl05/pdf/cshl_ecg_l2_05.pdfSimilarity Score Matrices William R. Pearson Algorithms for Biological](https://reader035.fdocuments.in/reader035/viewer/2022071004/5fc190bb8f8b840d751094e5/html5/thumbnails/7.jpg)
6
4. Banded Smith-WatermanNLPYL-I..: . :QVPLVEI
2. Extend along diagonal(local maximum)
1. Identify identical matches(length = ktup)
FASTA
Q
V
P
L
V
E
I
N L P Y L I
Outcome: one continuous, near-optimal gapped alignment
3. Join diagonal segments (DP)(maintain linearity)(optimal sum score)
BLAST
Q
V
P
L
V
E
I
N L P Y L I
2. extend from diagonal ends(X-drop threshold)
1. neighborhood word hits(word length)
Outcome: multiple HSPs, multiple linkages; only partially aligned
NL I.: :PL I
NLP..:QVP
L.E
3. report HSP linkages(maintain linearity)(probability)
![Page 8: Biological Sequence Comparison: Dynamic Programming ...people.virginia.edu/~wrp/cshl05/pdf/cshl_ecg_l2_05.pdfSimilarity Score Matrices William R. Pearson Algorithms for Biological](https://reader035.fdocuments.in/reader035/viewer/2022071004/5fc190bb8f8b840d751094e5/html5/thumbnails/8.jpg)
7
fasta.bioch.virginia.edu/noptalign
a
Similarity Scoring Matrices - Summary
• Similarity scoring matrices are “log-odds” matrices,reporting the “odds” that an alignment reflectshomology rather than chance
• One can predict evolutionary changes using a simplerandom model, which can generate mutationfrequencies at any evolutionary distance
• The optimal scoring matrix has an evolutionarydistance that matches that of the alignment
• Shallower scoring matrices have more informationcontent, or “bits/residue”, and thus can be used to findshorter domains
• Scoring matrices set evolutionary look back times
![Page 9: Biological Sequence Comparison: Dynamic Programming ...people.virginia.edu/~wrp/cshl05/pdf/cshl_ecg_l2_05.pdfSimilarity Score Matrices William R. Pearson Algorithms for Biological](https://reader035.fdocuments.in/reader035/viewer/2022071004/5fc190bb8f8b840d751094e5/html5/thumbnails/9.jpg)
8
Scoring Matrices – Concepts
• Where do scoring matrices come from– Transition probabilities and PAMs– Scoring matrices as log-odds values
(log(p[related]/p[chance])• The PAM 250 matrix• Scoring matrices and information content• The BLOSUM matrices• Effective matrices and gap penalties
DNA transition probabilities –1 PAM
a c g ta 0.99 0.001 0.008 0.001 = 1.0c 0.001 0.99 0.001 0.008 = 1.0g 0.008 0.001 0.99 0.001 = 1.0t 0.001 0.008 0.001 0.99 = 1.0
a
t g
c a
t g
c
0.99
0.008
0.001
0.001
![Page 10: Biological Sequence Comparison: Dynamic Programming ...people.virginia.edu/~wrp/cshl05/pdf/cshl_ecg_l2_05.pdfSimilarity Score Matrices William R. Pearson Algorithms for Biological](https://reader035.fdocuments.in/reader035/viewer/2022071004/5fc190bb8f8b840d751094e5/html5/thumbnails/10.jpg)
9
Matrix multiplesM^2={ PAM 2{0.980, 0.002, 0.016, 0.002},{0.002, 0.980, 0.002, 0.016},{0.016, 0.002, 0.980, 0.002},{0.002, 0.016, 0.002, 0.980}}
M^5={ PAM 5{0.952, 0.005, 0.038, 0.005},{0.005, 0.951, 0.005, 0.038},{0.038, 0.005, 0.952, 0.005},{0.005, 0.038, 0.005, 0.952}}
M^10={ PAM 10{0.907, 0.010, 0.073, 0.010},{0.010, 0.907, 0.010, 0.073},{0.073, 0.010, 0.907, 0.010},{0.010, 0.073, 0.010, 0.907}}
M^100={ PAM 100{0.499, 0.083, 0.336, 0.083},{0.083, 0.499, 0.083, 0.336},{0.336, 0.083, 0.499, 0.083},{0.083, 0.336, 0.083, 0.499}}
M^1000={ PAM 1000{0.255, 0.245, 0.255, 0.245}, {0.245, 0.255, 0.245, 0.255}, {0.255, 0.245, 0.255, 0.245}, {0.245, 0.255, 0.245, 0.255}}
qij = M^20= PAM20{0.828, 0.019, 0.133, 0.019},{0.019, 0.828, 0.019, 0.133},{0.133, 0.019, 0.828, 0.019},{0.019, 0.133, 0.019, 0.828}}
Where do scoring matrices come from?
pi(a,c,g,t)=pj=0.25
!
"S = logqij
p j
#
$ % %
&
' ( (
probability of mutation
probability of alignmentby chance
!
"S =10logqa,a
pa
#
$ %
&
' (
=10log0.828
0.25
#
$ %
&
' ( = 5.2
!
"S =10logqa,c
pc
#
$ %
&
' (
=10log0.019
0.25
#
$ %
&
' ( = )11.2
!
"2 =log(2)
10= 0.33
![Page 11: Biological Sequence Comparison: Dynamic Programming ...people.virginia.edu/~wrp/cshl05/pdf/cshl_ecg_l2_05.pdfSimilarity Score Matrices William R. Pearson Algorithms for Biological](https://reader035.fdocuments.in/reader035/viewer/2022071004/5fc190bb8f8b840d751094e5/html5/thumbnails/11.jpg)
10
Two expressions for Sij
Transition frequency(probability)
- Durbin et al.
Alignment frequency(probability)
-Altschul
!
"S = logqija
pi p j
#
$ % %
&
' ( (
!
"S = logqijt
p j
#
$ % %
&
' ( (
!
"S = logqija
= piqijt
pi p j
#
$ % %
&
' ( (
!
Altschul qija
= pi " Durbin qijt
Scoring matrices at DNAPAMs - ratios
PAM1={ ratio=1/3.13=+1/-3 H=1.90{ 1.99, -6.23, -6.23, -6.22},{-6.23, 1.99, -6.23, -6.23}, {-6.23, -6.23, 1.99, -6.23}, {-6.23, -6.23, -6.23, 1.99}}
PAM2={ ratio=1/2.65=+2/-5 H=1.82{ 1.97, -5.24, -5.24, -5.24},{-5.24, 1.98, -5.24, -5.24},{-5.24, -5.24, 1.98, -5.24},{-5.24, -5.24, -5.24, -5.24}}
PAM10={ ratio=1/1.61=+2/-3 H=1.40{ 1.86, -3.00, -3.00, -3.00},{-3.00, 1.86, -3.00, -3.00},{-3.00, -3.00, 1.86, -3.00},{-3.00, -3.00, -3.00, 1.86}}
PAM20={ ratio=1/1.21=+4/-5 H=1.05{ 1.72, -2.09, -2.09, -2.09},{-2.09, 1.72, -2.09, -2.09},{-2.09, -2.09, 1.72, -2.09}, {-2.09, -2.09, -2.09, 1.72}}
PAM30={ ratio=1/1=+1/-1 H=0.80{ 1.59, -1.59, -1.59, -1.59},{-1.59, 1.59, -1.59, -1.59},{-1.59, -1.59, 1.59, -1.59},{-1.59, -1.59, -1.59, 1.59}}
blastn (DNA)
PAM45={ ratio=1.23/1=+5/-4 H=0.54{ 1.40, -1.14, -1.14, -1.14},{-1.14, 1.40, -1.14, -1.14},{-1.14, -1.14, 1.40, -1.14},{-1.14, -1.14, -1.14, 1.40}}
fasta (DNA)
![Page 12: Biological Sequence Comparison: Dynamic Programming ...people.virginia.edu/~wrp/cshl05/pdf/cshl_ecg_l2_05.pdfSimilarity Score Matrices William R. Pearson Algorithms for Biological](https://reader035.fdocuments.in/reader035/viewer/2022071004/5fc190bb8f8b840d751094e5/html5/thumbnails/12.jpg)
11
Normalizedfrequencies
of the amino-acids
0.010Trp0.047Asp0.015Met0.050Glu0.030Tyr0.051Pro0.033Cys0.058Thr0.033Cys0.065Val0.037Ile0.070Ser0.038Gln0.081Lys0.040Phe0.085Leu0.040Asn0.087Ala0.041Arg0.089Gly
Relative mutabilitiesof the amino-acids
18Trp74Val20Cys93Gln40Leu94Met41Phe96Ile41Tyr97Thr49Gly100Ala56Pro102Glu56Lys106Asp65Arg130Ser66His134Asn
LIHGEQCDNRA
2534017157500371795L
303581713363066I
1023243104322610321H
112301016215610579G
4220831940266E
0765012093Q
001033C
5320154D
17109N
30R
A
Numbers of accepted mutations
numbers of accepted mutations (x10) from closelyrelated sequences. 1572 changes (20x20) tabulated
![Page 13: Biological Sequence Comparison: Dynamic Programming ...people.virginia.edu/~wrp/cshl05/pdf/cshl_ecg_l2_05.pdfSimilarity Score Matrices William R. Pearson Algorithms for Biological](https://reader035.fdocuments.in/reader035/viewer/2022071004/5fc190bb8f8b840d751094e5/html5/thumbnails/13.jpg)
12
Mutation probability matrix for 1 PAMLIHGEQCDNRA
994722411600313L
99872002121322I
1099120120131881H
10199357311112121G
13249865350567010E
3123127987605493Q
01100099730011C
0146536098594206D
1321664036982214N
1310001010199131R
46221178310929867A
Mutation probability matrix for 250 PAMsLIHGEQCDNRA
994722411600313L
99872002121322I
1099120120131881H
10199357311112121G
2365129111745E
523651016553Q
122211521112C
2365107111845D
23646527644N
236235234173R
6861298599613A
![Page 14: Biological Sequence Comparison: Dynamic Programming ...people.virginia.edu/~wrp/cshl05/pdf/cshl_ecg_l2_05.pdfSimilarity Score Matrices William R. Pearson Algorithms for Biological](https://reader035.fdocuments.in/reader035/viewer/2022071004/5fc190bb8f8b840d751094e5/html5/thumbnails/14.jpg)
13
The PAM250 matrixCys 12Ser 0 2Thr -2 1 3Pro -1 1 0 6Ala -2 1 1 1 2Gly -3 1 0 -1 1 5Asn -4 1 0 -1 0 0 2Asp -5 0 0 -1 0 1 2 4Glu -5 0 0 -1 0 0 1 3 4Gln -5 -1 -1 0 0 -1 1 2 2 4His -3 -1 -1 0 -1 -2 2 1 1 3 6Arg -4 0 -1 0 -2 -3 0 -1 -1 1 2 6Lys -5 0 0 -1 -1 -2 1 0 0 1 0 3 5Met -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6Ile -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5Leu -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6Val -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4Phe -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9Tyr 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10Trp -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W
A R N D E I LA 8R -9 12N -4 -7 11D -4 -13 3 11E -3 -11 -2 4 11I -6 -7 -7 -10 -7 12L -8 -11 -9 -16 -12 -1 10
Pam40 A R N D E I LA 2R -2 6N 0 0 2D 0 -1 2 4E 0 -1 1 3 4I -1 -2 -2 -2 -2 5L -2 -3 -3 -4 -3 2 6
Pam250
Where do scoring matrices comefrom?
• Scoring matrices can canbe designed for differentevolutionary distances(less=shallow;more=deep)
• Deep matrices allowmore substitution
!
"S = logqij
pi p j
#
$ % %
&
' ( (
frequency of replace-ment in homologs
frequency of align-ment by chance
![Page 15: Biological Sequence Comparison: Dynamic Programming ...people.virginia.edu/~wrp/cshl05/pdf/cshl_ecg_l2_05.pdfSimilarity Score Matrices William R. Pearson Algorithms for Biological](https://reader035.fdocuments.in/reader035/viewer/2022071004/5fc190bb8f8b840d751094e5/html5/thumbnails/15.jpg)
14
A R N D E I LA 8R -9 12N -4 -7 11D -4 -13 3 11E -3 -11 -2 4 11I -6 -7 -7 -10 -7 12L -8 -11 -9 -16 -12 -1 10
Pam40 A R N D E I LA 2R -2 6N 0 0 2D 0 -1 2 4E 0 -1 1 3 4I -1 -2 -2 -2 -2 5L -2 -3 -3 -4 -3 2 6
Pam250
Where do scoring matrices come from?
qij : replacement frequency at PAM40, 250qR:N ( 40) = 0.000435 pR = 0.051 qR:N (250) = 0.002193 pN = 0.043 λ2 Sij = lg2 (qij/pipj) λe Sij = ln(qij/pipj) pRpN = 0.002193λ2 SR:N( 40) = lg2 (0.000435/0.00219)= -2.333λ2 = 1/3; SR:N( 40) = -2.333/λ2 = -7λ SR:N(250) = lg2 (0.002193/0.002193)= 0
!
"S = logqij
pi p j
#
$ % %
&
' ( (
frequency of replace-ment in homologs
frequency of align-ment by chance
PAM and % identity
![Page 16: Biological Sequence Comparison: Dynamic Programming ...people.virginia.edu/~wrp/cshl05/pdf/cshl_ecg_l2_05.pdfSimilarity Score Matrices William R. Pearson Algorithms for Biological](https://reader035.fdocuments.in/reader035/viewer/2022071004/5fc190bb8f8b840d751094e5/html5/thumbnails/16.jpg)
15
Scoring matrix scale – revisited
!
"si, j = logqij
pi p j
si, j = logqij
pi p j
#
$ % %
&
' ( ( /"
sij '= sij )10* "'= "10
!
"e is the unique solution of :
pi p j
i, j
# e"e si , j =1, alternatively :
pi p j
i, j
# 2"2si , j =1 gives "2
!
"S =10 # logqR ,N
pR pN
$
% &
'
( )
=10 # log0.000435
0.051# 0.043
$
% &
'
( ) = *7.03
!
"2 =log(2)
10= 0.33
PAM250 scores are in 1/3 bitunits (scaled with λ2=0.33)PAM120, BLOSUM62 scores arescaled in 1/2 bit units (λ2=0.5)
Local alignment scores as measures of information
Information content: number of bitsrequired to represent all possibilitieswith optimal encoding
!
H = " pi log2i
# pi
H = " 0.5log2 0.5 = 0.50,1
# ("1) + 0.5("1)
=1 bit
H = " 0.25log21
4= " 0.25("2)
a,c,g,t
#a,c,g,t
#
= 2 bits
H = " pi20aa
# log2 pi = 4.19 bits
!
E = pi p jsij20aa
"
= #0.84 (PAM250)
= #1.64 (PAM120)
= #0.52 (BLOSUM62)
!
H = qijasij
20aa
" = pi20aa
" qijtsij
= 0.35 (PAM250)
= 0.98 (PAM120)
= 0.70 (BLOSUM62)
average informationcontent / position
![Page 17: Biological Sequence Comparison: Dynamic Programming ...people.virginia.edu/~wrp/cshl05/pdf/cshl_ecg_l2_05.pdfSimilarity Score Matrices William R. Pearson Algorithms for Biological](https://reader035.fdocuments.in/reader035/viewer/2022071004/5fc190bb8f8b840d751094e5/html5/thumbnails/17.jpg)
16
Relative entropy H of PAMmatrices
More about scoring matrices ...
PAM series:• Evolutionary model -
extrapolated from PAM1• PAM20: 20% change
(mammals)• PAM250: 250% change
(<20% identity)• Gap penalties should vary• shallow matrices (PAM10-
40) for short sequences andshort distances
BLOSUM series• Empirically determined,
no extrapolation (nomodel)
• BLOSUM45-50 - distant(1/3 bits)
• BLOSUM80 -very highlyconserved (not smallchange), highinfo/position
• BLOSUM62 - 1/2 bits
![Page 18: Biological Sequence Comparison: Dynamic Programming ...people.virginia.edu/~wrp/cshl05/pdf/cshl_ecg_l2_05.pdfSimilarity Score Matrices William R. Pearson Algorithms for Biological](https://reader035.fdocuments.in/reader035/viewer/2022071004/5fc190bb8f8b840d751094e5/html5/thumbnails/18.jpg)
17
Scoring Matrices and Gap-penalties -BLAST vs FASTA
BLAST• default scoring matrix:
BLOSUM62 (1/2 bit)• default gap penalty:
-11 (open)/-1(extend)(lowest -9/-1, -8/-2)
FASTA• default matrix:
BLOSUM50 (1/3 bit)• default gap penalty:
old: -12 (first residue)/-2= new: -10 (open)/-2(ext)
• BLOSUM62 -7/-1• PAM120 -16/-4• PAM20 -24/-4
Similarity Scoring Matrices - Summary
• Similarity scoring matrices are “log-odds” matrices,reporting the “odds” that an alignment reflectshomology rather than chance
• One can predict evolutionary changes using a simplerandom model, which can generate mutationfrequencies at any evolutionary distance
• The optimal scoring matrix has an evolutionarydistance that matches that of the alignment
• Shallower scoring matrices have more informationcontent, or “bits/residue”, and thus can be used to findshorter domains
• Scoring matrices set evolutionary look back times