Outline
• Explaining the connection
• Simple RW with absorption
• Moment-generating function method
• Size and duration of excursions
• Renewal equation and general RW
• Significance of alignments in BLAST
Intuitive introduction• Alignment as a random walk
g g a g a c t g t a g a c
g a a c g c c c t a g c c• Scores: match = 1, mismatch = -1• Solid symbols = ladder points, squares = excursions
Relation to BLAST
• Quality of alignment reflected by the course of the RW.
• Distribution of maximum heights of excursions achievable by chance, provides null hypothesis.
Simple RW with absorbing boundaries
We consider the case p q only
qpSSiidS ii 1]1Pr[},1,0{,,}{ 1
bTbT
aTaT
nkallbaTShT
nn
nn
k
n
iin
1
1
1
),,(,
Absorption probabilities
• Consider backward equation
• 2nd order, homogeneous, linear, difference equ.
where)]1(exp[)]1(exp[)exp(
)exp(
1,1,
1
]enoughlarge,Pr[
]enoughlarge,Pr[
2
1
11
hqhph
hCw
wwqwpww
uw
nbTw
naTu
iiih
bahhh
hh
nh
nh
Absorption probabilities
• This provides
• Constants derived from boundary conditions
hh
h
wu
ab
ahw
pq
1
)exp()exp(
)exp()exp(
)/ln(,0
**
**
*21
Mean number of steps to absorption
• 2nd order, inhomogeneous, difference equ.• Solution = any particular solution of (*)
+ general solution of the corresponding homogeneous equ.
• Verify
is a particular solution, and therefore
qp
hauhbwm
mmhCCpqhm
pqhm
qmpmm
hhh
bah
h
hhh
)()(
0),exp()/(
)/(
(*)1
*21
11
Moment-generating function approach
1)(,0
0]Pr[,0]Pr[
1.1
)exp(]Pr[)(then
},1,,0,,1,{If
)][exp()(
**
S
d
ciS
S
mst
dScS
Theorem
iiSm
ddccS
SEm
Moment-generating function approach
• Simple RW:
• Until absorption
11)]exp()exp([)(
)]exp()exp([)(
1)(),/ln(
)exp()exp()(
***
1
**
NNhT
NhT
N
iNN
S
S
pqm
pqmShT
mpq
pqm
N
N
Moment-generating function approach
• Sticky argument now: At the time of absorption,
• But the latter is equal to 1
)(
)](exp[)1()](exp[)(
)](exp[)1()](exp[)(
wp
1wp
***
h
hhhT
hhhT
h
hN
w
hawhbwm
hawhbwm
whb
whahT
N
N
Stopping time (at absorption)
)(
)]()([)(.
0)()()(
0)]}(exp[)({
all,1)]}(exp[)({
)'(.1.7
randomntdisplaceme
timestopping
*|
1
h
hhh
N
NS
NS
N
N
iiN
m
hbwhauqpmie
hTESENE
hTmEdt
d
hTmE
sWaldTheorem
hT
N
ShT
Asymptotics (p < q)
• Hypotheses
• So, define Y = excursion height
]onceleast at hittingPr[
)exp()exp(
)exp(1]0at absorptionPr[
1,0
**
*
y
yy
yb
ah
yyCy
yyY
as ),exp()exp()]exp(1[~
]onceleast at hittingPr[]Pr[***
Asymptotics of the mean time to absorption
• A = Mean{# steps before absorption at -1}
• Since
we have
bwu
bbw
pqpq
bbwu
b
as ,11
0)(
1)(lim
00
0
00
Random walks versus alignments
Anatomy of an excursion
• Pr[Yi y] ~ Cexp(-*y)• A= E[inter-ladder pts. distance]• A and C difficult to compute
P-values for a BLAST comparison
• Assume comparison of two sequences of length N, with expected ladder points distance A. This gives n=N/A excursions on the average. Also, let us denote
• From expression (2.134) we have (since Y is geometric-like)
• Making substitutions
we obtain
)]}1(exp[exp{1)]ln(Pr[)]exp(exp[1
)ln(
)exp( )exp(
)]}1(exp[exp{1]Pr[)]exp(exp[1
1max
1
1
max
*
xnKNxYxnK
Nxy
KACCAK
ynCyYynC
P-values for a BLAST comparison
• From previous slide
• Let us assume a normalized score
• Substituting into previous inequality, we obtain
• So, P-value, corresponding to an empirically obtained maximum score, equals
)ln(' where)],'exp(exp[1
)]exp(exp[1]'Pr[
)ln('
)]}1(exp[exp{1)]ln(Pr[)]exp(exp[1
max
max
1max
NKyssvalueP
ssS
NKYS
xKNxYxK
P-values for a BLAST comparison
• Expected value of the normalized score is equal approximately to Euler’s constant
• This yields
• Both
and
are invariant with respect to multiplication of the score by a constant (why?)
)2ln(
)ln( scorebit
)ln('
)ln(])[ln(][
]'[
max
max
11max
KY
NKYS
KNKNYE
SE
P-values for a BLAST comparison
• Expected number of excursions of height at least equal to v
• For an empirically found value of the score,
• By comparison with a previous formula we see
)'exp(1 value
)exp('
)exp()exp(
max
EP
yNKE
vNKvCA
NE
From the BLAST coursehttp://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
To assess whether a given alignment constitutes evidence for homology, it helps to know how strong an alignment can be expected from chance alone. In this context, "chance" can mean the comparison of
• (i) real but non-homologous sequences; • (ii) real sequences that are shuffled to preserve compositional
properties or • (iii) sequences that are generated randomly based upon a DNA or
protein sequence model.
Analytic statistical results invariably use the last of these definitions of chance, while empirical results based on simulation and curve-fitting may use any of the definitions.
• As demonstrated above, scores of local alignments are covered by a well-developed theory.
• For global alignments, Monte Carlo experiments can provide rough distributional results for some specific scoring systems and sequence compositions, but these can not be generalized easily.
– It is possible to express the score of interest in terms of standard deviations from the mean, but it is a mistake to assume that the relevant distribution is normal and convert this Z-value into a P-value; the tail behavior of global alignment scores is unknown.
– The most one can say reliably is that if 100 random alignments have score inferior to the alignment of interest, the P-value in question is likely less than 0.01.
• One further pitfall to avoid is exaggerating the significance of a result found among multiple tests.
– When many alignments have been generated, e.g. in a database search, the significance of the best must be discounted accordingly.
– An alignment with P-value 0.0001 in the context of a single trial may be assigned a P-value of only 0.1 if it was selected as the best among 1000 independent trials.
From the BLAST coursehttp://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
• The E-value of equation applies to the comparison of two proteins of lengths m and n. How does one assess the significance of an alignment that arises from the comparison of a protein of length m to a database containing many different proteins, of varying lengths?
• One view is that all proteins in the database are a priori equally likely to be related to the query. This implies that a low E-value for an alignment involving a short database sequence should carry the same weight as a low E-value for an alignment involving a long database sequence. To calculate a "database search" E-value, one simply multiplies the pairwise-comparison E-value by the number of sequences in the database.
From the BLAST coursehttp://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
• An alternative view is that a query is a priori more likely to be related to a long than to a short sequence, because long sequences are often composed of multiple distinct domains.
• If we assume the a priori chance of relatedness is proportional to sequence length, then the pairwise E-value involving a database sequence of length n should be multiplied by N/n, where N is the total length of the database in residues.
• Examining equation this can be accomplished simply by treating the database as a single long sequence of length N.
• The BLAST programs take this approach to calculating database E-value. Notice that for DNA sequence comparisons, the length of database records is largely arbitrary, and therefore this is the only really tenable method for estimating statistical significance.
From the BLAST coursehttp://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
Comparison of two unaligned sequences
• Until now, a fixed ungapped alignment in the comparison of two sequences of length N each.
• Now, given two sequences of lengths N1 and N2 without any specific alignment (total N1 + N2 – 1 ungapped alignments).
• Theory advanced, we give only highlights of results. • Many conclusions of the previous sections carry over
with N substituted by N1N2.
Scores
• The basic score is re-defined now
• Mean number of (independent) ladder points in all alignments
• Since the heights of excursions are geometric-like rv’s (n of them),
)1(
1])ln(Pr[1
before) as and ,( srv' like-geometric ofmax
alignments ungapped possible all using
sequences, thecomparinga RW in achieved score max
211
max
21
max
max
xx KeKe exNNYe
A
NNn
ACY
Y
Scores
• From the previous slide
• Define standardized score
• Expected count of (independent) excursions of height at least y
• Similar expressions as before for expected score and P-value
EKe
sssNNy
KeKe
eesSSE
KeeA
CCe
A
NNCe
A
NN
KNNYS
exNNYe
s
xx
11]'Pr[ , ]'[
E
, )ln('
1])ln(Pr[1
)ln(2121
21max
211
max
21
)1(
Karlin-Altschul sum statistic
• Idea: Add information from the r-1 “next to the highest” excursions
• It was proved that
• The particular statistics used
enough large for ,)!1(!
]Pr[,''
exp),,(
effects) edgereflect ' and '( ,,1),''ln('
1
1
11
2121
21max
trr
tetTSST
sessf
NNriKNNYS
YYYY
rt
rrr
r
kk
srS
ii
r
r
Choice of r and multiple testing
• Usually, all sum tests are performed for all “available” r
• The best P-value is accepted, following heuristic corrections (see Section 9.3.4),
1,1,''2
1,])1[(
max21
11
rePKeNNE
rPP
Ey
r
Comparison of a query sequence against a database
• Use Poisson distribution to obtain the following probability
• Since database is of length D, then expected # HSPs with scores v
• For all other
• Analyze Example 9.5.2.
Expect
2
Expect
2
1, " old"
Expect
1
1,)1(
Expect
1] withseq. database andquery between HSP1least at Pr[
1
ePN
DP
r
ePN
De
evY
r
E
E
Top Related