Download - Random Walks and BLAST Marek Kimmel (Statistics, Rice) [email protected] 713 348 5255 [email protected].

Random Walks and BLAST

Marek Kimmel (Statistics, Rice)[email protected]

713 348 5255

mailto:[email protected]

Outline

• Explaining the connection

• Simple RW with absorption

• Moment-generating function method

• Size and duration of excursions

• Renewal equation and general RW

• Significance of alignments in BLAST

Intuitive introduction• Alignment as a random walk

g g a g a c t g t a g a c

g a a c g c c c t a g c c• Scores: match = 1, mismatch = -1• Solid symbols = ladder points, squares = excursions

Relation to BLAST

• Quality of alignment reflected by the course of the RW.

• Distribution of maximum heights of excursions achievable by chance, provides null hypothesis.

Simple RW with absorbing boundaries

We consider the case p q only

qpSSiidS ii 1]1Pr[},1,0{,,}{ 1

bTbT

aTaT

nkallbaTShT

nn

nn

k

n

iin

1

1

1

),,(,

Absorption probabilities

• Consider backward equation

• 2nd order, homogeneous, linear, difference equ.

where)]1(exp[)]1(exp[)exp(

)exp(

1,1,

1

]enoughlarge,Pr[

]enoughlarge,Pr[

2

1

11

hqhph

hCw

wwqwpww

uw

nbTw

naTu

iiih

bahhh

hh

nh

nh

Absorption probabilities

• This provides

• Constants derived from boundary conditions

hh

h

wu

ab

ahw

pq

1

)exp()exp(

)exp()exp(

)/ln(,0

**

**

*21

Mean number of steps to absorption

• 2nd order, inhomogeneous, difference equ.• Solution = any particular solution of (*)

+ general solution of the corresponding homogeneous equ.

• Verify

is a particular solution, and therefore

qp

hauhbwm

mmhCCpqhm

pqhm

qmpmm

hhh

bah

h

hhh

)()(

0),exp()/(

)/(

(*)1

*21

11

Moment-generating function approach

1)(,0

0]Pr[,0]Pr[

1.1

)exp(]Pr[)(then

},1,,0,,1,{If

)][exp()(

**

S

d

ciS

S

mst

dScS

Theorem

iiSm

ddccS

SEm


• Simple RW:

• Until absorption

11)]exp()exp([)(

)]exp()exp([)(

1)(),/ln(

)exp()exp()(

***

1

**

NNhT

NhT

N

iNN

S

S

pqm

pqmShT

mpq

pqm

N

N


• Sticky argument now: At the time of absorption,

• But the latter is equal to 1

)(

)](exp[)1()](exp[)(

)](exp[)1()](exp[)(

wp

1wp

***

h

hhhT

hhhT

h

hN

w

hawhbwm

hawhbwm

whb

whahT

N

N

Stopping time (at absorption)

)(

)]()([)(.

0)()()(

0)]}(exp[)({

all,1)]}(exp[)({

)'(.1.7

randomntdisplaceme

timestopping

*|

1

h

hhh

N

NS

NS

N

N

iiN

m

hbwhauqpmie

hTESENE

hTmEdt

d

hTmE

sWaldTheorem

hT

N

ShT

Asymptotics (p < q)

• Hypotheses

• So, define Y = excursion height

]onceleast at hittingPr[

)exp()exp(

)exp(1]0at absorptionPr[

1,0

**

*

y

yy

yb

ah

yyCy

yyY

as ),exp()exp()]exp(1[~

]onceleast at hittingPr[]Pr[***

Asymptotics of the mean time to absorption

• A = Mean{# steps before absorption at -1}

• Since

we have

bwu

bbw

pqpq

bbwu

b

as ,11

0)(

1)(lim

00

0

00

Random walks versus alignments

Anatomy of an excursion

• Pr[Yi y] ~ Cexp(-*y)• A= E[inter-ladder pts. distance]• A and C difficult to compute

P-values for a BLAST comparison

• Assume comparison of two sequences of length N, with expected ladder points distance A. This gives n=N/A excursions on the average. Also, let us denote

• From expression (2.134) we have (since Y is geometric-like)

• Making substitutions

we obtain

)]}1(exp[exp{1)]ln(Pr[)]exp(exp[1

)ln(

)exp( )exp(

)]}1(exp[exp{1]Pr[)]exp(exp[1

1max

1

1

max

*

xnKNxYxnK

Nxy

KACCAK

ynCyYynC


• From previous slide

• Let us assume a normalized score

• Substituting into previous inequality, we obtain

• So, P-value, corresponding to an empirically obtained maximum score, equals

)ln(' where)],'exp(exp[1

)]exp(exp[1]'Pr[

)ln('

)]}1(exp[exp{1)]ln(Pr[)]exp(exp[1

max

max

1max

NKyssvalueP

ssS

NKYS

xKNxYxK


• Expected value of the normalized score is equal approximately to Euler’s constant

• This yields

• Both

and

are invariant with respect to multiplication of the score by a constant (why?)

)2ln(

)ln( scorebit

)ln('

)ln(])[ln(][

]'[

max

max

11max

KY

NKYS

KNKNYE

SE


• Expected number of excursions of height at least equal to v

• For an empirically found value of the score,

• By comparison with a previous formula we see

)'exp(1 value

)exp('

)exp()exp(

max

EP

yNKE

vNKvCA

NE

From the BLAST coursehttp://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

To assess whether a given alignment constitutes evidence for homology, it helps to know how strong an alignment can be expected from chance alone. In this context, "chance" can mean the comparison of

• (i) real but non-homologous sequences; • (ii) real sequences that are shuffled to preserve compositional

properties or • (iii) sequences that are generated randomly based upon a DNA or

protein sequence model.

Analytic statistical results invariably use the last of these definitions of chance, while empirical results based on simulation and curve-fitting may use any of the definitions.

• As demonstrated above, scores of local alignments are covered by a well-developed theory.

• For global alignments, Monte Carlo experiments can provide rough distributional results for some specific scoring systems and sequence compositions, but these can not be generalized easily.

– It is possible to express the score of interest in terms of standard deviations from the mean, but it is a mistake to assume that the relevant distribution is normal and convert this Z-value into a P-value; the tail behavior of global alignment scores is unknown.

– The most one can say reliably is that if 100 random alignments have score inferior to the alignment of interest, the P-value in question is likely less than 0.01.

• One further pitfall to avoid is exaggerating the significance of a result found among multiple tests.

– When many alignments have been generated, e.g. in a database search, the significance of the best must be discounted accordingly.

– An alignment with P-value 0.0001 in the context of a single trial may be assigned a P-value of only 0.1 if it was selected as the best among 1000 independent trials.


• The E-value of equation applies to the comparison of two proteins of lengths m and n. How does one assess the significance of an alignment that arises from the comparison of a protein of length m to a database containing many different proteins, of varying lengths?

• One view is that all proteins in the database are a priori equally likely to be related to the query. This implies that a low E-value for an alignment involving a short database sequence should carry the same weight as a low E-value for an alignment involving a long database sequence. To calculate a "database search" E-value, one simply multiplies the pairwise-comparison E-value by the number of sequences in the database.


• An alternative view is that a query is a priori more likely to be related to a long than to a short sequence, because long sequences are often composed of multiple distinct domains.

• If we assume the a priori chance of relatedness is proportional to sequence length, then the pairwise E-value involving a database sequence of length n should be multiplied by N/n, where N is the total length of the database in residues.

• Examining equation this can be accomplished simply by treating the database as a single long sequence of length N.

• The BLAST programs take this approach to calculating database E-value. Notice that for DNA sequence comparisons, the length of database records is largely arbitrary, and therefore this is the only really tenable method for estimating statistical significance.


Comparison of two unaligned sequences

• Until now, a fixed ungapped alignment in the comparison of two sequences of length N each.

• Now, given two sequences of lengths N1 and N2 without any specific alignment (total N1 + N2 – 1 ungapped alignments).

• Theory advanced, we give only highlights of results. • Many conclusions of the previous sections carry over

with N substituted by N1N2.

Scores

• The basic score is re-defined now

• Mean number of (independent) ladder points in all alignments

• Since the heights of excursions are geometric-like rv’s (n of them),

)1(

1])ln(Pr[1

before) as and ,( srv' like-geometric ofmax

alignments ungapped possible all using

sequences, thecomparinga RW in achieved score max

211

max

21

max

max

xx KeKe exNNYe

A

NNn

ACY

Y

Scores

• From the previous slide

• Define standardized score

• Expected count of (independent) excursions of height at least y

• Similar expressions as before for expected score and P-value

EKe

sssNNy

KeKe

eesSSE

KeeA

CCe

A

NNCe

A

NN

KNNYS

exNNYe

s

xx

11]'Pr[ , ]'[

E

, )ln('

1])ln(Pr[1

)ln(2121

21max

211

max

21

)1(

Karlin-Altschul sum statistic

• Idea: Add information from the r-1 “next to the highest” excursions

• It was proved that

• The particular statistics used

enough large for ,)!1(!

]Pr[,''

exp),,(

effects) edgereflect ' and '( ,,1),''ln('

1

1

11

2121

21max

trr

tetTSST

sessf

NNriKNNYS

YYYY

rt

rrr

r

kk

srS

ii

r

r

Choice of r and multiple testing

• Usually, all sum tests are performed for all “available” r

• The best P-value is accepted, following heuristic corrections (see Section 9.3.4),

1,1,''2

1,])1[(

max21

11

rePKeNNE

rPP

Ey

r

Comparison of a query sequence against a database

• Use Poisson distribution to obtain the following probability

• Since database is of length D, then expected # HSPs with scores v

• For all other

• Analyze Example 9.5.2.

Expect

2

Expect

2

1, " old"

Expect

1

1,)1(

Expect

1] withseq. database andquery between HSP1least at Pr[

1

ePN

DP

r

ePN

De

evY

r

E

E