Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we...

Copyright (c) 2002 by SNU CSE Biointelligence Lab1

Biological Sequence AnalysisBiological Sequence Analysis(Ch 11. Background on probability)(Ch 11. Background on probability)

Biointelligence LaboratorySchool of Computer Sci. & Eng.

Seoul National UniversitySeoul 151-742, Korea

This slide file is available online athttp://bi.snu.ac.kr/

http://bi.snu.ac.kr/

http://bi.snu.ac.kr/

2Copyright (c) 2002 by SNU CSE Biointelligence Lab

List of ContentsList of Contents

4 Introduction4 Random variables4 Probability density, distributions4 Transformation4 Plots of statistical distributions4 Relative entropy and mutual information4 Many random variables4 One DNA sequence4 Maximum likelihood4 Sampling4 Metropolis sampling4 EM algorithm


IntroductionIntroduction

4 Sequence similarity test: g g a g a c t g t a g a c a g c t a a t g c t a t ag a a c g c c c t a g c c a c g a g c c c t t a t cP (more than 10 matches) = ? (0.04)

4 Parameter, data, hypothesis, random variable


One DNA Sequence One DNA Sequence

4 Shotgun sequencing: Find long DNA sequence by many overlapping short sequences (500 bases).


One DNA Sequence (2)One DNA Sequence (2)

4 1. What is the mean proportion of the genome covered by contigs?

4 2. What is the mean number of contigs?4 3. What is the mean contig size?


((Discrete) Random VariablesDiscrete) Random Variables

4 A discrete numerical quantity corresponding to an observed outcome of experiment

4 (E.g.) experiment: rolling two six-sided diceoutcome: the two numbers on the dicediscrete random variables: the sum of the two numbers, the difference of the two numbers, the smaller number of the two,

4 Random variables are usually represented by uppercase symbols (X,Y, …) and the realized values of the random variables are represented by lowercase symbols (x,y,…)


Probability DistributionsProbability Distributions

4 For a finite set X : the probability distribution is simply an assignment of a probability px to each outcome x in X. (e.g.) the probability of outcomes of rolling a dice {1/12, 1/12, 1/12, 1/6, 1/4, 1/3}

4 For a continuous set X : the probability density f(x)xxfxxxxxp δδδ )()2/2/( =+≤≤−

∫

∫

∞

∞−=

≥

=≤≤

1)(

0)(

)()( 1

010

dxxf

xf

dxxfxxxpx

x

Note: f(x) can be greater than 0


Probability density, distribution (1/2)Probability density, distribution (1/2)

4 Relation of distribution and density

4 Expectation (mean)

4 For continuous X:

==≤=

∫∑

≤

≤

tx

tx

dxxf

xXPtXPtF

)(

)()()(

==

∫∑

dxxxf

xXxPXE x

)(

)()(

1)(,0)( >== becanxfxXP


Probability density, distribution (2/2)Probability density, distribution (2/2)

4 Relation of distribution and density

4 Conditions for the probability density:

4 Variance:

xt

xtxt

tFdtdxf

tXPtXPxXP

=

<≤

=

=−=== ∑∑

)()(

)()()(

1)(.2

0)(.1

=

≥

∫ dxxf

xf

222)( µµ −=− EXXE


TransformationTransformation

4 One random variable:

4 Many random variable: For 1-1 relations between X and U,

}){()(

))(())(())(()()(

)(

1112

11

11122

12

′=

=<=<=<=

=

−−

−−

ggfxf

tgFtgXPtXgPtXPtF

gmonotoneforXgX

)),,(,),,,((

),,(

111

1

nnn

n

XXUXXUU

XXX

LLLr

Lr

=

=

*1

111 ),,(),,(),,( JxxfJxxfuuf nXnXnU LLL == −


Statistical IndependenceStatistical Independence

4 Two events are independent if the outcome of one event does not affect the outcome of the other event

4 Discrete random variables are independent if the value of one does not affect the probabilities associated with the values of another random variable

4 (E.g.) Experiment: rolling a fair die1. A: the number is even, B: the number is greater than or equal to 3.P (A | B) = P (A) , P (B|A) = P (B) : A and B are independent2. C: the number is greater than 3P (C) = 1/2 , P (C | A) = 2/3, P (A | C) = 1/3, P (A) = 1/2 :A and C are not independent


Uniform DistributionUniform Distribution

4 Discrete uniform : each outcome can occur equally likely.

4 For N outcomes: P (x) = 1/N 4 Continuous uniform: f(x)=1/(b-a) or 1/area(A)


Bernoulli DistributionBernoulli Distribution

4 Bernoulli trial is a single trial with two possible outcomes (success / failure)

4 Bernoulli random variable Y: number of successes in this trial

1,0,)1()( 1 =−= − ypptrialainsuccesesyp yy


Binomial DistributionBinomial Distribution

4 Defined on a finite set of all the possible results of N trials with a binary outcome ( ‘0’ or ‘1’).

4 The random variable is the number of success in the fixed N trials.

)1()1()()()(

)1()(

)1()(

1

222

1

pNpppkN

mkkpmk

NpppkN

kkpkm

ppkN

Nofoutsucceseskp

kNkN

k

kNkN

k

kNk

−=−

−=−=

=−

==

−

=

−

=

−

=

−

∑∑

∑ ∑

σ


Multinomial DistributionMultinomial Distribution

4 Extend the binomial to K independent outcomes with probabilities

(e.g.) Rolling a fair dice N times:P (rolling 12 times and getting each number twice) = 12! / 2!6 (1/6)12 = 3.4 x 10 -3

),,1(, Kii L=θ

∏=

==

K

i

ni

Ki

i

nnn

Kinp11

)|,,1,( θθL

L


Geometric DistributionGeometric Distribution

4 Random variable is the number of trials before the first failure (the length of a success run)

4 Test of the significance of the long run in the sequences

4 Geometric-like random variables in BLAST theory:

11)()(

),2,1,0(,)1()(+−=≤=

=−=y

y

pyYPyF

yppyP L

)10(,~)()1(1 <<≥=−− CCpyYPyF y


Negative Binomial DistributionNegative Binomial Distribution

4 Random variable is the number of trials in the fixednumber of successes

4 P (N=n) = P(the first (n-1) trials result in exactly (m-1) successes and (n-m) failures and trial n results in success)

),2,1,(,)1(11

)( L++=−

−−

= − mmmnppmn

nP mnm


Generalized Geometric DistributionGeneralized Geometric Distribution

4 Random variable is the number of trials before the (k+1)-th failure.

),2,1,(,)1()( 1 L++=−

= +− kkkypp

ky

yP kky


Poisson DistributionPoisson Distribution

4 A limiting form of the binomial distribution (as nbecomes large, p becomes smaller) with moderate λ=np

),2,1,0(,!

)( L==−

yy

eyPyλλ


Exponential DistributionExponential Distribution

4 Exponential distribution is a continuous analogue of the geometric distribution

)()(301.02log)(

/1)(/1)(

0,1)(

0,)(

2

skewedrightXEXMed

XVarXE

xexF

xexfx

x

≈=

=

=

≥−=

≥=−

−

λ

λ

λ

λλ

λ


Relation with Geometric DistributionRelation with Geometric Distribution

4 Suppose X has an exponential distribution and Y is the integer part of X

4 The density function of the fractal part D = X-Y:

)()1()1()1()(

λ

λλ

−

−−

=

−=−=+<≤==

epppeeyXyPyYP yy

λ

λλ−

−

−=

eedf

d

1)(


GaussianGaussian DistributionDistribution

4 As N gets larger, mean and variance of the binomial distribution increase linearly rescale k by

then the binomial becomes a Gaussian with the density as

)1( pNpNpkmku−

−=

−=

σ

)2/exp(21)( 2uuf −=π


Dirichlet Dirichlet Distribution (1/3)Distribution (1/3)

4 The Dirichlet distribution is a conjugate distribution for the multinomial distribution. (The Dirichlet prior wrt. multinomial distribution gives Dirichlet posterior again.)

4 `s corresponds to the parameters of the multinomial distribution.

))()1((,)()(

)1()(

)10()1()()|,(

1 1

1

1 1

111

xxxdZ

ZD

K

i i i

i iK

iii

i

K

i

K

iiiK

i

i

Γ=+ΓΓ

Γ=−=

≤≤−=

∫∏ ∑∏∑

∏ ∑

= =

−

= =

−−

αα

θθδθα

θθδθααθθ

α

αL

iθ


DirichletDirichlet Distribution (2/3)Distribution (2/3)

4 The mean of Dirichlet is equal to the normalized parameters.

4 (e.g.) The three distributions all have the same mean (1/8, 2/8, 5/8) though with different values of α(1,2,5); (10,20,50); (.1, .2, .5)

4 For K=2, Dirichlet reduces to the Beta distribution

∑=

k k

iiE

ααθ ][


DirichletDirichlet Distribution (3/3)Distribution (3/3)

4 (e.g.) The dice factory: sampling fromDirichlet parameterized by 4 Factory A: all six parameters set to 10Factory B: all six parameters set to 24 On average both factories produce fair dice (average = 1/6)But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it isMore likely from factory B:4 The variance of Dirichlet is inversely proportional to the sum of parameters.

61 ,, θθ L

61 ,, αα L

6.199)5(.)1(.)2()12()|(

119.0)5(.)1(.)10()60()|(

12)12(56

110)110(56

=ΓΓ

=

=ΓΓ

=

−−

−−

B

A

D

D

αθ

αθ


Gamma Distribution (1/2)Gamma Distribution (1/2)

4 ),,0()(

),,(1

∞<<Γ

=−−

βααββα

ααβ

xxexgx

2/var/βα

βα

=

=mean


Gamma Distribution (2/2)Gamma Distribution (2/2)

4 The gamma distribution is conjugate to the Poisson: the probability of seeing n events over some interval when there is a probability p of an individual event occurring in that interval.

4 The gamma distribution is used to model the probabilities of rate.

4 The gamma distribution is used to model the rate of evolution at different sites in DNA sequences.

!)(

npenf

np−

=


Extreme Value Distribution (1/2)Extreme Value Distribution (1/2)

4 For N samples from g(x), the probability that the largest of them are less than x is G(x) N

where 4 Extreme value density (EVD) for g(x) is the limit of

h(x)4 EVD is used to model the breaking point of a chain,

to assessing the significance of the maximum score from a set of alignments

∫ ∞−=

xduugxG )()( 1)()()( −= NxGxNgxh


Extreme Value Distribution (2/2)Extreme Value Distribution (2/2)

4 EVD for exponential density

(Gumbel distribution) 4 (e.g.) For N=1,2,10,100,

N >= 10 gives a good approx. to the EVD

xexg αα −=)(

)()exp()/1()( ,1 yxzeeNeexh zzNzx −=−→−= −−−−− αααα ααα


Statistical Distribution (1/7) Statistical Distribution (1/7)

4 Binomial Distribution

P = 0.5, n =30

P = 0.8, n =30



4 Poisson Distribution



4 Negative-Binomial Distribution



4 Geometric Distribution



4 Exponential Distribution



4 Normal Distribution



4 Lognormal Distribution


Entropy (1/2)Entropy (1/2)

4 A measure of average uncertainty of an outcome4 Shannon entropy:

4 Entropy is maximized when all the P(xi) are equal and the maximum is then log K . If we are certain of the outcome, then the entropy is zero.

4 Information:

∑−=i

ii xPxPXH )(log)()(

afterbefore HHXI −=)(


Entropy (2/2)Entropy (2/2)

4 (E.g.) Entropy of equi-probable DNA symbol (A,C,G,T) is 2 bits.

4 Information content of a conserved position: Aparticular position is always an A or a G with p(A)=0.7 and p(G)=0.3. Thus (The more conserved the position, the higher the information content.)

bitsHH afterbefore 12.188.2 =−=−


Relative entropy and mutual information (1/4)Relative entropy and mutual information (1/4)

4 Relative entropy (KL divergence) P wrt. Q:

4 Information content and relative entropy are the same if Q represents the initial state

4 ( Not a metric )

∑=i i

ii xQ

xPxPQPH)()(log)()||(

)||()||( PQHQPH ≠



4 Positive of relative entropy

0)1)(/)()((

))(/)(log()()||(1)log(

=−≤

=−

−≤

∑∑

i iii

i iii

xPxQxP

xPxQxPQPHxx



4 Independency can be measured by the relative entropy between P(X,Y) and P(X)P(Y).

4 M (X, Y) can be interpreted as the amount of information that we acquire about outcome X when we are told outcome Y.

∑=ji ji

jiji yPxP

yxPyxPYXM

, )()(),(

log),();(



4 (E.g.) Acceptor sites: 757 acceptor sites from a database with human genes. 30 bases upstream, 20 bases down stream are extracted from each acceptor.

4 Relative entropy:

4 Mutual information:

.

]/)(log[)(

sequencestheinsnucleotidefourtheof

ondistributioveralltheisqwhere

qapap

a

a aii∑

∑ +ba iiii bpapbapbap, 1 )]()(/),(log[),(


Many random variables (1/2)Many random variables (1/2)

4 Marginal probability

4 Conditional probability∫ ∫

∑

+=

=====+

kiki

yykkii

dxdxxxfxxf

yYyYPyYyYPki

LLLL

LLL

111

,,1111

),,(),,(

),,(),,(1

),,(),,(),,|,,(

),,(),,(

),,|,,(

1

111

11

11

1111

i

kiki

ii

kk

iikkii

xxfxxfxxxxf

yYyYPyYyYP

yYyYyYyYP

L

LLL

L

L

LL

=

====

=

====

+

++


Many random variables (2/2)Many random variables (2/2)

4 Covariance

4 Correlation: ∫∫

∑

−−=

==−−=

21212211

,22112211

),())((

),())((

21

21

21

dxdxxxfxx

yYyYPyy

XX

yyYY

µµσ

µµσ

2112

21

σσσ

ρ YY=


One DNA Sequence (1/10)One DNA Sequence (1/10)

4 Shotgun sequencing: Find long DNA sequence by many overlapping short sequences.



4 1. What is the mean proportion of the genome covered by contigs?

4 2. What is the mean number of contigs?4 3. What is the mean contig size?



4 There are N fragments, each of length L, the whole DNA is of length G.

4 The position of the left hand end of any fragment is uniformly distributed in (0, G)

4 The number of fragments falling in the interval length h becomes Np=Nh/G from binomial distribution.

4 For large N and small interval h, this distribution becomes Poisson with mean a = Nh/G



4 Mean proportion of genome covered by one or more fragment = P (a point chosen at random is covered at least by one fragment) = P (the left end of at least one fragment is in the interval L starting from that point ), (by poisson) = )/(,11 / GNLaee aGNL =−=− −−



4 The length of fragments to cover the whole genome:P = .99 a = NL/G = 4.6 (a : coverage)P = .999 a = 6.9 (about 7 times the genome length)

4 Human genome is 3x10^9 nucleotides : 3,000,000 of them are missing with P=.999

4 What is the mean number of contigs ? success number: a unique right most fragmenttrial number: Np = P (a fragment is the right-most member of a contig)

= P (no other fragment has its left-end point on the fragment in question)

= e-a

4 Mean number of contigs GNLa NeNe /−− =



4 For G = 100,000 L = 500 :

4 If there is a small number of fragments , there must be a small number of contigs

4 A large number of fragments tend to form a small number of large contigs.

60.7 70.8 73.6 66.9 54.1 29.9 14.7 6.7 3.0 1.3Mean number of contigs

0.5 0.75 1 1.5 2 3 4 5 6 7a



4 Mean contig size ?

4 Expectation of a sum of random variables where Nis random

4 Moment generating function of a random variable4 Probability generating function of a random variable4 Geometric distribution4 Exponential distribution



4 Moment generating function (mgf) of a random variable Y:

4 Finding mean and variance from mgf.:

==∫

∑dyyfe

yPeeEm

yy

y

Y

)(

)()()(

θ

θ

θθ

02

22

0

2

02

22

0

)(log,)(log

)(,)(

==

==

=

=

−

=

=

θθ

θθ

θθσ

θθµ

µθθσ

θθµ

dmd

dmd

or

dmd

ddm



4 Probability generating function (pgf) of a (discrete) random variable Y:

−+

=

=

==

=

=

∑

2

12

22

1

)(

)(

)()()(

µµσ

µ

t

t

y

yY

tpdtd

tpdtd

tyPtEtp



4 Expectation of a sum of random variables when N is random :

NXXS ++= L1

)()()(

)),((:

))((.

))((.)(

))((:|

)(:

)|()(

XENESErulechainbyand

tqpSofpgfthus

tqpintofcoeff

tqPintofcoeffySP

tqnSofpgf

tPtpNofpgf

nySPPySP

yn

nn

y

nn

nn

nn

=

=

==

=

===

∑

∑

∑



4 Contig length = sum of n successful left end segments of fragments overlapped in the contig + L

4 The distance between two successful left-hand points ~ Exp (λ) ,λ=N/G (known from Poisson process)

4 The number of overlapping fragments ~ Geo (p)

4 Mean of the random distance (any left-left end)~ conditional exponential (0 < X < L)

∫ −− −==L

ax edxep0

1λλ

11)0|(

1)(

)(

)()0|(0

−−=<<

−=

=<<

−

−

∫

L

L

x

L

eLLXXE

eexf

dxxf

xfLxxf

λ

λ

λ

λ

λ



4 Mean Contig length:

4 (e.g.) For L = 500, a = 2 ,G = 100,000 : the mean contig size = 1600

Le

Le

LLXXENE

aa +

−−−=

+<<

11)1(

)0|()(

λ


Maximum likelihood Maximum likelihood

4 Consistent : parameter value used to generate the data set will also be the value that maximizes the likelihood in the limit.

4 (E.g.) For K observable outcomes of the model, the frequency of occurrence w will tend to

and log-likelihood for parameter θ : tends toBy the positivity of the relevant entropy implies that for all θ

Thus the likelihood is maximized by θ0

4 Drawbacks: can give poor estimate for scanty data

),|(maxarg MDPML θθθ

=

),|( 0 MP θω),|(log)/( MPnn ii ki θω∑ ∑

),|(log),|( 0 MPMP ii i θωθω∑

),|(log),|(),|(log),|( 000 MPMPMPMP ii iii i θωθωθωθω ∑∑ ≥


Posterior probability distributionPosterior probability distribution

4 Bayes theorem

4 Use of posteriors: MAP (maximum a posteriori probability) estimate

4 PME (posterior mean estimator) : parameters weighted by the posterior

)|()|(),|(),|(

MDPMPMDPMDP θθθ =

)|(),|(maxarg MPMDPMAP θθθθ

=

∫= θθθθ dnPPME )|(


Change of variablesChange of variables

4 Given a density of x: f(x) and a change of variablex= φ(y) , the density of y: g(y) becomes

4 The maximum of g(y) , MAP or PME may shift from that of f(x) .

|)(|)}({)( yyfyg φφ ′=


SamplingSampling

4 Given a finite set with probabilities P(x), to sample from this set means to pick elements x randomly with probability P(x) .

4 Sampling by random number generator in [0 1]:

)(

)}()()(]1,0[

)()({)(

11

11

i

ii

ii

xP

xpxpxprand

xpxpPxselectingP

=

+++<<

++=

−

−

L

L


Sampling by transformation from a Sampling by transformation from a uniform distributionuniform distribution

4 For a uniform density f(x) and a map x = φ(y),

4 (E.g.) Sampling from a Gaussian:

)(

)()(),()())(()(1 xy

duugyyyyfygy

b−=

=′=′= ∫φ

φφφφ

)(

2/)(

:.),(

1

2/2

xyletAnd

dueyDefine

yFindgivenisxygy u

−

∞−

−

=

= ∫φ

πφ


Sampling with the Metropolis algorithm (1/3)Sampling with the Metropolis algorithm (1/3)

4 For sampling when the analytic methods are not available

4 Generate a sequence {yi} to approximate P as close as we like from under a condition:

(detailed balance)4 Detailed balance implies:

4 The sequence under this transition process will sample P correctly

)|( 1−iyyτ

)|()()|()( yxyPxyxP ττ =

)()(#1lim xPxyN iN

==∞→



Metropolis algorithm:4 Symmetric proposal mechanism: Given a point x, this

selects a point y with probability F(y|x) and symmetric4 Acceptance mechanism : accepts proposed y with

probability min (1,P(y)/P(x))(A point y with larger posterior probability than the current x is always accepted, and one with lower probability is accepted randomly with probability P(y)/P(x))



Metropolis algorithm (balance):

)|()())(),(min()|())(),(min()|(

))(/)(,1min()|()()|()(

yxyPxPyPyxFyPxPxyF

xPyPxyFxPxyxP

τ

τ

====


Gibbs samplingGibbs sampling

4 Sampling from conditional distributions for each i.

4 Proposal distribution is the conditional distribution:Always accept the sample.

),,,,,|( 111 Niii xxxxxP LL +−


Estimation of probabilities from counts (1/2)Estimation of probabilities from counts (1/2)

4

0log

log

)(log

)|()|(log

:)|()|(

>=

=

=

≠>

∑

∑

∏∏

i

MLi

i

MLi

i

MLi

ii

nii

nMLii

ML

MLML

N

n

nPnP

nnobservatioanandanyfornPnP

i

i

θθθ

θθ

θ

θθθ

θθθθ

NniMLi /=θ


Estimation of probabilities from counts (2/2)Estimation of probabilities from counts (2/2)

4 For scarce data: use prior

)|()|(

)|()()()(

)()()()(

1)(

)|()|()|(

)(1)|(

)()|(

1

1

1

1

1

αθθ

αθα

α

θα

αθθθ

θα

αθ

θθ

α

+=

++

=

=

=

=

=

∏

∏

∏

−+

=

−

=

−

nDnP

nDZnMnP

nZZnMnP

nPDnPnP

ZD

nMnP

i

ni

K

i

ni

K

i

ni

ii

i

i

countspseudoAN

n

i

iiPMEi

:α

αθ++

=


Mixtures of Mixtures of DirichletsDirichlets

)(,)|(),,|( 1 kk

k

kk

m PqDqP ααθααθ ==∑L

ANnnP

nDnP

nPnPnP

kiik

k

PMEi

k

k

kk

kk

++

=

+=

=

∑

∑

∑

ααθ

αθα

ααθθ

)|(

)|()|(

)|(),|()|(


EM algorithm (1/3)EM algorithm (1/3)

4 A general algorithm for ML estimation with missing data

4 Baum-Welch algorithm for estimating hidden Markov model probabilities is a special case of the EM algorithm.

dataingmissynobservatiox

yxPxPMaximize

y

::

)|,(log)|(log:

∑= θθ



)|()|()|(log)|(,

),|(),|(log),|()|()|(

)|(log)|(

)|,(log),|()|

|(log),|()|,(log),|(),|(log)|,(log)|(

tttt

t

y

tttt

t

y

tt

y

t

y

t

QQxPxP

xyPxyPxyPQQ

xPxP

yxPxyP

xyPxyPyxPxyPxyPyxPxP

θθθθθθ

θθθθθθθ

θθ

θθθθ

θθθ

θθθ

−≥−

+−

=−

=

−=

−=

∑

∑

∑∑

log

log

(

),log

Thus

Q

θ



4 Choosing will always makes the difference positive and thus the likelihood of the new model is larger than the likelihood of the old one.

4 EM Algorithm:(E-step) Calculate Q function(M-step) Maximize Q(θ | θ t) wrt. θ4 The likelihood increases in each iteration4 Instead of maximizing in the (M-step), algorithms

that increase Q are called generalized EM (GEM).

)|(maxarg1 tt Q θθθθ

=+

Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we...

Documents

Transcript of Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we...