Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we...
Transcript of Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we...
![Page 1: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/1.jpg)
Copyright (c) 2002 by SNU CSE Biointelligence Lab1
Biological Sequence AnalysisBiological Sequence Analysis(Ch 11. Background on probability)(Ch 11. Background on probability)
Biointelligence LaboratorySchool of Computer Sci. & Eng.
Seoul National UniversitySeoul 151-742, Korea
This slide file is available online athttp://bi.snu.ac.kr/
![Page 2: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/2.jpg)
2Copyright (c) 2002 by SNU CSE Biointelligence Lab
List of ContentsList of Contents
4 Introduction4 Random variables4 Probability density, distributions4 Transformation4 Plots of statistical distributions4 Relative entropy and mutual information4 Many random variables4 One DNA sequence4 Maximum likelihood4 Sampling4 Metropolis sampling4 EM algorithm
![Page 3: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/3.jpg)
3Copyright (c) 2002 by SNU CSE Biointelligence Lab
IntroductionIntroduction
4 Sequence similarity test: g g a g a c t g t a g a c a g c t a a t g c t a t ag a a c g c c c t a g c c a c g a g c c c t t a t cP (more than 10 matches) = ? (0.04)
4 Parameter, data, hypothesis, random variable
![Page 4: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/4.jpg)
4Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence One DNA Sequence
4 Shotgun sequencing: Find long DNA sequence by many overlapping short sequences (500 bases).
![Page 5: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/5.jpg)
5Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (2)One DNA Sequence (2)
4 1. What is the mean proportion of the genome covered by contigs?
4 2. What is the mean number of contigs?4 3. What is the mean contig size?
![Page 6: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/6.jpg)
6Copyright (c) 2002 by SNU CSE Biointelligence Lab
((Discrete) Random VariablesDiscrete) Random Variables
4 A discrete numerical quantity corresponding to an observed outcome of experiment
4 (E.g.) experiment: rolling two six-sided diceoutcome: the two numbers on the dicediscrete random variables: the sum of the two numbers, the difference of the two numbers, the smaller number of the two,
4 Random variables are usually represented by uppercase symbols (X,Y, …) and the realized values of the random variables are represented by lowercase symbols (x,y,…)
![Page 7: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/7.jpg)
7Copyright (c) 2002 by SNU CSE Biointelligence Lab
Probability DistributionsProbability Distributions
4 For a finite set X : the probability distribution is simply an assignment of a probability px to each outcome x in X. (e.g.) the probability of outcomes of rolling a dice {1/12, 1/12, 1/12, 1/6, 1/4, 1/3}
4 For a continuous set X : the probability density f(x)xxfxxxxxp δδδ )()2/2/( =+≤≤−
∫
∫
∞
∞−=
≥
=≤≤
1)(
0)(
)()( 1
010
dxxf
xf
dxxfxxxpx
x
Note: f(x) can be greater than 0
![Page 8: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/8.jpg)
8Copyright (c) 2002 by SNU CSE Biointelligence Lab
Probability density, distribution (1/2)Probability density, distribution (1/2)
4 Relation of distribution and density
4 Expectation (mean)
4 For continuous X:
==≤=
∫∑
≤
≤
tx
tx
dxxf
xXPtXPtF
)(
)()()(
==
∫∑
dxxxf
xXxPXE x
)(
)()(
1)(,0)( >== becanxfxXP
![Page 9: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/9.jpg)
9Copyright (c) 2002 by SNU CSE Biointelligence Lab
Probability density, distribution (2/2)Probability density, distribution (2/2)
4 Relation of distribution and density
4 Conditions for the probability density:
4 Variance:
xt
xtxt
tFdtdxf
tXPtXPxXP
=
<≤
=
=−=== ∑∑
)()(
)()()(
1)(.2
0)(.1
=
≥
∫ dxxf
xf
222)( µµ −=− EXXE
![Page 10: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/10.jpg)
10Copyright (c) 2002 by SNU CSE Biointelligence Lab
TransformationTransformation
4 One random variable:
4 Many random variable: For 1-1 relations between X and U,
}){()(
))(())(())(()()(
)(
1112
11
11122
12
′=
=<=<=<=
=
−−
−−
ggfxf
tgFtgXPtXgPtXPtF
gmonotoneforXgX
)),,(,),,,((
),,(
111
1
nnn
n
XXUXXUU
XXX
LLLr
Lr
=
=
*1
111 ),,(),,(),,( JxxfJxxfuuf nXnXnU LLL == −
![Page 11: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/11.jpg)
11Copyright (c) 2002 by SNU CSE Biointelligence Lab
Statistical IndependenceStatistical Independence
4 Two events are independent if the outcome of one event does not affect the outcome of the other event
4 Discrete random variables are independent if the value of one does not affect the probabilities associated with the values of another random variable
4 (E.g.) Experiment: rolling a fair die1. A: the number is even, B: the number is greater than or equal to 3.P (A | B) = P (A) , P (B|A) = P (B) : A and B are independent2. C: the number is greater than 3P (C) = 1/2 , P (C | A) = 2/3, P (A | C) = 1/3, P (A) = 1/2 :A and C are not independent
![Page 12: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/12.jpg)
12Copyright (c) 2002 by SNU CSE Biointelligence Lab
Uniform DistributionUniform Distribution
4 Discrete uniform : each outcome can occur equally likely.
4 For N outcomes: P (x) = 1/N 4 Continuous uniform: f(x)=1/(b-a) or 1/area(A)
![Page 13: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/13.jpg)
13Copyright (c) 2002 by SNU CSE Biointelligence Lab
Bernoulli DistributionBernoulli Distribution
4 Bernoulli trial is a single trial with two possible outcomes (success / failure)
4 Bernoulli random variable Y: number of successes in this trial
1,0,)1()( 1 =−= − ypptrialainsuccesesyp yy
![Page 14: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/14.jpg)
14Copyright (c) 2002 by SNU CSE Biointelligence Lab
Binomial DistributionBinomial Distribution
4 Defined on a finite set of all the possible results of N trials with a binary outcome ( ‘0’ or ‘1’).
4 The random variable is the number of success in the fixed N trials.
)1()1()()()(
)1()(
)1()(
1
222
1
pNpppkN
mkkpmk
NpppkN
kkpkm
ppkN
Nofoutsucceseskp
kNkN
k
kNkN
k
kNk
−=−
−=−=
=−
==
−
=
−
=
−
=
−
∑∑
∑ ∑
σ
![Page 15: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/15.jpg)
15Copyright (c) 2002 by SNU CSE Biointelligence Lab
Multinomial DistributionMultinomial Distribution
4 Extend the binomial to K independent outcomes with probabilities
(e.g.) Rolling a fair dice N times:P (rolling 12 times and getting each number twice) = 12! / 2!6 (1/6)12 = 3.4 x 10 -3
),,1(, Kii L=θ
∏=
==
K
i
ni
Ki
i
nnn
Kinp11
)|,,1,( θθL
L
![Page 16: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/16.jpg)
16Copyright (c) 2002 by SNU CSE Biointelligence Lab
Geometric DistributionGeometric Distribution
4 Random variable is the number of trials before the first failure (the length of a success run)
4 Test of the significance of the long run in the sequences
4 Geometric-like random variables in BLAST theory:
11)()(
),2,1,0(,)1()(+−=≤=
=−=y
y
pyYPyF
yppyP L
)10(,~)()1(1 <<≥=−− CCpyYPyF y
![Page 17: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/17.jpg)
17Copyright (c) 2002 by SNU CSE Biointelligence Lab
Negative Binomial DistributionNegative Binomial Distribution
4 Random variable is the number of trials in the fixednumber of successes
4 P (N=n) = P(the first (n-1) trials result in exactly (m-1) successes and (n-m) failures and trial n results in success)
),2,1,(,)1(11
)( L++=−
−−
= − mmmnppmn
nP mnm
![Page 18: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/18.jpg)
18Copyright (c) 2002 by SNU CSE Biointelligence Lab
Generalized Geometric DistributionGeneralized Geometric Distribution
4 Random variable is the number of trials before the (k+1)-th failure.
),2,1,(,)1()( 1 L++=−
= +− kkkypp
ky
yP kky
![Page 19: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/19.jpg)
19Copyright (c) 2002 by SNU CSE Biointelligence Lab
Poisson DistributionPoisson Distribution
4 A limiting form of the binomial distribution (as nbecomes large, p becomes smaller) with moderate λ=np
),2,1,0(,!
)( L==−
yy
eyPyλλ
![Page 20: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/20.jpg)
20Copyright (c) 2002 by SNU CSE Biointelligence Lab
Exponential DistributionExponential Distribution
4 Exponential distribution is a continuous analogue of the geometric distribution
)()(301.02log)(
/1)(/1)(
0,1)(
0,)(
2
skewedrightXEXMed
XVarXE
xexF
xexfx
x
≈=
=
=
≥−=
≥=−
−
λ
λ
λ
λλ
λ
![Page 21: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/21.jpg)
21Copyright (c) 2002 by SNU CSE Biointelligence Lab
Relation with Geometric DistributionRelation with Geometric Distribution
4 Suppose X has an exponential distribution and Y is the integer part of X
4 The density function of the fractal part D = X-Y:
)()1()1()1()(
λ
λλ
−
−−
=
−=−=+<≤==
epppeeyXyPyYP yy
λ
λλ−
−
−=
eedf
d
1)(
![Page 22: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/22.jpg)
22Copyright (c) 2002 by SNU CSE Biointelligence Lab
GaussianGaussian DistributionDistribution
4 As N gets larger, mean and variance of the binomial distribution increase linearly rescale k by
then the binomial becomes a Gaussian with the density as
)1( pNpNpkmku−
−=
−=
σ
)2/exp(21)( 2uuf −=π
![Page 23: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/23.jpg)
23Copyright (c) 2002 by SNU CSE Biointelligence Lab
![Page 24: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/24.jpg)
24Copyright (c) 2002 by SNU CSE Biointelligence Lab
Dirichlet Dirichlet Distribution (1/3)Distribution (1/3)
4 The Dirichlet distribution is a conjugate distribution for the multinomial distribution. (The Dirichlet prior wrt. multinomial distribution gives Dirichlet posterior again.)
4 `s corresponds to the parameters of the multinomial distribution.
))()1((,)()(
)1()(
)10()1()()|,(
1 1
1
1 1
111
xxxdZ
ZD
K
i i i
i iK
iii
i
K
i
K
iiiK
i
i
Γ=+ΓΓ
Γ=−=
≤≤−=
∫∏ ∑∏∑
∏ ∑
= =
−
= =
−−
αα
θθδθα
θθδθααθθ
α
αL
iθ
![Page 25: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/25.jpg)
25Copyright (c) 2002 by SNU CSE Biointelligence Lab
DirichletDirichlet Distribution (2/3)Distribution (2/3)
4 The mean of Dirichlet is equal to the normalized parameters.
4 (e.g.) The three distributions all have the same mean (1/8, 2/8, 5/8) though with different values of α(1,2,5); (10,20,50); (.1, .2, .5)
4 For K=2, Dirichlet reduces to the Beta distribution
∑=
k k
iiE
ααθ ][
![Page 26: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/26.jpg)
26Copyright (c) 2002 by SNU CSE Biointelligence Lab
DirichletDirichlet Distribution (3/3)Distribution (3/3)
4 (e.g.) The dice factory: sampling fromDirichlet parameterized by 4 Factory A: all six parameters set to 10Factory B: all six parameters set to 24 On average both factories produce fair dice (average = 1/6)But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it isMore likely from factory B:4 The variance of Dirichlet is inversely proportional to the sum of parameters.
61 ,, θθ L
61 ,, αα L
6.199)5(.)1(.)2()12()|(
119.0)5(.)1(.)10()60()|(
12)12(56
110)110(56
=ΓΓ
=
=ΓΓ
=
−−
−−
B
A
D
D
αθ
αθ
![Page 27: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/27.jpg)
27Copyright (c) 2002 by SNU CSE Biointelligence Lab
Gamma Distribution (1/2)Gamma Distribution (1/2)
4 ),,0()(
),,(1
∞<<Γ
=−−
βααββα
ααβ
xxexgx
2/var/βα
βα
=
=mean
![Page 28: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/28.jpg)
28Copyright (c) 2002 by SNU CSE Biointelligence Lab
Gamma Distribution (2/2)Gamma Distribution (2/2)
4 The gamma distribution is conjugate to the Poisson: the probability of seeing n events over some interval when there is a probability p of an individual event occurring in that interval.
4 The gamma distribution is used to model the probabilities of rate.
4 The gamma distribution is used to model the rate of evolution at different sites in DNA sequences.
!)(
npenf
np−
=
![Page 29: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/29.jpg)
29Copyright (c) 2002 by SNU CSE Biointelligence Lab
Extreme Value Distribution (1/2)Extreme Value Distribution (1/2)
4 For N samples from g(x), the probability that the largest of them are less than x is G(x) N
where 4 Extreme value density (EVD) for g(x) is the limit of
h(x)4 EVD is used to model the breaking point of a chain,
to assessing the significance of the maximum score from a set of alignments
∫ ∞−=
xduugxG )()( 1)()()( −= NxGxNgxh
![Page 30: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/30.jpg)
30Copyright (c) 2002 by SNU CSE Biointelligence Lab
Extreme Value Distribution (2/2)Extreme Value Distribution (2/2)
4 EVD for exponential density
(Gumbel distribution) 4 (e.g.) For N=1,2,10,100,
N >= 10 gives a good approx. to the EVD
xexg αα −=)(
)()exp()/1()( ,1 yxzeeNeexh zzNzx −=−→−= −−−−− αααα ααα
![Page 31: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/31.jpg)
31Copyright (c) 2002 by SNU CSE Biointelligence Lab
Statistical Distribution (1/7) Statistical Distribution (1/7)
4 Binomial Distribution
P = 0.5, n =30
P = 0.8, n =30
![Page 32: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/32.jpg)
32Copyright (c) 2002 by SNU CSE Biointelligence Lab
Statistical Distribution (2/7) Statistical Distribution (2/7)
4 Poisson Distribution
![Page 33: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/33.jpg)
33Copyright (c) 2002 by SNU CSE Biointelligence Lab
Statistical Distribution (3/7) Statistical Distribution (3/7)
4 Negative-Binomial Distribution
![Page 34: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/34.jpg)
34Copyright (c) 2002 by SNU CSE Biointelligence Lab
Statistical Distribution (4/7) Statistical Distribution (4/7)
4 Geometric Distribution
![Page 35: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/35.jpg)
35Copyright (c) 2002 by SNU CSE Biointelligence Lab
Statistical Distribution (5/7) Statistical Distribution (5/7)
4 Exponential Distribution
![Page 36: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/36.jpg)
36Copyright (c) 2002 by SNU CSE Biointelligence Lab
Statistical Distribution (6/7) Statistical Distribution (6/7)
4 Normal Distribution
![Page 37: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/37.jpg)
37Copyright (c) 2002 by SNU CSE Biointelligence Lab
Statistical Distribution (7/7) Statistical Distribution (7/7)
4 Lognormal Distribution
![Page 38: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/38.jpg)
38Copyright (c) 2002 by SNU CSE Biointelligence Lab
Entropy (1/2)Entropy (1/2)
4 A measure of average uncertainty of an outcome4 Shannon entropy:
4 Entropy is maximized when all the P(xi) are equal and the maximum is then log K . If we are certain of the outcome, then the entropy is zero.
4 Information:
∑−=i
ii xPxPXH )(log)()(
afterbefore HHXI −=)(
![Page 39: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/39.jpg)
39Copyright (c) 2002 by SNU CSE Biointelligence Lab
Entropy (2/2)Entropy (2/2)
4 (E.g.) Entropy of equi-probable DNA symbol (A,C,G,T) is 2 bits.
4 Information content of a conserved position: Aparticular position is always an A or a G with p(A)=0.7 and p(G)=0.3. Thus (The more conserved the position, the higher the information content.)
bitsHH afterbefore 12.188.2 =−=−
![Page 40: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/40.jpg)
40Copyright (c) 2002 by SNU CSE Biointelligence Lab
Relative entropy and mutual information (1/4)Relative entropy and mutual information (1/4)
4 Relative entropy (KL divergence) P wrt. Q:
4 Information content and relative entropy are the same if Q represents the initial state
4 ( Not a metric )
∑=i i
ii xQ
xPxPQPH)()(log)()||(
)||()||( PQHQPH ≠
![Page 41: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/41.jpg)
41Copyright (c) 2002 by SNU CSE Biointelligence Lab
Relative entropy and mutual information (2/4)Relative entropy and mutual information (2/4)
4 Positive of relative entropy
0)1)(/)()((
))(/)(log()()||(1)log(
=−≤
=−
−≤
∑∑
i iii
i iii
xPxQxP
xPxQxPQPHxx
![Page 42: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/42.jpg)
42Copyright (c) 2002 by SNU CSE Biointelligence Lab
Relative entropy and mutual information (3/4)Relative entropy and mutual information (3/4)
4 Independency can be measured by the relative entropy between P(X,Y) and P(X)P(Y).
4 M (X, Y) can be interpreted as the amount of information that we acquire about outcome X when we are told outcome Y.
∑=ji ji
jiji yPxP
yxPyxPYXM
, )()(),(
log),();(
![Page 43: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/43.jpg)
43Copyright (c) 2002 by SNU CSE Biointelligence Lab
Relative entropy and mutual information (4/4)Relative entropy and mutual information (4/4)
4 (E.g.) Acceptor sites: 757 acceptor sites from a database with human genes. 30 bases upstream, 20 bases down stream are extracted from each acceptor.
4 Relative entropy:
4 Mutual information:
.
]/)(log[)(
sequencestheinsnucleotidefourtheof
ondistributioveralltheisqwhere
qapap
a
a aii∑
∑ +ba iiii bpapbapbap, 1 )]()(/),(log[),(
![Page 44: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/44.jpg)
44Copyright (c) 2002 by SNU CSE Biointelligence Lab
![Page 45: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/45.jpg)
45Copyright (c) 2002 by SNU CSE Biointelligence Lab
Many random variables (1/2)Many random variables (1/2)
4 Marginal probability
4 Conditional probability∫ ∫
∑
+=
=====+
kiki
yykkii
dxdxxxfxxf
yYyYPyYyYPki
LLLL
LLL
111
,,1111
),,(),,(
),,(),,(1
),,(),,(),,|,,(
),,(),,(
),,|,,(
1
111
11
11
1111
i
kiki
ii
kk
iikkii
xxfxxfxxxxf
yYyYPyYyYP
yYyYyYyYP
L
LLL
L
L
LL
=
====
=
====
+
++
![Page 46: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/46.jpg)
46Copyright (c) 2002 by SNU CSE Biointelligence Lab
Many random variables (2/2)Many random variables (2/2)
4 Covariance
4 Correlation: ∫∫
∑
−−=
==−−=
21212211
,22112211
),())((
),())((
21
21
21
dxdxxxfxx
yYyYPyy
XX
yyYY
µµσ
µµσ
2112
21
σσσ
ρ YY=
![Page 47: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/47.jpg)
47Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (1/10)One DNA Sequence (1/10)
4 Shotgun sequencing: Find long DNA sequence by many overlapping short sequences.
![Page 48: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/48.jpg)
48Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (2/10)One DNA Sequence (2/10)
4 1. What is the mean proportion of the genome covered by contigs?
4 2. What is the mean number of contigs?4 3. What is the mean contig size?
![Page 49: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/49.jpg)
49Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (3/10)One DNA Sequence (3/10)
4 There are N fragments, each of length L, the whole DNA is of length G.
4 The position of the left hand end of any fragment is uniformly distributed in (0, G)
4 The number of fragments falling in the interval length h becomes Np=Nh/G from binomial distribution.
4 For large N and small interval h, this distribution becomes Poisson with mean a = Nh/G
![Page 50: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/50.jpg)
50Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (4/10)One DNA Sequence (4/10)
4 Mean proportion of genome covered by one or more fragment = P (a point chosen at random is covered at least by one fragment) = P (the left end of at least one fragment is in the interval L starting from that point ), (by poisson) = )/(,11 / GNLaee aGNL =−=− −−
![Page 51: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/51.jpg)
51Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (5/10)One DNA Sequence (5/10)
4 The length of fragments to cover the whole genome:P = .99 a = NL/G = 4.6 (a : coverage)P = .999 a = 6.9 (about 7 times the genome length)
4 Human genome is 3x10^9 nucleotides : 3,000,000 of them are missing with P=.999
4 What is the mean number of contigs ? success number: a unique right most fragmenttrial number: Np = P (a fragment is the right-most member of a contig)
= P (no other fragment has its left-end point on the fragment in question)
= e-a
4 Mean number of contigs GNLa NeNe /−− =
![Page 52: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/52.jpg)
52Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (6/10)One DNA Sequence (6/10)
4 For G = 100,000 L = 500 :
4 If there is a small number of fragments , there must be a small number of contigs
4 A large number of fragments tend to form a small number of large contigs.
60.7 70.8 73.6 66.9 54.1 29.9 14.7 6.7 3.0 1.3Mean number of contigs
0.5 0.75 1 1.5 2 3 4 5 6 7a
![Page 53: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/53.jpg)
53Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (7/10)One DNA Sequence (7/10)
4 Mean contig size ?
4 Expectation of a sum of random variables where Nis random
4 Moment generating function of a random variable4 Probability generating function of a random variable4 Geometric distribution4 Exponential distribution
![Page 54: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/54.jpg)
54Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (8/10)One DNA Sequence (8/10)
4 Moment generating function (mgf) of a random variable Y:
4 Finding mean and variance from mgf.:
==∫
∑dyyfe
yPeeEm
yy
y
Y
)(
)()()(
θ
θ
θθ
02
22
0
2
02
22
0
)(log,)(log
)(,)(
==
==
=
=
−
=
=
θθ
θθ
θθσ
θθµ
µθθσ
θθµ
dmd
dmd
or
dmd
ddm
![Page 55: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/55.jpg)
55Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (9/10)One DNA Sequence (9/10)
4 Probability generating function (pgf) of a (discrete) random variable Y:
−+
=
=
==
=
=
∑
2
12
22
1
)(
)(
)()()(
µµσ
µ
t
t
y
yY
tpdtd
tpdtd
tyPtEtp
![Page 56: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/56.jpg)
56Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (9/10)One DNA Sequence (9/10)
4 Expectation of a sum of random variables when N is random :
NXXS ++= L1
)()()(
)),((:
))((.
))((.)(
))((:|
)(:
)|()(
XENESErulechainbyand
tqpSofpgfthus
tqpintofcoeff
tqPintofcoeffySP
tqnSofpgf
tPtpNofpgf
nySPPySP
yn
nn
y
nn
nn
nn
=
=
==
=
===
∑
∑
∑
![Page 57: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/57.jpg)
57Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (10/10)One DNA Sequence (10/10)
4 Contig length = sum of n successful left end segments of fragments overlapped in the contig + L
4 The distance between two successful left-hand points ~ Exp (λ) ,λ=N/G (known from Poisson process)
4 The number of overlapping fragments ~ Geo (p)
4 Mean of the random distance (any left-left end)~ conditional exponential (0 < X < L)
∫ −− −==L
ax edxep0
1λλ
11)0|(
1)(
)(
)()0|(0
−−=<<
−=
=<<
−
−
∫
L
L
x
L
eLLXXE
eexf
dxxf
xfLxxf
λ
λ
λ
λ
λ
![Page 58: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/58.jpg)
58Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (11/10)One DNA Sequence (11/10)
4 Mean Contig length:
4 (e.g.) For L = 500, a = 2 ,G = 100,000 : the mean contig size = 1600
Le
Le
LLXXENE
aa +
−−−=
+<<
11)1(
)0|()(
λ
![Page 59: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/59.jpg)
59Copyright (c) 2002 by SNU CSE Biointelligence Lab
Maximum likelihood Maximum likelihood
4 Consistent : parameter value used to generate the data set will also be the value that maximizes the likelihood in the limit.
4 (E.g.) For K observable outcomes of the model, the frequency of occurrence w will tend to
and log-likelihood for parameter θ : tends toBy the positivity of the relevant entropy implies that for all θ
Thus the likelihood is maximized by θ0
4 Drawbacks: can give poor estimate for scanty data
),|(maxarg MDPML θθθ
=
),|( 0 MP θω),|(log)/( MPnn ii ki θω∑ ∑
),|(log),|( 0 MPMP ii i θωθω∑
),|(log),|(),|(log),|( 000 MPMPMPMP ii iii i θωθωθωθω ∑∑ ≥
![Page 60: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/60.jpg)
60Copyright (c) 2002 by SNU CSE Biointelligence Lab
Posterior probability distributionPosterior probability distribution
4 Bayes theorem
4 Use of posteriors: MAP (maximum a posteriori probability) estimate
4 PME (posterior mean estimator) : parameters weighted by the posterior
)|()|(),|(),|(
MDPMPMDPMDP θθθ =
)|(),|(maxarg MPMDPMAP θθθθ
=
∫= θθθθ dnPPME )|(
![Page 61: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/61.jpg)
61Copyright (c) 2002 by SNU CSE Biointelligence Lab
Change of variablesChange of variables
4 Given a density of x: f(x) and a change of variablex= φ(y) , the density of y: g(y) becomes
4 The maximum of g(y) , MAP or PME may shift from that of f(x) .
|)(|)}({)( yyfyg φφ ′=
![Page 62: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/62.jpg)
62Copyright (c) 2002 by SNU CSE Biointelligence Lab
SamplingSampling
4 Given a finite set with probabilities P(x), to sample from this set means to pick elements x randomly with probability P(x) .
4 Sampling by random number generator in [0 1]:
)(
)}()()(]1,0[
)()({)(
11
11
i
ii
ii
xP
xpxpxprand
xpxpPxselectingP
=
+++<<
++=
−
−
L
L
![Page 63: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/63.jpg)
63Copyright (c) 2002 by SNU CSE Biointelligence Lab
Sampling by transformation from a Sampling by transformation from a uniform distributionuniform distribution
4 For a uniform density f(x) and a map x = φ(y),
4 (E.g.) Sampling from a Gaussian:
)(
)()(),()())(()(1 xy
duugyyyyfygy
b−=
=′=′= ∫φ
φφφφ
)(
2/)(
:.),(
1
2/2
xyletAnd
dueyDefine
yFindgivenisxygy u
−
∞−
−
=
= ∫φ
πφ
![Page 64: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/64.jpg)
64Copyright (c) 2002 by SNU CSE Biointelligence Lab
Sampling with the Metropolis algorithm (1/3)Sampling with the Metropolis algorithm (1/3)
4 For sampling when the analytic methods are not available
4 Generate a sequence {yi} to approximate P as close as we like from under a condition:
(detailed balance)4 Detailed balance implies:
4 The sequence under this transition process will sample P correctly
)|( 1−iyyτ
)|()()|()( yxyPxyxP ττ =
)()(#1lim xPxyN iN
==∞→
![Page 65: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/65.jpg)
65Copyright (c) 2002 by SNU CSE Biointelligence Lab
Sampling with the Metropolis algorithm (2/3)Sampling with the Metropolis algorithm (2/3)
Metropolis algorithm:4 Symmetric proposal mechanism: Given a point x, this
selects a point y with probability F(y|x) and symmetric4 Acceptance mechanism : accepts proposed y with
probability min (1,P(y)/P(x))(A point y with larger posterior probability than the current x is always accepted, and one with lower probability is accepted randomly with probability P(y)/P(x))
![Page 66: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/66.jpg)
66Copyright (c) 2002 by SNU CSE Biointelligence Lab
Sampling with the Metropolis algorithm (3/3)Sampling with the Metropolis algorithm (3/3)
Metropolis algorithm (balance):
)|()())(),(min()|())(),(min()|(
))(/)(,1min()|()()|()(
yxyPxPyPyxFyPxPxyF
xPyPxyFxPxyxP
τ
τ
====
![Page 67: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/67.jpg)
67Copyright (c) 2002 by SNU CSE Biointelligence Lab
Gibbs samplingGibbs sampling
4 Sampling from conditional distributions for each i.
4 Proposal distribution is the conditional distribution:Always accept the sample.
),,,,,|( 111 Niii xxxxxP LL +−
![Page 68: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/68.jpg)
68Copyright (c) 2002 by SNU CSE Biointelligence Lab
Estimation of probabilities from counts (1/2)Estimation of probabilities from counts (1/2)
4
0log
log
)(log
)|()|(log
:)|()|(
>=
=
=
≠>
∑
∑
∏∏
i
MLi
i
MLi
i
MLi
ii
nii
nMLii
ML
MLML
N
n
nPnP
nnobservatioanandanyfornPnP
i
i
θθθ
θθ
θ
θθθ
θθθθ
NniMLi /=θ
![Page 69: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/69.jpg)
69Copyright (c) 2002 by SNU CSE Biointelligence Lab
Estimation of probabilities from counts (2/2)Estimation of probabilities from counts (2/2)
4 For scarce data: use prior
)|()|(
)|()()()(
)()()()(
1)(
)|()|()|(
)(1)|(
)()|(
1
1
1
1
1
αθθ
αθα
α
θα
αθθθ
θα
αθ
θθ
α
+=
++
=
=
=
=
=
∏
∏
∏
−+
=
−
=
−
nDnP
nDZnMnP
nZZnMnP
nPDnPnP
ZD
nMnP
i
ni
K
i
ni
K
i
ni
ii
i
i
countspseudoAN
n
i
iiPMEi
:α
αθ++
=
![Page 70: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/70.jpg)
70Copyright (c) 2002 by SNU CSE Biointelligence Lab
Mixtures of Mixtures of DirichletsDirichlets
)(,)|(),,|( 1 kk
k
kk
m PqDqP ααθααθ ==∑L
ANnnP
nDnP
nPnPnP
kiik
k
PMEi
k
k
kk
kk
++
=
+=
=
∑
∑
∑
ααθ
αθα
ααθθ
)|(
)|()|(
)|(),|()|(
![Page 71: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/71.jpg)
71Copyright (c) 2002 by SNU CSE Biointelligence Lab
EM algorithm (1/3)EM algorithm (1/3)
4 A general algorithm for ML estimation with missing data
4 Baum-Welch algorithm for estimating hidden Markov model probabilities is a special case of the EM algorithm.
dataingmissynobservatiox
yxPxPMaximize
y
::
)|,(log)|(log:
∑= θθ
![Page 72: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/72.jpg)
72Copyright (c) 2002 by SNU CSE Biointelligence Lab
EM algorithm (2/3)EM algorithm (2/3)
)|()|()|(log)|(,
),|(),|(log),|()|()|(
)|(log)|(
)|,(log),|()|
|(log),|()|,(log),|(),|(log)|,(log)|(
tttt
t
y
tttt
t
y
tt
y
t
y
t
QQxPxP
xyPxyPxyPQQ
xPxP
yxPxyP
xyPxyPyxPxyPxyPyxPxP
θθθθθθ
θθθθθθθ
θθ
θθθθ
θθθ
θθθ
−≥−
+−
=−
=
−=
−=
∑
∑
∑∑
log
log
(
),log
Thus
Q
θ
![Page 73: Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it is More likely from factory B: 4](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f7493dbd7e2b2087a0c6aeb/html5/thumbnails/73.jpg)
73Copyright (c) 2002 by SNU CSE Biointelligence Lab
EM algorithm (3/3)EM algorithm (3/3)
4 Choosing will always makes the difference positive and thus the likelihood of the new model is larger than the likelihood of the old one.
4 EM Algorithm:(E-step) Calculate Q function(M-step) Maximize Q(θ | θ t) wrt. θ4 The likelihood increases in each iteration4 Instead of maximizing in the (M-step), algorithms
that increase Q are called generalized EM (GEM).
)|(maxarg1 tt Q θθθθ
=+