Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we...
Transcript of Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we...
Copyright (c) 2002 by SNU CSE Biointelligence Lab1
Biological Sequence AnalysisBiological Sequence Analysis(Ch 11. Background on probability)(Ch 11. Background on probability)
Biointelligence LaboratorySchool of Computer Sci. & Eng.
Seoul National UniversitySeoul 151-742, Korea
This slide file is available online athttp://bi.snu.ac.kr/
2Copyright (c) 2002 by SNU CSE Biointelligence Lab
List of ContentsList of Contents
4 Introduction4 Random variables4 Probability density, distributions4 Transformation4 Plots of statistical distributions4 Relative entropy and mutual information4 Many random variables4 One DNA sequence4 Maximum likelihood4 Sampling4 Metropolis sampling4 EM algorithm
3Copyright (c) 2002 by SNU CSE Biointelligence Lab
IntroductionIntroduction
4 Sequence similarity test: g g a g a c t g t a g a c a g c t a a t g c t a t ag a a c g c c c t a g c c a c g a g c c c t t a t cP (more than 10 matches) = ? (0.04)
4 Parameter, data, hypothesis, random variable
4Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence One DNA Sequence
4 Shotgun sequencing: Find long DNA sequence by many overlapping short sequences (500 bases).
5Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (2)One DNA Sequence (2)
4 1. What is the mean proportion of the genome covered by contigs?
4 2. What is the mean number of contigs?4 3. What is the mean contig size?
6Copyright (c) 2002 by SNU CSE Biointelligence Lab
((Discrete) Random VariablesDiscrete) Random Variables
4 A discrete numerical quantity corresponding to an observed outcome of experiment
4 (E.g.) experiment: rolling two six-sided diceoutcome: the two numbers on the dicediscrete random variables: the sum of the two numbers, the difference of the two numbers, the smaller number of the two,
4 Random variables are usually represented by uppercase symbols (X,Y, …) and the realized values of the random variables are represented by lowercase symbols (x,y,…)
7Copyright (c) 2002 by SNU CSE Biointelligence Lab
Probability DistributionsProbability Distributions
4 For a finite set X : the probability distribution is simply an assignment of a probability px to each outcome x in X. (e.g.) the probability of outcomes of rolling a dice {1/12, 1/12, 1/12, 1/6, 1/4, 1/3}
4 For a continuous set X : the probability density f(x)xxfxxxxxp δδδ )()2/2/( =+≤≤−
∫
∫
∞
∞−=
≥
=≤≤
1)(
0)(
)()( 1
010
dxxf
xf
dxxfxxxpx
x
Note: f(x) can be greater than 0
8Copyright (c) 2002 by SNU CSE Biointelligence Lab
Probability density, distribution (1/2)Probability density, distribution (1/2)
4 Relation of distribution and density
4 Expectation (mean)
4 For continuous X:
==≤=
∫∑
≤
≤
tx
tx
dxxf
xXPtXPtF
)(
)()()(
==
∫∑
dxxxf
xXxPXE x
)(
)()(
1)(,0)( >== becanxfxXP
9Copyright (c) 2002 by SNU CSE Biointelligence Lab
Probability density, distribution (2/2)Probability density, distribution (2/2)
4 Relation of distribution and density
4 Conditions for the probability density:
4 Variance:
xt
xtxt
tFdtdxf
tXPtXPxXP
=
<≤
=
=−=== ∑∑
)()(
)()()(
1)(.2
0)(.1
=
≥
∫ dxxf
xf
222)( µµ −=− EXXE
10Copyright (c) 2002 by SNU CSE Biointelligence Lab
TransformationTransformation
4 One random variable:
4 Many random variable: For 1-1 relations between X and U,
}){()(
))(())(())(()()(
)(
1112
11
11122
12
′=
=<=<=<=
=
−−
−−
ggfxf
tgFtgXPtXgPtXPtF
gmonotoneforXgX
)),,(,),,,((
),,(
111
1
nnn
n
XXUXXUU
XXX
LLLr
Lr
=
=
*1
111 ),,(),,(),,( JxxfJxxfuuf nXnXnU LLL == −
11Copyright (c) 2002 by SNU CSE Biointelligence Lab
Statistical IndependenceStatistical Independence
4 Two events are independent if the outcome of one event does not affect the outcome of the other event
4 Discrete random variables are independent if the value of one does not affect the probabilities associated with the values of another random variable
4 (E.g.) Experiment: rolling a fair die1. A: the number is even, B: the number is greater than or equal to 3.P (A | B) = P (A) , P (B|A) = P (B) : A and B are independent2. C: the number is greater than 3P (C) = 1/2 , P (C | A) = 2/3, P (A | C) = 1/3, P (A) = 1/2 :A and C are not independent
12Copyright (c) 2002 by SNU CSE Biointelligence Lab
Uniform DistributionUniform Distribution
4 Discrete uniform : each outcome can occur equally likely.
4 For N outcomes: P (x) = 1/N 4 Continuous uniform: f(x)=1/(b-a) or 1/area(A)
13Copyright (c) 2002 by SNU CSE Biointelligence Lab
Bernoulli DistributionBernoulli Distribution
4 Bernoulli trial is a single trial with two possible outcomes (success / failure)
4 Bernoulli random variable Y: number of successes in this trial
1,0,)1()( 1 =−= − ypptrialainsuccesesyp yy
14Copyright (c) 2002 by SNU CSE Biointelligence Lab
Binomial DistributionBinomial Distribution
4 Defined on a finite set of all the possible results of N trials with a binary outcome ( ‘0’ or ‘1’).
4 The random variable is the number of success in the fixed N trials.
)1()1()()()(
)1()(
)1()(
1
222
1
pNpppkN
mkkpmk
NpppkN
kkpkm
ppkN
Nofoutsucceseskp
kNkN
k
kNkN
k
kNk
−=−
−=−=
=−
==
−
=
−
=
−
=
−
∑∑
∑ ∑
σ
15Copyright (c) 2002 by SNU CSE Biointelligence Lab
Multinomial DistributionMultinomial Distribution
4 Extend the binomial to K independent outcomes with probabilities
(e.g.) Rolling a fair dice N times:P (rolling 12 times and getting each number twice) = 12! / 2!6 (1/6)12 = 3.4 x 10 -3
),,1(, Kii L=θ
∏=
==
K
i
ni
Ki
i
nnn
Kinp11
)|,,1,( θθL
L
16Copyright (c) 2002 by SNU CSE Biointelligence Lab
Geometric DistributionGeometric Distribution
4 Random variable is the number of trials before the first failure (the length of a success run)
4 Test of the significance of the long run in the sequences
4 Geometric-like random variables in BLAST theory:
11)()(
),2,1,0(,)1()(+−=≤=
=−=y
y
pyYPyF
yppyP L
)10(,~)()1(1 <<≥=−− CCpyYPyF y
17Copyright (c) 2002 by SNU CSE Biointelligence Lab
Negative Binomial DistributionNegative Binomial Distribution
4 Random variable is the number of trials in the fixednumber of successes
4 P (N=n) = P(the first (n-1) trials result in exactly (m-1) successes and (n-m) failures and trial n results in success)
),2,1,(,)1(11
)( L++=−
−−
= − mmmnppmn
nP mnm
18Copyright (c) 2002 by SNU CSE Biointelligence Lab
Generalized Geometric DistributionGeneralized Geometric Distribution
4 Random variable is the number of trials before the (k+1)-th failure.
),2,1,(,)1()( 1 L++=−
= +− kkkypp
ky
yP kky
19Copyright (c) 2002 by SNU CSE Biointelligence Lab
Poisson DistributionPoisson Distribution
4 A limiting form of the binomial distribution (as nbecomes large, p becomes smaller) with moderate λ=np
),2,1,0(,!
)( L==−
yy
eyPyλλ
20Copyright (c) 2002 by SNU CSE Biointelligence Lab
Exponential DistributionExponential Distribution
4 Exponential distribution is a continuous analogue of the geometric distribution
)()(301.02log)(
/1)(/1)(
0,1)(
0,)(
2
skewedrightXEXMed
XVarXE
xexF
xexfx
x
≈=
=
=
≥−=
≥=−
−
λ
λ
λ
λλ
λ
21Copyright (c) 2002 by SNU CSE Biointelligence Lab
Relation with Geometric DistributionRelation with Geometric Distribution
4 Suppose X has an exponential distribution and Y is the integer part of X
4 The density function of the fractal part D = X-Y:
)()1()1()1()(
λ
λλ
−
−−
=
−=−=+<≤==
epppeeyXyPyYP yy
λ
λλ−
−
−=
eedf
d
1)(
22Copyright (c) 2002 by SNU CSE Biointelligence Lab
GaussianGaussian DistributionDistribution
4 As N gets larger, mean and variance of the binomial distribution increase linearly rescale k by
then the binomial becomes a Gaussian with the density as
)1( pNpNpkmku−
−=
−=
σ
)2/exp(21)( 2uuf −=π
23Copyright (c) 2002 by SNU CSE Biointelligence Lab
24Copyright (c) 2002 by SNU CSE Biointelligence Lab
Dirichlet Dirichlet Distribution (1/3)Distribution (1/3)
4 The Dirichlet distribution is a conjugate distribution for the multinomial distribution. (The Dirichlet prior wrt. multinomial distribution gives Dirichlet posterior again.)
4 `s corresponds to the parameters of the multinomial distribution.
))()1((,)()(
)1()(
)10()1()()|,(
1 1
1
1 1
111
xxxdZ
ZD
K
i i i
i iK
iii
i
K
i
K
iiiK
i
i
Γ=+ΓΓ
Γ=−=
≤≤−=
∫∏ ∑∏∑
∏ ∑
= =
−
= =
−−
αα
θθδθα
θθδθααθθ
α
αL
iθ
25Copyright (c) 2002 by SNU CSE Biointelligence Lab
DirichletDirichlet Distribution (2/3)Distribution (2/3)
4 The mean of Dirichlet is equal to the normalized parameters.
4 (e.g.) The three distributions all have the same mean (1/8, 2/8, 5/8) though with different values of α(1,2,5); (10,20,50); (.1, .2, .5)
4 For K=2, Dirichlet reduces to the Beta distribution
∑=
k k
iiE
ααθ ][
26Copyright (c) 2002 by SNU CSE Biointelligence Lab
DirichletDirichlet Distribution (3/3)Distribution (3/3)
4 (e.g.) The dice factory: sampling fromDirichlet parameterized by 4 Factory A: all six parameters set to 10Factory B: all six parameters set to 24 On average both factories produce fair dice (average = 1/6)But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it isMore likely from factory B:4 The variance of Dirichlet is inversely proportional to the sum of parameters.
61 ,, θθ L
61 ,, αα L
6.199)5(.)1(.)2()12()|(
119.0)5(.)1(.)10()60()|(
12)12(56
110)110(56
=ΓΓ
=
=ΓΓ
=
−−
−−
B
A
D
D
αθ
αθ
27Copyright (c) 2002 by SNU CSE Biointelligence Lab
Gamma Distribution (1/2)Gamma Distribution (1/2)
4 ),,0()(
),,(1
∞<<Γ
=−−
βααββα
ααβ
xxexgx
2/var/βα
βα
=
=mean
28Copyright (c) 2002 by SNU CSE Biointelligence Lab
Gamma Distribution (2/2)Gamma Distribution (2/2)
4 The gamma distribution is conjugate to the Poisson: the probability of seeing n events over some interval when there is a probability p of an individual event occurring in that interval.
4 The gamma distribution is used to model the probabilities of rate.
4 The gamma distribution is used to model the rate of evolution at different sites in DNA sequences.
!)(
npenf
np−
=
29Copyright (c) 2002 by SNU CSE Biointelligence Lab
Extreme Value Distribution (1/2)Extreme Value Distribution (1/2)
4 For N samples from g(x), the probability that the largest of them are less than x is G(x) N
where 4 Extreme value density (EVD) for g(x) is the limit of
h(x)4 EVD is used to model the breaking point of a chain,
to assessing the significance of the maximum score from a set of alignments
∫ ∞−=
xduugxG )()( 1)()()( −= NxGxNgxh
30Copyright (c) 2002 by SNU CSE Biointelligence Lab
Extreme Value Distribution (2/2)Extreme Value Distribution (2/2)
4 EVD for exponential density
(Gumbel distribution) 4 (e.g.) For N=1,2,10,100,
N >= 10 gives a good approx. to the EVD
xexg αα −=)(
)()exp()/1()( ,1 yxzeeNeexh zzNzx −=−→−= −−−−− αααα ααα
31Copyright (c) 2002 by SNU CSE Biointelligence Lab
Statistical Distribution (1/7) Statistical Distribution (1/7)
4 Binomial Distribution
P = 0.5, n =30
P = 0.8, n =30
32Copyright (c) 2002 by SNU CSE Biointelligence Lab
Statistical Distribution (2/7) Statistical Distribution (2/7)
4 Poisson Distribution
33Copyright (c) 2002 by SNU CSE Biointelligence Lab
Statistical Distribution (3/7) Statistical Distribution (3/7)
4 Negative-Binomial Distribution
34Copyright (c) 2002 by SNU CSE Biointelligence Lab
Statistical Distribution (4/7) Statistical Distribution (4/7)
4 Geometric Distribution
35Copyright (c) 2002 by SNU CSE Biointelligence Lab
Statistical Distribution (5/7) Statistical Distribution (5/7)
4 Exponential Distribution
36Copyright (c) 2002 by SNU CSE Biointelligence Lab
Statistical Distribution (6/7) Statistical Distribution (6/7)
4 Normal Distribution
37Copyright (c) 2002 by SNU CSE Biointelligence Lab
Statistical Distribution (7/7) Statistical Distribution (7/7)
4 Lognormal Distribution
38Copyright (c) 2002 by SNU CSE Biointelligence Lab
Entropy (1/2)Entropy (1/2)
4 A measure of average uncertainty of an outcome4 Shannon entropy:
4 Entropy is maximized when all the P(xi) are equal and the maximum is then log K . If we are certain of the outcome, then the entropy is zero.
4 Information:
∑−=i
ii xPxPXH )(log)()(
afterbefore HHXI −=)(
39Copyright (c) 2002 by SNU CSE Biointelligence Lab
Entropy (2/2)Entropy (2/2)
4 (E.g.) Entropy of equi-probable DNA symbol (A,C,G,T) is 2 bits.
4 Information content of a conserved position: Aparticular position is always an A or a G with p(A)=0.7 and p(G)=0.3. Thus (The more conserved the position, the higher the information content.)
bitsHH afterbefore 12.188.2 =−=−
40Copyright (c) 2002 by SNU CSE Biointelligence Lab
Relative entropy and mutual information (1/4)Relative entropy and mutual information (1/4)
4 Relative entropy (KL divergence) P wrt. Q:
4 Information content and relative entropy are the same if Q represents the initial state
4 ( Not a metric )
∑=i i
ii xQ
xPxPQPH)()(log)()||(
)||()||( PQHQPH ≠
41Copyright (c) 2002 by SNU CSE Biointelligence Lab
Relative entropy and mutual information (2/4)Relative entropy and mutual information (2/4)
4 Positive of relative entropy
0)1)(/)()((
))(/)(log()()||(1)log(
=−≤
=−
−≤
∑∑
i iii
i iii
xPxQxP
xPxQxPQPHxx
42Copyright (c) 2002 by SNU CSE Biointelligence Lab
Relative entropy and mutual information (3/4)Relative entropy and mutual information (3/4)
4 Independency can be measured by the relative entropy between P(X,Y) and P(X)P(Y).
4 M (X, Y) can be interpreted as the amount of information that we acquire about outcome X when we are told outcome Y.
∑=ji ji
jiji yPxP
yxPyxPYXM
, )()(),(
log),();(
43Copyright (c) 2002 by SNU CSE Biointelligence Lab
Relative entropy and mutual information (4/4)Relative entropy and mutual information (4/4)
4 (E.g.) Acceptor sites: 757 acceptor sites from a database with human genes. 30 bases upstream, 20 bases down stream are extracted from each acceptor.
4 Relative entropy:
4 Mutual information:
.
]/)(log[)(
sequencestheinsnucleotidefourtheof
ondistributioveralltheisqwhere
qapap
a
a aii∑
∑ +ba iiii bpapbapbap, 1 )]()(/),(log[),(
44Copyright (c) 2002 by SNU CSE Biointelligence Lab
45Copyright (c) 2002 by SNU CSE Biointelligence Lab
Many random variables (1/2)Many random variables (1/2)
4 Marginal probability
4 Conditional probability∫ ∫
∑
+=
=====+
kiki
yykkii
dxdxxxfxxf
yYyYPyYyYPki
LLLL
LLL
111
,,1111
),,(),,(
),,(),,(1
),,(),,(),,|,,(
),,(),,(
),,|,,(
1
111
11
11
1111
i
kiki
ii
kk
iikkii
xxfxxfxxxxf
yYyYPyYyYP
yYyYyYyYP
L
LLL
L
L
LL
=
====
=
====
+
++
46Copyright (c) 2002 by SNU CSE Biointelligence Lab
Many random variables (2/2)Many random variables (2/2)
4 Covariance
4 Correlation: ∫∫
∑
−−=
==−−=
21212211
,22112211
),())((
),())((
21
21
21
dxdxxxfxx
yYyYPyy
XX
yyYY
µµσ
µµσ
2112
21
σσσ
ρ YY=
47Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (1/10)One DNA Sequence (1/10)
4 Shotgun sequencing: Find long DNA sequence by many overlapping short sequences.
48Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (2/10)One DNA Sequence (2/10)
4 1. What is the mean proportion of the genome covered by contigs?
4 2. What is the mean number of contigs?4 3. What is the mean contig size?
49Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (3/10)One DNA Sequence (3/10)
4 There are N fragments, each of length L, the whole DNA is of length G.
4 The position of the left hand end of any fragment is uniformly distributed in (0, G)
4 The number of fragments falling in the interval length h becomes Np=Nh/G from binomial distribution.
4 For large N and small interval h, this distribution becomes Poisson with mean a = Nh/G
50Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (4/10)One DNA Sequence (4/10)
4 Mean proportion of genome covered by one or more fragment = P (a point chosen at random is covered at least by one fragment) = P (the left end of at least one fragment is in the interval L starting from that point ), (by poisson) = )/(,11 / GNLaee aGNL =−=− −−
51Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (5/10)One DNA Sequence (5/10)
4 The length of fragments to cover the whole genome:P = .99 a = NL/G = 4.6 (a : coverage)P = .999 a = 6.9 (about 7 times the genome length)
4 Human genome is 3x10^9 nucleotides : 3,000,000 of them are missing with P=.999
4 What is the mean number of contigs ? success number: a unique right most fragmenttrial number: Np = P (a fragment is the right-most member of a contig)
= P (no other fragment has its left-end point on the fragment in question)
= e-a
4 Mean number of contigs GNLa NeNe /−− =
52Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (6/10)One DNA Sequence (6/10)
4 For G = 100,000 L = 500 :
4 If there is a small number of fragments , there must be a small number of contigs
4 A large number of fragments tend to form a small number of large contigs.
60.7 70.8 73.6 66.9 54.1 29.9 14.7 6.7 3.0 1.3Mean number of contigs
0.5 0.75 1 1.5 2 3 4 5 6 7a
53Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (7/10)One DNA Sequence (7/10)
4 Mean contig size ?
4 Expectation of a sum of random variables where Nis random
4 Moment generating function of a random variable4 Probability generating function of a random variable4 Geometric distribution4 Exponential distribution
54Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (8/10)One DNA Sequence (8/10)
4 Moment generating function (mgf) of a random variable Y:
4 Finding mean and variance from mgf.:
==∫
∑dyyfe
yPeeEm
yy
y
Y
)(
)()()(
θ
θ
θθ
02
22
0
2
02
22
0
)(log,)(log
)(,)(
==
==
=
=
−
=
=
θθ
θθ
θθσ
θθµ
µθθσ
θθµ
dmd
dmd
or
dmd
ddm
55Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (9/10)One DNA Sequence (9/10)
4 Probability generating function (pgf) of a (discrete) random variable Y:
−+
=
=
==
=
=
∑
2
12
22
1
)(
)(
)()()(
µµσ
µ
t
t
y
yY
tpdtd
tpdtd
tyPtEtp
56Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (9/10)One DNA Sequence (9/10)
4 Expectation of a sum of random variables when N is random :
NXXS ++= L1
)()()(
)),((:
))((.
))((.)(
))((:|
)(:
)|()(
XENESErulechainbyand
tqpSofpgfthus
tqpintofcoeff
tqPintofcoeffySP
tqnSofpgf
tPtpNofpgf
nySPPySP
yn
nn
y
nn
nn
nn
=
=
==
=
===
∑
∑
∑
57Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (10/10)One DNA Sequence (10/10)
4 Contig length = sum of n successful left end segments of fragments overlapped in the contig + L
4 The distance between two successful left-hand points ~ Exp (λ) ,λ=N/G (known from Poisson process)
4 The number of overlapping fragments ~ Geo (p)
4 Mean of the random distance (any left-left end)~ conditional exponential (0 < X < L)
∫ −− −==L
ax edxep0
1λλ
11)0|(
1)(
)(
)()0|(0
−−=<<
−=
=<<
−
−
∫
L
L
x
L
eLLXXE
eexf
dxxf
xfLxxf
λ
λ
λ
λ
λ
58Copyright (c) 2002 by SNU CSE Biointelligence Lab
One DNA Sequence (11/10)One DNA Sequence (11/10)
4 Mean Contig length:
4 (e.g.) For L = 500, a = 2 ,G = 100,000 : the mean contig size = 1600
Le
Le
LLXXENE
aa +
−−−=
+<<
11)1(
)0|()(
λ
59Copyright (c) 2002 by SNU CSE Biointelligence Lab
Maximum likelihood Maximum likelihood
4 Consistent : parameter value used to generate the data set will also be the value that maximizes the likelihood in the limit.
4 (E.g.) For K observable outcomes of the model, the frequency of occurrence w will tend to
and log-likelihood for parameter θ : tends toBy the positivity of the relevant entropy implies that for all θ
Thus the likelihood is maximized by θ0
4 Drawbacks: can give poor estimate for scanty data
),|(maxarg MDPML θθθ
=
),|( 0 MP θω),|(log)/( MPnn ii ki θω∑ ∑
),|(log),|( 0 MPMP ii i θωθω∑
),|(log),|(),|(log),|( 000 MPMPMPMP ii iii i θωθωθωθω ∑∑ ≥
60Copyright (c) 2002 by SNU CSE Biointelligence Lab
Posterior probability distributionPosterior probability distribution
4 Bayes theorem
4 Use of posteriors: MAP (maximum a posteriori probability) estimate
4 PME (posterior mean estimator) : parameters weighted by the posterior
)|()|(),|(),|(
MDPMPMDPMDP θθθ =
)|(),|(maxarg MPMDPMAP θθθθ
=
∫= θθθθ dnPPME )|(
61Copyright (c) 2002 by SNU CSE Biointelligence Lab
Change of variablesChange of variables
4 Given a density of x: f(x) and a change of variablex= φ(y) , the density of y: g(y) becomes
4 The maximum of g(y) , MAP or PME may shift from that of f(x) .
|)(|)}({)( yyfyg φφ ′=
62Copyright (c) 2002 by SNU CSE Biointelligence Lab
SamplingSampling
4 Given a finite set with probabilities P(x), to sample from this set means to pick elements x randomly with probability P(x) .
4 Sampling by random number generator in [0 1]:
)(
)}()()(]1,0[
)()({)(
11
11
i
ii
ii
xP
xpxpxprand
xpxpPxselectingP
=
+++<<
++=
−
−
L
L
63Copyright (c) 2002 by SNU CSE Biointelligence Lab
Sampling by transformation from a Sampling by transformation from a uniform distributionuniform distribution
4 For a uniform density f(x) and a map x = φ(y),
4 (E.g.) Sampling from a Gaussian:
)(
)()(),()())(()(1 xy
duugyyyyfygy
b−=
=′=′= ∫φ
φφφφ
)(
2/)(
:.),(
1
2/2
xyletAnd
dueyDefine
yFindgivenisxygy u
−
∞−
−
=
= ∫φ
πφ
64Copyright (c) 2002 by SNU CSE Biointelligence Lab
Sampling with the Metropolis algorithm (1/3)Sampling with the Metropolis algorithm (1/3)
4 For sampling when the analytic methods are not available
4 Generate a sequence {yi} to approximate P as close as we like from under a condition:
(detailed balance)4 Detailed balance implies:
4 The sequence under this transition process will sample P correctly
)|( 1−iyyτ
)|()()|()( yxyPxyxP ττ =
)()(#1lim xPxyN iN
==∞→
65Copyright (c) 2002 by SNU CSE Biointelligence Lab
Sampling with the Metropolis algorithm (2/3)Sampling with the Metropolis algorithm (2/3)
Metropolis algorithm:4 Symmetric proposal mechanism: Given a point x, this
selects a point y with probability F(y|x) and symmetric4 Acceptance mechanism : accepts proposed y with
probability min (1,P(y)/P(x))(A point y with larger posterior probability than the current x is always accepted, and one with lower probability is accepted randomly with probability P(y)/P(x))
66Copyright (c) 2002 by SNU CSE Biointelligence Lab
Sampling with the Metropolis algorithm (3/3)Sampling with the Metropolis algorithm (3/3)
Metropolis algorithm (balance):
)|()())(),(min()|())(),(min()|(
))(/)(,1min()|()()|()(
yxyPxPyPyxFyPxPxyF
xPyPxyFxPxyxP
τ
τ
====
67Copyright (c) 2002 by SNU CSE Biointelligence Lab
Gibbs samplingGibbs sampling
4 Sampling from conditional distributions for each i.
4 Proposal distribution is the conditional distribution:Always accept the sample.
),,,,,|( 111 Niii xxxxxP LL +−
68Copyright (c) 2002 by SNU CSE Biointelligence Lab
Estimation of probabilities from counts (1/2)Estimation of probabilities from counts (1/2)
4
0log
log
)(log
)|()|(log
:)|()|(
>=
=
=
≠>
∑
∑
∏∏
i
MLi
i
MLi
i
MLi
ii
nii
nMLii
ML
MLML
N
n
nPnP
nnobservatioanandanyfornPnP
i
i
θθθ
θθ
θ
θθθ
θθθθ
NniMLi /=θ
69Copyright (c) 2002 by SNU CSE Biointelligence Lab
Estimation of probabilities from counts (2/2)Estimation of probabilities from counts (2/2)
4 For scarce data: use prior
)|()|(
)|()()()(
)()()()(
1)(
)|()|()|(
)(1)|(
)()|(
1
1
1
1
1
αθθ
αθα
α
θα
αθθθ
θα
αθ
θθ
α
+=
++
=
=
=
=
=
∏
∏
∏
−+
=
−
=
−
nDnP
nDZnMnP
nZZnMnP
nPDnPnP
ZD
nMnP
i
ni
K
i
ni
K
i
ni
ii
i
i
countspseudoAN
n
i
iiPMEi
:α
αθ++
=
70Copyright (c) 2002 by SNU CSE Biointelligence Lab
Mixtures of Mixtures of DirichletsDirichlets
)(,)|(),,|( 1 kk
k
kk
m PqDqP ααθααθ ==∑L
ANnnP
nDnP
nPnPnP
kiik
k
PMEi
k
k
kk
kk
++
=
+=
=
∑
∑
∑
ααθ
αθα
ααθθ
)|(
)|()|(
)|(),|()|(
71Copyright (c) 2002 by SNU CSE Biointelligence Lab
EM algorithm (1/3)EM algorithm (1/3)
4 A general algorithm for ML estimation with missing data
4 Baum-Welch algorithm for estimating hidden Markov model probabilities is a special case of the EM algorithm.
dataingmissynobservatiox
yxPxPMaximize
y
::
)|,(log)|(log:
∑= θθ
72Copyright (c) 2002 by SNU CSE Biointelligence Lab
EM algorithm (2/3)EM algorithm (2/3)
)|()|()|(log)|(,
),|(),|(log),|()|()|(
)|(log)|(
)|,(log),|()|
|(log),|()|,(log),|(),|(log)|,(log)|(
tttt
t
y
tttt
t
y
tt
y
t
y
t
QQxPxP
xyPxyPxyPQQ
xPxP
yxPxyP
xyPxyPyxPxyPxyPyxPxP
θθθθθθ
θθθθθθθ
θθ
θθθθ
θθθ
θθθ
−≥−
+−
=−
=
−=
−=
∑
∑
∑∑
log
log
(
),log
Thus
Q
θ
73Copyright (c) 2002 by SNU CSE Biointelligence Lab
EM algorithm (3/3)EM algorithm (3/3)
4 Choosing will always makes the difference positive and thus the likelihood of the new model is larger than the likelihood of the old one.
4 EM Algorithm:(E-step) Calculate Q function(M-step) Maximize Q(θ | θ t) wrt. θ4 The likelihood increases in each iteration4 Instead of maximizing in the (M-step), algorithms
that increase Q are called generalized EM (GEM).
)|(maxarg1 tt Q θθθθ
=+