Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we...

Post on 26-Jul-2020

2 views 0 download

Transcript of Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we...

Copyright (c) 2002 by SNU CSE Biointelligence Lab1

Biological Sequence AnalysisBiological Sequence Analysis(Ch 11. Background on probability)(Ch 11. Background on probability)

Biointelligence LaboratorySchool of Computer Sci. & Eng.

Seoul National UniversitySeoul 151-742, Korea

This slide file is available online athttp://bi.snu.ac.kr/

2Copyright (c) 2002 by SNU CSE Biointelligence Lab

List of ContentsList of Contents

4 Introduction4 Random variables4 Probability density, distributions4 Transformation4 Plots of statistical distributions4 Relative entropy and mutual information4 Many random variables4 One DNA sequence4 Maximum likelihood4 Sampling4 Metropolis sampling4 EM algorithm

3Copyright (c) 2002 by SNU CSE Biointelligence Lab

IntroductionIntroduction

4 Sequence similarity test: g g a g a c t g t a g a c a g c t a a t g c t a t ag a a c g c c c t a g c c a c g a g c c c t t a t cP (more than 10 matches) = ? (0.04)

4 Parameter, data, hypothesis, random variable

4Copyright (c) 2002 by SNU CSE Biointelligence Lab

One DNA Sequence One DNA Sequence

4 Shotgun sequencing: Find long DNA sequence by many overlapping short sequences (500 bases).

5Copyright (c) 2002 by SNU CSE Biointelligence Lab

One DNA Sequence (2)One DNA Sequence (2)

4 1. What is the mean proportion of the genome covered by contigs?

4 2. What is the mean number of contigs?4 3. What is the mean contig size?

6Copyright (c) 2002 by SNU CSE Biointelligence Lab

((Discrete) Random VariablesDiscrete) Random Variables

4 A discrete numerical quantity corresponding to an observed outcome of experiment

4 (E.g.) experiment: rolling two six-sided diceoutcome: the two numbers on the dicediscrete random variables: the sum of the two numbers, the difference of the two numbers, the smaller number of the two,

4 Random variables are usually represented by uppercase symbols (X,Y, …) and the realized values of the random variables are represented by lowercase symbols (x,y,…)

7Copyright (c) 2002 by SNU CSE Biointelligence Lab

Probability DistributionsProbability Distributions

4 For a finite set X : the probability distribution is simply an assignment of a probability px to each outcome x in X. (e.g.) the probability of outcomes of rolling a dice {1/12, 1/12, 1/12, 1/6, 1/4, 1/3}

4 For a continuous set X : the probability density f(x)xxfxxxxxp δδδ )()2/2/( =+≤≤−

∞−=

=≤≤

1)(

0)(

)()( 1

010

dxxf

xf

dxxfxxxpx

x

Note: f(x) can be greater than 0

8Copyright (c) 2002 by SNU CSE Biointelligence Lab

Probability density, distribution (1/2)Probability density, distribution (1/2)

4 Relation of distribution and density

4 Expectation (mean)

4 For continuous X:

==≤=

∫∑

tx

tx

dxxf

xXPtXPtF

)(

)()()(

==

∫∑

dxxxf

xXxPXE x

)(

)()(

1)(,0)( >== becanxfxXP

9Copyright (c) 2002 by SNU CSE Biointelligence Lab

Probability density, distribution (2/2)Probability density, distribution (2/2)

4 Relation of distribution and density

4 Conditions for the probability density:

4 Variance:

xt

xtxt

tFdtdxf

tXPtXPxXP

=

<≤

=

=−=== ∑∑

)()(

)()()(

1)(.2

0)(.1

=

∫ dxxf

xf

222)( µµ −=− EXXE

10Copyright (c) 2002 by SNU CSE Biointelligence Lab

TransformationTransformation

4 One random variable:

4 Many random variable: For 1-1 relations between X and U,

}){()(

))(())(())(()()(

)(

1112

11

11122

12

′=

=<=<=<=

=

−−

−−

ggfxf

tgFtgXPtXgPtXPtF

gmonotoneforXgX

)),,(,),,,((

),,(

111

1

nnn

n

XXUXXUU

XXX

LLLr

Lr

=

=

*1

111 ),,(),,(),,( JxxfJxxfuuf nXnXnU LLL == −

11Copyright (c) 2002 by SNU CSE Biointelligence Lab

Statistical IndependenceStatistical Independence

4 Two events are independent if the outcome of one event does not affect the outcome of the other event

4 Discrete random variables are independent if the value of one does not affect the probabilities associated with the values of another random variable

4 (E.g.) Experiment: rolling a fair die1. A: the number is even, B: the number is greater than or equal to 3.P (A | B) = P (A) , P (B|A) = P (B) : A and B are independent2. C: the number is greater than 3P (C) = 1/2 , P (C | A) = 2/3, P (A | C) = 1/3, P (A) = 1/2 :A and C are not independent

12Copyright (c) 2002 by SNU CSE Biointelligence Lab

Uniform DistributionUniform Distribution

4 Discrete uniform : each outcome can occur equally likely.

4 For N outcomes: P (x) = 1/N 4 Continuous uniform: f(x)=1/(b-a) or 1/area(A)

13Copyright (c) 2002 by SNU CSE Biointelligence Lab

Bernoulli DistributionBernoulli Distribution

4 Bernoulli trial is a single trial with two possible outcomes (success / failure)

4 Bernoulli random variable Y: number of successes in this trial

1,0,)1()( 1 =−= − ypptrialainsuccesesyp yy

14Copyright (c) 2002 by SNU CSE Biointelligence Lab

Binomial DistributionBinomial Distribution

4 Defined on a finite set of all the possible results of N trials with a binary outcome ( ‘0’ or ‘1’).

4 The random variable is the number of success in the fixed N trials.

)1()1()()()(

)1()(

)1()(

1

222

1

pNpppkN

mkkpmk

NpppkN

kkpkm

ppkN

Nofoutsucceseskp

kNkN

k

kNkN

k

kNk

−=−

−=−=

=−

==

=

=

=

∑∑

∑ ∑

σ

15Copyright (c) 2002 by SNU CSE Biointelligence Lab

Multinomial DistributionMultinomial Distribution

4 Extend the binomial to K independent outcomes with probabilities

(e.g.) Rolling a fair dice N times:P (rolling 12 times and getting each number twice) = 12! / 2!6 (1/6)12 = 3.4 x 10 -3

),,1(, Kii L=θ

∏=

==

K

i

ni

Ki

i

nnn

Kinp11

)|,,1,( θθL

L

16Copyright (c) 2002 by SNU CSE Biointelligence Lab

Geometric DistributionGeometric Distribution

4 Random variable is the number of trials before the first failure (the length of a success run)

4 Test of the significance of the long run in the sequences

4 Geometric-like random variables in BLAST theory:

11)()(

),2,1,0(,)1()(+−=≤=

=−=y

y

pyYPyF

yppyP L

)10(,~)()1(1 <<≥=−− CCpyYPyF y

17Copyright (c) 2002 by SNU CSE Biointelligence Lab

Negative Binomial DistributionNegative Binomial Distribution

4 Random variable is the number of trials in the fixednumber of successes

4 P (N=n) = P(the first (n-1) trials result in exactly (m-1) successes and (n-m) failures and trial n results in success)

),2,1,(,)1(11

)( L++=−

−−

= − mmmnppmn

nP mnm

18Copyright (c) 2002 by SNU CSE Biointelligence Lab

Generalized Geometric DistributionGeneralized Geometric Distribution

4 Random variable is the number of trials before the (k+1)-th failure.

),2,1,(,)1()( 1 L++=−

= +− kkkypp

ky

yP kky

19Copyright (c) 2002 by SNU CSE Biointelligence Lab

Poisson DistributionPoisson Distribution

4 A limiting form of the binomial distribution (as nbecomes large, p becomes smaller) with moderate λ=np

),2,1,0(,!

)( L==−

yy

eyPyλλ

20Copyright (c) 2002 by SNU CSE Biointelligence Lab

Exponential DistributionExponential Distribution

4 Exponential distribution is a continuous analogue of the geometric distribution

)()(301.02log)(

/1)(/1)(

0,1)(

0,)(

2

skewedrightXEXMed

XVarXE

xexF

xexfx

x

≈=

=

=

≥−=

≥=−

λ

λ

λ

λλ

λ

21Copyright (c) 2002 by SNU CSE Biointelligence Lab

Relation with Geometric DistributionRelation with Geometric Distribution

4 Suppose X has an exponential distribution and Y is the integer part of X

4 The density function of the fractal part D = X-Y:

)()1()1()1()(

λ

λλ

−−

=

−=−=+<≤==

epppeeyXyPyYP yy

λ

λλ−

−=

eedf

d

1)(

22Copyright (c) 2002 by SNU CSE Biointelligence Lab

GaussianGaussian DistributionDistribution

4 As N gets larger, mean and variance of the binomial distribution increase linearly rescale k by

then the binomial becomes a Gaussian with the density as

)1( pNpNpkmku−

−=

−=

σ

)2/exp(21)( 2uuf −=π

23Copyright (c) 2002 by SNU CSE Biointelligence Lab

24Copyright (c) 2002 by SNU CSE Biointelligence Lab

Dirichlet Dirichlet Distribution (1/3)Distribution (1/3)

4 The Dirichlet distribution is a conjugate distribution for the multinomial distribution. (The Dirichlet prior wrt. multinomial distribution gives Dirichlet posterior again.)

4 `s corresponds to the parameters of the multinomial distribution.

))()1((,)()(

)1()(

)10()1()()|,(

1 1

1

1 1

111

xxxdZ

ZD

K

i i i

i iK

iii

i

K

i

K

iiiK

i

i

Γ=+ΓΓ

Γ=−=

≤≤−=

∫∏ ∑∏∑

∏ ∑

= =

= =

−−

αα

θθδθα

θθδθααθθ

α

αL

25Copyright (c) 2002 by SNU CSE Biointelligence Lab

DirichletDirichlet Distribution (2/3)Distribution (2/3)

4 The mean of Dirichlet is equal to the normalized parameters.

4 (e.g.) The three distributions all have the same mean (1/8, 2/8, 5/8) though with different values of α(1,2,5); (10,20,50); (.1, .2, .5)

4 For K=2, Dirichlet reduces to the Beta distribution

∑=

k k

iiE

ααθ ][

26Copyright (c) 2002 by SNU CSE Biointelligence Lab

DirichletDirichlet Distribution (3/3)Distribution (3/3)

4 (e.g.) The dice factory: sampling fromDirichlet parameterized by 4 Factory A: all six parameters set to 10Factory B: all six parameters set to 24 On average both factories produce fair dice (average = 1/6)But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it isMore likely from factory B:4 The variance of Dirichlet is inversely proportional to the sum of parameters.

61 ,, θθ L

61 ,, αα L

6.199)5(.)1(.)2()12()|(

119.0)5(.)1(.)10()60()|(

12)12(56

110)110(56

=ΓΓ

=

=ΓΓ

=

−−

−−

B

A

D

D

αθ

αθ

27Copyright (c) 2002 by SNU CSE Biointelligence Lab

Gamma Distribution (1/2)Gamma Distribution (1/2)

4 ),,0()(

),,(1

∞<<Γ

=−−

βααββα

ααβ

xxexgx

2/var/βα

βα

=

=mean

28Copyright (c) 2002 by SNU CSE Biointelligence Lab

Gamma Distribution (2/2)Gamma Distribution (2/2)

4 The gamma distribution is conjugate to the Poisson: the probability of seeing n events over some interval when there is a probability p of an individual event occurring in that interval.

4 The gamma distribution is used to model the probabilities of rate.

4 The gamma distribution is used to model the rate of evolution at different sites in DNA sequences.

!)(

npenf

np−

=

29Copyright (c) 2002 by SNU CSE Biointelligence Lab

Extreme Value Distribution (1/2)Extreme Value Distribution (1/2)

4 For N samples from g(x), the probability that the largest of them are less than x is G(x) N

where 4 Extreme value density (EVD) for g(x) is the limit of

h(x)4 EVD is used to model the breaking point of a chain,

to assessing the significance of the maximum score from a set of alignments

∫ ∞−=

xduugxG )()( 1)()()( −= NxGxNgxh

30Copyright (c) 2002 by SNU CSE Biointelligence Lab

Extreme Value Distribution (2/2)Extreme Value Distribution (2/2)

4 EVD for exponential density

(Gumbel distribution) 4 (e.g.) For N=1,2,10,100,

N >= 10 gives a good approx. to the EVD

xexg αα −=)(

)()exp()/1()( ,1 yxzeeNeexh zzNzx −=−→−= −−−−− αααα ααα

31Copyright (c) 2002 by SNU CSE Biointelligence Lab

Statistical Distribution (1/7) Statistical Distribution (1/7)

4 Binomial Distribution

P = 0.5, n =30

P = 0.8, n =30

32Copyright (c) 2002 by SNU CSE Biointelligence Lab

Statistical Distribution (2/7) Statistical Distribution (2/7)

4 Poisson Distribution

33Copyright (c) 2002 by SNU CSE Biointelligence Lab

Statistical Distribution (3/7) Statistical Distribution (3/7)

4 Negative-Binomial Distribution

34Copyright (c) 2002 by SNU CSE Biointelligence Lab

Statistical Distribution (4/7) Statistical Distribution (4/7)

4 Geometric Distribution

35Copyright (c) 2002 by SNU CSE Biointelligence Lab

Statistical Distribution (5/7) Statistical Distribution (5/7)

4 Exponential Distribution

36Copyright (c) 2002 by SNU CSE Biointelligence Lab

Statistical Distribution (6/7) Statistical Distribution (6/7)

4 Normal Distribution

37Copyright (c) 2002 by SNU CSE Biointelligence Lab

Statistical Distribution (7/7) Statistical Distribution (7/7)

4 Lognormal Distribution

38Copyright (c) 2002 by SNU CSE Biointelligence Lab

Entropy (1/2)Entropy (1/2)

4 A measure of average uncertainty of an outcome4 Shannon entropy:

4 Entropy is maximized when all the P(xi) are equal and the maximum is then log K . If we are certain of the outcome, then the entropy is zero.

4 Information:

∑−=i

ii xPxPXH )(log)()(

afterbefore HHXI −=)(

39Copyright (c) 2002 by SNU CSE Biointelligence Lab

Entropy (2/2)Entropy (2/2)

4 (E.g.) Entropy of equi-probable DNA symbol (A,C,G,T) is 2 bits.

4 Information content of a conserved position: Aparticular position is always an A or a G with p(A)=0.7 and p(G)=0.3. Thus (The more conserved the position, the higher the information content.)

bitsHH afterbefore 12.188.2 =−=−

40Copyright (c) 2002 by SNU CSE Biointelligence Lab

Relative entropy and mutual information (1/4)Relative entropy and mutual information (1/4)

4 Relative entropy (KL divergence) P wrt. Q:

4 Information content and relative entropy are the same if Q represents the initial state

4 ( Not a metric )

∑=i i

ii xQ

xPxPQPH)()(log)()||(

)||()||( PQHQPH ≠

41Copyright (c) 2002 by SNU CSE Biointelligence Lab

Relative entropy and mutual information (2/4)Relative entropy and mutual information (2/4)

4 Positive of relative entropy

0)1)(/)()((

))(/)(log()()||(1)log(

=−≤

=−

−≤

∑∑

i iii

i iii

xPxQxP

xPxQxPQPHxx

42Copyright (c) 2002 by SNU CSE Biointelligence Lab

Relative entropy and mutual information (3/4)Relative entropy and mutual information (3/4)

4 Independency can be measured by the relative entropy between P(X,Y) and P(X)P(Y).

4 M (X, Y) can be interpreted as the amount of information that we acquire about outcome X when we are told outcome Y.

∑=ji ji

jiji yPxP

yxPyxPYXM

, )()(),(

log),();(

43Copyright (c) 2002 by SNU CSE Biointelligence Lab

Relative entropy and mutual information (4/4)Relative entropy and mutual information (4/4)

4 (E.g.) Acceptor sites: 757 acceptor sites from a database with human genes. 30 bases upstream, 20 bases down stream are extracted from each acceptor.

4 Relative entropy:

4 Mutual information:

.

]/)(log[)(

sequencestheinsnucleotidefourtheof

ondistributioveralltheisqwhere

qapap

a

a aii∑

∑ +ba iiii bpapbapbap, 1 )]()(/),(log[),(

44Copyright (c) 2002 by SNU CSE Biointelligence Lab

45Copyright (c) 2002 by SNU CSE Biointelligence Lab

Many random variables (1/2)Many random variables (1/2)

4 Marginal probability

4 Conditional probability∫ ∫

+=

=====+

kiki

yykkii

dxdxxxfxxf

yYyYPyYyYPki

LLLL

LLL

111

,,1111

),,(),,(

),,(),,(1

),,(),,(),,|,,(

),,(),,(

),,|,,(

1

111

11

11

1111

i

kiki

ii

kk

iikkii

xxfxxfxxxxf

yYyYPyYyYP

yYyYyYyYP

L

LLL

L

L

LL

=

====

=

====

+

++

46Copyright (c) 2002 by SNU CSE Biointelligence Lab

Many random variables (2/2)Many random variables (2/2)

4 Covariance

4 Correlation: ∫∫

−−=

==−−=

21212211

,22112211

),())((

),())((

21

21

21

dxdxxxfxx

yYyYPyy

XX

yyYY

µµσ

µµσ

2112

21

σσσ

ρ YY=

47Copyright (c) 2002 by SNU CSE Biointelligence Lab

One DNA Sequence (1/10)One DNA Sequence (1/10)

4 Shotgun sequencing: Find long DNA sequence by many overlapping short sequences.

48Copyright (c) 2002 by SNU CSE Biointelligence Lab

One DNA Sequence (2/10)One DNA Sequence (2/10)

4 1. What is the mean proportion of the genome covered by contigs?

4 2. What is the mean number of contigs?4 3. What is the mean contig size?

49Copyright (c) 2002 by SNU CSE Biointelligence Lab

One DNA Sequence (3/10)One DNA Sequence (3/10)

4 There are N fragments, each of length L, the whole DNA is of length G.

4 The position of the left hand end of any fragment is uniformly distributed in (0, G)

4 The number of fragments falling in the interval length h becomes Np=Nh/G from binomial distribution.

4 For large N and small interval h, this distribution becomes Poisson with mean a = Nh/G

50Copyright (c) 2002 by SNU CSE Biointelligence Lab

One DNA Sequence (4/10)One DNA Sequence (4/10)

4 Mean proportion of genome covered by one or more fragment = P (a point chosen at random is covered at least by one fragment) = P (the left end of at least one fragment is in the interval L starting from that point ), (by poisson) = )/(,11 / GNLaee aGNL =−=− −−

51Copyright (c) 2002 by SNU CSE Biointelligence Lab

One DNA Sequence (5/10)One DNA Sequence (5/10)

4 The length of fragments to cover the whole genome:P = .99 a = NL/G = 4.6 (a : coverage)P = .999 a = 6.9 (about 7 times the genome length)

4 Human genome is 3x10^9 nucleotides : 3,000,000 of them are missing with P=.999

4 What is the mean number of contigs ? success number: a unique right most fragmenttrial number: Np = P (a fragment is the right-most member of a contig)

= P (no other fragment has its left-end point on the fragment in question)

= e-a

4 Mean number of contigs GNLa NeNe /−− =

52Copyright (c) 2002 by SNU CSE Biointelligence Lab

One DNA Sequence (6/10)One DNA Sequence (6/10)

4 For G = 100,000 L = 500 :

4 If there is a small number of fragments , there must be a small number of contigs

4 A large number of fragments tend to form a small number of large contigs.

60.7 70.8 73.6 66.9 54.1 29.9 14.7 6.7 3.0 1.3Mean number of contigs

0.5 0.75 1 1.5 2 3 4 5 6 7a

53Copyright (c) 2002 by SNU CSE Biointelligence Lab

One DNA Sequence (7/10)One DNA Sequence (7/10)

4 Mean contig size ?

4 Expectation of a sum of random variables where Nis random

4 Moment generating function of a random variable4 Probability generating function of a random variable4 Geometric distribution4 Exponential distribution

54Copyright (c) 2002 by SNU CSE Biointelligence Lab

One DNA Sequence (8/10)One DNA Sequence (8/10)

4 Moment generating function (mgf) of a random variable Y:

4 Finding mean and variance from mgf.:

==∫

∑dyyfe

yPeeEm

yy

y

Y

)(

)()()(

θ

θ

θθ

02

22

0

2

02

22

0

)(log,)(log

)(,)(

==

==

=

=

=

=

θθ

θθ

θθσ

θθµ

µθθσ

θθµ

dmd

dmd

or

dmd

ddm

55Copyright (c) 2002 by SNU CSE Biointelligence Lab

One DNA Sequence (9/10)One DNA Sequence (9/10)

4 Probability generating function (pgf) of a (discrete) random variable Y:

−+

=

=

==

=

=

2

12

22

1

)(

)(

)()()(

µµσ

µ

t

t

y

yY

tpdtd

tpdtd

tyPtEtp

56Copyright (c) 2002 by SNU CSE Biointelligence Lab

One DNA Sequence (9/10)One DNA Sequence (9/10)

4 Expectation of a sum of random variables when N is random :

NXXS ++= L1

)()()(

)),((:

))((.

))((.)(

))((:|

)(:

)|()(

XENESErulechainbyand

tqpSofpgfthus

tqpintofcoeff

tqPintofcoeffySP

tqnSofpgf

tPtpNofpgf

nySPPySP

yn

nn

y

nn

nn

nn

=

=

==

=

===

57Copyright (c) 2002 by SNU CSE Biointelligence Lab

One DNA Sequence (10/10)One DNA Sequence (10/10)

4 Contig length = sum of n successful left end segments of fragments overlapped in the contig + L

4 The distance between two successful left-hand points ~ Exp (λ) ,λ=N/G (known from Poisson process)

4 The number of overlapping fragments ~ Geo (p)

4 Mean of the random distance (any left-left end)~ conditional exponential (0 < X < L)

∫ −− −==L

ax edxep0

1λλ

11)0|(

1)(

)(

)()0|(0

−−=<<

−=

=<<

L

L

x

L

eLLXXE

eexf

dxxf

xfLxxf

λ

λ

λ

λ

λ

58Copyright (c) 2002 by SNU CSE Biointelligence Lab

One DNA Sequence (11/10)One DNA Sequence (11/10)

4 Mean Contig length:

4 (e.g.) For L = 500, a = 2 ,G = 100,000 : the mean contig size = 1600

Le

Le

LLXXENE

aa +

−−−=

+<<

11)1(

)0|()(

λ

59Copyright (c) 2002 by SNU CSE Biointelligence Lab

Maximum likelihood Maximum likelihood

4 Consistent : parameter value used to generate the data set will also be the value that maximizes the likelihood in the limit.

4 (E.g.) For K observable outcomes of the model, the frequency of occurrence w will tend to

and log-likelihood for parameter θ : tends toBy the positivity of the relevant entropy implies that for all θ

Thus the likelihood is maximized by θ0

4 Drawbacks: can give poor estimate for scanty data

),|(maxarg MDPML θθθ

=

),|( 0 MP θω),|(log)/( MPnn ii ki θω∑ ∑

),|(log),|( 0 MPMP ii i θωθω∑

),|(log),|(),|(log),|( 000 MPMPMPMP ii iii i θωθωθωθω ∑∑ ≥

60Copyright (c) 2002 by SNU CSE Biointelligence Lab

Posterior probability distributionPosterior probability distribution

4 Bayes theorem

4 Use of posteriors: MAP (maximum a posteriori probability) estimate

4 PME (posterior mean estimator) : parameters weighted by the posterior

)|()|(),|(),|(

MDPMPMDPMDP θθθ =

)|(),|(maxarg MPMDPMAP θθθθ

=

∫= θθθθ dnPPME )|(

61Copyright (c) 2002 by SNU CSE Biointelligence Lab

Change of variablesChange of variables

4 Given a density of x: f(x) and a change of variablex= φ(y) , the density of y: g(y) becomes

4 The maximum of g(y) , MAP or PME may shift from that of f(x) .

|)(|)}({)( yyfyg φφ ′=

62Copyright (c) 2002 by SNU CSE Biointelligence Lab

SamplingSampling

4 Given a finite set with probabilities P(x), to sample from this set means to pick elements x randomly with probability P(x) .

4 Sampling by random number generator in [0 1]:

)(

)}()()(]1,0[

)()({)(

11

11

i

ii

ii

xP

xpxpxprand

xpxpPxselectingP

=

+++<<

++=

L

L

63Copyright (c) 2002 by SNU CSE Biointelligence Lab

Sampling by transformation from a Sampling by transformation from a uniform distributionuniform distribution

4 For a uniform density f(x) and a map x = φ(y),

4 (E.g.) Sampling from a Gaussian:

)(

)()(),()())(()(1 xy

duugyyyyfygy

b−=

=′=′= ∫φ

φφφφ

)(

2/)(

:.),(

1

2/2

xyletAnd

dueyDefine

yFindgivenisxygy u

∞−

=

= ∫φ

πφ

64Copyright (c) 2002 by SNU CSE Biointelligence Lab

Sampling with the Metropolis algorithm (1/3)Sampling with the Metropolis algorithm (1/3)

4 For sampling when the analytic methods are not available

4 Generate a sequence {yi} to approximate P as close as we like from under a condition:

(detailed balance)4 Detailed balance implies:

4 The sequence under this transition process will sample P correctly

)|( 1−iyyτ

)|()()|()( yxyPxyxP ττ =

)()(#1lim xPxyN iN

==∞→

65Copyright (c) 2002 by SNU CSE Biointelligence Lab

Sampling with the Metropolis algorithm (2/3)Sampling with the Metropolis algorithm (2/3)

Metropolis algorithm:4 Symmetric proposal mechanism: Given a point x, this

selects a point y with probability F(y|x) and symmetric4 Acceptance mechanism : accepts proposed y with

probability min (1,P(y)/P(x))(A point y with larger posterior probability than the current x is always accepted, and one with lower probability is accepted randomly with probability P(y)/P(x))

66Copyright (c) 2002 by SNU CSE Biointelligence Lab

Sampling with the Metropolis algorithm (3/3)Sampling with the Metropolis algorithm (3/3)

Metropolis algorithm (balance):

)|()())(),(min()|())(),(min()|(

))(/)(,1min()|()()|()(

yxyPxPyPyxFyPxPxyF

xPyPxyFxPxyxP

τ

τ

====

67Copyright (c) 2002 by SNU CSE Biointelligence Lab

Gibbs samplingGibbs sampling

4 Sampling from conditional distributions for each i.

4 Proposal distribution is the conditional distribution:Always accept the sample.

),,,,,|( 111 Niii xxxxxP LL +−

68Copyright (c) 2002 by SNU CSE Biointelligence Lab

Estimation of probabilities from counts (1/2)Estimation of probabilities from counts (1/2)

4

0log

log

)(log

)|()|(log

:)|()|(

>=

=

=

≠>

∏∏

i

MLi

i

MLi

i

MLi

ii

nii

nMLii

ML

MLML

N

n

nPnP

nnobservatioanandanyfornPnP

i

i

θθθ

θθ

θ

θθθ

θθθθ

NniMLi /=θ

69Copyright (c) 2002 by SNU CSE Biointelligence Lab

Estimation of probabilities from counts (2/2)Estimation of probabilities from counts (2/2)

4 For scarce data: use prior

)|()|(

)|()()()(

)()()()(

1)(

)|()|()|(

)(1)|(

)()|(

1

1

1

1

1

αθθ

αθα

α

θα

αθθθ

θα

αθ

θθ

α

+=

++

=

=

=

=

=

−+

=

=

nDnP

nDZnMnP

nZZnMnP

nPDnPnP

ZD

nMnP

i

ni

K

i

ni

K

i

ni

ii

i

i

countspseudoAN

n

i

iiPMEi

αθ++

=

70Copyright (c) 2002 by SNU CSE Biointelligence Lab

Mixtures of Mixtures of DirichletsDirichlets

)(,)|(),,|( 1 kk

k

kk

m PqDqP ααθααθ ==∑L

ANnnP

nDnP

nPnPnP

kiik

k

PMEi

k

k

kk

kk

++

=

+=

=

ααθ

αθα

ααθθ

)|(

)|()|(

)|(),|()|(

71Copyright (c) 2002 by SNU CSE Biointelligence Lab

EM algorithm (1/3)EM algorithm (1/3)

4 A general algorithm for ML estimation with missing data

4 Baum-Welch algorithm for estimating hidden Markov model probabilities is a special case of the EM algorithm.

dataingmissynobservatiox

yxPxPMaximize

y

::

)|,(log)|(log:

∑= θθ

72Copyright (c) 2002 by SNU CSE Biointelligence Lab

EM algorithm (2/3)EM algorithm (2/3)

)|()|()|(log)|(,

),|(),|(log),|()|()|(

)|(log)|(

)|,(log),|()|

|(log),|()|,(log),|(),|(log)|,(log)|(

tttt

t

y

tttt

t

y

tt

y

t

y

t

QQxPxP

xyPxyPxyPQQ

xPxP

yxPxyP

xyPxyPyxPxyPxyPyxPxP

θθθθθθ

θθθθθθθ

θθ

θθθθ

θθθ

θθθ

−≥−

+−

=−

=

−=

−=

∑∑

log

log

(

),log

Thus

Q

θ

73Copyright (c) 2002 by SNU CSE Biointelligence Lab

EM algorithm (3/3)EM algorithm (3/3)

4 Choosing will always makes the difference positive and thus the likelihood of the new model is larger than the likelihood of the old one.

4 EM Algorithm:(E-step) Calculate Q function(M-step) Maximize Q(θ | θ t) wrt. θ4 The likelihood increases in each iteration4 Instead of maximizing in the (M-step), algorithms

that increase Q are called generalized EM (GEM).

)|(maxarg1 tt Q θθθθ

=+