Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we...

Biological Sequence AnalysisBiological Sequence Analysis(Ch 11. Background on probability)(Ch 11. Background on probability)

Biointelligence LaboratorySchool of Computer Sci. & Eng.

Seoul National UniversitySeoul 151-742, Korea

This slide file is available online athttp://bi.snu.ac.kr/

List of ContentsList of Contents

4 Introduction4 Random variables4 Probability density, distributions4 Transformation4 Plots of statistical distributions4 Relative entropy and mutual information4 Many random variables4 One DNA sequence4 Maximum likelihood4 Sampling4 Metropolis sampling4 EM algorithm

IntroductionIntroduction

4 Sequence similarity test: g g a g a c t g t a g a c a g c t a a t g c t a t ag a a c g c c c t a g c c a c g a g c c c t t a t cP (more than 10 matches) = ? (0.04)

4 Parameter, data, hypothesis, random variable

One DNA Sequence One DNA Sequence

4 Shotgun sequencing: Find long DNA sequence by many overlapping short sequences (500 bases).

One DNA Sequence (2)One DNA Sequence (2)

4 1. What is the mean proportion of the genome covered by contigs?

4 2. What is the mean number of contigs?4 3. What is the mean contig size?

((Discrete) Random VariablesDiscrete) Random Variables

4 A discrete numerical quantity corresponding to an observed outcome of experiment

4 (E.g.) experiment: rolling two six-sided diceoutcome: the two numbers on the dicediscrete random variables: the sum of the two numbers, the difference of the two numbers, the smaller number of the two,

4 Random variables are usually represented by uppercase symbols (X,Y, …) and the realized values of the random variables are represented by lowercase symbols (x,y,…)

Probability DistributionsProbability Distributions

4 For a finite set X : the probability distribution is simply an assignment of a probability px to each outcome x in X. (e.g.) the probability of outcomes of rolling a dice {1/12, 1/12, 1/12, 1/6, 1/4, 1/3}

4 For a continuous set X : the probability density f(x)xxfxxxxxp δδδ )()2/2/( =+≤≤−

∞−=

=≤≤

)()( 1

dxxfxxxpx

Note: f(x) can be greater than 0

Probability density, distribution (1/2)Probability density, distribution (1/2)

4 Relation of distribution and density

4 Expectation (mean)

4 For continuous X:

==≤=

∫∑

xXPtXPtF

)()()(

∫∑

xXxPXE x

1)(,0)( >== becanxfxXP

Probability density, distribution (2/2)Probability density, distribution (2/2)

4 Relation of distribution and density

4 Conditions for the probability density:

4 Variance:

tFdtdxf

tXPtXPxXP

=−=== ∑∑

)()()(

∫ dxxf

222)( µµ −=− EXXE

TransformationTransformation

4 One random variable:

4 Many random variable: For 1-1 relations between X and U,

}){()(

))(())(())(()()(

=<=<=<=

−−

tgFtgXPtXgPtXPtF

gmonotoneforXgX

)),,(,),,,((

XXUXXUU

111 ),,(),,(),,( JxxfJxxfuuf nXnXnU LLL == −

Statistical IndependenceStatistical Independence

4 Two events are independent if the outcome of one event does not affect the outcome of the other event

4 Discrete random variables are independent if the value of one does not affect the probabilities associated with the values of another random variable

4 (E.g.) Experiment: rolling a fair die1. A: the number is even, B: the number is greater than or equal to 3.P (A | B) = P (A) , P (B|A) = P (B) : A and B are independent2. C: the number is greater than 3P (C) = 1/2 , P (C | A) = 2/3, P (A | C) = 1/3, P (A) = 1/2 :A and C are not independent

Uniform DistributionUniform Distribution

4 Discrete uniform : each outcome can occur equally likely.

4 For N outcomes: P (x) = 1/N 4 Continuous uniform: f(x)=1/(b-a) or 1/area(A)

Bernoulli DistributionBernoulli Distribution

4 Bernoulli trial is a single trial with two possible outcomes (success / failure)

4 Bernoulli random variable Y: number of successes in this trial

1,0,)1()( 1 =−= − ypptrialainsuccesesyp yy

Binomial DistributionBinomial Distribution

4 Defined on a finite set of all the possible results of N trials with a binary outcome ( ‘0’ or ‘1’).

4 The random variable is the number of success in the fixed N trials.

)1()1()()()(

pNpppkN

mkkpmk

NpppkN

Nofoutsucceseskp

−=−

−=−=

∑∑

∑ ∑

Multinomial DistributionMultinomial Distribution

4 Extend the binomial to K independent outcomes with probabilities

(e.g.) Rolling a fair dice N times:P (rolling 12 times and getting each number twice) = 12! / 2!6 (1/6)12 = 3.4 x 10 -3

),,1(, Kii L=θ

Kinp11

)|,,1,( θθL

Geometric DistributionGeometric Distribution

4 Random variable is the number of trials before the first failure (the length of a success run)

4 Test of the significance of the long run in the sequences

4 Geometric-like random variables in BLAST theory:

11)()(

),2,1,0(,)1()(+−=≤=

=−=y

pyYPyF

yppyP L

)10(,~)()1(1 <<≥=−− CCpyYPyF y

Negative Binomial DistributionNegative Binomial Distribution

4 Random variable is the number of trials in the fixednumber of successes

4 P (N=n) = P(the first (n-1) trials result in exactly (m-1) successes and (n-m) failures and trial n results in success)

),2,1,(,)1(11

)( L++=−

−−

= − mmmnppmn

nP mnm

Generalized Geometric DistributionGeneralized Geometric Distribution

4 Random variable is the number of trials before the (k+1)-th failure.

),2,1,(,)1()( 1 L++=−

= +− kkkypp

yP kky

Poisson DistributionPoisson Distribution

4 A limiting form of the binomial distribution (as nbecomes large, p becomes smaller) with moderate λ=np

),2,1,0(,!

)( L==−

eyPyλλ

Exponential DistributionExponential Distribution

4 Exponential distribution is a continuous analogue of the geometric distribution

)()(301.02log)(

/1)(/1)(

skewedrightXEXMed

XVarXE

≥−=

≥=−

Relation with Geometric DistributionRelation with Geometric Distribution

4 Suppose X has an exponential distribution and Y is the integer part of X

4 The density function of the fractal part D = X-Y:

)()1()1()1()(

−−

−=−=+<≤==

epppeeyXyPyYP yy

λλ−

GaussianGaussian DistributionDistribution

4 As N gets larger, mean and variance of the binomial distribution increase linearly rescale k by

then the binomial becomes a Gaussian with the density as

)1( pNpNpkmku−

)2/exp(21)( 2uuf −=π

Dirichlet Dirichlet Distribution (1/3)Distribution (1/3)

4 The Dirichlet distribution is a conjugate distribution for the multinomial distribution. (The Dirichlet prior wrt. multinomial distribution gives Dirichlet posterior again.)

4 `s corresponds to the parameters of the multinomial distribution.

))()1((,)()(

)10()1()()|,(

Γ=+ΓΓ

Γ=−=

≤≤−=

∫∏ ∑∏∑

∏ ∑

−−

θθδθα

θθδθααθθ

DirichletDirichlet Distribution (2/3)Distribution (2/3)

4 The mean of Dirichlet is equal to the normalized parameters.

4 (e.g.) The three distributions all have the same mean (1/8, 2/8, 5/8) though with different values of α(1,2,5); (10,20,50); (.1, .2, .5)

4 For K=2, Dirichlet reduces to the Beta distribution

ααθ ][

DirichletDirichlet Distribution (3/3)Distribution (3/3)

4 (e.g.) The dice factory: sampling fromDirichlet parameterized by 4 Factory A: all six parameters set to 10Factory B: all six parameters set to 24 On average both factories produce fair dice (average = 1/6)But if we find a loaded dice with (.1 .1 .1 .1 .1 .6), it isMore likely from factory B:4 The variance of Dirichlet is inversely proportional to the sum of parameters.

61 ,, θθ L

61 ,, αα L

6.199)5(.)1(.)2()12()|(

119.0)5(.)1(.)10()60()|(

12)12(56

110)110(56

−−

Gamma Distribution (1/2)Gamma Distribution (1/2)

4 ),,0()(

∞<<Γ

=−−

βααββα

ααβ

xxexgx

2/var/βα

Gamma Distribution (2/2)Gamma Distribution (2/2)

4 The gamma distribution is conjugate to the Poisson: the probability of seeing n events over some interval when there is a probability p of an individual event occurring in that interval.

4 The gamma distribution is used to model the probabilities of rate.

4 The gamma distribution is used to model the rate of evolution at different sites in DNA sequences.

Extreme Value Distribution (1/2)Extreme Value Distribution (1/2)

4 For N samples from g(x), the probability that the largest of them are less than x is G(x) N

where 4 Extreme value density (EVD) for g(x) is the limit of

h(x)4 EVD is used to model the breaking point of a chain,

to assessing the significance of the maximum score from a set of alignments

∫ ∞−=

xduugxG )()( 1)()()( −= NxGxNgxh

Extreme Value Distribution (2/2)Extreme Value Distribution (2/2)

4 EVD for exponential density

(Gumbel distribution) 4 (e.g.) For N=1,2,10,100,

N >= 10 gives a good approx. to the EVD

xexg αα −=)(

)()exp()/1()( ,1 yxzeeNeexh zzNzx −=−→−= −−−−− αααα ααα

Statistical Distribution (1/7) Statistical Distribution (1/7)

4 Binomial Distribution

P = 0.5, n =30

P = 0.8, n =30

4 Poisson Distribution

4 Negative-Binomial Distribution

4 Geometric Distribution

4 Exponential Distribution

4 Normal Distribution

4 Lognormal Distribution

Entropy (1/2)Entropy (1/2)

4 A measure of average uncertainty of an outcome4 Shannon entropy:

4 Entropy is maximized when all the P(xi) are equal and the maximum is then log K . If we are certain of the outcome, then the entropy is zero.

4 Information:

∑−=i

ii xPxPXH )(log)()(

afterbefore HHXI −=)(

Entropy (2/2)Entropy (2/2)

4 (E.g.) Entropy of equi-probable DNA symbol (A,C,G,T) is 2 bits.

4 Information content of a conserved position: Aparticular position is always an A or a G with p(A)=0.7 and p(G)=0.3. Thus (The more conserved the position, the higher the information content.)

bitsHH afterbefore 12.188.2 =−=−

Relative entropy and mutual information (1/4)Relative entropy and mutual information (1/4)

4 Relative entropy (KL divergence) P wrt. Q:

4 Information content and relative entropy are the same if Q represents the initial state

4 ( Not a metric )

∑=i i

xPxPQPH)()(log)()||(

)||()||( PQHQPH ≠

4 Positive of relative entropy

0)1)(/)()((

))(/)(log()()||(1)log(

=−≤

−≤

∑∑

xPxQxP

xPxQxPQPHxx

4 Independency can be measured by the relative entropy between P(X,Y) and P(X)P(Y).

4 M (X, Y) can be interpreted as the amount of information that we acquire about outcome X when we are told outcome Y.

∑=ji ji

jiji yPxP

yxPyxPYXM

, )()(),(

log),();(

4 (E.g.) Acceptor sites: 757 acceptor sites from a database with human genes. 30 bases upstream, 20 bases down stream are extracted from each acceptor.

4 Relative entropy:

4 Mutual information:

]/)(log[)(

sequencestheinsnucleotidefourtheof

ondistributioveralltheisqwhere

a aii∑

∑ +ba iiii bpapbapbap, 1 )]()(/),(log[),(

Many random variables (1/2)Many random variables (1/2)

4 Marginal probability

4 Conditional probability∫ ∫

=====+

yykkii

dxdxxxfxxf

yYyYPyYyYPki

,,1111

),,(),,(

),,(),,(1

),,(),,(),,|,,(

),,(),,(

),,|,,(

iikkii

xxfxxfxxxxf

yYyYPyYyYP

yYyYyYyYP

Many random variables (2/2)Many random variables (2/2)

4 Covariance

4 Correlation: ∫∫

−−=

==−−=

21212211

,22112211

),())((

dxdxxxfxx

yYyYPyy

µµσ

σσσ

ρ YY=

One DNA Sequence (1/10)One DNA Sequence (1/10)

4 Shotgun sequencing: Find long DNA sequence by many overlapping short sequences.

4 1. What is the mean proportion of the genome covered by contigs?

4 2. What is the mean number of contigs?4 3. What is the mean contig size?

4 There are N fragments, each of length L, the whole DNA is of length G.

4 The position of the left hand end of any fragment is uniformly distributed in (0, G)

4 The number of fragments falling in the interval length h becomes Np=Nh/G from binomial distribution.

4 For large N and small interval h, this distribution becomes Poisson with mean a = Nh/G

4 Mean proportion of genome covered by one or more fragment = P (a point chosen at random is covered at least by one fragment) = P (the left end of at least one fragment is in the interval L starting from that point ), (by poisson) = )/(,11 / GNLaee aGNL =−=− −−

4 The length of fragments to cover the whole genome:P = .99 a = NL/G = 4.6 (a : coverage)P = .999 a = 6.9 (about 7 times the genome length)

4 Human genome is 3x10^9 nucleotides : 3,000,000 of them are missing with P=.999

4 What is the mean number of contigs ? success number: a unique right most fragmenttrial number: Np = P (a fragment is the right-most member of a contig)

= P (no other fragment has its left-end point on the fragment in question)

4 Mean number of contigs GNLa NeNe /−− =

4 For G = 100,000 L = 500 :

4 If there is a small number of fragments , there must be a small number of contigs

4 A large number of fragments tend to form a small number of large contigs.

60.7 70.8 73.6 66.9 54.1 29.9 14.7 6.7 3.0 1.3Mean number of contigs

0.5 0.75 1 1.5 2 3 4 5 6 7a

4 Mean contig size ?

4 Expectation of a sum of random variables where Nis random

4 Moment generating function of a random variable4 Probability generating function of a random variable4 Geometric distribution4 Exponential distribution

4 Moment generating function (mgf) of a random variable Y:

4 Finding mean and variance from mgf.:

∑dyyfe

yPeeEm

)()()(

)(log,)(log

θθσ

θθµ

µθθσ

θθµ

4 Probability generating function (pgf) of a (discrete) random variable Y:

)()()(

µµσ

tyPtEtp

4 Expectation of a sum of random variables when N is random :

NXXS ++= L1

)()()(

)),((:

))((.)(

))((:|

XENESErulechainbyand

tqpSofpgfthus

tqpintofcoeff

tqPintofcoeffySP

tqnSofpgf

tPtpNofpgf

nySPPySP

4 Contig length = sum of n successful left end segments of fragments overlapped in the contig + L

4 The distance between two successful left-hand points ~ Exp (λ) ,λ=N/G (known from Poisson process)

4 The number of overlapping fragments ~ Geo (p)

4 Mean of the random distance (any left-left end)~ conditional exponential (0 < X < L)

∫ −− −==L

ax edxep0

11)0|(

)()0|(0

−−=<<

eLLXXE

xfLxxf

4 Mean Contig length:

4 (e.g.) For L = 500, a = 2 ,G = 100,000 : the mean contig size = 1600

LLXXENE

−−−=

)0|()(

Maximum likelihood Maximum likelihood

4 Consistent : parameter value used to generate the data set will also be the value that maximizes the likelihood in the limit.

4 (E.g.) For K observable outcomes of the model, the frequency of occurrence w will tend to

and log-likelihood for parameter θ : tends toBy the positivity of the relevant entropy implies that for all θ

Thus the likelihood is maximized by θ0

4 Drawbacks: can give poor estimate for scanty data

),|(maxarg MDPML θθθ

),|( 0 MP θω),|(log)/( MPnn ii ki θω∑ ∑

),|(log),|( 0 MPMP ii i θωθω∑

),|(log),|(),|(log),|( 000 MPMPMPMP ii iii i θωθωθωθω ∑∑ ≥

Posterior probability distributionPosterior probability distribution

4 Bayes theorem

4 Use of posteriors: MAP (maximum a posteriori probability) estimate

4 PME (posterior mean estimator) : parameters weighted by the posterior

)|()|(),|(),|(

MDPMPMDPMDP θθθ =

)|(),|(maxarg MPMDPMAP θθθθ

∫= θθθθ dnPPME )|(

Change of variablesChange of variables

4 Given a density of x: f(x) and a change of variablex= φ(y) , the density of y: g(y) becomes

4 The maximum of g(y) , MAP or PME may shift from that of f(x) .

|)(|)}({)( yyfyg φφ ′=

SamplingSampling

4 Given a finite set with probabilities P(x), to sample from this set means to pick elements x randomly with probability P(x) .

4 Sampling by random number generator in [0 1]:

)}()()(]1,0[

)()({)(

xpxpxprand

xpxpPxselectingP

Sampling by transformation from a Sampling by transformation from a uniform distributionuniform distribution

4 For a uniform density f(x) and a map x = φ(y),

4 (E.g.) Sampling from a Gaussian:

)()(),()())(()(1 xy

duugyyyyfygy

=′=′= ∫φ

φφφφ

xyletAnd

dueyDefine

yFindgivenisxygy u

∞−

= ∫φ

Sampling with the Metropolis algorithm (1/3)Sampling with the Metropolis algorithm (1/3)

4 For sampling when the analytic methods are not available

4 Generate a sequence {yi} to approximate P as close as we like from under a condition:

(detailed balance)4 Detailed balance implies:

4 The sequence under this transition process will sample P correctly

)|( 1−iyyτ

)|()()|()( yxyPxyxP ττ =

)()(#1lim xPxyN iN

==∞→

Metropolis algorithm:4 Symmetric proposal mechanism: Given a point x, this

selects a point y with probability F(y|x) and symmetric4 Acceptance mechanism : accepts proposed y with

probability min (1,P(y)/P(x))(A point y with larger posterior probability than the current x is always accepted, and one with lower probability is accepted randomly with probability P(y)/P(x))

Metropolis algorithm (balance):

)|()())(),(min()|())(),(min()|(

))(/)(,1min()|()()|()(

yxyPxPyPyxFyPxPxyF

xPyPxyFxPxyxP

Gibbs samplingGibbs sampling

4 Sampling from conditional distributions for each i.

4 Proposal distribution is the conditional distribution:Always accept the sample.

),,,,,|( 111 Niii xxxxxP LL +−

Estimation of probabilities from counts (1/2)Estimation of probabilities from counts (1/2)

)|()|(log

:)|()|(

∏∏

nnobservatioanandanyfornPnP

θθθ

θθθθ

NniMLi /=θ

Estimation of probabilities from counts (2/2)Estimation of probabilities from counts (2/2)

4 For scarce data: use prior

)|()|(

)|()()()(

)()()()(

)|()|()|(

)(1)|(

αθθ

αθα

αθθθ

nDZnMnP

nZZnMnP

nPDnPnP

countspseudoAN

iiPMEi

αθ++

Mixtures of Mixtures of DirichletsDirichlets

)(,)|(),,|( 1 kk

m PqDqP ααθααθ ==∑L

nPnPnP

ααθ

αθα

ααθθ

)|()|(

)|(),|()|(

EM algorithm (1/3)EM algorithm (1/3)

4 A general algorithm for ML estimation with missing data

4 Baum-Welch algorithm for estimating hidden Markov model probabilities is a special case of the EM algorithm.

dataingmissynobservatiox

yxPxPMaximize

)|,(log)|(log:

∑= θθ

)|()|()|(log)|(,

),|(),|(log),|()|()|(

)|(log)|(

)|,(log),|()|

|(log),|()|,(log),|(),|(log)|,(log)|(

QQxPxP

xyPxyPxyPQQ

yxPxyP

xyPxyPyxPxyPxyPyxPxP

θθθθθθ

θθθθθθθ

θθθθ

θθθ

−≥−

∑∑

4 Choosing will always makes the difference positive and thus the likelihood of the new model is larger than the likelihood of the old one.

4 EM Algorithm:(E-step) Calculate Q function(M-step) Maximize Q(θ | θ t) wrt. θ4 The likelihood increases in each iteration4 Instead of maximizing in the (M-step), algorithms

that increase Q are called generalized EM (GEM).

)|(maxarg1 tt Q θθθθ

Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we...

Documents

Transcript of Biological Sequence Analysis4 On average both factories produce fair dice (average = 1/6) But if we...

DICE : A Formally Veriﬁed Implementation of DICE Measured Boot

DiceCollector.com - Home of the Worlds Largest Dice ...The Dice If you play in a dice game you natur- ally wonder whether or not the dice are square. The dice are the most important

The Dice Game Bookmrsgiulvezan.weebly.com/.../5/8/1/3/5813301/dicegamebook.pdf^oll Dice Rol Greater Than, Less Than, or Equa Vlaterials: Dice or number cubes (1 or 2 per chile Directions:

Campaña Checa y Elige - ConMexico · 2013-04-23 · que comprÓ y dice que si y el otro dice que no, el narrador dice que es muy importante revisar 1 una lata con una etiqueta, dice

Cambridge International Examinations Cambridge ... · The diagram shows two fair dice. The numbers on dice A are 0, 0, 1, 1, 1, 3. The numbers on dice B are 1, 1, 2, 2, 2, 3. When

Indie music analysis4

Q Model Ice Machines · ice cube size d dice y half dice # cube size 2 dice 3 dice 4 half-dice 5 half-dice condenser type air-cooled water-cooled air-cooled water-cooled a self-contained

1) Description of the Chapter - solutionbank.co · Web view1) Description of the Chapter2. 2) Essentials to Cover2. 3) Exercise Planning Table3. 4) Continuing Case Analysis4. 5)

structural analysis4

Programming games Dice game. Computer science big ideas! Work on dice game. Show alternative dice game. Homework: Finish dice game.

Ulrich Heimeshoff, DICE Andrea Müller, DICE

55413089-El-Midrash-Dice-Bereshit (1)

BR 1/991 Dice Game Implementation Why was dice game implemented in three 22V10 PLDs? What are the resources needed by the Dice Game? –Outputs: 6 for dice.

DICE Automotive 2015 Poster (1)

1 ELECTRONIC DICE - Picaxe · Design and make an electronic dice The dice must show the numbers 1-6 randomly by switching LEDs on and off. Design Specification Points 1) The design

DICE INSTALLATION INSTRUCTION - cardiagnostics.be VIDA VADIS... · DICE INSTALLATION INSTRUCTION © Volvo Car Corporation 2010 135EN05 3 1 INTRODUCTION DiCE (Diagnostic Communication

Dice games and dice products wholesale

volvo vida dice User Manual - obd2be.com vida dice User Manual.pdf · 1 Introduction . DiCE – Diagnostic Communication Equipment, is a tool that is used together with VIDA All-in-one

dice sets dice sets

1 ELECTRONIC DICE - Home - PICAXE