Post on 29-Dec-2015
Model Inference and Model Inference and AveragingAveraging
Prof. Liqing ZhangProf. Liqing Zhang
Dept. Computer Science & Engineering,
Shanghai Jiaotong University
Contents• The Bootstrap and Maximum Likelihood Methods• Bayesian Methods • Relationship Between the Bootstrap and Bayesian
Inference• The EM Algorithm • MCMC for Sampling from the Posterior• Bagging• Model Averaging and Stacking• Stochastic Search: Bumping
23/4/19 Model Inference and Averaging 2
Bootstrap by Basis Expansions
• Consider a linear expansion
• The least square error solution
• The Covariance of \beta
23/4/19 Model Inference and Averaging 3
1
( ) ( )N
j jj
x h x
1ˆ ( )T TH H H y
1 2
2 2
ˆ ˆ( ) ( ) ;
ˆ ˆ( ( )) /
T
i i
cor H H
y x N
23/4/19 Model Inference and Averaging 5
Parametric Model
• Assume a parameterized probability density (parametric model) for observations
23/4/19 Model Inference and Averaging 6
Maximum Likelihood Inference
• Suppose we are trying to measure the true value of some quantity (xT).
– We make repeated measurements of this quantity {x1, x2, … xn}.
– The standard way to estimate xT from our measurements is to calculate the mean value:
and set .
x 1
Nxi
i1
N
DOES THIS PROCEDURE MAKE SENSE?
The maximum likelihood method (MLM) answers this question and provides a
general method for estimating parameters of interest from data.
T xx
23/4/19 Model Inference and Averaging 7
The Maximum Likelihood Method
• Statement of the Maximum Likelihood Method– Assume we have made N measurements of x
{x1, x2, …, xn}.– Assume we know the probability distribution
function that describes x: f(x, a).
– Assume we want to determine the parameter a.
• MLM: pick a to maximize the probability of
getting the measurements (the xi's) we did!
23/4/19 Model Inference and Averaging 8
• The probability of measuring• The probability of measuring • The probability of measuring • If the measurements are independent, the
probability of getting the measurements we did is:
• We can drop the dxn term as it is only a proportionality constant.
])[,(),(),(),(),(),(
21
21n
n
n
dxxfxfxfdxxfdxxfdxxfL
),(:FunctionLikelihoodthecalledis1
N
iixfLL
The MLM Implementation
1 1 is ( , )x f x dx2 2 is ( , )x f x dx
is ( , )n nx f x dx
23/4/19 Model Inference and Averaging 9
• We want to pick the a that maximizes L:
– Often easier to maximize lnL. – L and lnL are both maximum at the same
location.• we maximize rather than L itself because
converts the product into a summation.
L
*0
ln L ln f (xi ,)i1
N
Log Maximum Likelihood Method
ln L ln L
23/4/19 Model Inference and Averaging 10
• The new maximization condition is:
• could be an array of parameters (e.g. slope and intercept) or just a single variable.
• equations to determine a range from simple linear equations to coupled non-linear equations.
0),(lnln
*1*
i
N
i
xfL
Log Maximum Likelihood Method
23/4/19 Model Inference and Averaging 11
• Let f(x, ) be given by a Gaussian distribution function.
• Let =μ be the mean of the Gaussian. We want to use our data+MLM to find the mean.
• We want the best estimate of a from our set of n measurements {x1, x2, …, xn}.
• Let’s assume that s is the same for each measurement.
An Example: Gaussian
23/4/19 Model Inference and Averaging 12
An Example: Gaussian
• Gaussian PDF
• The likelihood function for this problem is:
n
i
in
i
xnxxxn
n
i
xn
ii
eeee
exfL
12
2
2
2
2
22
2
21
2
2
2
)(
2
)(
2
)(
2
)(
1
2
)(
1
2
1
2
1
2
1),(
f (xi ,) 1 2
e (x i )2
2 2
23/4/19 Model Inference and Averaging 13
• We want to find the a that maximizes the log likelihood function:
n
ii
n
ii
n
ii
n
i
i
xn
xx
xn
L
111
2
12
2
10)1)((2;0)(
02
)(
2
1ln
ln
n
i
i
x
nn
ii
xn
exfL
n
i
i
12
2
2
)(
1
2
)()
2
1ln(
)]2
1ln([),(lnln 1
2
2
An Example: Gaussian
23/4/19 Model Inference and Averaging 14
• If are different for each data point then is just the weighted average:
n
i i
n
i i
ix
12
12
1
Weighted Average
An Example: Gaussian
23/4/19 Model Inference and Averaging 15
• Let f(x, ) be given by a Poisson distribution.
• Let = μ be the mean of the Poisson.
• We want the best estimate of a from our set of n measurements {x1, x2, … xn}.
• Poisson PDF:
!),(
x
exf
x
An Example: Poisson
23/4/19 Model Inference and Averaging 16
An Example: Poisson
• The likelihood function for this problem is:
!!..!!...
!!
!),(
2121
11
121
n
xn
n
xxx
n
i i
xn
ii
xxx
e
x
e
x
e
x
e
x
exfL
n
ii
n
i
23/4/19 Model Inference and Averaging 17
An Example: Poisson
• Find a that maximizes the log likelihood function:
n
ii
n
ii
n
n
ii
xn
xn
xxxxnd
d
d
Ld
1
1
211
1
01
)!!..!ln(lnln
Average
23/4/19 Model Inference and Averaging 18
• For large data samples (large n) the likelihood function, L, approaches a Gaussian distribution.
• Maximum likelihood estimates are usually consistent.
– For large n the estimates converge to the true value of the parameters we wish to determine.
• Maximum likelihood estimates are usually unbiased.
– For all sample sizes the parameter of interest is calculated correctly.
General properties of MLM
23/4/19 Model Inference and Averaging 19
General properties of MLM
• Maximum likelihood estimate is efficient: the estimate has the smallest variance.
• Maximum likelihood estimate is sufficient: it uses all the information in the observations (the xi’s).
• The solution from MLM is unique.
• Bad news: we must know the correct probability distribution for the problem at hand!
23/4/19 Model Inference and Averaging 20
Maximum Likelihood
• We maximize the likelihood function
• Log-likelihood function
23/4/19 Model Inference and Averaging 21
Score Function
• Assess the precision of using the likelihood function
• Assume that L takes its maximum in the interior parameter space. Then
23/4/19 Model Inference and Averaging 22
Likelihood Function
• We maximize the likelihood function
• We omit normalization since only adds a constant factor
• Think of L as a function of with Z fixed
• Log-likelihood function
23/4/19 Model Inference and Averaging 23
Fisher Information
• Negative sum of second derivatives is the information matrix
• is called the observed information, should be greater 0.
• Fisher information ( expected information ) is
23/4/19 Model Inference and Averaging 24
Sampling Theory
• Basic result of sampling theory
• The sampling distribution of the max-likelihood estimator approaches the following normal distribution, as
when we sample independently from • This suggests to approximate the distribution with
23/4/19 Model Inference and Averaging 25
Error Bound
• The corresponding error estimates are obtained from
• The confidence points have the form
23/4/19 Model Inference and Averaging 26
Simplified form of the Fisher information
Suppose, in addition, that the operations of integration and differentiation can be swapped for the second derivative of f(x;θ) as well, i.e.,
In this case, it can be shown that the Fisher information equals
The Cramér–Rao bound can then be written as
23/4/19 Model Inference and Averaging 27
Single-parameter proof
If the expectation of T is denoted by ψ(θ), then, for all θ,
Let X be a random variable with probability density function f(x;θ). Here T = t(X) is a statistic, which is used as an estimator for ψ(θ). If V is the score, i.e.
then the expectation of V, written E(V), is zero. If we consider the covariance cov(V,T) of V and T, we have cov(V,T) = E(VT), because E(V) = 0. Expanding this expression we have
Using the chain rule
and the definition of expectation gives, after cancelling f(x;θ),
because the integration and differentiation operations
commute (second condition).
The Cauchy-Schwarz inequality shows that
Therefore
23/4/19 Model Inference and Averaging 28
An Example
• Consider a linear expansion
• The least square error solution
• The Covariance of \beta
23/4/19 Model Inference and Averaging 29
1
( ) ( )N
j jj
x h x
1ˆ ( )T TH H H y
1 2
2 2
ˆ ˆ( ) ( ) ;
ˆ ˆ( ( )) /
T
i i
cor H H
y x N
An Example
• The confidence region
23/4/19 Model Inference and Averaging 30
1
1 1/2
ˆˆ prediction model ( ) ( ),
The standard deviation
ˆ ˆ [ ( )] [ ( ) ( ) ( )]
N
j jj
T T
Consider x h x
se x h x H H h x
ˆ ˆ( ) 1.96 [ ( )]x se x
23/4/19 Model Inference and Averaging 31
Bayesian Methods• Given a sampling model Pr(Z | ) and a prior Pr( )
for the parameters, estimate the posterior probability
• By drawing samples or estimating its mean or mode
• Differences to mere counting ( frequentist approach )– Prior: allow for uncertainties present before seeing the
data– Posterior: allow for uncertainties present after seeing the
data
23/4/19 Model Inference and Averaging 32
Bayesian Methods
• The posterior distribution affords also a predictive distribution of seeing future values
• In contrast, the max-likelihood approach would predict future data on the basis of
not accounting for the uncertainty in the parameters
newZ
An Example
• Consider a linear expansion
• The least square error solution
23/4/19 Model Inference and Averaging 33
1
( ) ( )N
j jj
x h x
xx
pp
T
exp1
21
2/2/2
1
• The posterior distribution for \beta is also Gaussian, with mean and covariance
• The corresponding posterior values for \mu(x),
23/4/19 Model Inference and Averaging 34
23/4/19 Model Inference and Averaging 36
Bootstrap vs Bayesian
• The bootstrap mean is an approximate posterior average
• Simple example:– Single observation z drawn from a normal
distribution– Assume a normal prior for :– Resulting posterior distribution
23/4/19 Model Inference and Averaging 37
Bootstrap vs Bayesian
• Three ingredients make this work– The choice of a noninformative prior for – The dependence of on Z only through
the max-likelihood estimate Thus
– The symmetry of
Bootstrap vs Bayesian• The bootstrap distribution represents an (approximate)
nonparametric, noninformative posterior distribution for our parameter.
• But this bootstrap distribution is obtained painlessly without having to formally specify a prior and without having to sample from the posterior distribution.
• Hence we might think of the bootstrap distribution as a \poor man's" Bayes posterior. By perturbing the data, the bootstrap approximates the Bayesian effect of perturbing the parameters, and is typically much simpler to carry out.
23/4/19 Model Inference and Averaging 38
23/4/19 Model Inference and Averaging 39
The EM Algorithm
• The EM algorithm for two-component Gaussian mixtures – Take initial guesses for the
parameters
– Expectation Step: Compute the responsibilities
2211ˆ,ˆ,ˆ,ˆ,ˆ
23/4/19 Model Inference and Averaging 40
The EM Algorithm
– Maximization Step: Compute the weighted means and variances
– Iterate 2 and 3 until convergence
23/4/19 Model Inference and Averaging 41
The EM Algorithm in General
• Baum-Welch algorithm
• Applicable to problems for which maximizing the log-likelihood is difficult but simplified by enlarging the sample set with unobserved ( latent ) data ( data augmentation ).
23/4/19 Model Inference and Averaging 44
The EM Algorithm in General
• Start with initial params
• Expectation Step: at the j-th step compute
as a function of the dummy argument
• Maximization Step: Determine the new params by maximizing
• Iterate 2 and 3 until convergence
)1(ˆ j
23/4/19 Model Inference and Averaging 45
Model Averaging and Stacking
• Given predictions • Under square-error loss, seek weights
• Such that
• Here the input x is fixed and the N observations in Z are distributed according to P