Post on 12-Jan-2016
The Triangle of Statistical Inference: Likelihoood
Data
Scientific Model
Probability Model
Inference
An example...
The Data:xi = measurements of DBH on 50 treesyi = measurements of crown radius on those trees
The Scientific Model:yi = xi + (linear relationship, with 2 parameters ( and an error term () (the residuals))
The Probability Model: is normally distributed, with E[ ] and variance estimated from the observed variance of the residuals...
Data
Scientific Model (hypothesis)
Probability Model
Inference
Data
Scientific Model (hypothesis)
Probability Model
Inference
So what is likelihood –and what is it good for?
1. Probability based (“inverse probability”).“ mathematical quantity that appears to be
appropriate for measuring our order of preference among different possible populations but does not in fact obey the laws of probability”
--RA. Fischer
2. Foundation of theory of statistics.
3. Enables comparison of alternate models.
So what is likelihood –and what is it good for?
Scientific hypotheses cannot be treated as outcomes of trials (probabilities) because we will never have the full set of possible outcomes.
However, we can calculate the probability of obtaining the results, given our model (scientific hypothesis (P(data|model).
Likelihood is proportional to this probability.
Likelihood is proportional to probability
P(data|hypothesis ( )) L(hyp|data)
P(data|hypothesis ( )) = kL(| data)
In plain English: “The likelihood (L) of the set of parameters () (in the scientific model), given the data (x), is proportional to the probability of observing the data, given the parameters...”
{and this probability is something we can calculate, using the appropriate underlying probability model (i.e. a PDF)}
Parameter values can specify your hypotheses
P(datai|θ) = kL(θ |data)
Parameter is fixed, datavariable. What is the prob. of observing the data if our model andparameters are correct?
Parameter is variable, data fixed. What is the likelihood of the parametergiven the data?
General Likelihood Function
L(θ|x) = cg(x|θ )
Likelihood function Data (xi )
Parameters in probability model
Probability density function or discrete density function
c is a constant, and thus, unimportant in comparison of alternate hypotheses or models as long as the data remain constant.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-4 -3 -2 -1 0 1 2 3 4 5
Pro
ba
bili
ty
General Likelihood Function
L(θ|x) = g(xi|θ )
Likelihood function Data (xi )
Parameters in probability model
Probability density function or discrete density function
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-4 -3 -2 -1 0 1 2 3 4 5
Pro
ba
bili
ty
n
i 1
The parameters of the pdf are determined by the data and by the value of the parameters in scientific model!!
Likelihood Axiom
“Within the framework of a statistical model, a
set of data supports one statistical
hypothesis better than other if the likelihood
of the first hypothesis, on the data, exceeds
the likelihood of the second hypothesis”.
(Edwards 1972, p.)
How to derive a likelihood function: Binomial
The most likely parameter value is 10/50 = 0.20
105010 11010
1
)p(p)p|(g)|p(L
)p|x(cg)x|p(L
)p(px
n)x(g xnx Probability Density Function
Likelihood
Event 10 trees die out of a population of 50
Question: What is the mortality rate (p)?
Likelihood Profile: Binomial
105010 11010 )p(p)p|(g)|p(L
-2E-12
0
2E-12
4E-12
6E-12
8E-12
1E-11
1.2E-11
1.4E-11
1.6E-11
0 0.2 0.4 0.6 0.8 1
Value of estimated parameter (p)
lik
eli
ho
od
The model (parameter p) is defined by the data!!
An example: Can we predict tree fecundity as a function of tree size?
The Data:xi = measurements of DBH on 50 treesyi = counts of seeds produced by trees
The Scientific Model:yi = DBH + exponential relationship, with 1 parameter ( and an error term ()
The Probability Model:Data follow a Poisson distribution, with E[x] and variance = λ
Data
Scientific Model
Probability Model
Inference
Iterative process
1. Pick a value for the parameter in your scientific model, Recall scientific model is yi = DBH
2. For each data point, calculate the expected (predicted) value for that value of
3. Calculate the probability of observing what you observed given that parameter value and your probability model.
4. Multiply the probabilities of individual observations.
5. Go back to 1 until you find maximum likelihood estimate for parameter
Data
Scientific Model (hypothesis)
Probability Model
Inference
Data
Scientific Model (hypothesis)
Probability Model
Inference
Likelihood Poisson Process
!x
)(e)xX(P
x E[x]= λ
First pass…
Model yi = DBH
Predicted = 0.0617Observed = 2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3
Number of seeds
Pro
ba
bilit
y
Poisson randomVariable with E[x1]=0.0617
0 1 2
Do for n observations……
!x
)(e)xX(P
x
001702 .!x
)pred(e)X(P
xpred
E[x]= λ
Pick a new value of beta...
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3 4 5 6
Number of seeds
Pro
ba
bili
ty Poisson randomVariable with E[x1]=0.498
0 1 2 3 4
Do for n observations……
!x
)(e)xX(P
x
07502 .!x
)pred(e)X(P
xpred
Model yi = DBH
Predicted = 0.498Observed = 2
Probability and Likelihood
1. Multiplying probabilities is not convenient from a computational point of view.
2. We take the log of the probabilities and we maximize that number.
3. This gives us the Maximum Likelihood Estimate of the parameter.
Likelihood Profile
-170.5
-170.0
-169.5
-169.0
-168.5
-168.0
-167.5
-167.0
-166.5
-166.0
-165.5
-165.0
0 0.2 0.4 0.6 0.8 1 1.2
Lo
g L
ike
liho
od ML estimate
Beta
Model yi = DBH
Model comparison
Data
Scientific Model (hypothesis)
Probability Model
Inference
Data
Scientific Model (hypothesis)
Probability Model
Inference
The Data:xi = measurements of DBH on 50 treesyi = counts of seed produced by trees
The Scientific Models:yi = DBH + exponential relationship, with 1 parameter (
OR
yi = DBH + linear relationship with 1 parameter (
The Probability Model:Data follow a Poisson distribution, with E[x] and variance = λ
Model comparison
Data
Scientific Model (hypothesis)
Probability Model
Inference
Data
Scientific Model (hypothesis)
Probability Model
Inference
The Data:xi = measurements of DBH on 50 treesyi = counts of seed produced by trees
The Scientific Models:yi = DBH + exponential relationship, with 1 parameter (
The Probability Model:Data follow a Poisson distribution, with E[x] and variance = λ
OR
Data follow a negative binomial distribution with E[x]=m and clumping parameter k. (Variance is defined by m and k (estimated).
FIRST PRINCIPLES1. Proportions Binomial2. Several categories Multinomial3. Count events Poisson, Neg. binomial4. Continuous data, additive processes Normal5. Quantities from multiplicative probabilities Lognormal,
Gamma.EMPIRICAL1. Examine residuals.2. Tests different probability distributions for model errors.
Determination of appropriate likelihood function
Probability models can be thought of as competing hypotheses in exactly the same way that different parameter values (structural models) are competing hypotheses.
Likelihood functions: An aside about logarithms
1
)a(log
)blog(*a)blog(
)blog()alog(b
alog
)blog()alog()b*alog(
a
a
Taking the logarithm in base a of a number is the inverse of raising that number to the power a. Example: log101000= 3
Basic Log Operations
Poisson Likelihood Function
!x
)(e)xX(P
x
!x
)(e)x|(L
i
xn
i
i
1
)]!xln(lnx[)x|(oodLoglikelih ii
n
i
1
Likelihood
Discrete Density Function
Variance]X[E
Negative Binomial Distribution Likelihood Function
)]kmln()kn()n(ln)mln(n)kn(ln[
)]kln()kln(k[N)|x(oodLoglikelih
iiiiii
N
i
11
Likelihood
Discrete Density Functionnk
km
m
k
m
!n)k(
)nk()nXPr(
1
)kn(i
ki
i
iN
ii
i)km(
km
)k()n(
)kn(L
11
k is an estimated parameter!!
2
k
mmVariancem]X[E
Normal Distribution Likelihood Function
))x(
exp()x(f2
2
2 22
1
))x(
exp()x|,(L in
i2
2
21 22
1
))x(
()]ln()[ln(n
)x|,(oodLogLikelih
in
i2
2
1 22
2
1
Prob. Density Function
Likelihood
E[x] = μVariance = δ2
Lognormal Distribution Likelihood Function
2
2
2 2
1
2
1
)xln(exp
x)x(f
2
2
21 2
1
2
1
)xln(
expx
)x|,(L in
i
)]xln(
)x̂ln()xln([)]ln()ln([n
)x|,(oodLoglikelih
i
n
i
ii
1
2
22
2
22
2
1
Likelihood
Prob. Density Function
n
)]x̂ln()x[ln()x̂ln()(E ii
n
ii
2
1
22
2
Gamma Distribution Likelihood Function
Prob. Density Function
n
iii
sxaa
sxxasaa
shapea
asXVar
asxE
exas
xf
1
2
/1
/)ln()1()ln()ln(LogLik
parameter scales
parameter
][
][
)(
1)(
Exponential Distribution Likelihood Function
i
n
i
n
i
x
x
x)ln()|x(oodLogLikelih
eL
e)x(f
i
1
1
Prob. Density Function
Likelihood
2
11
Variance]x[E
Evaluating the strength of evidence for the MLE
Now that you have an MLE, how should you evaluate it?
Two purposes of support/confidence intervals
• Measure of support for alternate parameter estimates.
• Help with fitting when something goes wrong.
Methods of calculating support intervals
• Bootstrapping
• Likelihood curves and profiles
Bootstrapping
• Resample the data with replacement and record the number of times that the parameter estimate fell within an interval.
• Frequentist approach: If I sampled my data a large number of times, what would my confidence in the estimate be?
General method
• Draw the likelihood curve (one parameter) or surface (two parameters) or n-dimensional space (n-parameters).
• Figure out how much the likelihood changes as the parameter of interest moves away from the MLE.
Strength of evidence for particular parameter estimates – “Support”
• Likelihood provides an objective measure of the strength of evidence for different parameter estimates...
Log-likelihood = “Support” (Edwards 1992)
-155
-153
-151
-149
-147
2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8
Parameter Estimate
Lo
g-L
ikel
iho
od
Asymptotic vs. Simultaneous M-Unit Support Limits
• Asymptotic:– Hold all other parameters at their MLE values, and
systematically vary the remaining parameter until likelihood declines by a chosen amount (m)
-155
-153
-151
-149
-147
2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8
Parameter Estimate
Lo
g-L
ike
liho
od
2-unit support interval
Maximum likelihood estimate
What should “m” be? (1.92
is a good number, and is
roughly analogous to a
95% CI)
An aside on the Likelihood Ratio Test
• Ratios of log-likelihoods (R) follow a chi-square distribution with degrees of freedom equal to the difference in the number of parameters between models A and B.
)]M|Y(L)M|Y(L[R BA 2
Asymptotic vs. Simultaneous M-Unit Support Limits
• Simultaneous:– Resampling method: draw a very large number of
random sets of parameters and calculate log-likelihood. M-unit simultaneous support limits for parameter xi are the upper and lower limits that don’t differ by more than m units of support.
• Set the focal parameter to a range of values and for each value optimize the likelihood for all the other parameters:
In practice, it can require an enormous number of iterations to do this if there are more than a few parameters
Asymptotic vs. Simultaneous Support Limits
Parameter 1
Par
amet
er 2
2-unit dropin support
A hypothetical likelihood surface for 2 parameters
Asymptotic 2-unitsupport limits for P1
Simultaneous 2-unitsupport limits for P1
Other measures of strength of evidence for different parameter estimates
• Edwards (1992; Chapter 5)– Various measures of the “shape” of the
likelihood surface in the vicinity of the MLE...
How pointed is the peak?...
Evaluating Support for Parameter Estimates
• Traditional confidence intervals and standard errors of the parameter estimates can be generated from the Hessian matrix
– Hessian = matrix of second partial derivatives of the likelihood function with respect to parameters, evaluated at the maximum likelihood estimates
– Also called the “Information Matrix” by Fisher
– Provides a measure of the steepness of the likelihood surface in the region of the optimum
– Evaluated at MLE points it is the observed information matrix
– Can be generated in R using optim
An example from R
• The Hessian matrix (when maximizing a log likelihood) is a numerical approximation for Fisher's Information Matrix (i.e. the matrix of second partial derivatives of the likelihood function), evaluated at the point of the maximum likelihood estimates. Thus, it's a measure of the steepness of the drop in the likelihood surface as you move away from the MLE.
> res$hessiana b sd
a -150.182 -2758.360 -0.201b -2758.360 -67984.416 -5.925Sd -0.202 -5.926 -299.422
The Hessian CI
• Now invert the negative of the Hessian matrix to get the matrix of parameter variance and covariance
• The square roots of the diagonals of the inverted negative Hessian are the standard errors
• Are we reverting to a frequentist framework?
> solve(-1*res$hessian)a b sd
a 2.613229e-02 -1.060277e-03 3.370998e-06b -1.060277e-03 5.772835e-05 -4.278866e-07sd 3.370998e-06 -4.278866e-07 3.339775e-03
(*and 1.96 * S.E. is a 95% CI)
> sqrt(diag(solve(-1*res$hessian)))a b sd0.1616 0.007597 0.05779
Some references
A.W.F. Edwards. 1972. Likelihood. Cambridge University Press.
Feller, W. 1968. An introduction to probability theory and its application. Wiley & Sons.