Mathematical Statistics - Nanyang Technological University

Mathematical Statistics

MAS 713

Chapter 6

Previous lecture

Point estimatorsI Estimation and sampling distributionI Point estimationI Properties of estimators

Sufficient StatisticsI Factorization Theorem

Any questions?

Mathematical Statistics (MAS713) Ariel Neufeld 2 / 53

This lecture

1 6.1 Maximum Likelihood Estimation6.1.1 Introduction6.1.3 Maximum Likelihood Principle

2 6.2 Cramér-Rao Lower Bound6.2.1 Introduction6.2.2 Examples

3 6.3 Method of Moments

4 6.4 Examples: MLE and Methods of Moments

Additional reading : Chapter 7


6.1 Maximum Likelihood Estimation 6.1.1 Introduction

Intuition of MLE

A patient visits a physican and complains about the followingsympthoms:“I have a headache, I’m feeling weak and have no appetite."The Doctor’s diagnostics options:

1 You have a brain tumor.2 You broke your foot.3 You have a cold.

The Doctor’s job is to determine the most likely illness.We’ll revisit this example later.


6.1 Maximum Likelihood Estimation 6.1.1 Introduction

We have seen that there are plenty of choices for an estimator θ ofan unknown parameter θ

=⇒ How to choose θ?

One possible approach:

Given observations x1, x2, . . . , xn, choose unknown parameterθ = θ(x1, . . . , xn) in such a way that it maximizes the probability of theoccurrence of our observed values x1, x2, . . . , xn.

=⇒ choose θ such that

P(X1 = x1, . . . ,Xn = xn | θ) = maxθ

P(X1 = x1, . . . ,Xn = xn | θ)

It is the intuition behind the Maximum Likelihood estimator (MLE).


6.1 Maximum Likelihood Estimation 6.1.3 Maximum Likelihood Principle

The Maximum Likelihood PrincipleThe main ingredients:

1 X : a random variable.2 θ: parameter to estimate (restricted to a parameter space Sθ).3 p (X ; θ) (or p (X |θ)): a statistical model (pmf or pdf)4 X1, . . . ,Xn: a random sample from X .

We want to construct good estimators for θ

Notation: Given observation x1, . . . , xn, we write

p(x|θ) =

{joint probability mass function if X is discretejoint probability density function if X is continuous



The Maximum Likelihood PrincipleDefinitionLet X = (X1, . . . ,Xn) have joint pdf/pmf p (x; θ) where θ ∈ Sθ. Thelikelihood function (or simply likelihood) is defined by

Sθ 3 θ 7→ L (θ) := L (θ; x) = p (x; θ)

Note: x is fixed and θ varies in Sθ.

The likelihood is a function of θ.The likelihood is not a pdf/pmf (as function of θ, for fixed x).If the data is i.i.d then

L (θ; x) =n∏

i=1

p (xi ; θ)



The Maximum Likelihood PrincipleChoose θ = θ (x) which maximizes the likelihood function, i.e.

L(θ; x

)= max

θ∈Sθ

L(θ; x

)

by definition of the arg max, this means

θ (x) ∈ arg maxθ∈Sθ

L(θ; x

)

Definition of Maximum Likelihood Estimator (MLE)Let X = (X1, . . . ,Xn) be a random sample. If

θ (X) ∈ arg maxθ∈Sθ

L(θ; X

)

Then we call θ (X) a Maximum Likelihood Estimator (MLE) for θ.Note: MLE may not be unique or may not exist.

Remark: arg maxθ f (θ) is the set of points, θ, for which f (θ) attains thefunction’s largest value.



Intuition of MLEThe data:x =“I have a headache, I’m feeling weak and have no appetite."The (discrete) parameter space θ:

1 You have a brain tumor.2 You broke your foot.3 You have a cold.

The likelihood under each parameter:

P (“headache, weakness,no appetite"|θ = brain tumor) = 0.2P (“headache, weakness,no appetite"|θ = broken foot) = 0.05P (“headache, weakness,no appetite"|θ = cold) = 0.4

.The ML estimateThe likelihood of having a cold is the highest.θ = cold



The Maximum Likelihood PrincipleWe may apply any monotone increasing function, and still achievemaximization. Very often it is more convenient to consider thelogarithm of the likelihood function (log-likelihood function)

log L (θ,x) = log p (x|θ)

Since the logarithm is a monotonic function, the maximization of thelikelihood and log-likelihood functions is equivalent, that is, θmaximizes the likelihood function if and only if it also maximizes thelog-likelihood function.

arg maxθ∈Sθ

L(θ; X

)= arg max

θ∈Sθ

log L(θ; X

)

or in other words

θ ∈ arg maxθ∈Sθ

L(θ; X

)⇐⇒ θ ∈ arg max

θ∈Sθ

log L(θ; X

)



Maximum Likelihood Estimation

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

log p(x; θ)

p(x; θ)

θθ



ExampleSuppose that X is a discrete random variable with the followingprobability mass function:

X 0 1 2 3p(X) 2θ/3 θ/3 2 (1− θ) /3 (1− θ) /3

where 0 < θ < 1 is a parameter. The following 10 independentobservations were taken from such a distribution:x = (x1, . . . , x10) = (3,0,2,1,3,2,1,0,2,1).Find a point estimate of θ using the MLE.



Solution:

The likelihood function given the observationsx = (x1, . . . , x10) = (3,0,2,1,3,2,1,0,2,1) is given by

L(θ; x) =n∏

i=1

p(xi |θ)

= p(X = 3|θ)p(X = 0|θ)p(X = 2|θ)p(X = 1|θ)p(X = 3|θ)

× p(X = 2|θ)p(X = 1|θ)p(X = 0|θ)p(X = 2|θ)p(X = 1|θ)

=

(2θ3

)2(θ3

)3(2(1− θ)

3

)3(1− θ3

)2

.

=⇒ θ ∈ arg maxθ∈(0,1)

(2θ3

)2(θ3

)3(2(1− θ)

3

)3(1− θ3

)2

Clearly, the likelihood function is not easy to maximize.Let’s look at the log-likelihood



The log-likelihood function given the observationsx = (x1, . . . , x10) = (3,0,2,1,3,2,1,0,2,1) is

log L(θ; x) = logn∏

i=1

p(xi |θ)

= 2(

log23

+ log θ

)+ 3

(log

13

+ log θ

)+ 3

(log

23

+ log(1− θ)

)

+ 2(

log(13− log(1− θ)

)

= Constant + 5 log θ + 5 log(1− θ)

Setting the derivative to 0 and solving

d log L(θ)

dθ= 5

(1θ− 1

1− θ

)= 0

θ = θ(x) = 0.5



Example: Estimating mean and variance in a normal populationGiven a random sample X = (X1, . . . ,Xn) of size n where

Xii.i.d∼ N (µ, σ)

Derive the Maximum Likelihood estimator for the mean and variance ofa Normal random variable

Solution:

θ =(µ, σ2), Sθ = R× (0,∞)

We need to find (µ, σ2

)∈ arg max

(µ,σ2)

p(

x|µ, σ2)

Notation: We write φ(x |µ, σ) for the pdf of a N(µ, σ)-distributedrandom variable, i.e.

φ(x |µ, σ) := 1√2πσ2 e−

(x−µ)2

2σ2



θ :=(µ, σ2

)∈ argmax

µ,σ2p(

x|µ, σ2)

i.i.d= argmax

µ,σ2

n∏i=1

p(

xi |µ, σ2)

= argmaxµ,σ2

n∏i=1

φ (xi |µ, σ)

= argmaxµ,σ2

n∑i=1

log φ (xi |µ, σ)

= argmaxµ,σ2

n∑i=1

log

(1√

2πσ2exp

(− 1

2σ2 (xi − µ)2))

= argmaxµ,σ2

−n2

(log(2π) + log(σ2)

)−

n∑i=1

(xi−µ)2

2σ2



To find the maximizer, we calculate

∂

∂µ

(− n 1

2


)−

n∑

i=1

(xi−µ)2

2σ2

)=

n∑

i=1

xi−µσ2 .

Similarly, setting v := σ2 and taking the derivatives yields

∂

∂σ2

(−n2


)−

n∑

i=1

(xi−µ)2

2σ2

)

=∂

∂v

(−n2

(log(2π) + log(v)

)−

n∑

i=1

(xi−µ)2

2v

)

= −n2

1v + 1

2v2

n∑

i=1

(xi − µ)2

= −n2

1σ2 + 1

2σ4

n∑

i=1

(xi − µ)2



Setting both derivatives equal to 0 implies

n∑

i=1

xi−µσ2 = 0 =⇒ µ = 1

n

n∑

i=1

xi = xn

−n2

1v + 1

2v2

n∑i=1

(xi − µ)2 = 0 =⇒ v = σ2 = 1n

n∑i=1

(xi−µ)2 = 1n

n∑i=1

(xi−xn)2

Therefore, we obtained the estimators

µ = 1n

n∑

i=1

Xi = Xn

σ2 = 1n

n∑i=1

(Xi − Xn)2

Note: Don’t forget, estimators are random variables!



Note:

E[µ] = E[

1n

n∑i=1

Xi

]= µ =⇒ µ unbiased

But, one can show that

E[σ2] = E[

1n

n∑i=1

(Xi − Xn)2]

= n−1n σ2 =⇒ σ2 biased

Observe:In this setting S2 := 1

n−1

n∑i=1

(Xi − Xn)2 is unbiased estimator for σ2.



Some issues to consider:

1 How do we guarantee that MLE exists?2 How do we guarantee that MLE is unique?3 How do we guarantee that calculation of MLE is tractable?4 Is the Likelihood function convex (related to uniqueness)?5 Boundary conditions?6 Numerical sensitivity: in many cases the likelihood function is flat...

These are not statistical questions, but mathematical ones, namelyfunctional analysis, convex analysis....



Cramér-Rao Bound (CRLB)


6.2 Cramér-Rao Lower Bound 6.2.1 Introduction

Cramér-Rao Lower Bound (CRLB)

The Cramér-Rao Lower Bound (CRLB) sets a lower bound on thevariance of any unbiased estimator. This can be extremely useful inseveral ways:

1 If we find an estimator that achieves the CRLB, then we know thatwe have found an Minimum Variance Unbiased estimator (MVUE)!

2 The CRLB can provide a benchmark against which we cancompare the performance of any unbiased estimator (We knowwe’re doing very well if our estimator is "close" to the CRLB)

3 The CRLB enables us to rule-out impossible estimators. That is,we know that it is physically impossible to find an unbiasedestimator that beats the CRLB. This is useful in feasibility studies.

4 The theory behind the CRLB can tell us if an estimator existswhich achieves the bound.



Cramér-Rao Lower Bound (CRLB)

Theorem: Cramér-Rao Lower Bound

If θ is any unbiased estimator of θ based on the random sample X,then the variance of the error in the estimator is bounded by theinverse of the Fisher Information I:

E[∥∥∥θ − θ

∥∥∥2]

= Var(θ) ≥ I−1,

where I is given by:

I = −E

[d2 log p (X|θ)

dθ2

].




Definition: Efficient Estimator

An unbiased estimator θ is called efficient if

Var(θ) = I−1

Efficient estimator is an unbiased estimator withminimal possible variance.

Theorem: Sufficient condition for efficiency

If θ is an unbiased estimator of θ and

∂ log p (Y|θ)

∂θ= c (θ)

(θ − θ

)

then θ is an efficient estimator.


6.2 Cramér-Rao Lower Bound 6.2.2 Examples

ExampleSuppose that X ∼ Bin(m, p), where m is known. The pmf is given by

p (x ; p) =

(mx

)px (1− p)m−x , x = 0,1, . . . ,m.

Find the CRLB.

Note: The range of X depends on m, but not on the unknownparameter p. Also, the sample size equals n = 1.



Solution:

The log-likelihood is given by

log p (x ; p) = log

(mx

)+ x log p + (m − x) log (1− p)

The first derivative is given by:

∂ log p (x ; p)

∂p=

xp− (m − x)

11− p

The second derivative is given by:

∂2 log p (x ; p)

∂p2 =−xp2 − (m − x)

1

(1− p)2



Therefore the Fisher Information I satisfies

I := −E

[−Xp2 − (m − X )

1

(1− p)2

]=

E [X ]

p2 + (m − E [X ])1

(1− p)2

=mpp2 + (m −mp)

1

(1− p)2

=m

p (1− p)

It follows that the CRLB is given by

Var(p)≥ I−1 =

p (1− p)

m




ExampleConsider n observations, such that

Yk = m + Wk , k = {1, . . . ,n}

where Wki.i.d∼ N

(0, σ2)

1 Find the MLE for m.2 Is m and efficient estimator?



Cramér-Rao Bound (CRLB)Solution:

1) As Yki.i.d∼ N

(m, σ2), we know from Slide 18 that

m =

∑ni=1 Yi

n= Yn

2)m is unbiased, as E

[m]

= 1n∑n

i=1 E [Yi ] = mMoreover, from calculation on Slide 16–17

∂ log p(Y|m, σ2)

∂m=

n∑

i=1

(Yi −m)

σ2

=nσ2

(1n

n∑

i=1

Yi −m

)

= c(m −m

)

; efficient estimatorMathematical Statistics (MAS713) Ariel Neufeld 29 / 53


Properties of MLE



The concept of MLE makes sense, but can we scientifically justify it?

Bad news: no optimum properties for finite samples.Good news: has a few attractive limiting properties.



Properties of MLE

What are the criteria for a “good" estimator?Unbiased.Consistency.normality.efficiency.



The MLE satisfies the following 4 asymptotic properties:(under some additional regularity and integrability conditions)

Consistency: the sequence of MLEs converges in probability to thevalue being estimated.

limn→∞

P(∣∣∣θ(n) − θ

∣∣∣ > ε)

= 0 ∀ε > 0.

Asymptotically unbiased: The MLE satisfies

limn→∞

E(θ(n) − θ

)= 0



Asymptotic normality: A consistent estimator is called asymptoticallynormal if for some σ2

∞ > 0 we have that the limiting distribution of√

n(θ(n) − θ

)is equal N

(0, σ2

∞), i.e.

limn→∞

√n(θ(n) − θ

)d→ N (0, σ∞)

Asymptotic efficiency: Moreover, we call a consistent estimatorasymptotically efficient if σ2

∞ = I−1, meaning that

limn→∞

√n(θ(n) − θ

)d→ N

(0,√I−1

)



Method of Moments Estimator


6.3 Method of Moments

Method of Moments EstimatorFacts :

Moments give good (but not always full!) information aboutdistribution.

If the distribution has bounded support thenmoments uniquely determine the law.

Idea:

=⇒ match sample moments with population moments

Theorem: Law of Large NumbersLet X1, . . . , Xn be i.i.d random variable with E[|X1|] <∞ and denotethe mean µ = E[X1]. Then

1n

n∑

i=1

Xi → µ



Method of MomentsLet X1,X2, . . . ,Xn be a sample from a population with pdf or pmf p (x |θ1, θ2, . . . , θk ).Let the unknown parameter θ = (θ1, θ2, . . . , θk) be k -dimensional.

The method of moments estimation is found by:1) equating the first k sample moments to the corresponding k population moments,2) solving the resulting system of simultaneous equations.

The k -th theoretical/population moment of this random variable is defined as

µk = E[X k]=

∫xk p (x |θ1, θ2, . . . , θk ) dx if X continuous

µk = E[X k]=∑

x

xk p (x |θ1, θ2, . . . , θk ) if X discrete.

If X1,X2, . . . ,Xn are i.i.d. random variables from that distribution, the k -th samplemoment is defined as

mk =1n

n∑i=1

X ki ,

thus mk can be viewed as an estimator for µk . From the law of large number, we havemk → µk in probability as n→∞.



Method of Moments:

E [X ] = 1n

n∑i=1

Xi

E[X 2] = 1

n

n∑i=1

X 2i

......

E[X k] = 1

n

n∑i=1

X ki

=⇒ Solve the set of k equations and find θ1, . . . , θk .



ExampleSuppose that X is a discrete random variable with the followingprobability mass function:

X 0 1 2 3p(X) 2θ/3 θ/3 2 (1− θ) /3 (1− θ) /3

where θ is a parameter in (0,1). The following 10 independentobservations were taken from such a distribution:x = (x1, . . . , x10) = (3,0,2,1,3,2,1,0,2,1).

Find a point estimate of θ using the method of moments and MLE.



Solution:

We have only a single parameter to estimate=⇒ we need to calculate only the first moment.

The theoretical mean value is

E [X ] =3∑

x=0

xp (x ; θ) = 02θ3

+ 1θ

3+ 2

2 (1− θ)

3+ 3

(1− θ)

3=

73− 2θ

The sample mean is

x =1n

n∑

i=0

xi =3 + 0 + 2 + 1 + 3 + 2 + 1 + 0 + 2 + 1

10= 1.5

We solve the single equation

73− 2θ = 1.5

and find that θ = 512 .



The likelihood function of X given the observationsx = (x1, . . . , x10) = (3,0,2,1,3,2,1,0,2,1) is

L(θ; x) =n∏

i=1

p(xi |θ)

= p(X = 3|θ)p(X = 0|θ)p(X = 2|θ)p(X = 1θ)p(X = 3θ)

× p(X = 2θ)p(X = 1θ)p(X = 0θ)p(X = 2θ)p(X = 1θ)

=

(2θ3

)2(θ3

)3(2(1− θ)

3

)3(1− θ3

)2

.

θ = arg maxθ∈(0,1)

(2θ3

)2(θ3

)3(2(1− θ)

3

)3(1− θ3

)2

Clearly, the likelihood function is not easy to maximize.Let’s look at the log-likelihood



The log-likelihood function of X given the observationsx = (x1, . . . , x10) = (3,0,2,1,3,2,1,0,2,1) is

log L(θ) = logn∏

i=1

p(xi |θ)

= 2(

log23

+ log θ

)+ 3

(log

13

+ log θ

)+ 3

(log

23

+ log(1− θ)

)

+ 2(

log(13− log(1− θ)

)

= Constant + 5 log θ + 5 log(1− θ)

Setting the derivative to 0 and solving

d log L(θ)

dθ= 5

(1θ− 1

1− θ

)= 0

θ = 0.5(the Method of Moments yields θ = 5/12, which is different from MLE.)



Example

Use the Method of Moments to estimate the parameters µ and σ2 forthe normal density

p(

x |µ, σ2)

=1√2πσ

exp

(−(x − µ)2

2σ2

)

based on i.i.d. random sample X1, . . . ,Xn.



Solution:

First and second theoretical moments for the normal distribution are

µ1 = E [X ] = µ

µ2 = E[X 2]

= σ2 + µ2.

The first and second sample moments are

m1 =1n

n∑

i=1

Xi

m2 =1n

n∑

i=1

X 2i .



Solving the equations

µ =1n

n∑

i=1

Xi

σ2 + µ2 =1n

n∑

i=1

X 2i .

We have the Method of Moments estimator

µ =1n

n∑

i=1

Xi

σ2 =1n

n∑

i=1

X 2i −

(1n

n∑

i=1

Xi

)2

=1n

n∑

i=1

X 2i −

(Xn)2

=1n

n∑

i=1

(Xi − Xn

)2

In this case the MLE and MME yield the same estimators.



ExampleLet X1, . . . ,Xn be i.i.d samples with from a uniform distribution on theinterval [a,b], that is

p (x |a,b) =

{1

b−a , a ≤ x ≤ b0 ,otherwise

Find the Method of Moments estimator for a,b.



Solution:

The first two moments are:

µ1 = E [X ] =

b∫

a

x1

b − ad x =

a + b2

µ2 = E[X 2]

=

b∫

a

x2 1b − a

d x =a2 + ab + b2

3.

The corresponding sample moments are:

m1 =1n

n∑

i=1

Xi

m2 =1n

n∑

i=1

X 2i



We solve the equations:

µ1 = m1

µ2 = m2.

and obtain:

a = m1 −√

3(m2 −m2

1

)

b = m1 +√

3(m2 −m2

1

)


6.4 Examples: MLE and Methods of Moments

ExampleLet X1, . . . ,Xn be i.i.d samples with from a beta distribution(X ∼ β (θ,1)) with pdf

p (x |θ) = θxθ−1, 0 ≤ x ≤ 1, 0 ≤ θ ≤ ∞

1 Find the MLE for θ.2 Find the Method of Moments estimator for θ.



The likelihood function is given by

p (x |θ) =n∏

i=1

θxθ−1i = θn

n∏

i=1

(xi)θ−1 = θn

(n∏

i=1

xi

)θ−1

Its derivative is given by

ddθ

log p (x |θ) =ddθ

log

θn

(n∏

i=1

xi

)θ−1

=ddθ

(n log θ + (θ − 1)

n∑

i=1

log (xi)

)

=nθ

+n∑

i=1

log(xi)



Set the derivative equal to zero, solve for θ , and replace xi by Xito obtain

θ = − nn∑

i=1log(Xi)

Is this the maximum?Let’s calculate the second derivative

ddθ2 log p (x |θ) =

ddθ

(nθ

+n∑

i=1

log(xi)

)

= − nθ2 ≤ 0,

so this is the MLE.



The Method of Moments for θ:

T The first moment of X ∼ β (θ,1)

E [X ] =θ

θ + 1

The first sample moment is

m1 =1n

n∑

i=1

Xi

We solve the equation

θ

θ + 1=

1n

n∑

i=1

Xi

which yields θ =

n∑i=1

Xi

n−n∑

i=1Xi



Objectives

Now you should be able to :Understand the likelihood principleUnderstand how to formulate the MLE procedureApply the CRLBUnderstand how to formulate the Method of Moments estimationprocedure

Put yourself to the test ! ; Q7.1 p.355, Q7.2 p.355, Q7.6 p.355, Q7.8p.355, Q7.10 p.355, Q7.15 p.355


Mathematical Statistics - Nanyang Technological University

Documents

Transcript of Mathematical Statistics - Nanyang Technological University