Spanos lecture+3-6334-estimation

PHIL 6334 - Probability/Statistics Lecture Notes 3:

Estimation (Point and Interval)

Aris Spanos [Spring 2014]

1 Introduction

In this lecture we will consider point estimation in its simplest

form by focusing the discussion on simple statistical models,

whose generic form is give in table 1.

Table 1 — Simple (generic) Statistical Model

[i] Probability model: Φ={(;θ) θ∈Θ ∈R} [ii] Sampling model: X:=(1 ) is a random (IID) sample.

What makes this type of statistical model ‘simple’ is the no-

tion of random (IID) sample.

1.1 Random sample (IID)

The notion of a random sample is defined in terms of the joint

distribution of the sampleX:=(1 2 ) say (1 2 ;θ)

for all x:=(1 2 )∈R by imposing two probabilistic

assumptions:

(I) Independence: the sample X is said to be Indepen-

dent (I) if, for all x∈R the joint distribution splits up into

a product of marginal distributions:

(x;θ)=1(1;θ1)·2(2;θ2)· · · · (;θ):=Q

=1 (;θ)

(ID) Identically Distributed: the sample X is said to

be Identically Distributed (ID) if the marginal distributions

1

are identical:

(;θ)=(;θ) for all =1 2

Note that this means two things, the density functions have

the same form and the unknown parameters are common to

all of them.

For a better understanding of these two crucial probabilistic

assumptions we need to simplify the discussion by focusing

first on the two r.v. variable case, which we denote by and

to avoid subscripts.

First, let us revisit the notion of a random variable in order

to motive the notions of marginal and joint distributions.

Example 5. Tossing a coin twice and noting the outcome.

In this case ={() ( ) () ( )} and let us as-sume that the events of interest are ={() ( ) ()}and ={( ) ( ) ()} Using these two events we cangenerate the event space of interest F by applying the set

theoretic operations of union (∪), intersection (∩), and com-plementation (−). That is, F={∅ ∩ };convince yourself that this will give rise to the set of all sub-

sets of . Let us define the real-valued functions () and

() on as follows:

()=( )=()=1 ( )=0

( ) = ()= ( )=1 () =0

Do these two functions define proper r.v.s with respect to F To check that we define all possible events generated by these

functions and check whether they belong to F :{:()=0}={( )}=∈F {:()=1}=∈F

2

{: ()=0}={()}=∈F {: ()=1}=∈F Hence, both functions do define proper r.v’s with respect

to F To derive their distributions we assume that we have afair coin, i.e. each event in has probability .25 of occurring.

Hence, both functions do define proper r.v’s with respect to

F To derive their distributions we assume that we have a faircoin, i.e. each event in has probability .25 of occurring.

{:()=0}= (=0)=25

{:()=1}= (=1)=75

{: ()=0}= (=0)=25

{: ()=1}= (=1)=75

Hence, their ‘marginal’ density functions take the form:

0 1

() 25 75

0 1

() 25 75(1)

How can one define the joint distribution of these two r.v.s?

To define the joint density function we need to specify all the

events:

(= =) ∈R ∈R

denoting ‘their joint occurrence’, and then attach probabili-

ties to these events. These events belong to F by definition

because as a field is closed under the set theoretic operations

∪ ∩ so that:(=0 =0)= {}=∅(=0 =1)= {( )}(=1 =0)= {()}(=1 =1)= {( ) ()}

(=0 =0)= 0

(=0 =1)= 25

(=1 =0)= 25

(=1 =1)= 50

3

Hence, the joint density is defined by:

Â 0 1

0 0 25

1 25 50

(2)

How is the joint density (2) connected to the individual (mar-

ginal) densities given in (1)? It turns out that if we sum

over the rows of the above table for each value of , i.e. useP∈R

( )= () we will get the marginal distribution

of : () ∈R and if we sum over the columns for each

value of , i.e. useP

∈R( )=() we will get the

marginal distribution of : () ∈R :

Â 0 1 ()

0 0 25 .25

1 25 50 .75

() .25 .75 1

(3)

Note: ()=0(25)+1(75)=75 = ( )

()=(0−75)2(25)+(1−75)2(75)=1875 = ( )

Armed with the joint distribution we can proceed to de-

fine the notions of Independence and Identically Distributed

between the r.v’s and .

Independence. Two r.v’s and are said to be Inde-

pendent iff:

( )=()· () for all values ( )∈R ×R (4)

That is, to verify that these two r.v’s are independent we need

to confirm that the probability of all possible pairs of values

( ) satisfies (4).

4

Example. In the case of the joint distribution in (3)

we can show that the r.v’s are not independent because for

( )=(0 0):

(0 0)=0 6= (0)· (0)=(25)(25)It is important to emphasize that the above condition of

Independence is not equivalent to the two random variables

being uncorrelated:

( )=09 ( )=()· () for all ( )∈R×R

where ‘9’ denotes ‘does not imply’. This is because( )

is a measure of linear dependence between and since it

is based on the covariance defined by:

( ) =[(-())( -( ))=2(0)(0-75) + (25)(0-75)(1-75)+

+(25)(0−75)(1−75) + (5)(1−75)(1−75) = −0625A standardized covariance yields the correlation:

( )=( )√

()· ( )=−06251875

=− 13

The intuition underlying this result is that the correlation in-

volves only the first twomoments [mean, variance, covariance]

of and but independence is defined in terms of the den-

sity functions; the latter, in principle, involves all moments,

not just the first two!

Identically Distributed. Two r.v’s and are said

to be Identically Distributed iff:

(; )= (; ) for all values ( )∈R ×R (5)

Example. In the case of the joint distribution in (3) we

can show that the r.v’s are identically distributed because (5)

5

holds. In particular, both r.v’s and take the same values

with the same probabilities.

To shed further light on the notion of IID, consider the

three bivariate distributions given below.

Â 1 2 ()

0 018 042 06

2 012 028 04

() 03 07 1

(A)

Â 0 1 ()

0 018 042 06

1 012 028 04

() 03 07 1

(B)

Â 0 1 ()

0 036 024 06

1 024 016 04

() 06 04 1

(C)

(I) and are Independent iff:

( )=()· () for all ( )∈R ×R (6)

(ID) and are Identically Distributed iff:

() = () for all ( )∈R ×R = and R=R

The random variables and are independent in all three

cases since they satisfy (4) (verify!).

The random variables in (A) are not Identically Distributed

becauseR 6=R and ()6= () for some ( )∈R×R

The random variables in (B) are not Identically Distrib-

uted because even though R=R ()6= () for some( )∈R ×R

Finally, the random variables in (C) are Identically Distrib-

uted because R=R and ()= () for all ( )∈R ×R

6

2 Point Estimation: an overview

It turns out that all forms of frequentist inference, which in-

clude point and interval estimation, hypothesis testing and

prediction, are defined in terms of two sets:

X — sample space: the set of all possible values of the sample X

Θ — parameter space: the set of all possible values of θ

Note that the sample space X is always a subset of R and

denoted by R

In estimation the objective is to use the statistical in-

formation to infer the ‘true’ value ∗ of the unknown parame-ter, whatever that happens to be, as along as it belongs to

Θ

In general, an estimator b of is a mapping (function)from the sample space to the parameter space:b(): X→ Θ (7)

Example 1. Let the statistical model of interest be the

simple Bernoulli model (table 2) and consider the question of

estimating the unknown parameter whose parameter space

is Θ:=[0 1] Note that the sample space is: X:={0 1}Table 2 - Simple Bernoulli Model

Statistical GM: = + ∈N.[1] Bernoulli: v Ber( ) =0 1[2] constant mean: ()=

[3] constant variance: ()=(1−)

⎫⎬⎭ ∈N.

[4] Independence: { ∈N} is an independent process

7

The notation b(X) is used to denote an estimator in or-der to bring out the fact that it is a function of the sample

X and for different values it generates the sampling distri-

bution (b(x); ) for x∈X. Post-data b(X) yields an esti-mate b(x0) which constitutes a particular value of b(X) cor-responding to datax0Crucial distinction: b(X)-estimator(Plato’s world), b(x0)-estimate (real world), and -unknown

constant (Plato’s world); Fisher (1922).

In light of the definition in (7), which of the following map-

pings constitute potential estimators of ?

Table 3: Estimators of ?

[a] b1(X)=

[b] b2(X)=1 −

[c] b3(X)=(1 +)2

[d] b(X)=¡1¢P=1 for some 3

[e] b+1(X)=¡ 1+1

¢P=1

Do the mappings [a]-[e] in table 3 constitute estimators of

? All five functions [a]-[e] have X as their domain, but is the

range of each mapping a subset of Θ:=[0 1]? Mapping [a],

[c]-[e] can be possible estimators of because their ranges are

subsets of [0 1], but [b] cannot not because it can take the

value −1 [ensure you understand why!] which lies outside theparameter space of

One can easily think of many more functions from X to Θ

that will qualify as possible estimators of Given the plethora

of such possible estimators, how does one decide which one is

the most appropriate?

8

To answer that question let us think about the possibil-

ity of an ideal estimator, ∗():X → ∗ i.e., ∗(x)=∗ forall values x∈X . That is, ∗(X) pinpoints the true value ∗of whatever the data. A moment’s reflection reveals that

no such estimator could exist because X is a random vector

with its own distribution (x; ) for all x∈X. Moreover, inview of the randomness of X, any mapping of the form (7)

will be a random variable with its own sampling distribution,

(b(x); ) which is directly derivable from (x; ). Let us

take stock of these distributions.

Let us keep track of these distributions and where they

come from. The distribution of the sample (x; ) for all x∈Xis given by the assumptions of statistical model in question.

I In the above case of the simple Bernoulli model, we cancombine assumptions [2]-[4] to give us:

(x; )[2]-[4]=Y

=1(; )

and then use [1]: (; )=(1−)1− =1 2 to deter-

mine (x; ):

(x; )[2]-[4]=Y

=1(; )

[1]-[4]=

P=1 (1−)

P=1 1−=(1−)−

where =P

=1 , and one can show that :

=P

=1 v Bin( (1− )) (8)

i.e. is Binomially distributed. note that the means and

variances are derived using the two formulae:

(i) (1 + 2 + )=(1) + (2) +

(ii) (1 + 2 + )=2 (1) + 2 (2)(9)

9

To derive the mean and variance of :

( ) = (P

=1)(i)=P

=1()=P

=1 =

( )= (P

=1)(ii)=P

=1 ()=P

=1 (1−)=(1−)The result in (8) is a special case of a general result.

¥ The sampling distribution of any (well-behaved) functionof the sample, say =(1 2 ) can be derived from

(x; ) x∈X using the formula:()=P( ≤ )=

R ··· R{x: (x)≤} (x;θ)x ∈R (10)

In the Bernoulli case, all the estimators [a], [c]-[e] are linear

functions of (12 ) and thus, by (8), their distribu-

tion is Binomial. In particular,

Table 4: Estimators and their sampling distributions

[a] b1(X)= v Ber( (1−))[c] b3(X)=(1 +)2 v Bin

³ [

(1−)2]´

[d] b(X)=¡1¢P=1 v Bin

³ [

(1−)]´ for 3

[e] b+1(X)=¡ 1+1

¢P=1 v Bin

³

+1 [

(1−)(+1)2

]´

(11)

It is important to emphasize at the outset that the sampling

distributions [a]-[e] are evaluated under =∗ where ∗ is thetrue value of

It is clear that none of the sampling distributions of the

estimators in table 4 resembles that of the ideal estimator,

∗(X), whose sampling distribution, if it exists, would be ofthe form:

[i] P(∗(X)=∗)=1 (12)

10

In terms of its first two moments, the ideal estimator satisfies

[ii] (∗(X))=∗and [iii] (∗(X))=0 In contrast to the(infeasible) ideal estimator in (12), when the estimators in

table 4 infer using an outcome x, the inference is always

subject to some error because the variance is not zero. The

sampling distributions of these estimators provide the basis

for evaluating such errors.

In the statistics literature the evaluation of inferential

errors in estimation is accomplished in two interconnected

stages.

The objective of the first stage is to narrow down the set

of all possible estimators of to an optimal subset, where

optimality is assessed by how closely the sampling distribution

of an estimator approximates that of the ideal estimator in

(12); the subject matter of section 3.

The second stage is concerned with using optimal estima-

tors to construct the shortest Confidence Intervals (CI)

for the unknown parameter based on prespecifying the er-

ror of covering (encompassing) ∗ within a random interval ofthe form ((X) (X)); the subject matter of section 4.

3 Properties of point estimators

As mentioned above, the notion of an optimal estimator can

be motivated by how well the sampling distribution of an esti-

mator b(X) approximates that of the ideal estimator in (12).In particular, the three features of the ideal estimator [i]-[iii]

motivate the following optimal properties of feasible estima-

tors.

11

Condition [ii] motivates the property known as:

[I] Unbiasedness: An estimator b(X) is said to be an un-biased for if:

(b(X))=∗ (13)

That is, the mean of the sampling distribution of b(X) coin-cides with the true value of the unknown parameter

Example. In the case of the simple Bernoulli model,

we can see from table 4 that the estimators b1(X) b3(X)and b(X) are unbiased since in all three cases (13) is satis-fied. In contrast, estimator b+1(X) is not unbiased because³b+1(X)´=

+1 6=

Condition [iii] motivates the property known as:

[II] Full Efficiency: An unbiased estimator b(X) is saidto be a fully efficient estimator of if its variance is as small

as it can be, where the latter is expressed by:

(b(X))=():=h ³−2 ln (x;)

2

´i−1

where ‘()’ stands for the Cramer-Rao lower bound; note

that (x; ) is given by the assumed model.

Example (the derivations are not important!). In the case

of the simple Bernoulli model:

ln (x; )= ln +(− ) ln(1−) where =P

=1 ( )=

ln (x;)

= ( )(1

)− (− )( 1

1−)2 ln (x;)

2=− ( 1

2)−(− )( 1

1−)2

³−2 ln (x;)

2

´=( 1

2)( ) + [−( )]( 1

1−)2 =

(1−)

and thus the Cramer-Rao lower bound is:

():=(1−)

12

Looking at the estimators of in (12) it is clear that only

one unbiased estimator achieves that bound, b(X) Hence,b(X) is the only estimator of which is both unbiased andfully efficient.

Comparisons between unbiased estimators can be made in

terms of relative efficiency:

(b1(X)) (b2(X)) for 2

asserting that b2(X) is relatively more efficient than b1(X)but one needs to be careful with such comparisons because

they can be very misleading when both estimators are bad,

as in the case above; the fact that b2(X) is relatively moreefficient than b1(X) does not mean that the former is evenan adequate estimator. Hence, relative efficiency is not some-

thing to write home about!

What renders these two estimators practically use-

less? An asymptotic property motivated by condition [i] of

the ideal estimator, known as consistency.

Intuitively, an estimator b(X) is consistent when its preci-sion (how close to ∗ is) improves as the sample size increases.Condition [i] of the ideal estimator motivates the property

known as:

[III] Consistency: an estimator b(X) is consistent if:Strong: P(lim→∞b(X)=∗)=1Weak: lim→∞ P

³¯̄̄b(X)− ∗¯̄̄≤

´=1

(14)

That is, an estimator b(X) is consistent if it approximates(probabilistically) the sampling distribution of the ideal es-

13

timator asymptotically; as → ∞ The difference between

strong and weak consistency stems from the form of proba-

bilistic convergence they involve, with the former being stronger

than the latter. Both of these properties constitute an exten-

sion of the Strong and Weak Law of Large Numbers (LLN)

which hold for the sample mean =1

P=1 of a process

{ =1 2 } under certain probabilistic assumptions,the most restrictive being that the process is IID; see Spanos

(1999), ch. 8.

200180160140120100806040201

0.70

0.65

0.60

0.55

0.50

0.45

0.40

Inde x

Sam

ple

Ave

rage

Fig. 1: t-plot of for a BerIID

realization with =200

10009008007006005004003002001001

0.70

0.65

0.60

0.55

0.50

0.45

0.40

Inde x

Sam

ple

aver

age

Fig. 2: t-plot of for a BerIID

realization with =1000

In practice, it is no-trivial to prove that a particular esti-

mator is consistent or not by verifying directly the conditions

in (14). However, there is often a short-cut for verifying con-

sistency in the case of unbiased estimators using the sufficient

condition:lim→∞

³b(X)´ =0 (15)

Example. In the case of the simple Bernoulli model, one

can verify that the estimators b1(X) and b3(X) are inconsis-tent because:lim→∞

³b1(X)´=(1−)6=0 lim

→∞

³b1(X)´=(1−)26=0

14

i.e. their variances do not decrease to zero as the sample size

goes to infinity.

In contrast, the estimators b(X) and b+1(X) are consis-tent because:

lim→∞

³b(X)´= lim

→∞((1−)

)=0 lim

→∞(b+1(X))=0

Note that ‘MSE’ denotes the ‘Mean Square Error’, defined by:

(b; ∗)= (b) + [(b; ∗)]2where (b; ∗)=(b)−∗ Hence:lim→∞

(b)=0 if (a) lim→∞

(b)=0 and (b) lim→∞

(b)=∗where (b) is equivalent to lim

→∞(b; ∗)=0

Let us take stock of the above properties and how they

can used by the practitioner in deciding which estimator is

optimal. The property which defines minimal reliability for

an estimator is that of consistency. Intuitively, consistency

indicates that as the sample size increases [as → ∞] theestimator b(X) approaches ∗ the true value of in someprobabilistic sense; convergence almost surely or convergence

in probability. Hence, if an estimator b(X) is not consistent,it is automatically excluded from the subset of potentially

optimal estimators, irrespective of any other properties this

estimator might enjoy. In particular, an unbiased estimator

which is inconsistent is practically useless. On the other hand,

just because an estimator b(X) is consistent does not implythat it’s a ‘good’ estimator; it only implies that it’s minimally

acceptable.

15

¥ It is important to emphasize that the properties of unbi-asedness and fully efficiency hold for any sample size 1

and thus we call themfinite sample properties, but consistency

is an asymptotic property because it holds as →∞

Example. In the case of the simple Bernoulli model, if

the choice between estimators is confined (artificially) among

the estimators b1(X) b3(X) andb+1(X) the latter estimatorshould be chosen, despite being biased, because it’s a consis-

tent estimator of On the other hand, among the estimators

given in table 4, b(X) is clearly the best (most optimal) be-cause it satisfies all three properties. In particular, b(X), notonly satisfies the minimal property of consistency, but it also

has the smallest variance possible, which means that it comes

closer to the ideal estimator than any of the others, for any

sample size 2 The sampling distribution of b(X), whenevaluated under =∗, takes the form:

[d] b(X)= ¡1¢P=1

=∗v Bin³∗ [

∗(1−∗)

]´ (16)

whatever the ‘true’ value ∗ happens to be.

Additional Asymptotic properties

In addition to the properties of estimatorsmentioned above,

there are certain other properties which are often used in prac-

tice to decide on the optimality of an estimator. The most

important is given below for completeness.

[V] Asymptotic Normality: an estimator b(X) is saidto be asymptotically Normal if:√

³b(X)−

´vN(0 ∞()) ∞()6=0 (17)

where ‘v’ stands for ‘can be asymptotically approximated by’.

16

This property is an extension of a well-known result in

probability theory: the Central Limit Theorem (CLT).

The CLTasserts that, under certain probabilistic assumptions

on the process { =1 2 } , the most restrictivebeing that the process is IID, the sampling distribution of

=1

P=1 for a ‘large enough’ can be approximated

by the Normal distribution (Spanos, 1999, ch. 8):

(−())√ ()

vN(0 1) (18)

Note that the important difference between (17) and (18) is

that b(X) in the former does not have to coincide with ;

it can be any well-behaved function (X) of the sample X.

Example. In the case of the simple Bernoulli model the

sampling distribution of b(X) which we know is Binomial(see (16)), it can also be approximated using (18). In the

graph below we compare the Normal approximation to the

Binomial for =10 and =20 in the case where =5, and the

improvement is clearly noticeable.

1086420

0.4

0.3

0.2

0.1

0.0

X

Dens

ity

Binomial 10 0.5Distribution n p

Normal 5 1Distribution Mean StDev

Distribution Plot

Normal approx. of Bin.:

(; =.5 =10)

18161412108642

0.20

0.15

0.10

0.05

0.00

X

Dens

ity

Binomial 20 0.5Distribution n p

Normal 10 2.236Distribution Mean StDev

Distribution Plot

Normal approx. of Bin.:

(; =.5 =20)

17

4 Confidence Intervals (CIs): an overview

4.1 An optimal CI begins with an optimal point estimator

Example 2. Let us summarize the discussion concerning

point estimation by briefly discussing the simple (one para-

meter) Normal model, where =1 (table 5).

Table 5 - Simple Normal Model (one unknown parameter)

Statistical GM: =+ ∈N={1 2 }[1] Normality: v N( ) ∈R[2] Constant mean: ()=

[3] Constant variance: ()=2 (known)

⎫⎬⎭ ∈N.

[4] Independence: { ∈N} independent process

In section 3 we discussed the question of choosing among nu-

merous possible estimators of such as [a]-[e] (table 6) using

their sampling distributions. These results stem from the fol-

lowing theorem. If X:=(1 2 ) is a random (IID)

sample from the Normal distribution, i.e.

v NIID( 2) ∈N:=(1 2 )then the sampling distribution of

P=1 is:P

=1 v N( 2) (19)

Among the above estimators the sample mean for 2=1:b(X):= =1

P=1 v N

¡ 1

¢constitutes the optimal point estimator of because it is:

[U] Unbiased (()=∗),

[FE] Fully Efficient ( ()=()), and

[SC] Strongly Consistent (P(lim→∞=∗)=1).

18

Table 6: Estimators UN FE SC

[a] b1(X)= v N( 1) X × ×[b] b2(X)=1 − v N(0 2) × × ×[c] b3(X)=(1 +)2 v N( 12) X × ×[d] b(X)=¡1¢P

=1 v N¡ 1

¢X X X

[e] b+1(X)= 1+1

P=1vN

³

+1

(+1)2

´× × X

Given that any ‘decent’ estimator b(X) of is likely toyield any value in the interval (−∞∞) can one say some-thing more about its reliability than just "on average" its

values b(x) for x∈X are more likely to occur around ∗ (thetrue value) than those further away?

4.2 What is a Confidence Interval?

I This is what a Confidence Interval (CI) proposes to address.In general, a 1− CI for takes the generic form:

P((X) ≤ ∗ ≤ (X))=1−where (X) and (X) denote the lower and upper (random)

bounds of this CI. The 1− is referred to as the confidencelevel and represents the coverage probability of the CI:

(X;)=((X) (X))

in the sense that the probability that the random interval

(X) covers (overlays) the true ∗ is equal to (1−) This is often envisioned in terms of a long-run metaphor

of repeating the experiment underlying the statistical model

in question in order to get a sequence of outcomes (realizations

19

of X) x each of which will yield an observed

(x0;) =1 2

In the context of this metaphor (1−) denotes the relativefrequency of the observed CIs that will include (overlay) ∗Example 2. In the case of the simple (one parameter)

Normal model (table 5), let us consider the question of con-

structing 95 CIs using the different unbiased estimators of

in table 6:

[a] P(b1(X)−196 ≤ ∗ ≤ b1(X)+196)=95[c] P(b3(X)−196( 1√2) ≤ ∗ ≤ b3(X)+196( 1√2))=95[d] P(−196( 1√) ≤ ∗ ≤ +196(

1√))=95 (20)

How do these CIs differ? The answer is in terms of their

precision (accuracy).

One way to measure precision for CIs is to evaluate their

length:

[a]: 2 (196)=392 [c]: 2³196( 1√

2)´=2772 [d]: 2

³196( 1√

)´=392√

It is clear from this evaluation that the CI associated with

=1

P=1 is the shortest for any 2; e.g. for =100

the length of this CI is 392√100=392

20

∗

1. ` −−−− −−−− a2. ` −−−−− −−−−− a3. ` −−−− −−−− a4. ` −−−− −−−− a5. ` −−−− −−−− a6. ` −−−− −−−− a7. ` −−−− −−−− a8. ` −−−− −−−− a9. ` −−−− −−−− a10. ` −−−− −−−− a11. ` −−−− −−−− a12. ` −−−− −−−− a13.I ` −−−−−−−− a14. ` −−−− −−−− a15. ` −−−− −−−− a16. ` −−−− −−−− a17. ` −−−−− −−−−− a18. ` −−−− −−−− a19. ` −−−− −−−− a20. ` −−−− −−−− a

21

4.3 Constructing Confidence Intervals (CIs)

More generally, the sampling distribution of optimal estimator

gives rise to a pivot (a function of the sample and whose

distribution is known):√¡ −

¢ =∗v N(0 1) (21)

which can be used to construct the shortest CI among all

(1−) CIs for :P(−

2( 1√

) ≤ ∗ ≤ +

2( 1√

)) = (1−) (22)

where P(|| ≥ 2)=(1−) for vN(0 1) (figures 1-2).

Example 3. In the case where 2 is unknown, and we use

2=[1( − 1)]P=1( − )

2 to estimate it, the pivot in

(21) takes the form:√(−)

=∗v St(− 1) (23)

where St(−1) denotes the Student’s t distributionwith (−1)degrees of freedom.

Step 1. Attach a (1−) coverage probability using (23):P(−

2≤√(−)

≤

2) = (1−)

where P(| | ≥ 2)=(1−) for vSt(−1)

Step 2. Re-arrange√(−)

to isolate to derive the CI:

P(−2≤√(−)

≤

2)=P(−

2( √

) ≤ ¡ −

¢ ≤ 2( √

))=

=P(−−2( √

) ≤ − ≤ − +

2( √

))=

= P(−2( √

) ≤ ∗ ≤ +

2( √

))= (1−)

In figures 1-2 the underlying distribution is Normal and in

figures 3-4 it’s Student’s t with 19 degrees of freedom. One

22

can see that, while the tail areas are the same for each , the

threshold values for the Normal 2for each are smaller than

the corresponding values ∗2for the Student’s t because the

latter has heavier tails due to the randomness of 2

0.4

0.3

0.2

0.1

0.0

X

Dens

ity

-1.96

0.025

1.96

0.025

0

Distribution PlotNormal, Mean=0, StDev=1

Fig. 1: P(|| ≥ 025)=.95

for vN(0 1)

0.4

0.3

0.2

0.1

0.0

X

Dens

ity-1.64

0.05

1.64

0.05

0

Distribution PlotNormal, Mean=0, StDev=1

Fig. 2: P(|| ≥ 05)=.90

for vN(0 1)

0.4

0.3

0.2

0.1

0.0

X

Dens

ity

-2.09

0.025

2.09

0.025

0

Distribution PlotT, df=20

Fig. 3: P(| | ≥ 025)=.95

for vSt(19)

0.4

0.3

0.2

0.1

0.0

X

Dens

ity

-1.72

0.05

1.72

0.05

0

Distribution PlotT, df=20

Fig. 4: P(| | ≥ 05)=.90

for vSt(19)

23

5 Summary and conclusions

The primary objective in frequentist estimation is to learn

about ∗ the true value of the unknown parameter of inter-est using its sampling distribution (b; ∗) associated withparticular sample size The finite sample properties are de-

fined directly in terms of (b; ∗) and the asymptotic prop-erties are defined in terms of the asymptotic sampling distri-

bution ∞(b; ∗) aiming to approximate (b; ∗) at the limitas →∞

The question that needs to be considered at this stage is:

what combination of the above mentioned

properties specifies an ‘optimal’ estimator?

A necessary but minimal property for an estimator is con-

sistency (preferably strong). By itself, however, consistency

does not secure learning from data for a given ; it’s a promis-

sory note for potential learning. Hence, for actual learning

one needs to supplement consistency with certain finite sam-

ple properties like unbiasedness and efficiency to ensure that

learning can take place with the particular datax0:=(1 2 )

of sample size .

Among finite sample properties full efficiency is clearly

the most important because it secures the highest degree of

learning for a given since it offers the best possible precision.

Relative efficiency, although desirable, needs to be in-

vestigated further to find out how large is the class of esti-

mators being compared before passing judgement. Being the

24

best econometrician in my family, although worthy of something,

does not make me a good econometrician!!

Unbiasedness, although desirable, is not considered in-

dispensable by itself. Indeed, as shown above, an unbiased

but inconsistent estimator is practically useless, and a consis-

tent but biased estimator is always preferable.

Hence, a consistent, unbiased and fully efficient esti-

mator sets the gold standard in estimation.

In conclusion, it is important to emphasize that point es-

timation is often considered inadequate for the purposes of

scientific inquiry because a ‘good’ point estimator b(X) byitself, does not provide any measure of the reliability and

precision associated with the estimate b(x0); one would bewrong to assume that b(x0) ' ∗ This is the reason whyb(x0) is often accompanied by its standard error [the es-timated standard deviation

q (b(X))] or the p-value of

some test of significance associated with the generic hypothe-

sis =0.

Interval estimation rectifies this weakness of point esti-

mation by providing the relevant error probabilities as-

sociated with inferences pertaining to ‘covering’ the true value

∗ of

25

Spanos lecture+3-6334-estimation

Education

Transcript of Spanos lecture+3-6334-estimation