Spanos lecture+3-6334-estimation
description
Transcript of Spanos lecture+3-6334-estimation
PHIL 6334 - Probability/Statistics Lecture Notes 3:
Estimation (Point and Interval)
Aris Spanos [Spring 2014]
1 Introduction
In this lecture we will consider point estimation in its simplest
form by focusing the discussion on simple statistical models,
whose generic form is give in table 1.
Table 1 — Simple (generic) Statistical Model
[i] Probability model: Φ={(;θ) θ∈Θ ∈R} [ii] Sampling model: X:=(1 ) is a random (IID) sample.
What makes this type of statistical model ‘simple’ is the no-
tion of random (IID) sample.
1.1 Random sample (IID)
The notion of a random sample is defined in terms of the joint
distribution of the sampleX:=(1 2 ) say (1 2 ;θ)
for all x:=(1 2 )∈R by imposing two probabilistic
assumptions:
(I) Independence: the sample X is said to be Indepen-
dent (I) if, for all x∈R the joint distribution splits up into
a product of marginal distributions:
(x;θ)=1(1;θ1)·2(2;θ2)· · · · (;θ):=Q
=1 (;θ)
(ID) Identically Distributed: the sample X is said to
be Identically Distributed (ID) if the marginal distributions
1
are identical:
(;θ)=(;θ) for all =1 2
Note that this means two things, the density functions have
the same form and the unknown parameters are common to
all of them.
For a better understanding of these two crucial probabilistic
assumptions we need to simplify the discussion by focusing
first on the two r.v. variable case, which we denote by and
to avoid subscripts.
First, let us revisit the notion of a random variable in order
to motive the notions of marginal and joint distributions.
Example 5. Tossing a coin twice and noting the outcome.
In this case ={() ( ) () ( )} and let us as-sume that the events of interest are ={() ( ) ()}and ={( ) ( ) ()} Using these two events we cangenerate the event space of interest F by applying the set
theoretic operations of union (∪), intersection (∩), and com-plementation (−). That is, F={∅ ∩ };convince yourself that this will give rise to the set of all sub-
sets of . Let us define the real-valued functions () and
() on as follows:
()=( )=()=1 ( )=0
( ) = ()= ( )=1 () =0
Do these two functions define proper r.v.s with respect to F To check that we define all possible events generated by these
functions and check whether they belong to F :{:()=0}={( )}=∈F {:()=1}=∈F
2
{: ()=0}={()}=∈F {: ()=1}=∈F Hence, both functions do define proper r.v’s with respect
to F To derive their distributions we assume that we have afair coin, i.e. each event in has probability .25 of occurring.
Hence, both functions do define proper r.v’s with respect to
F To derive their distributions we assume that we have a faircoin, i.e. each event in has probability .25 of occurring.
{:()=0}= (=0)=25
{:()=1}= (=1)=75
{: ()=0}= (=0)=25
{: ()=1}= (=1)=75
Hence, their ‘marginal’ density functions take the form:
0 1
() 25 75
0 1
() 25 75(1)
How can one define the joint distribution of these two r.v.s?
To define the joint density function we need to specify all the
events:
(= =) ∈R ∈R
denoting ‘their joint occurrence’, and then attach probabili-
ties to these events. These events belong to F by definition
because as a field is closed under the set theoretic operations
∪ ∩ so that:(=0 =0)= {}=∅(=0 =1)= {( )}(=1 =0)= {()}(=1 =1)= {( ) ()}
(=0 =0)= 0
(=0 =1)= 25
(=1 =0)= 25
(=1 =1)= 50
3
Hence, the joint density is defined by:
 0 1
0 0 25
1 25 50
(2)
How is the joint density (2) connected to the individual (mar-
ginal) densities given in (1)? It turns out that if we sum
over the rows of the above table for each value of , i.e. useP∈R
( )= () we will get the marginal distribution
of : () ∈R and if we sum over the columns for each
value of , i.e. useP
∈R( )=() we will get the
marginal distribution of : () ∈R :
 0 1 ()
0 0 25 .25
1 25 50 .75
() .25 .75 1
(3)
Note: ()=0(25)+1(75)=75 = ( )
()=(0−75)2(25)+(1−75)2(75)=1875 = ( )
Armed with the joint distribution we can proceed to de-
fine the notions of Independence and Identically Distributed
between the r.v’s and .
Independence. Two r.v’s and are said to be Inde-
pendent iff:
( )=()· () for all values ( )∈R ×R (4)
That is, to verify that these two r.v’s are independent we need
to confirm that the probability of all possible pairs of values
( ) satisfies (4).
4
Example. In the case of the joint distribution in (3)
we can show that the r.v’s are not independent because for
( )=(0 0):
(0 0)=0 6= (0)· (0)=(25)(25)It is important to emphasize that the above condition of
Independence is not equivalent to the two random variables
being uncorrelated:
( )=09 ( )=()· () for all ( )∈R×R
where ‘9’ denotes ‘does not imply’. This is because( )
is a measure of linear dependence between and since it
is based on the covariance defined by:
( ) =[(-())( -( ))=2(0)(0-75) + (25)(0-75)(1-75)+
+(25)(0−75)(1−75) + (5)(1−75)(1−75) = −0625A standardized covariance yields the correlation:
( )=( )√
()· ( )=−06251875
=− 13
The intuition underlying this result is that the correlation in-
volves only the first twomoments [mean, variance, covariance]
of and but independence is defined in terms of the den-
sity functions; the latter, in principle, involves all moments,
not just the first two!
Identically Distributed. Two r.v’s and are said
to be Identically Distributed iff:
(; )= (; ) for all values ( )∈R ×R (5)
Example. In the case of the joint distribution in (3) we
can show that the r.v’s are identically distributed because (5)
5
holds. In particular, both r.v’s and take the same values
with the same probabilities.
To shed further light on the notion of IID, consider the
three bivariate distributions given below.
 1 2 ()
0 018 042 06
2 012 028 04
() 03 07 1
(A)
 0 1 ()
0 018 042 06
1 012 028 04
() 03 07 1
(B)
 0 1 ()
0 036 024 06
1 024 016 04
() 06 04 1
(C)
(I) and are Independent iff:
( )=()· () for all ( )∈R ×R (6)
(ID) and are Identically Distributed iff:
() = () for all ( )∈R ×R = and R=R
The random variables and are independent in all three
cases since they satisfy (4) (verify!).
The random variables in (A) are not Identically Distributed
becauseR 6=R and ()6= () for some ( )∈R×R
The random variables in (B) are not Identically Distrib-
uted because even though R=R ()6= () for some( )∈R ×R
Finally, the random variables in (C) are Identically Distrib-
uted because R=R and ()= () for all ( )∈R ×R
6
2 Point Estimation: an overview
It turns out that all forms of frequentist inference, which in-
clude point and interval estimation, hypothesis testing and
prediction, are defined in terms of two sets:
X — sample space: the set of all possible values of the sample X
Θ — parameter space: the set of all possible values of θ
Note that the sample space X is always a subset of R and
denoted by R
In estimation the objective is to use the statistical in-
formation to infer the ‘true’ value ∗ of the unknown parame-ter, whatever that happens to be, as along as it belongs to
Θ
In general, an estimator b of is a mapping (function)from the sample space to the parameter space:b(): X→ Θ (7)
Example 1. Let the statistical model of interest be the
simple Bernoulli model (table 2) and consider the question of
estimating the unknown parameter whose parameter space
is Θ:=[0 1] Note that the sample space is: X:={0 1}Table 2 - Simple Bernoulli Model
Statistical GM: = + ∈N.[1] Bernoulli: v Ber( ) =0 1[2] constant mean: ()=
[3] constant variance: ()=(1−)
⎫⎬⎭ ∈N.
[4] Independence: { ∈N} is an independent process
7
The notation b(X) is used to denote an estimator in or-der to bring out the fact that it is a function of the sample
X and for different values it generates the sampling distri-
bution (b(x); ) for x∈X. Post-data b(X) yields an esti-mate b(x0) which constitutes a particular value of b(X) cor-responding to datax0Crucial distinction: b(X)-estimator(Plato’s world), b(x0)-estimate (real world), and -unknown
constant (Plato’s world); Fisher (1922).
In light of the definition in (7), which of the following map-
pings constitute potential estimators of ?
Table 3: Estimators of ?
[a] b1(X)=
[b] b2(X)=1 −
[c] b3(X)=(1 +)2
[d] b(X)=¡1¢P=1 for some 3
[e] b+1(X)=¡ 1+1
¢P=1
Do the mappings [a]-[e] in table 3 constitute estimators of
? All five functions [a]-[e] have X as their domain, but is the
range of each mapping a subset of Θ:=[0 1]? Mapping [a],
[c]-[e] can be possible estimators of because their ranges are
subsets of [0 1], but [b] cannot not because it can take the
value −1 [ensure you understand why!] which lies outside theparameter space of
One can easily think of many more functions from X to Θ
that will qualify as possible estimators of Given the plethora
of such possible estimators, how does one decide which one is
the most appropriate?
8
To answer that question let us think about the possibil-
ity of an ideal estimator, ∗():X → ∗ i.e., ∗(x)=∗ forall values x∈X . That is, ∗(X) pinpoints the true value ∗of whatever the data. A moment’s reflection reveals that
no such estimator could exist because X is a random vector
with its own distribution (x; ) for all x∈X. Moreover, inview of the randomness of X, any mapping of the form (7)
will be a random variable with its own sampling distribution,
(b(x); ) which is directly derivable from (x; ). Let us
take stock of these distributions.
Let us keep track of these distributions and where they
come from. The distribution of the sample (x; ) for all x∈Xis given by the assumptions of statistical model in question.
I In the above case of the simple Bernoulli model, we cancombine assumptions [2]-[4] to give us:
(x; )[2]-[4]=Y
=1(; )
and then use [1]: (; )=(1−)1− =1 2 to deter-
mine (x; ):
(x; )[2]-[4]=Y
=1(; )
[1]-[4]=
P=1 (1−)
P=1 1−=(1−)−
where =P
=1 , and one can show that :
=P
=1 v Bin( (1− )) (8)
i.e. is Binomially distributed. note that the means and
variances are derived using the two formulae:
(i) (1 + 2 + )=(1) + (2) +
(ii) (1 + 2 + )=2 (1) + 2 (2)(9)
9
To derive the mean and variance of :
( ) = (P
=1)(i)=P
=1()=P
=1 =
( )= (P
=1)(ii)=P
=1 ()=P
=1 (1−)=(1−)The result in (8) is a special case of a general result.
¥ The sampling distribution of any (well-behaved) functionof the sample, say =(1 2 ) can be derived from
(x; ) x∈X using the formula:()=P( ≤ )=
R ··· R{x: (x)≤} (x;θ)x ∈R (10)
In the Bernoulli case, all the estimators [a], [c]-[e] are linear
functions of (12 ) and thus, by (8), their distribu-
tion is Binomial. In particular,
Table 4: Estimators and their sampling distributions
[a] b1(X)= v Ber( (1−))[c] b3(X)=(1 +)2 v Bin
³ [
(1−)2]´
[d] b(X)=¡1¢P=1 v Bin
³ [
(1−)]´ for 3
[e] b+1(X)=¡ 1+1
¢P=1 v Bin
³
+1 [
(1−)(+1)2
]´
(11)
It is important to emphasize at the outset that the sampling
distributions [a]-[e] are evaluated under =∗ where ∗ is thetrue value of
It is clear that none of the sampling distributions of the
estimators in table 4 resembles that of the ideal estimator,
∗(X), whose sampling distribution, if it exists, would be ofthe form:
[i] P(∗(X)=∗)=1 (12)
10
In terms of its first two moments, the ideal estimator satisfies
[ii] (∗(X))=∗and [iii] (∗(X))=0 In contrast to the(infeasible) ideal estimator in (12), when the estimators in
table 4 infer using an outcome x, the inference is always
subject to some error because the variance is not zero. The
sampling distributions of these estimators provide the basis
for evaluating such errors.
In the statistics literature the evaluation of inferential
errors in estimation is accomplished in two interconnected
stages.
The objective of the first stage is to narrow down the set
of all possible estimators of to an optimal subset, where
optimality is assessed by how closely the sampling distribution
of an estimator approximates that of the ideal estimator in
(12); the subject matter of section 3.
The second stage is concerned with using optimal estima-
tors to construct the shortest Confidence Intervals (CI)
for the unknown parameter based on prespecifying the er-
ror of covering (encompassing) ∗ within a random interval ofthe form ((X) (X)); the subject matter of section 4.
3 Properties of point estimators
As mentioned above, the notion of an optimal estimator can
be motivated by how well the sampling distribution of an esti-
mator b(X) approximates that of the ideal estimator in (12).In particular, the three features of the ideal estimator [i]-[iii]
motivate the following optimal properties of feasible estima-
tors.
11
Condition [ii] motivates the property known as:
[I] Unbiasedness: An estimator b(X) is said to be an un-biased for if:
(b(X))=∗ (13)
That is, the mean of the sampling distribution of b(X) coin-cides with the true value of the unknown parameter
Example. In the case of the simple Bernoulli model,
we can see from table 4 that the estimators b1(X) b3(X)and b(X) are unbiased since in all three cases (13) is satis-fied. In contrast, estimator b+1(X) is not unbiased because³b+1(X)´=
+1 6=
Condition [iii] motivates the property known as:
[II] Full Efficiency: An unbiased estimator b(X) is saidto be a fully efficient estimator of if its variance is as small
as it can be, where the latter is expressed by:
(b(X))=():=h ³−2 ln (x;)
2
´i−1
where ‘()’ stands for the Cramer-Rao lower bound; note
that (x; ) is given by the assumed model.
Example (the derivations are not important!). In the case
of the simple Bernoulli model:
ln (x; )= ln +(− ) ln(1−) where =P
=1 ( )=
ln (x;)
= ( )(1
)− (− )( 1
1−)2 ln (x;)
2=− ( 1
2)−(− )( 1
1−)2
³−2 ln (x;)
2
´=( 1
2)( ) + [−( )]( 1
1−)2 =
(1−)
and thus the Cramer-Rao lower bound is:
():=(1−)
12
Looking at the estimators of in (12) it is clear that only
one unbiased estimator achieves that bound, b(X) Hence,b(X) is the only estimator of which is both unbiased andfully efficient.
Comparisons between unbiased estimators can be made in
terms of relative efficiency:
(b1(X)) (b2(X)) for 2
asserting that b2(X) is relatively more efficient than b1(X)but one needs to be careful with such comparisons because
they can be very misleading when both estimators are bad,
as in the case above; the fact that b2(X) is relatively moreefficient than b1(X) does not mean that the former is evenan adequate estimator. Hence, relative efficiency is not some-
thing to write home about!
What renders these two estimators practically use-
less? An asymptotic property motivated by condition [i] of
the ideal estimator, known as consistency.
Intuitively, an estimator b(X) is consistent when its preci-sion (how close to ∗ is) improves as the sample size increases.Condition [i] of the ideal estimator motivates the property
known as:
[III] Consistency: an estimator b(X) is consistent if:Strong: P(lim→∞b(X)=∗)=1Weak: lim→∞ P
³¯̄̄b(X)− ∗¯̄̄≤
´=1
(14)
That is, an estimator b(X) is consistent if it approximates(probabilistically) the sampling distribution of the ideal es-
13
timator asymptotically; as → ∞ The difference between
strong and weak consistency stems from the form of proba-
bilistic convergence they involve, with the former being stronger
than the latter. Both of these properties constitute an exten-
sion of the Strong and Weak Law of Large Numbers (LLN)
which hold for the sample mean =1
P=1 of a process
{ =1 2 } under certain probabilistic assumptions,the most restrictive being that the process is IID; see Spanos
(1999), ch. 8.
200180160140120100806040201
0.70
0.65
0.60
0.55
0.50
0.45
0.40
Inde x
Sam
ple
Ave
rage
Fig. 1: t-plot of for a BerIID
realization with =200
10009008007006005004003002001001
0.70
0.65
0.60
0.55
0.50
0.45
0.40
Inde x
Sam
ple
aver
age
Fig. 2: t-plot of for a BerIID
realization with =1000
In practice, it is no-trivial to prove that a particular esti-
mator is consistent or not by verifying directly the conditions
in (14). However, there is often a short-cut for verifying con-
sistency in the case of unbiased estimators using the sufficient
condition:lim→∞
³b(X)´ =0 (15)
Example. In the case of the simple Bernoulli model, one
can verify that the estimators b1(X) and b3(X) are inconsis-tent because:lim→∞
³b1(X)´=(1−)6=0 lim
→∞
³b1(X)´=(1−)26=0
14
i.e. their variances do not decrease to zero as the sample size
goes to infinity.
In contrast, the estimators b(X) and b+1(X) are consis-tent because:
lim→∞
³b(X)´= lim
→∞((1−)
)=0 lim
→∞(b+1(X))=0
Note that ‘MSE’ denotes the ‘Mean Square Error’, defined by:
(b; ∗)= (b) + [(b; ∗)]2where (b; ∗)=(b)−∗ Hence:lim→∞
(b)=0 if (a) lim→∞
(b)=0 and (b) lim→∞
(b)=∗where (b) is equivalent to lim
→∞(b; ∗)=0
Let us take stock of the above properties and how they
can used by the practitioner in deciding which estimator is
optimal. The property which defines minimal reliability for
an estimator is that of consistency. Intuitively, consistency
indicates that as the sample size increases [as → ∞] theestimator b(X) approaches ∗ the true value of in someprobabilistic sense; convergence almost surely or convergence
in probability. Hence, if an estimator b(X) is not consistent,it is automatically excluded from the subset of potentially
optimal estimators, irrespective of any other properties this
estimator might enjoy. In particular, an unbiased estimator
which is inconsistent is practically useless. On the other hand,
just because an estimator b(X) is consistent does not implythat it’s a ‘good’ estimator; it only implies that it’s minimally
acceptable.
15
¥ It is important to emphasize that the properties of unbi-asedness and fully efficiency hold for any sample size 1
and thus we call themfinite sample properties, but consistency
is an asymptotic property because it holds as →∞
Example. In the case of the simple Bernoulli model, if
the choice between estimators is confined (artificially) among
the estimators b1(X) b3(X) andb+1(X) the latter estimatorshould be chosen, despite being biased, because it’s a consis-
tent estimator of On the other hand, among the estimators
given in table 4, b(X) is clearly the best (most optimal) be-cause it satisfies all three properties. In particular, b(X), notonly satisfies the minimal property of consistency, but it also
has the smallest variance possible, which means that it comes
closer to the ideal estimator than any of the others, for any
sample size 2 The sampling distribution of b(X), whenevaluated under =∗, takes the form:
[d] b(X)= ¡1¢P=1
=∗v Bin³∗ [
∗(1−∗)
]´ (16)
whatever the ‘true’ value ∗ happens to be.
Additional Asymptotic properties
In addition to the properties of estimatorsmentioned above,
there are certain other properties which are often used in prac-
tice to decide on the optimality of an estimator. The most
important is given below for completeness.
[V] Asymptotic Normality: an estimator b(X) is saidto be asymptotically Normal if:√
³b(X)−
´vN(0 ∞()) ∞()6=0 (17)
where ‘v’ stands for ‘can be asymptotically approximated by’.
16
This property is an extension of a well-known result in
probability theory: the Central Limit Theorem (CLT).
The CLTasserts that, under certain probabilistic assumptions
on the process { =1 2 } , the most restrictivebeing that the process is IID, the sampling distribution of
=1
P=1 for a ‘large enough’ can be approximated
by the Normal distribution (Spanos, 1999, ch. 8):
(−())√ ()
vN(0 1) (18)
Note that the important difference between (17) and (18) is
that b(X) in the former does not have to coincide with ;
it can be any well-behaved function (X) of the sample X.
Example. In the case of the simple Bernoulli model the
sampling distribution of b(X) which we know is Binomial(see (16)), it can also be approximated using (18). In the
graph below we compare the Normal approximation to the
Binomial for =10 and =20 in the case where =5, and the
improvement is clearly noticeable.
1086420
0.4
0.3
0.2
0.1
0.0
X
Dens
ity
Binomial 10 0.5Distribution n p
Normal 5 1Distribution Mean StDev
Distribution Plot
Normal approx. of Bin.:
(; =.5 =10)
18161412108642
0.20
0.15
0.10
0.05
0.00
X
Dens
ity
Binomial 20 0.5Distribution n p
Normal 10 2.236Distribution Mean StDev
Distribution Plot
Normal approx. of Bin.:
(; =.5 =20)
17
4 Confidence Intervals (CIs): an overview
4.1 An optimal CI begins with an optimal point estimator
Example 2. Let us summarize the discussion concerning
point estimation by briefly discussing the simple (one para-
meter) Normal model, where =1 (table 5).
Table 5 - Simple Normal Model (one unknown parameter)
Statistical GM: =+ ∈N={1 2 }[1] Normality: v N( ) ∈R[2] Constant mean: ()=
[3] Constant variance: ()=2 (known)
⎫⎬⎭ ∈N.
[4] Independence: { ∈N} independent process
In section 3 we discussed the question of choosing among nu-
merous possible estimators of such as [a]-[e] (table 6) using
their sampling distributions. These results stem from the fol-
lowing theorem. If X:=(1 2 ) is a random (IID)
sample from the Normal distribution, i.e.
v NIID( 2) ∈N:=(1 2 )then the sampling distribution of
P=1 is:P
=1 v N( 2) (19)
Among the above estimators the sample mean for 2=1:b(X):= =1
P=1 v N
¡ 1
¢constitutes the optimal point estimator of because it is:
[U] Unbiased (()=∗),
[FE] Fully Efficient ( ()=()), and
[SC] Strongly Consistent (P(lim→∞=∗)=1).
18
Table 6: Estimators UN FE SC
[a] b1(X)= v N( 1) X × ×[b] b2(X)=1 − v N(0 2) × × ×[c] b3(X)=(1 +)2 v N( 12) X × ×[d] b(X)=¡1¢P
=1 v N¡ 1
¢X X X
[e] b+1(X)= 1+1
P=1vN
³
+1
(+1)2
´× × X
Given that any ‘decent’ estimator b(X) of is likely toyield any value in the interval (−∞∞) can one say some-thing more about its reliability than just "on average" its
values b(x) for x∈X are more likely to occur around ∗ (thetrue value) than those further away?
4.2 What is a Confidence Interval?
I This is what a Confidence Interval (CI) proposes to address.In general, a 1− CI for takes the generic form:
P((X) ≤ ∗ ≤ (X))=1−where (X) and (X) denote the lower and upper (random)
bounds of this CI. The 1− is referred to as the confidencelevel and represents the coverage probability of the CI:
(X;)=((X) (X))
in the sense that the probability that the random interval
(X) covers (overlays) the true ∗ is equal to (1−) This is often envisioned in terms of a long-run metaphor
of repeating the experiment underlying the statistical model
in question in order to get a sequence of outcomes (realizations
19
of X) x each of which will yield an observed
(x0;) =1 2
In the context of this metaphor (1−) denotes the relativefrequency of the observed CIs that will include (overlay) ∗Example 2. In the case of the simple (one parameter)
Normal model (table 5), let us consider the question of con-
structing 95 CIs using the different unbiased estimators of
in table 6:
[a] P(b1(X)−196 ≤ ∗ ≤ b1(X)+196)=95[c] P(b3(X)−196( 1√2) ≤ ∗ ≤ b3(X)+196( 1√2))=95[d] P(−196( 1√) ≤ ∗ ≤ +196(
1√))=95 (20)
How do these CIs differ? The answer is in terms of their
precision (accuracy).
One way to measure precision for CIs is to evaluate their
length:
[a]: 2 (196)=392 [c]: 2³196( 1√
2)´=2772 [d]: 2
³196( 1√
)´=392√
It is clear from this evaluation that the CI associated with
=1
P=1 is the shortest for any 2; e.g. for =100
the length of this CI is 392√100=392
20
∗
1. ` −−−− −−−− a2. ` −−−−− −−−−− a3. ` −−−− −−−− a4. ` −−−− −−−− a5. ` −−−− −−−− a6. ` −−−− −−−− a7. ` −−−− −−−− a8. ` −−−− −−−− a9. ` −−−− −−−− a10. ` −−−− −−−− a11. ` −−−− −−−− a12. ` −−−− −−−− a13.I ` −−−−−−−− a14. ` −−−− −−−− a15. ` −−−− −−−− a16. ` −−−− −−−− a17. ` −−−−− −−−−− a18. ` −−−− −−−− a19. ` −−−− −−−− a20. ` −−−− −−−− a
21
4.3 Constructing Confidence Intervals (CIs)
More generally, the sampling distribution of optimal estimator
gives rise to a pivot (a function of the sample and whose
distribution is known):√¡ −
¢ =∗v N(0 1) (21)
which can be used to construct the shortest CI among all
(1−) CIs for :P(−
2( 1√
) ≤ ∗ ≤ +
2( 1√
)) = (1−) (22)
where P(|| ≥ 2)=(1−) for vN(0 1) (figures 1-2).
Example 3. In the case where 2 is unknown, and we use
2=[1( − 1)]P=1( − )
2 to estimate it, the pivot in
(21) takes the form:√(−)
=∗v St(− 1) (23)
where St(−1) denotes the Student’s t distributionwith (−1)degrees of freedom.
Step 1. Attach a (1−) coverage probability using (23):P(−
2≤√(−)
≤
2) = (1−)
where P(| | ≥ 2)=(1−) for vSt(−1)
Step 2. Re-arrange√(−)
to isolate to derive the CI:
P(−2≤√(−)
≤
2)=P(−
2( √
) ≤ ¡ −
¢ ≤ 2( √
))=
=P(−−2( √
) ≤ − ≤ − +
2( √
))=
= P(−2( √
) ≤ ∗ ≤ +
2( √
))= (1−)
In figures 1-2 the underlying distribution is Normal and in
figures 3-4 it’s Student’s t with 19 degrees of freedom. One
22
can see that, while the tail areas are the same for each , the
threshold values for the Normal 2for each are smaller than
the corresponding values ∗2for the Student’s t because the
latter has heavier tails due to the randomness of 2
0.4
0.3
0.2
0.1
0.0
X
Dens
ity
-1.96
0.025
1.96
0.025
0
Distribution PlotNormal, Mean=0, StDev=1
Fig. 1: P(|| ≥ 025)=.95
for vN(0 1)
0.4
0.3
0.2
0.1
0.0
X
Dens
ity-1.64
0.05
1.64
0.05
0
Distribution PlotNormal, Mean=0, StDev=1
Fig. 2: P(|| ≥ 05)=.90
for vN(0 1)
0.4
0.3
0.2
0.1
0.0
X
Dens
ity
-2.09
0.025
2.09
0.025
0
Distribution PlotT, df=20
Fig. 3: P(| | ≥ 025)=.95
for vSt(19)
0.4
0.3
0.2
0.1
0.0
X
Dens
ity
-1.72
0.05
1.72
0.05
0
Distribution PlotT, df=20
Fig. 4: P(| | ≥ 05)=.90
for vSt(19)
23
5 Summary and conclusions
The primary objective in frequentist estimation is to learn
about ∗ the true value of the unknown parameter of inter-est using its sampling distribution (b; ∗) associated withparticular sample size The finite sample properties are de-
fined directly in terms of (b; ∗) and the asymptotic prop-erties are defined in terms of the asymptotic sampling distri-
bution ∞(b; ∗) aiming to approximate (b; ∗) at the limitas →∞
The question that needs to be considered at this stage is:
what combination of the above mentioned
properties specifies an ‘optimal’ estimator?
A necessary but minimal property for an estimator is con-
sistency (preferably strong). By itself, however, consistency
does not secure learning from data for a given ; it’s a promis-
sory note for potential learning. Hence, for actual learning
one needs to supplement consistency with certain finite sam-
ple properties like unbiasedness and efficiency to ensure that
learning can take place with the particular datax0:=(1 2 )
of sample size .
Among finite sample properties full efficiency is clearly
the most important because it secures the highest degree of
learning for a given since it offers the best possible precision.
Relative efficiency, although desirable, needs to be in-
vestigated further to find out how large is the class of esti-
mators being compared before passing judgement. Being the
24
best econometrician in my family, although worthy of something,
does not make me a good econometrician!!
Unbiasedness, although desirable, is not considered in-
dispensable by itself. Indeed, as shown above, an unbiased
but inconsistent estimator is practically useless, and a consis-
tent but biased estimator is always preferable.
Hence, a consistent, unbiased and fully efficient esti-
mator sets the gold standard in estimation.
In conclusion, it is important to emphasize that point es-
timation is often considered inadequate for the purposes of
scientific inquiry because a ‘good’ point estimator b(X) byitself, does not provide any measure of the reliability and
precision associated with the estimate b(x0); one would bewrong to assume that b(x0) ' ∗ This is the reason whyb(x0) is often accompanied by its standard error [the es-timated standard deviation
q (b(X))] or the p-value of
some test of significance associated with the generic hypothe-
sis =0.
Interval estimation rectifies this weakness of point esti-
mation by providing the relevant error probabilities as-
sociated with inferences pertaining to ‘covering’ the true value
∗ of
25