Post on 30-Aug-2018
Notes on Maximum Likelihood and Time
Series
Antonis Demos
(Athens University of Economics and Business)
December 2006
2
0.1 Conditional Probability and Independence
In many statistical applications we have variables X and Y (or events A and B)
and want to explain or predict Y or A from X or B, we are interested not only in
marginal probabilities but in conditional ones as well, i.e., we want to incorporate
some information in our predictions. LetA andB be two events inA and a probability
function P (.). The conditional probability of A given event B, is denoted by
P [A|B] and is defined as follows:
Definition 1 The probability of an event A given an event B, denoted by P (A|B),
is given by
P ([A|B) = P (A ∩B)P (B)
if P (B) > 0
and is left undefined if P (B) = 0.
From the above formula is evident P [AB] = P [A|B]P [B] = P [B|A]P [A] if
both P [A] and P [B] are nonzero. Notice that when speaking of conditional proba-
bilities we are conditioning on some given event B; that is, we are assuming that the
experiment has resulted in some outcome in B. B, in effect then becomes our ”new”
sample space. All probability properties of the previous section apply to conditional
probabilities as well, i.e. P (·|B) is a probability measure. In particular:
1. P (A|B) ≥ 0
2. P (S|B) = 1
3. P (∪∞i=1Ai|B) =P∞
i=1 P (Ai|B) for any pairwise disjoint eventsAi∞i=1.
Note that if A and B are mutually exclusive events, P (A|B) = 0. When
A ⊆ B, P (A|B) = P (A)P (B)
≥ P (A) with strict inequality unless P (B) = 1. When
B ⊆ A, P (A|B) = 1.
However, there is an additional property (Law) called the Law of Total
Probabilities which states that:
LAW OF TOTAL PROBABILITY:
P (A) = P (A ∩B) + P (A ∩Bc)
Conditional Probability and Independence 3
For a given probability space (Ω,A, P [.]), if B1, B2, ..., Bn is a collection of mutually
exclusive events in A satisfyingnSi=1
Bi = Ω and P [Bi] > 0 for i = 1, 2, ..., n then for
every A ∈ A,
P [A] =nXi=1
P [A|Bi]P [Bi]
Another important theorem in probability is the so called Bayes’ Theorem
which states:
BAYES RULE: Given a probability space (Ω,A, P [.]), if B1, B2, ..., Bn is a
collection of mutually exclusive events in A satisfyingnSi=1
Bi = Ω and P [Bi] > 0 for
i = 1, 2, ..., n then for every A ∈ A for which P [A] > 0 we have:
P [Bj|A] =P [A|Bj]P [Bj]nPi=1
P [A|Bi]P [Bi]
Notice that for events A and B ∈ A which satisfy P [A] > 0 and P [B] > 0 we have:
P (B|A) = P (A|B)P (B)P (A|B)P (B) + P (A|Bc)P (Bc)
.
This follows from the definition of conditional independence and the law of total
probability. The probability P (B) is a prior probability and P (A|B) frequently is a
likelihood, while P (B|A) is the posterior.
Finally theMultiplication Rule states:
Given a probability space (Ω,A, P [.]), if A1, A2, ..., An are events in A for
which P [A1A2......An−1] > 0 then:
P [A1A2......An] = P [A1]P [A2|A1]P [A3|A1A2].....P [An|A1A2....An−1]
Example: A plant has two machines. Machine A produces 60% of the total
output with a fraction defective of 0.02. Machine B the rest output with a fraction
defective of 0.04. If a single unit of output is observed to be defective, what is the
probability that this unit was produced by machine A?
4
If A is the event that the unit was produced by machine A, B the event that
the unit was produced by machine B and D the event that the unit is defective. Then
we ask what is P [A|D]. But P [A|D] = P [AD]P [D]
. Now P [AD] = P [D|A]P [A] = 0.02 ∗
0.6 = 0.012. Also P [D] = P [D|A]P [A] + P [D|B]P [B] = 0.012 + 0.04 ∗ 0.4 = 0.028.
Consequently, P [A|D] = 0.571. Notice that P [B|D] = 1− P [A|D] = 0.429. We can
also use a tree diagram to evaluate P [AD] and P [BD].
Example: A marketing manager believes the market demand potential of a
new product to be high with a probability of 0.30, or average with probability of 0.50,
or to be low with a probability of 0.20. From a sample of 20 employees, 14 indicated
a very favorable reception to the new product. In the past such an employee response
(14 out of 20 favorable) has occurred with the following probabilities: if the actual
demand is high, the probability of favorable reception is 0.80; if the actual demand is
average, the probability of favorable reception is 0.55; and if the actual demand is low,
the probability of the favorable reception is 0.30. Thus given a favorable reception,
what is the probability of actual high demand?
Again what we ask is P [H|F ] = P [HF ]P [F ]
. Now P [F ] = P [H]P [F |H]+P [A]P [F |A]+
P [L]P [F |L] = 0.24+0.275+0.06 = 0.575. Also P [HF ] = P [F |H]P [H] = 0.24. Hence
P [H|F ] = 0.240.575
= 0.4174
Example: There are five boxes and they are numbered 1 to 5. Each box
contains 10 balls. Box i has i defective balls and 10−i non-defective balls, i = 1, 2, .., 5.
Consider the following random experiment: First a box is selected at random, and
then a ball is selected at random from the selected box. 1) What is the probability
that a defective ball will be selected? 2) If we have already selected the ball and
noted that is defective, what is the probability that it came from the box 5?
Let A denote the event that a defective ball is selected and Bi the event
that box i is selected, i = 1, 2, .., 5. Note that P [Bi] = 1/5, for i = 1, 2, .., 5, and
P [A|Bi] = i/10. Question 1) is what is P [A]? Using the theorem of total probabilities
we have:
Conditional Probability and Independence 5
P [A] =5P
i=1
P [A|Bi]P [Bi] =5P
i=1
i515= 3/10. Notice that the total number of
defective balls is 15 out of 50. Hence in this case we can say that P [A] = 1550= 3/10.
This is true as the probabilities of choosing each of the 5 boxes is the same. Question
2) asks what is P [B5|A]. Since box 5 contains more defective balls than box 4, which
contains more defective balls than box 3 and so on, we expect to find that P [B5|A] >
P [B4|A] > P [B3|A] > P [B2|A] > P [B1|A]. We apply Bayes’ theorem:
P [B5|A] =P [A|B5]P [B5]5P
i=1
P [A|Bi]P [Bi]
=1215310
=1
3
Similarly P [Bj|A] = P [A|Bj ]P [Bj ]5Pi=1
P [A|Bi]P [Bi]
=j10
15
310
= j15for j = 1, 2, ..., 5. Notice that uncon-
ditionally all B0is were equally likely.
Let A and B be two events in A and a probability function P (.). Events
A and B are defined independent if and only if one of the following conditions is
satisfied:
(i) P [AB] = P [A]P [B].
(ii) P [A|B] = P [A] if P [B] > 0.
(iii) P [B|A] = P [B] if P [A] > 0.
These are equivalent definitions except that (i) does not really require P (A),
P (B) > 0. Notice that the property of two events A and B and the property that
A and B are mutually exclusive are distinct, though related properties. We know
that if A and B are mutually exclusive then P [AB] = 0. Now if these events are
also independent then P [AB] = P [A]P [B], and consequently P [A]P [B] = 0, which
means that either P [A] = 0 or P [B] = 0. Hence two mutually exclusive events are
independent if P [A] = 0 or P [B] = 0. On the other hand if P [A] 6= 0 and P [B] 6= 0,
then if A and B are independent can not be mutually exclusive and oppositely if they
are mutually exclusive can not be independent. Also notice that independence is not
transitive, i.e., A independent of B and B independent of C does not imply that A
6
is independent of C.
Example: Consider tossing two dice. Let A denote the event of an odd total,
B the event of an ace on the first die, and C the event of a total of seven. We ask
the following:
(i) Are A and B independent?
(ii) Are A and C independent?
(iii) Are B and C independent?
(i) P [A|B] = 1/2, P [A] = 1/2 hence P [A|B] = P [A] and consequently A and
B are independent.
(ii) P [A|C] = 1 6= P [A] = 1/2 hence A and C are not independent.
(iii) P [C|B] = 1/6 = P [C] = 1/6 hence B and C are independent.
Notice that although A and B are independent and C and B are independent
A and C are not independent.
Let us extend the independence of two events to several ones:
For a given probability space (Ω,A, P [.]), let A1, A2, ..., An be n events in A.
Events A1, A2, ..., An are defined to be independent if and only if:
P [AiAj] = P [Ai]P [Aj] for i 6= j
P [AiAjAk] = P [Ai]P [Aj]P [Ak] for i 6= j, i 6= k, k 6= j
and so on
P [nTi=1
Ai] =nQi=1
P [Ai]
Notice that pairwise independence does not imply independence, as the fol-
lowing example shows.
Example: Consider tossing two dice. Let A1 denote the event of an odd face
in the first die, A2 the event of an odd face in the second die, and A3 the event of
an odd total. Then we have: P [A1]P [A2] =1212= P [A1A2], P [A1]P [A3] =
1212=
P [A3|A1]P [A1] = P [A1A3], and P [A2A3] =14= P [A2]P [A3] hence A1, A2, A3 are
pairwise independent. However notice that P [A1A2A3] = 0 6= 18= P [A1]P [A2]P [A3].
Hence A1, A2, A3 are not independent.
Random Variables, Distribution Functions, and Densities 7
0.2 Random Variables, Distribution Functions, and Densities
The probability space (S,A, P ) is not particularly easy to work with. In practice, we
often need to work with spaces with some structure (metric spaces). It is convenient
therefore to work with a cardinalization of S by using the notion of random variable.
Formally, a random variable X is just a mapping from the sample space to
the real line, i.e.,
X : S −→ R,
with a certain property, it is a measurable mapping, i.e.
AX = A ⊂ S : X(A) ∈ B =©X−1(B) : B ∈ B
ª⊆ A,
where B is a sigma-algebra on R, for any B in B the inverse image belongs to A. The
probability measure PX can then be defined by
PX (X ∈ B) = P¡X−1 (B)
¢.
It is straightforward to show that AX is a σ-algebra whenever B is. Therefore, PX
is a probability measure obeying Kolmogorov’s axioms. Hence we have transferred
(S,A, P ) −→ (R,B, PX), where B is the Borel σ-algebra when X(S) = R or any
uncountable set, and B is P (X (S)) when X (S) is finite. The function X(.) must be
such that the set Ar, defined by Ar = ω : X(ω) ≤ r, belongs to A for every real
number r, as elements of B are left-closed intervals of R.
The important part of the definition is that in terms of a random experiment,
S is the totality of outcomes of that random experiment, and the function, or random
variable, X(.) with domain S makes some real number correspond to each outcome
of the experiment. The fact that we also require the collection of ω0s for which
X(ω) ≤ r to be an event (i.e. an element of A) for each real number r is not much of
a restriction since the use of random variables is, in our case, to describe only events.
Example: Consider the experiment of tossing a single coin. Let the random
variable X denote the number of heads. In this case S = head, tail, and X(ω) = 1
8
if ω = head, and X(ω) = 0 if ω = tail. So the random variable X associates a
real number with each outcome of the experiment. To show that X satisfies the
definition we should show that ω : X(ω) ≤ r, belongs to A for every real number
r. A = φ, head, tail, S. Now if r < 0, ω : X(ω) ≤ r = φ, if 0 ≤ r < 1 then
ω : X(ω) ≤ r = tail, and if r ≥ 1 then ω : X(ω) ≤ r = head, tail = S.
Hence, for each r the set ω : X(ω) ≤ r belongs to A and consequently X(.) is a
random variable.
In the above example the random variable is described in terms of the random
experiment as opposed to its functional form, which is the usual case.
We can now work with (R,B, PX), which has metric structure and algebra.
For example, we toss two die in which case the sample space is
S = (1, 1) , (1, 2) , ..., (6, 6) .
We can define two random variables: the Sum and Product:
X (S) = 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
X (S) = 1, 2, 3, 4, 5, 6, 8, 9, 10, ..., 36
The simplest form of random variables are the indicators IA
IA (s) =
⎧⎨⎩ 1 if s ∈ A
0 if s /∈ A
This has associated sigma algebra in S
φ, S,A,Ac
Finally, we give formal definition of a continuous real-valued random variable.
Definition 2 A random variable is continuous if its probability measure PX is ab-
solutely continuous with respect to Lebesgue measure, i.e., PX(A) = 0 whenever
λ(A) = 0.
Random Variables, Distribution Functions, and Densities 9
0.2.1 Distribution Functions
Associated with each random variable there is the distribution function
FX (x) = PX (X ≤ x)
defined for all x ∈ R. This function effectively replaces PX . Note that we can
reconstruct PX from FX .
EXAMPLE. S = H,T , X (H) = 1, X (T ) = 0, (p = 1/2).
If x < 0, FX(x) = 0
If 0 ≤ x < 1, FX(x) = 1/2
If x ≥ 1, FX(x) = 1.
EXAMPLE. The logit c.d.f. is
FX(x) =1
1 + e−x
It is continuous everywhere and asymptotes to 0 and 1 at ±∞ respectively. Strictly
increasing.
Note that the distribution function FX(x) of a continuous random variable is
a continuous function. The distribution function of a discrete random variable is a
step function.
Theorem 3 A function F (·) is a c.d.f. of a random variable X if and only if the
following three conditions hold
1. limx→−∞ F (x) = 0 and limx→∞ F (x) = 1
2. F is a nondecreasing function in x
3. F is right-continuous, i.e., for all x0, limx→x+0F (x) = F (x0)
4. F is continuous except at a set of points of Lebesgue measure zero.
0.2.2 Continuous Random Variables
A random variable X is called continuous if there exist a function fX(.) such that
FX(x) =xR
−∞fX(u)du for every real number x. In such a case FX(x) is the cumulative
distribution and the function fX(.) is the density function.
10
Notice that according to the above definition the density function is not
uniquely determined. The idea is that if the a function change value if a few points
its integral is unchanged. Furthermore, notice that fX(x) = dFX(x)/dx.
The notations for discrete and continuous density functions are the same,
yet they have different interpretations. We know that for discrete random variables
fX(x) = P [X = x], which is not true for continuous random variables. Further-
more, for discrete random variables fX(.) is a function with domain the real line and
counterdomain the interval [0, 1], whereas, for continuous random variables fX(.) is a
function with domain the real line and counterdomain the interval [0,∞). Note that
for a continuous r.v.
P (X = x) ≤ P (x− ε ≤ X ≤ x) = FX(x)− FX(x− ε)→ 0
as ε → 0, by the continuity of FX(x). The set X = x is an example of a set of
measure (in this case the measure is P or PX) zero. In fact, any countable set is
of measure zero under a distribution which is absolutely continuous with respect to
Lebesgue measure. Because the probability of a singleton is zero
P (a ≤ X ≤ b) = P (a ≤ X < b) = P (a < X < b)
for any a, b.
Example: Let X be the random variable representing the length of a tele-
phone conversation. One could model this experiment by assuming that the distrib-
ution of X is given by FX(x) = (1− e−λx) where λ is some positive number and the
random variable can take values only from the interval [0,∞). The density function
is dFX(x)/dx = fX(x) = λe−λx. If we assume that telephone conversations are mea-
sured in minutes, P [5 < X ≤ 10] =R 105
fX(x)dx =R 105
λe−λxdx = e−5λ − e−10λ, and
for λ = 1/5 we have that P [5 < X ≤ 10] = e−1 − e−2 = 0.23.
The example above indicates that the density functions of continuous random
variables are used to calculate probabilities of events defined in terms of the corre-
sponding continuous random variable X i.e. P [a < X ≤ b] =R bafX(x)dx. Again we
Expectations and Moments of Random Variables 11
can give the definition of the density function without any reference to the random
variable i.e. any function f(.) with domain the real line and counterdomain [0,∞) is
defined to be a probability density function iff
(i) f(x) ≥ 0 for all x
(ii)R∞−∞ f(x)dx = 1.
In practice when we refer to the certain distribution of a random variable, we
state its density or cumulative distribution function. However, notice that not all
random variables are either discrete or continuous.
0.3 Expectations and Moments of Random Variables
An extremely useful concept in problems involving random variables or distributions
is that of expectation.
0.3.1 Mean or Expectation
Let X be a random variable. The mean or the expected value of X, denoted by
E[X] or μX , is defined by:
(i) E[X] =P
xjP [X = xj] =P
xjfX(xj)
if X is a discrete random variable with counterdomain the countable set
x1, ..., xj, ..
(ii) E[X] =R∞−∞ xfX(x)dx
if X is a continuous random variable with density function fX(x) and if either¯R∞0
xfX(x)dx¯<∞ or
¯R 0−∞ xfX(x)dx
¯<∞ or both.
(iii) E[X] =R∞0[1− FX(x)]dx−
R 0−∞ FX(x)dx
for an arbitrary random variable X.
(i) and (ii) are used in practice to find the mean for discrete and continuous
random variables, respectively. (iii) is used for the mean of a random variable that is
neither discrete nor continuous.
Notice that in the above definition we assume that the sum and the integrals
exist. Also that the summation in (i) runs over the possible values of j and the
12
jth term is the value of the random variable multiplied by the probability that the
random variable takes this value. Hence E[X] is an average of the values that the
random variable takes on, where each value is weighted by the probability that the
random variable takes this value. Values that are more probable receive more weight.
The same is true in the integral form in (ii). There the value x is multiplied by the
approximate probability that X equals the value x, i.e. fX(x)dx, and then integrated
over all values.
Notice that in the definition of a mean of a random variable, only density
functions or cumulative distributions were used. Hence we have really defined the
mean for these functions without reference to random variables. We then call the
defined mean the mean of the cumulative distribution or the appropriate density
function. Hence, we can speak of the mean of a distribution or density function as
well as the mean of a random variable.
Notice that E[X] is the center of gravity (or centroid) of the unit mass that
is determined by the density function of X. So the mean of X is a measure of where
the values of the random variable are centered or located i.e. is a measure of central
location.
Example: Consider the experiment of tossing two dice. Let X denote the
total of the upturned faces. Then for this case we have:
E[X] =12Pi=2
ifX(i) = 7
Example: Consider a X that can take only to possible values, 1 and -1, each
with probability 0.5. Then the mean of X is:
E[X] = 1 ∗ 0.5 + (−1) ∗ 0.5 = 0
Notice that the mean in this case is not one of the possible values of X.
Example: Consider a continuous random variable X with density function
fX(x) = λe−λx for x ∈ [0,∞). Then
E[X] =∞R−∞
xfX(x)dx =∞R0
xλe−λxdx = 1/λ
Example: Consider a continuous random variable X with density function
Expectations and Moments of Random Variables 13
fX(x) = x−2 for x ∈ [1,∞). Then
E[X] =∞R−∞
xfX(x)dx =∞R1
xx−2dx = limb→∞
log b =∞
so we say that the mean does not exist, or that it is infinite.
Median of X: When FX is continuous and strictly increasing, we can define
the median of X, denoted m(X), as being the unique solution to
FX(m) =1
2.
Since in this case, F−1X (·) exists, we can alternatively write m = F−1X (12). For discrete
r.v., there may be many m that satisfy this or may none. Suppose
X =
⎧⎪⎪⎪⎨⎪⎪⎪⎩0 1/3
1 1/3
2 1/3
,
then there does not exist an m with FX(m) =12. Also, if
X =
⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩
0 1/4
1 1/4
2 1/4
3 1/4
,
then any 1 ≤ m ≤ 2 is an adequate median.
Note that if E (Xn) exists, then so does E (Xn−1) but not vice versa (n > 0).
Also when the support is infinite, the expectation does not necessarily exist.
IfR∞0
xfX(x)dx =∞ butR 0−∞ xfX(x)dx > −∞, then E (X) =∞
IfR∞0
xfX(x)dx =∞ andR 0−∞ xfX(x)dx = −∞, then E (X)is not defined.
Example: [Cauchy] fX(x) = 1π
11+x2
. This density function is symmetric
about zero, and one is temted to say that E (X) = 0. ButR∞0
xfX(x)dx = ∞ andR 0−∞ xfX(x)dx = −∞, so E(X) does not exist according to the above definition.
Now consider Y = g(X), where g is a (piecewise) monotonic continuous func-
tion. Then
E (Y ) =
Z ∞
−∞yfY (y)dy =
Z ∞
−∞g(x)fX(x)dx = E (g(x))
14
Theorem 4 Expectation has the following properties:
1. [Linearity] E (a1g1(X) + a2g2(X) + a3) = a1E (g1(X)) + a2E (g2(X)) + a3
2. [Monotonicity] If g1(x) ≥ g2(x)⇒ E (g1(X)) ≥ E (g2(X))
3. Jensen’s inequality. If g(x) is a weakly convex function, i.e., g (λx+ (1− λ) y) ≤
λg (x) + (1− λ) g (y) for all x, y, and all with 0 ≤ λ ≤ 1, then
E (g(X)) ≥ g (E (X)) .
An Interpretation of Expectation
We claim that E (X) is the unique minimizer of E (X − θ)2 with respect to θ, assum-
ing that the second moment of X is finite.
Theorem 5 Suppose that E (X2) exists and is finite, then E (X) is the unique min-
imizer of E (X − θ)2 with respect to θ.
This Theorem says that the Expectation is the closest quantity to θ, in mean
square error.
0.3.2 Variance
Let X be a random variable and μX be E[X]. The variance of X, denoted by σ2X or
var[X], is defined by:
(i) var[X] =P(xj − μX)
2P [X = xj] =P(xj − μX)
2fX(xj)
if X is a discrete random variable with counterdomain the countable set
x1, ..., xj, ..
(ii) var[X] =R∞−∞(xj − μX)
2fX(x)dx
if X is a continuous random variable with density function fX(x).
(iii) var[X] =R∞02x[1− FX(x) + FX(−x)]dx− μ2X
for an arbitrary random variable X.
The variances are defined only if the series in (i) is convergent or if the integrals
in (ii) or (iii) exist. Again, the variance of a random variable is defined in terms of
Expectations and Moments of Random Variables 15
the density function or cumulative distribution function of the random variable and
consequently, variance can be defined in terms of these functions without reference
to a random variable.
Notice that variance is a measure of spread since if the values of the random
variable X tend to be far from their mean, the variance of X will be larger than the
variance of a comparable random variable whose values tend to be near their mean.
It is clear from (i), (ii) and (iii) that the variance is a nonnegative number.
If X is a random variable with variance σ2X , then the standard deviation of
X, denoted by σX , is defined aspvar(X)
The standard deviation of a random variable, like the variance, is a measure
of spread or dispersion of the values of a random variable. In many applications it
is preferable to the variance since it will have the same measurement units as the
random variable itself.
Example: Consider the experiment of tossing two dice. Let X denote the
total of the upturned faces. Then for this case we have (μX = 7):
var[X] =12Pi=2
(i− μX)2fX(i) = 210/36
Example: Consider a X that can take only to possible values, 1 and -1, each
with probability 0.5. Then the variance of X is (μX = 0):
var[X] = 0.5 ∗ 12 + 0.5 ∗ (−1)2 = 1
Example: Consider a X that can take only to possible values, 10 and -10,
each with probability 0.5. Then we have:
μX = E[X] = 10 ∗ 0.5 + (−10) ∗ 0.5 = 0
var[X] = 0.5 ∗ 102 + 0.5 ∗ (−10)2 = 100
Notice that in examples 2 and 3 the two random variables have the same mean
but different variance, larger being the variance of the random variable with values
further away from the mean.
Example: Consider a continuous random variable X with density function
fX(x) = λe−λx for x ∈ [0,∞). Then (μX = 1/λ):
16
var[X] =∞R−∞(x− μX)
2fX(x)dx =∞R0
(x− 1/λ)2λe−λxdx = 1λ2
Example: Consider a continuous random variable X with density function
fX(x) = x−2 for x ∈ [1,∞). Then we know that the mean of X does not exist.
Consequently, we can not define the variance.
Notice that
V ar (X) = E£(X −E(X))2
¤= E
¡X2¢−E2 (X)
and that
V ar (aX + b) = a2V ar (X) , SD =√V ar, SD (aX + b) = |a|SD(X),
i.e., SD(X) changes proportionally. Variance/standard deviation measures disper-
sion, higher variance more spread out. Interquartile range: F−1X (3/4)−F−1X (1/4), the
range of middle half always exists and is an alternative measure of dispersion.
0.3.3 Higher Moments of a Random Variable
If X is a random variable, the rth raw moment of X, denoted by μ/r, is defined as:
μ/r = E[Xr]
if this expectation exists. Notice that μ/r = E[X] = μ/1 = μX , the mean of X.
If X is a random variable, the rth central moment of X about α is defined
as E[(X−α)r]. If α = μX , we have the rth central moment of X about μX , denoted
by μr, which is:
μr = E[(X − μX)r]
We have measures defined in terms of quantiles to describe some of the char-
acteristics of random variables or density functions. The qth quantile of a random
variable X or of its corresponding distribution is denoted by ξq and is defined as the
smallest number ξ satisfying FX(ξ) ≥ q. If X is a continuous random variable, then
the qth quantile of X is given as the smallest number ξ satisfying FX(ξ) ≥ q.
Expectations and Moments of Random Variables 17
The median of a random variable X, denoted by medX or med(X), or ξq, is
the 0.5th quantile. Notice that if X a continuous random variable the median of X
satisfies:
Z med(X)
−∞fX(x)dx =
1
2=
Z ∞
med(X)
fX(x)dx
so the median of X is any number that has half the mass of X to its right and the
other half to its left. The median and the mean are measures of central location.
The third moment about the mean μ3 = E (X −E (X))3 is called a measure
of asymmetry, or skewness. Symmetrical distributions can be shown to have μ3 = 0.
Distributions can be skewed to the left or to the right. However, knowledge of the
third moment gives no clue as to the shape of the distribution, i.e. it could be the
case that μ3 = 0 but the distribution to be far from symmetrical. The ratio μ3σ3is
unitless and is call the coefficient of skewness. An alternative measure of skewness
is provided by the ratio: (mean-median)/(standard deviation)
The fourth moment about the mean μ4 = E (X −E (X))4 is used as a measure
of kurtosis, which is a degree of flatness of a density near the center. The coefficient
of kurtosis is defined as μ4σ4− 3 and positive values are sometimes used to indicate
that a density function is more peaked around its center than the normal (leptokur-
tic distributions). A positive value of the coefficient of kurtosis is indicative for a
distribution which is flatter around its center than the standard normal (platykurtic
distributions). This measure suffers from the same failing as the measure of skewness
i.e. it does not always measure what it supposed to.
While a particular moment or a few of the moments may give little information
about a distribution the entire set of moments will determine the distribution exactly.
In applied statistics the first two moments are of great importance, but the third and
forth are also useful.
18
0.3.4 Moment Generating Functions
Finally we turn to the moment generating function (mgf) and characteristic Function
(cf). The mgf is defined as
MX (t) = E¡etX¢=
Z ∞
−∞etxfX(x)dx
for any real t, provided this integral exists in some neighborhood of 0. It is the
Laplace transform of the function fX(·) with argument −t. We have the useful
inversion formula
fX(x) =
Z ∞
−∞MX (t) e
−txdt
The mgf is of limited use, since it does not exist for many r.v. the cf is applicable
more generally, since it always exists:
ϕX (t) = E¡eitX
¢=
Z ∞
−∞eitxfX(x)dx =
Z ∞
−∞cos (tx) fX(x)dx+i
Z ∞
−∞sin (tx) fX(x)dx
This essentially is the Fourier transform of the function fX(·) and there is a well
defined inversion formula
fX(x) =1√2π
Z ∞
−∞e−itxϕX (t) dt
If X is symmetric about zero, the complex part of cf is zero. Also,
dr
dtrϕX (0) = E
¡irXreitX
¢↓t=0= irE (Xr) , r = 1, 2, 3, ..
Thus the moments of X are related to the derivative of the cf at the origin.
If
c (t) =
Z ∞
−∞exp (itx) dF (x)
notice thatdrc (t)
dtr=
Z ∞
−∞(ix)r exp (itx) dF (x)
anddrc (t)
dtr
¯t=0
=
Z ∞
−∞(ix)r dF (x) = (i)r μ/r ⇒ μ/r = (−i)
r drc (t)
dtr
¯t=0
Expectations and Moments of Random Variables 19
the rth uncenterd moment. Now expanding c (t) in powers of t we get
c (t) = c (0) +drc (t)
dtr
¯t=0
t+ ...+drc (t)
dtr
¯t=0
(t)r
r!+ ... = 1+ μ
/1 (it) + ...+ μ/r
(it)r
r!+ ...
The cummulants are defined as the coefficients κ1, κ2, ..., κr of the identity in it
exp
Ãκ1 (it) + κ2
(it)2
2!+ ...+ κr
(it)r
r!+ ...
!= 1 + μ
/1 (it) + ...+ μ/r
(it)r
r!+ ...
= c (t) =
Z ∞
−∞exp (itx) dF (x)
The cumulant-moment connection:
Suppose X is a random variable with n moments a1, ...an. Then X has n
cumulants k1, ...kn and
ar+1 =rX
j=0
⎛⎝ r
j
⎞⎠ ajkr+1−j for r = 0, ..., n− 1.
Writing out for r = 0, ...3 produces:
a1 = k1
a2 = k2 + a1k1
a3 = k3 + 2a1k2 + a2k1
a4 = k4 + 3a1k3 + 3a2k2 + a3k1.
These recursive formulas can be used to calculate the a0s efficiently from the
k0s, and vice versa. When X has mean 0, that is, when a1 = 0 = k1, aj becomes
μj = E((X −E(X))j),
so the above formulas simplify to:
μ2 = k2
μ3 = k3
μ4 = k4 + 3k22.
20
0.3.5 Expectations of Functions of Random Variablers
Product and Quotient
Let f (X,Y ) = XY, E (X) = μX and E (Y ) = μY . Then, expanding f (X,Y ) = X
Y
around (μX , μY ), we have
f (X,Y ) =μXμY+1
μY(X − μX)−
μX(μY )
2 (Y − μY )+μX(μY )
3 (Y − μY )2− 1
(μY )2 (X − μX) (Y − μY )
as ∂f∂X
= 1Y, ∂f
∂Y= − X
Y 2, ∂2f
∂X2 = 0, ∂2f∂X∂Y
= ∂2f∂Y ∂X
= − 1Y 2, and ∂2f
∂Y 2= 2 X
Y 3. Taking
expectations we have
E
µX
Y
¶=
μXμY
+μX(μY )
3V ar (Y )−1
(μY )2Cov (X,Y ) .
For the variance, take again the variance of the Taylor expansion and keeping only
terms up to order 2 we have:
V ar
µX
Y
¶=(μX)
2
(μY )2
∙V ar (X)
(μX)2 +
V ar (Y )
(μY )2 − 2
Cov (X,Y )
μXμY
¸.
0.3.6 Continuous Distributions
UNIFORM ON [a, b].
A very simple distribution for a continuous random variable is the uniform distribu-
tion. Its density function is:
f (x|a, b) =
⎧⎨⎩ 1b−a if x ∈ [a, b]
0 otherwise,
and
F (x|a, b) =Z x
a
f (z|a, b) dz = x− a
b− a,
where −∞ < a < b < ∞. Then the random variable X is defined to be uniformly
distributed over the interval [a, b]. Now if X is uniformly distributed over [a, b] then
E (X) =a+ b
2, median =
a+ b
2, V ar (X) =
(b− a)2
12.
Expectations and Moments of Random Variables 21
If X v U [a, b] =⇒ X − a v U [0, b− a] =⇒ X−ab−a v U [0, b− a]. Notice that if
a random variable is uniformly distributed over one of the following intervals [a, b),
(a, b], (a, b) the density function, expected value and variance does not change.
Exponential Distribution
If a random variable X has a density function given by:
fX(x) = fX(x;λ) = λe−λx for 0 ≤ x <∞
where λ > 0 then X is defined to have an (negative) exponential distribution. Now
this random variable X we have
E[X] =1
λand var[X] =
1
λ2
Pareto-Levy or Stable Distributions
The stable distributions are a natural generalization of the normal in that, as their
name suggests, they are stable under addition, i.e. a sum of stable random variables
is also a random variable of the same type. However, nonnormal stable distributions
have more probability mass in the tail areas than the normal. In fact, the nonnormal
stable distributions are so fat-tailed that their variance and all higher moments are
infinite.
Closed form expressions for the density functions of stable random variables
are available for only the cases of normal and Cauchy.
If a random variable X has a density function given by:
fX(x) = fX(x; γ, δ) =1
π
γ
γ2 + (x− δ)2for −∞ < x <∞
where −∞ < δ <∞ and 0 < γ <∞, then X is defined to have a Cauchy distribu-
tion. Notice that for this random variable even the mean is infinite.
22
Normal or Gaussian:
We say that X v N [μ, σ2] then
f¡x|μ, σ2
¢=
1√2πσ2
e−(x−μ)22σ2 , −∞ < x <∞
E (X) = μ, V ar (X) = σ2.
The distribution is symmetric about μ, it is also unimodal and positive everywhere.
NoticeX − μ
σ= Z v N [0, 1]
is the standard normal distribution.
Lognormal Distribution
Let X be a positive random variable, and let a new random variable Y be defined
as Y = logX. If Y has a normal distribution, then X is said to have a lognormal
distribution. The density function of a lognormal distribution is given by
fX(x;μ, σ2) =
1
x√2πσ2
e−(log x−μ)2
2σ2 for 0 < x <∞
where μ and σ2 are parameters such that −∞ < μ <∞ and σ2 > 0. We haven
E[X] = eμ+12σ2 and var[X] = e2μ+2σ
2 − e2μ+σ2
Notice that if X is lognormally distributed then
E[logX] = μ and var[logX] = σ2
Gamma-χ2
f (x|α, β) = 1
Γ (α)βαxα−1e−
xβ , 0 < x <∞, α, β > 0
α is shape parameter, β is a scale parameter. Here Γ (α) =R∞0
tα−1e−tdt is the
Gamma function, Γ (n) = n!. The χ2k is when α = k, and β = 1.
Multivariate Random Variables 23
Notice that we can approximate the Poisson and Binomial functions by the
normal, in the sense that if a random variable X is distributed as Poisson with
parameter λ, then X−λ√λis distributed approximately as standard normal. On the
other hand if Y ∼ Binomial(n, p) then Y−np√np(1−p)
∼ N(0, 1).
The standard normal is an important distribution for another reason, as well.
Assume that we have a sample of n independent random variables, x1, x2, ..., xn, which
are coming from the same distribution with mean m and variance s2, then we have
the following:
1√n
nXi=1
xi −m
s∼ N(0, 1)
This is the well known Central Limit Theorem for independent observa-
tions.
0.4 Multivariate Random Variables
We now consider the extension to multiple r.v., i.e.,
X = (X1,X2, ..,Xk) ∈ Rk
The joint pmf, fX(x), is a function with
P (X ∈ A) =Xx∈A
fX(x)
The joint pdf, fX(x), is a function with
P (X ∈ A) =
Zx∈A
fX(x)dx
This is a multivariate integral, and in general difficult to compute. If A is a rectangle
A = [a1, b1]× ...× [ak, bk], then
Zx∈A
fX(x)dx =
bkZak
...
b1Za1
fX(x)dx1..dxk
24
The joint c.d.f. is defined similarly
FX (x) =X
z1≤x1,...,zk≤xk
fX(z1, z2, ..., zk)
FX (x) = P (X1 ≤ x1, ...,Xk ≤ xk) =
x1Z−∞
...
xnZ−∞
fX(z1, z2, ..., zk)dz1..dzk
The multivariate c.d.f. has similar coordinate-wise properties to a univariate c.d.f.
For continuously differentiable c.d.f.’s
fX(x) =∂kFX (x)
∂x1∂x2..∂xk
0.4.1 Conditional Distributions and Independence
We defined conditional probability P (A|B) = P (A∩B)/P (B) for events with P (B) 6=
0. We now want to define conditional distributions of Y |X. In the discrete case there
is no problem
fY |X(y|x) = P (Y = y|X = x) =f(y, x)
fX(x)
when the event X = x has nonzero probability. Likewise we can define
FY |X(y|x) = P (Y ≤ y|X = x) =
PY≤y f(y, x)
fX(x)
Note that fY |X(y|x) is a density function and FY |X(y|x) is a c.d.f.
1) fY |X(y|x) ≥ 0 for all y
2)P
y fY |X(y|x) =P
y f(y,x)
fX(x)= fX(x)
fX(x)= 1
In the continuous case, it appears a bit anomalous to talk about the P (y ∈
A|X = x), since X = x itself has zero probability of occurring. Still, we define the
conditional density function
fY |X(y|x) =f(y, x)
fX(x)
in terms of the joint and marginal densities. It turns out that fY |X(y|x) has the
properties of p.d.f.
Multivariate Random Variables 25
1) fY |X(y|x) ≥ 0
2)R∞−∞ fY |X(y|x)dy =
R∞−∞ f(y,x)dy
fX(x)= fX(x)
fX(x)= 1.
We can define Expectations within the conditional distribution
E(Y |X = x) =
Z ∞
−∞yfY |X(y|x)dy =
R∞−∞ yf(y, x)dyR∞−∞ f(y, x)dy
and higher moments of the conditional distribution
0.4.2 Independence
We say that Y and X are independent (denoted by ⊥⊥) if
P (Y ∈ A,X ∈ B) = P (Y ∈ A)P (X ∈ B)
for all events A, B, in the relevant sigma-algebras. This is equivalent to the cdf’s
version which is simpler to state and apply.
FY X(y, x) = F (y)F (x)
In fact, we also work with the equivalent density version
f(y, x) = f(y)f(x) for all y, x
fY |X(y|x) = f(y) for all y
fX|Y (x|y) = f(x) for all x
If Y ⊥⊥ X, then g(X) ⊥⊥ h(Y ) for any measurable functions g, and h.
We can generalise the notion of independence to multiple random variables.
Thus Y , X, and Z are mutually independent if:
f(y, x, z) = f(y)f(x)f(z)
f(y, x) = f(y)f(x) for all y, x
f(x, z) = f(x)f(z) for all x, z
f(y, z) = f(y)f(z) for all y, z
for all y, x, z.
26
0.4.3 Examples of Multivariate Distributions
Multivariate Normal
We say that X (X1, X2, ...,Xk) vMVNk (μ,Σ) , when
fX (x|μ,Σ) =1
(2π)k/2 [det (Σ)]1/2exp
µ−12(x− μ)/Σ−1 (x− μ)
¶where Σ is a k × k covariance matrix
Σ =
⎛⎜⎜⎜⎜⎜⎜⎝σ11 σ12 ... σ1k
. . ....
. . ....
σkk
⎞⎟⎟⎟⎟⎟⎟⎠and det (Σ) is the determinant of Σ.
Theorem 6 (a) If X v MVNk (μ,Σ) then Xi v N (μi, σii) (this is shown by inte-
gration of the joint density with respect to the other variables).
(b) The conditional distributions X = (X1, X2) are Normal too
fX1|X2 (x1|x2) v N¡μX1|X2
,ΣX1|X2
¢where
μX1|X2= μ1 + Σ12Σ
−122 (x2 − μ2) , ΣX1|X2 = Σ11 − Σ12Σ
−122 Σ21.
(c) Iff Σ diagonal then X1,X2, ..,Xk are mutually independent. In this case
det (Σ) = σ11σ22..σkk
−12(x− μ)/Σ−1 (x− μ) = −1
2
kXj=1
¡xj − μj
¢2σjj
so that
fX (x|μ,Σ) =kY
j=1
1p2πσjj
exp
Ã−12
¡xj − μj
¢2σjj
!
Multivariate Random Variables 27
0.4.4 More on Conditional Distributions
We now consider the relationship between two, or more, r.v. when they are not inde-
pendent. In this case, conditional density fY |X and c.d.f. FY |X is in general varying
with the conditioning point x. Likewise for conditional mean E (Y |X), conditional
medianM (Y |X), conditional variance V (Y |X), conditional cf E¡eitY |X
¢, and other
functionals, all of which characterize the relationship between Y and X. Note that
this is a directional concept, unlike covariance, and so for example E (Y |X) can be
very different from E (X|Y ).
Regression Models:
We start with random variable (Y,X). We can write for any such random variable
Y =
m(X)z | E (Y |X)| z
systematic part
+
εz | Y −E (Y |X)| z rand om part
By construction ε satisfies E (ε|X) = 0, but ε is not necessarily independent of X.
For example, V ar (ε|X) = V ar (Y −E (Y |X) |X) = V ar (Y |X) = σ2 (X) can be
expected to vary with X as much as m (X) = E (Y |X) . A convenient and popular
simplification is to assume that
E (Y |X) = α+ βX
V ar (Y |X) = σ2
For example, in the bivariate normal distribution Y |X has
E (Y |X) = μY + ρY XσYσX
(X − μX)
V ar (Y |X) = σ2Y¡1− ρ2Y X
¢and in fact ε ⊥⊥ X.
We have the following result about conditional expectations
28
Theorem 7 (1) E (Y ) = E [E (Y |X)]
(2) E (Y |X) minimizes E£(Y − g (X))2
¤over all measurable functions g (·)
(3) V ar (Y ) = E [V ar (Y |X)] + V ar [E (Y |X)]
Proof. (1) Write fY X (y, x) = fY |X (y|x) fX (x) then we have E (Y ) =RyfY (y)dy =R
y¡R
fY X(y, x)dx¢dy =
Ry¡R
fY |X (y|x) fX (x) dx¢dy =
=R ¡R
yfY |X (y|x) dy¢fX (x) dx =
R[E(Y |X = x] fX (x) dx = E (E (Y |X))
(2) E£(Y − g (X))2
¤= E
£[Y − E (Y |X) +E (Y |X)− g (X)]2
¤= E [Y −E (Y |X)]2+2E [[Y − E (Y |X)] [E (Y |X)− g (X)]]+E [E (Y |X)− g (X)]2
as now E (Y E (Y |X)) = E£(E (Y |X))2
¤, and E (Y g (X)) = E (E (Y |X) g (X)) we
get thatE£(Y − g (X))2
¤= E [Y −E (Y |X)]2+E [E (Y |X)− g (X)]2 ≥ E [Y −E (Y |X)]2.
(3) V ar (Y ) = E [Y −E (Y )]2 = E [Y −E (Y |X)]2 +E [E (Y |X)−E (Y )]2
+2E [[Y −E (Y |X)] [E (Y |X)−E (Y )]]
The first term isE [Y −E (Y |X)]2 = EE£[Y − E (Y |X)]2 |X
¤ = E [V ar (Y |X)]
The second term is E [E (Y |X)−E (Y )]2 = V ar [E (Y |X)]
The third term is zero as ε = Y − E (Y |X) is such that E (ε|X) = 0, and
E (Y |X)−E (Y ) is measurable with respect to X.
Covariance
Cov (X,Y ) = E [X −E (X)]E [Y −E (Y )] = E (XY )− E (X)E (Y )
Note that if X or Y is a constant then Cov (X,Y ) = 0. Also
Cov (aX + b, cY + d) = acCov (X,Y )
An alternative measure of association is given by the correlation coefficient
ρXY =Cov (X,Y )
σXσY
Note that
ρaX+b,cY+d = sign (a)× sign (c)× ρXY
Multivariate Random Variables 29
If E (Y |X) = a = E (Y ) almost surely, then Cov (X,Y ) = 0. Also if X and Y are
independent r.v. then Cov (X,Y ) = 0.
Both the covariance and the correlation of random variables X and Y are
measures of a linear relationship of X and Y in the following sense. cov[X,Y ] will
be positive when (X − μX) and (Y − μY ) tend to have the same sign with high
probability, and cov[X,Y ] will be negative when (X−μX) and (Y −μY ) tend to have
opposite signs with high probability. The actual magnitude of the cov[X,Y ] does not
much meaning of how strong the linear relationship between X and Y is. This is
because the variability of X and Y is also important. The correlation coefficient does
not have this problem, as we divide the covariance by the product of the standard
deviations. Furthermore, the correlation is unitless and −1 ≤ ρ ≤ 1.
The properties are very useful for evaluating the expected return and stan-
dard deviation of a portfolio. Assume ra and rb are the returns on assets A and B,
and their variances are σ2a and σ2b , respectively. Assume that we form a portfolio of
the two assets with weights wa and wb, respectively. If the correlation of the returns
of these assets is ρ, find the expected return and standard deviation of the portfolio.
If Rp is the return of the portfolio then Rp = wara + wbrb. The expected
portfolio return is E[Rp] = waE[ra] + wbE[rb]. The variance of the portfolio is
var[Rp] = var[wara + wbrb] = E[(wara + wbrb)2]− (E[wara + wbrb])
2 =
= w2aE[r2a] + w2bE[r
2b ] + 2wawbE[rarb]
−w2a(E[ra])2 − w2b (E[rb])2 − 2wawbE[ra]E[rb] =
= w2a E[r2a]− (E[ra])2+w2b E[r2b ]− (E[rb])2+2wawb E[rarb]− E[ra]E[rb]
= w2avar[ra] + w2bvar[rb] + 2wawbcov[ra, rb] or = w2aσ2a + w2bσ
2b + 2wawbρσaσb
In a vector format we have:
E[Rp] =³wa wb
´⎛⎝ E[ra]
E[rb]
⎞⎠ and
var[Rp] =³wa wb
´⎛⎝ σ2a ρσaσb
ρσaσb σ2b
⎞⎠⎛⎝ wa
wb
⎞⎠
30
From the above example we can see that var[aX+bY ] = a2var[X]+b2var[Y ]+
2abcov[X,Y ] for random variables X and Y and constants a and b. In fact we can
generalize the formula above for several random variablesX1,X2, ...,Xn and constants
a1, a2, a3, ..., an i.e. var[a1X1 + a2X2 + ...anXn] =nPi=1
a2i var[Xi] + 2nPi<j
aiajcov[Xi,Xj]
0.5 Sampling Theory and Sample Statistics
To proceed we shall recall the following definitions.
Let X1,X2, ...,Xk be k random variables all defined on the same probability
space (Ω,A, P [.]). The joint cumulative distribution function of X1,X2, ..., Xk,
denoted by FX1,X2,...Xn(•, •, ..., •), is defined as
FX1,X2,...Xk(x1, x2, ..., xk) = P [X1 ≤ x1;X2 ≤ x2; ...;Xk ≤ xk]
for all (x1, x2, ..., xk).
Let X1,X2, ...,Xk be k discrete random variables, then the joint discrete
density function of these, denoted by fX1,X2,...Xk(•, •, ..., •), is defined to be
fX1,X2,...Xk(x1, x2, ..., xk) = P [X1 = x1;X2 = x2; ...;Xk = xk]
for (x1, x2, ..., xk), a value of (X1,X2, ..., Xk) and is 0 otherwise.
Let X1, X2, ..., Xk be k continuous random variables, then the joint contin-
uous density function of these, denoted by fX1,X2,...Xk(•, •, ..., •), is defined to be
a function such that
FX1,X2,...Xk(x1, x2, ..., xk) =
xkZ−∞
...
x1Z−∞
fX1,X2,...Xk(u1, u2, ..., uk)du1..duk
for all (x1, x2, ..., xk).
The totality of elements which are under discussion and about which infor-
mation is desired will be called the target population. The statistical problem is
to find out something about a certain target population. It is generally impossible
or impractical to examine the entire population, but one may examine a part of it
Sampling Theory and Sample Statistics 31
(a sample from it) and, on the basis of this limited investigation, make inferences
regarding the entire target population.
The problem immediately arises as to how the sample of the population should
be selected. Of practical importance is the case of a simple random sample, usually
called a random sample, which can be defined as follows:
Let the random variablesX1, X2, ..., Xn have a joint density fX1,X2,...Xn(x1, x2, ..., xn)
that factors as follows:
fX1,X2,...Xn(x1, x2, ..., xn) = f(x1)f(x2)...f(xn)
where f(.) is the common density of each Xi. Then X1, X2, ...,Xn is defined to be a
random sample of size n from a population with density f(.). Note that identical
distribution can be weakened - could have different population for each j - reflecting
heterogeneous individuals. Also, in time series we might want to allow dependence,
i.e., Xj and Xk are dependent. When we are dealing with finite population, sampling
without replacement causes some heterogeneity since ifX1 = x1, then the distribution
of X2 must be affected.
A sample statistic is a function of observable random variables, which is itself
an observable random variable, which does not contain any unknown parameters, i.e.
a sample statistic is any quantity we can write as a measurable function, T (X1, ...,Xn).
For example, let X1, X2, ...,Xk be a random sample from the density f(.). Then the
rth sample moment, denoted by M/r , is defined as:
M/r =
1
n
nXi=1
Xri .
In particular, if r = 1, we get the sample mean, which is usually denoted by X orXn;
that is:
Xn =1
n
nXi=1
Xi
Also the rth sample central moment (about Xn), denoted byMr, is defined
32
as:
Mr =1
n
nXi=1
¡Xi −Xn
¢r.
In particular, if r = 2, we get the sample variance, and the sample standard deviation
s2 =1
n
nXi=1
¡Xi −X
¢2, s =
√s2
or maybe another sample statistic for the variance,
s2∗ =1
n− 1
nXi=1
¡Xi −X
¢2.
We can also get the sample Median,
M = median X1, ..., Xn =
⎧⎨⎩ X(r) if n = 2r − 112
£X(r) +X(r+1)
¤if n = 2r
the empirical cumulative distribution function
Fn(x) =1
n
nXi=1
1 (Xi ≤ x)
ϕn (t) =1
n
nXi=1
eitXi =1
n
nXi=1
sin (tXi) + i1
n
nXi=1
cos (tXi)
These are analogous of corresponding population characteristics and will be shown
to be similar to them when n is large. We calculate the properties of these variables:
(1) Exact properties; (2) Asymptotic.
0.5.1 Means and Variances
We can prove the following theorems:
Theorem 8 Let X1, X2, ..., Xk be a random sample from the density f(.). The ex-
pected value of the rth sample moment is equal to the rth population moment, i.e.
the rth sample moment is an unbiased estimator of the rth population moment (Proof
omitted).
Sampling Theory and Sample Statistics 33
Theorem 9 Let X1, X2, ...,Xn be a random sample from a density f(.), and let Xn =
1n
nPi=1
Xi be the sample mean. Then
E[Xn] = μ and var[Xn] =1
nσ2
where μ and σ2 are the mean and variance of f(.), respectively. Notice that this is
true for any distribution f(.), provided that is not infinite.
Proof
E[Xn] = E[ 1n
nPi=1
Xi] =1n
nPi=1
E[Xi] =1n
nPi=1
μ = 1nnμ = μ. Also
var[Xn] = var[ 1n
nPi=1
Xi] =1n2
nPi=1
var[Xi] =1n2
nPi=1
σ2 = 1n2nσ2 = 1
nσ2
Theorem 10 Let X1,X2, ..., Xn be a random sample from a density f(.), and let s2∗
defined as above. Then
E[s2∗] = σ2 and var[s2∗] =1
n
µμ4 −
n− 3n− 1σ
4
¶where σ2 and μ4 are the variance and the 4
th central moment of f(.), respectively.
Notice that this is true for any distribution f(.), provided that μ4 is not infinite.
Proof
We shall prove first the following identity, which will be used latter:nPi=1
(Xi − μ)2 =nPi=1
¡Xi −Xn
¢2+ n
¡Xn − μ
¢2P(Xi − μ)2 =
P¡Xi −Xn +Xn − μ
¢2=P£¡
Xi −Xn
¢+¡Xn − μ
¢¤2=
=Ph¡
Xi −Xn
¢2+ 2
¡Xi −Xn
¢ ¡Xn − μ
¢+¡Xn − μ
¢2i=
=P¡
Xi −Xn
¢2+ 2
¡Xn − μ
¢P¡Xi −Xn
¢+ n
¡Xn − μ
¢2=
=P¡
Xi −Xn
¢2+ n
¡Xn − μ
¢2Using the above identity we obtain:
E[S2n] = E
∙1
n−1
nPi=1
(Xi −Xn)2
¸= 1
n−1E
∙nPi=1
(Xi − μ)2 − n¡Xn − μ
¢2¸=
= 1n−1
∙nPi=1
E (Xi − μ)2 − nE¡Xn − μ
¢2¸= 1
n−1
∙nPi=1
σ2 − nvar¡Xn
¢¸=
= 1n−1
£nσ2 − n 1
nσ2¤= σ2
The derivation of the variance of S2n is omitted.
34
0.5.2 Sampling from the Normal Distribution
Theorem 11 Let denote Xn the sample mean of a random sample of size n from a
normal distribution with mean μ and variance σ2. Then (1) X v N(μ, σ2/n).
(2) X and s2 are independent.
(3) (n−1)s2∗σ2
v χ2n−1
(4) X−μs∗/√nv tn−1.
Proof
(1) We know that
ϕX (t) = [ϕX (t/n)]n .
Now ϕX (t/n) = exp¡iμt− 1
2σ2t2
¢. Hence ϕX (t) =
hexp
³iμ t
n− 1
2σ2¡tn
¢2´in=
exp³iμt− 1
2
³σ2
n
´t2´, which is the cf of a normal distribution with mean μ and
variance σ2
n.
(2) For n = 2 we have that if X1 v N(0, 1) and X2 v N(0, 1) then X = X1+X2
2
and s2 = (X1−X2)2
4. Define Z1 =
X1+X2
2and Z2 =
X1−X2
2. Then Z1 and Z2 are
uncorrelated and by normality independent.
0.5.3 The Gamma Function
The gamma function is defined as:
Γ(t) =
∞Z0
xt−1e−xdx for t > 0
Notice that Γ(t+ 1) = tΓ(t),as
Γ(t+ 1) =
∞Z0
xte−xdx = −∞Z0
xtde−x = − xte−x¯∞0+ t
∞Z0
xt−1de−x = tΓ(t)
and if t is an integer then Γ(t+ 1) = t!. Also if t is again an integer then Γ(t+ 12) =
1∗3∗5∗...(2t−1)2t
√π. Finally Γ(1
2) =√π.
Recall that if X is a random variable with density
fX(x) =1
Γ(k/2)
µ1
2
¶k/2
xk2−1e−
12x for 0 < x <∞
Sampling Theory and Sample Statistics 35
where Γ(.) is the gamma function, then X is defined to have a chi-square distrib-
ution with k degrees of freedom.
Notice that X is distributed as above then:
E[X] = k and var[X] = 2k
We can prove the following theorem
Theorem 12 If the random variables Xi, i = 1, 2, .., k are normally and indepen-
dently distributed with means μi and variances σ2i then
U =kXi=1
µXi − μi
σi
¶2has a chi-square distribution with k degrees of freedom. Proof omitted.
Furthermore,
Theorem 13 If the random variables Xi, i = 1, 2, .., k are normally and indepen-
dently distributed with mean μ and variance σ2, and let S2 = 1n−1
nPi=1
(Xi − Xn)2
then
U =(n− 1)S2
σ2v χ2n−1
where χ2n−1 is the chi-square distribution with n−1 degrees of freedom. Proof omitted.
0.5.4 The F Distribution
If X is a random variable with density
fX(x) =Γ[(m+ n)/2]
Γ(m/2)Γ(n/2)
³mn
´m/2 xm2−1
[1 + (m/n)x](m+n)/2for 0 < x <∞
where Γ(.) is the gamma function, then X is defined to have a F distribution with
m and n degrees of freedom.
Notice that if X is distributed as above then:
E[X] =n
n− 2 and var[X] =2n2(m+ n− 2)m(n− 2)2(n− 4)
36
Theorem 14 If the random variables U and V are independently distributed as chi-
square with m and n degrees of freedom, respectively i.e. U v χ2m and V v χ2n
independently, thenU/m
V/n= X v Fm,n
where Fm,n is the F distribution with m,n degrees of freedom. Proof omitted.
0.5.5 The Student-t Distribution
If X is a random variable with density
fX(x) =Γ[(k + 1)/2]
Γ(k/2)
1√kπ
1
[1 + x2/k](k+1)/2for −∞ < x <∞
where Γ(.) is the gamma function, then X is defined to have a t distribution with
k degrees of freedom.
Notice that if X is distributed as above then:
E[X] = 0 and var[X] =k
k − 2
Theorem 15 If the random variables Z and V are independently distributed as stan-
dard normal and chi-square with k, respectively i.e. Z v (N(0, 1) and V v χ2k inde-
pendently, thenZpV/k
= X v tk
where tk is the t distribution with k degrees of freedom. Proof omitted.
The above Theorems are very useful especially to get the distribution of various
tests and construct confidence intervals.
0.6 Parametric Point Estimation
The point estimation admits two problems. The first is to devise some means of ob-
taining a statistic to use as an estimator. The second, to select criteria and techniques
to define and find a “best” estimator among many possible estimators.
Parametric Point Estimation 37
0.6.1 Methods of Finding Estimators
Any statistic (known function of observable random variables that is itself a random
variable) whose values are used to estimate τ(θ), where τ(.) is some function of the
parameter θ, is defined to be an estimator of τ(θ).
Notice that for specific values of the realized random sample the estimator
takes a specific value called estimate.
0.6.2 Method of Moments
Let f(.; θ1, θ2, ..., θk) be a density of a random variable X which has k parameters
θ1, θ2, ..., θk. As before let μ/r denote the rth moment i.e. = E[Xr]. In general μ/r
will be a known function of the k parameters θ1, θ2, ..., θk. Denote this by writ-
ing μ/r = μ
/r(θ1, θ2, ..., θk). Let X1,X2, ...,Xn be a random sample from the density
f(.; θ1, θ2, ..., θk), and, as before, letM/j be the j
th sample moment, i.e. M/j =
1n
nPi=1
Xji .
Then equating sample moments to population ones we get k equations with k un-
knowns, i.e.
M/j = μ
/j(θ1, θ2, ..., θk) for j = 1, 2, ..., k
Let the solution to these equations be bθ1,bθ2, ...,bθk. We say that these k estimatorsare the estimators of θ1, θ2, ..., θk obtained by the method of moments.
Example: Let X1, X2, ...,Xn be a random sample from a normal distribution
with mean μ and variance σ2. Let (θ1, θ2) = (μ, σ2). Estimate the parameters μ and
σ by the method of moments.. Recall that σ2 = μ/2− (μ
/1)2 and μ = μ
/1. The method
of moment equations become:
1n
nPi=1
Xi = X =M/1 = μ
/1 = μ
/1(μ, σ
2) = μ
1n
nPi=1
X2i =M
/2 = μ
/2 = μ
/2(μ, σ
2) = σ2 + μ2
Solving the two equations for μ and σ we get:bμ = X, and bσ =r 1n
nPi=1
(Xi −X) which are the M-M estimators of μ and σ.
Example: Let X1,X2, ...,Xn be a random sample from a Poisson distribution
with parameter λ. There is only one parameter, hence only one equation, which is:
38
1n
nPi=1
Xi = X =M/1 = μ
/1 = μ
/1(λ) = λ
Hence the M-M estimator of λ is bλ = X.
0.6.3 Maximum Likelihood
Consider the following estimation problem. Suppose that a box contains a number of
black and a number of white balls, and suppose that it is known that the ratio of the
number is 3/1 but it is not known whether the black or the white are more numerous,
i.e. the number of drawing a black ball is either 1/4 or 3/4. If n balls are drawn with
replacement from the box, the distribution of X, the number of black balls, is given
by the binomial distribution
f(x; p) =
⎛⎝ n
x
⎞⎠ px(1− p)n−x for x = 0, 1, 2, ..., n
where p is the probability of drawing a black ball. Here p = 1/4 or p = 3/4. We shall
draw a sample of three balls, i.e. n = 3, with replacement and attempt to estimate
the unknown parameter p of the distribution. the estimation is simple in this case as
we have to choose only between the two numbers 1/4 = 0.25 and 3/4 = 0.75. The
possible outcomes and their probabilities are given below:
outcome : x 0 1 2 3
f(x; 0.75) 1/64 9/64 27/64 27/64
f(x; 0.25) 27/64 27/64 9/64 1/64
In the present example, if we found x = 0 in a sample of 3, the estimate 0.25
for p would be preferred over 0.75 because the probability 27/64 is greater than 1/64,
i.e.
And in general we should estimate p by 0.25 when x = 0 or 1 and by 0.75
when x = 2 or 3. The estimator may be defined as
bp = bp(x) =⎧⎨⎩ 0.25 for x = 0, 1
0.75 for x = 2, 3
Parametric Point Estimation 39
The estimator thus selects fro every possible x the value of p, say bp, such thatf(x; bp) > f(x; p/)
where p/ is the other value of p.
More generally, if several values of p were possible, we might reasonably pro-
ceed in the same manner. Thus if we found x = 2 in a sample of 3 from a binomial
population, we should substitute all possible values of p in the expression
f(2; p) =
⎛⎝ 3
2
⎞⎠ p2(1− p) for 0 ≤ p ≤ 1
and choose as our estimate that value of p which maximizes f(2; p). The position
of the maximum of the function above is found by setting equal to zero the first
derivative with respect to p, i.e. ddpf(2; p) = 6p − 9p2 = 3p(2 − 3p) = 0 ⇒ p = 0 or
p = 2/3. The second derivative is: d2
dp2f(2; p) = 6 − 18p. Hence, d2
dp2f(2; 0) = 6 and
the value of p = 0 represents a minimum, whereas d2
dp2f(2; 2
3) = −6 and consequently
p = 23represents the maximum. Hence bp = 2
3is our estimate which has the property
f(x; bp) > f(x; p/)
where p/ is any other value in the interval 0 ≤ p ≤ 1.
The likelihood function of n random variables X1,X2, ...,Xn is defined to
be the joint density of the n random variables, say fX1,X2,...,Xn(x1, x2, ..., xn; θ), which
is considered to be a function of θ. In particular, if X1,X2, ...,Xn is a random sample
from the density f(x; θ), then the likelihood function is f(x1; θ)f(x2; θ).....f(xn; θ).
To think of the likelihood function as a function of θ, we shall use the notation
L(θ;x1, x2, ..., xn) or L(•;x1, x2, ..., xn) for the likelihood function in general.
The likelihood is a value of a density function. Consequently, for discrete
random variables it is a probability. Suppose for the moment that θ is known, denoted
by θ0. The particular value of the random variables which is “most likely to occur”
is that value x/1, x/2, ..., x
/n such that fX1,X2,...,Xn(x1, x2, ..., xn; θ0) is a maximum. for
40
example, for simplicity let us assume that n = 1 and X1 has the normal density
with mean 0 and variance 1. Then the value of the random variable which is most
likely to occur is X1 = 0. By “most likely to occur” we mean the value x/1 of X1
such that φ0,1(x/1) > φ0,1(x1). Now let us suppose that the joint density of n random
variables is fX1,X2,...,Xn(x1, x2, ..., xn; θ), where θ is known. Let the particular values
which are observed be represented by x/1, x
/2, ..., x
/n. We want to know from which
density is this particular set of values most likely to have come. We want to know
from which density (what value of θ) is the likelihood largest that the set x/1, x/2, ..., x
/n
was obtained. in other words, we want to find the value of θ in the admissible set,
denoted by bθ, which maximizes the likelihood function L(θ;x/1, x
/2, ..., x
/n). The valuebθ which maximizes the likelihood function is, in general, a function of x1, x2, ..., xn,
say bθ = bθ(x1, x2, ..., xn). Hence we have the following definition:Let L(θ) = L(θ;x1, x2, ..., xn) be the likelihood function for the random vari-
ables X1,X2, ...,Xn. If bθ [where bθ = bθ(x1, x2, ..., xn) is a function of the observationsx1, x2, ..., xn] is the value of θ in the admissible range which maximizes L(θ), then bΘ =bθ(X1, X2, ...,Xn) is the maximum likelihood estimator of θ. bθ = bθ(x1, x2, ..., xn)is the maximum likelihood estimate of θ for the sample x1, x2, ..., xn.
The most important cases which we shall consider are those in whichX1,X2, ...,Xn
is a random sample from some density function f(x; θ), so that the likelihood function
is
L(θ) = f(x1; θ)f(x2; θ).....f(xn; θ)
Many likelihood functions satisfy regularity conditions so the maximum likelihood
estimator is the solution of the equation
dL(θ)
dθ= 0
Also L(θ) and logL(θ) have their maxima at the same value of θ, and it is sometimes
easier to find the maximum of the logarithm of the likelihood. Notice also that if
the likelihood function contains k parameters then we find the estimator from the
solution of the k first order conditions.
Parametric Point Estimation 41
Example: Let a random sample of size n is drawn from the Bernoulli distri-
bution
f(x; p) = px(1− p)1−x
where 0 ≤ p ≤ 1. The sample values x1, x2, ..., xn will be a sequence of 0s and 1s, and
the likelihood function is
L(p) =nYi=1
pxi(1− p)1−xi = pP
xi(1− p)n−P
xi
Let y =P
xi we obtain that
logL(p) = y log p+ (n− y) log(1− p)
andd logL(p)
dp=
y
p− n− y
1− p
Setting this expression equal to zero we get
bp = y
n=1
n
Xxi = x
which is intuitively what the estimate for this parameter should be.
Example: Let a random sample of size n is drawn from the normal distrib-
ution with density
f(x;μ, σ2) =1√2πσ2
e−(x−μ)22σ2
The likelihood function is
L(μ, σ2) =nYi=1
1√2πσ2
e−(xi−μ)
2
2σ2 =
µ1
2πσ2
¶n/2
exp
"− 1
2σ2
nXi=1
(xi − μ)2
#the logarithm of the likelihood function is
logL(μ, σ2) = −n2log 2π − n
2log σ2 − 1
2σ2
nXi=1
(xi − μ)2
To find the maximum with respect to μ and σ2 we compute
∂ logL
∂μ=1
σ2
nXi=1
(xi − μ)
42
and∂ logL
∂σ2= −n
2
1
σ2+
1
2σ4
nXi=1
(xi − μ)2
and putting these derivatives equal to 0 and solving the resulting equations we find
the estimates
bμ = 1
n
nXi=1
xi = x
and bσ2 = 1
n
nXi=1
(xi − x)2
which turn out to be the sample moments corresponding to μ and σ2.
0.6.4 Properties of Point Estimators
One needs to define criteria so that various estimators can be compared. One of these
is the unbiasedness. An estimator T = t(X1,X2, ...,Xn) is defined to be an unbiased
estimator of τ(θ) if and only if
Eθ[T ] = Eθ[t(X1,X2, ..., Xn)] = τ(θ)
for all θ in the admissible space.
Other criteria are consistency, mean square error etc.
0.7 Maximum Likelihood Estimation
Let the observations be x = (x1, x2, ..., xn), and the Likelihood Function be denoted
by L (θ) = d (x; θ), θ ∈ Θ ⊂ <k. Then the Maximum Likelihood Estimator (MLE) is:
∧θ = argmax
θ∈Θp (x; θ)⇔ d
µx;∧θ
¶≥ d (x; θ) ∀θ ∈ Θ.
If∧θ is unique, then the model is identified. Let (θ) = ln d (x; θ), then for local
identification we have the following Lemma.
Maximum Likelihood Estimation 43
Lemma 16 The model is Locally Identified iff the Hessian is negative definite with
probability 1, i.e.
Pr
⎡⎢⎢⎣H µ∧θ¶ = ∂2µ∧θ
¶∂θ∂θ/
< 0
⎤⎥⎥⎦ = 1.Assume that the log-Likelihood Function can be written in the following form:
(θ) =nXi=1
ln d (xi; θ) .
Then we usually make the following assumptions:
Assumption A1. The range of the random variable x, say C, i.e.
C = x ∈ <n : d (x; θ) > 0 ,
be independent of the parameter θ.
Assumption A2. The Log-Likelihood Function ln d (x; θ) has partial derivatives
with respect of θ up to third order, which are bounded and integrable with
respect to x.
The vector
s (x; θ) =∂ (x; θ)
∂θ
which is k × 1, is called the score vector. We can state the following Lemma.
Lemma 17 Under the assumptions A1. and A2. we have that
E (s) = 0, E¡ss/¢= −E
∙∂2
∂θ∂θ/
¸.
Proof: As L (x, θ) is a density function it follows thatZC
L (x, θ) dx = 1
44
where C = x ∈ <n : L (x, θ) > 0 ⊂ <n. Under A1. C is independent of θ. Conse-
quently, the derivative, with respect to θ, of the integral is equal to the integral of
the derivative.∗ Consequently taking derivatives of the above integral we have:
∂
∂θ
ZC
L (x, θ) dx = 0⇒ZC
∂L (x, θ)
∂θdx = 0⇒
ZC
∂L
∂θ
1
LLdx = 0⇒
ZC
∂ lnL
∂θLdx = 0
⇒ZC
∂
∂θLdx = 0⇒
ZC
sLdx = 0⇒ E (s) = 0.
Hence, we also have that E¡s/¢= 0 and taking derivatives with respect to θ
we have
∂
∂θ
ZC
s/Ldx = 0⇒ ∂
∂θ
ZC
∂
∂θ/Ldx = 0⇒
ZC
∂
∂θ
µ∂
∂θ/L
¶dx = 0
⇒ZC
∙µ∂
∂θ
∂
∂θ/
¶L+
∂L
∂θ
µ∂
∂θ/
¶¸dx = 0
⇒ZC
∙∂2
∂θ∂θ/L+
∂L
∂θ
1
LL
µ∂
∂θ/
¶¸dx = 0
⇒ZC
∙∂2
∂θ∂θ/L+
∂
∂θ
∂
∂θ/L
¸dx = 0
⇒ZC
∂
∂θ
∂
∂θ/Ldx = −
ZC
∂2
∂θ∂θ/Ldx
⇒ZC
ss/Ldx = −ZC
∂2
∂θ∂θ/Ldx⇒ E
¡ss/¢= −E
µ∂2
∂θ∂θ/
¶which is the second result. ¥
The matrix
J (θ) = E¡ss/¢= E
µ∂
∂θ
∂
∂θ/
¶= −E
µ∂2
∂θ∂θ/
¶is called (Fisher) Information Matrix and is a measure of the information that the
sample x contains about the parameters in θ. In case that (x, θ) can be written as
∗In case that C was dependent on θ, we would have to apply the Second Fundamental Theorm
of Analysis to find the derivative of the integral.
Maximum Likelihood Estimation 45
(x, θ) =Pn
i=1 (xi, θ) we have that
J (θ) = −EÃ
∂2
∂θ∂θ/
nXi=1
(xi, θ)
!= −
nXi=1
E
µ∂2 (xi, θ)
∂θ∂θ/
¶= −nE
µ∂2 (xi, θ)
∂θ∂θ/
¶.
Consequently the Information matrix is proportional to the sample size. Furthermore,
by Assumption A2. E³∂2 (xi,θ)
∂θ∂θ/
´is bounded. Hence
J (θ) = O (n) .
Now we can state the following Lemma that will be needed in the sequel.
Lemma 18 Under assumptions A1. and A2. we have that for any unbiased estima-
tor of θ, say eθ,E³eθs/´ = Ik,
where Ik is the identity matrix of order k.
Proof: As eθ is an unbiased estimator of θ, we have thatE³eθ´ = θ⇒
ZC
eθL (x; θ) dx = θ.
Taking derivatives, with respect to θ/, we have that
∂
∂θ/
ZC
eθ (x)L (x; θ) dx = Ik ⇒ZC
eθ (x) ∂L (x; θ)∂θ/
1
L (x; θ)L (x; θ) dx = Ik
⇒ZC
eθ (x) ∂ (x, θ)
∂θ/L (x; θ) dx = Ik ⇒
ZC
eθ (x) s/L (x; θ) dx = Ik
⇒ E³eθs/´ = Ik
which is the result. ¥We can now prove the Cramer-Rao Theorem.
Theorem 19 Under the regularity assumptions A1. and A2. we have that for any
unbiased estimator eθ we have thatJ−1 (θ) ≤ V
³eθ´ .
46
Proof: Let us define the following 2k× 1 vector ξ/ =³δ/, s/
´=³eθ/ − θ/, s/
´.
Now we have that
V (ξ) = E³ξξ/´=
⎡⎣ δδ/ δs/
sδ/ ss/
⎤⎦ =⎡⎣ V
³eθ´ Ik
Ik J (θ)
⎤⎦ .For the above result we took into consideration that eθ is unbiased, i.e. E ³eθ´ = θ,
the above Lemma, i.e. E³eθs/´ = Ik, E (s) = 0, and E
¡ss/¢= J (θ). It is known
that all variance-covariance matrices are positive semi-definite. Hence V (ξ) ≥ 0. Let
us define the following matrix
B =hIk −J−1 (θ)
i.
The matrix B is k × 2k and of rank k. Consequently, as V (ξ) ≥ 0, we have that
BV (ξ)B/ ≥ 0. Hence, as Ik and J−1 (θ) are symmetric we have
BV (ξ)B/ =hIk −J−1 (θ)
i⎡⎣ V³eθ´ Ik
Ik J (θ)
⎤⎦⎡⎣ Ik
−J−1 (θ)
⎤⎦=
hV³eθ´− J−1 (θ) 0
i⎡⎣ Ik
−J−1 (θ)
⎤⎦ = V³eθ´− J−1 (θ) ≥ 0.
Consequently, V³eθ´ ≥ J−1 (θ), in the sense that V
³eθ´− J−1 (θ) is a positive semi-
definite matrix. ¥The matrix J−1 (θ) is called the Cramer-Rao Lower Bound as it is the lower
bound for any unbiased estimator (either linear or non-linear).
Let, now, θ0 denote the true parameter values of θ and d (x; θ0) be the likeli-
hood function evaluated at the true parameter values. Then for any function f (x)
we define
E0 [f (x)] =
Zf (x) d (x; θ0) dx.
Let (θ) /n be the average log-likelihood and define a function z : Θ→ < as
z (θ) = E0 ( (θ) /n) =1
n
Z(θ) d (x; θ0) dx.
Then we can state the following Lemma:
Maximum Likelihood Estimation 47
Lemma 20 ∀θ ∈ Θ we have that
z (θ) ≤ z (θ0)
with strict inequality if
Pr [x ∈ S : d (x; θ) 6= d (x; θ0)] > 0
Proof: From the definition of z (θ) we have that
n [z (θ)− z (θ0)] = E0 [ (θ)− (θ0)] = E0
∙ln
d (x; θ)
d (x; θ0)
¸≤ lnE0
∙d (x; θ)
d (x; θ0)
¸= ln
Zd (x; θ)
d (x; θ0)d (x; θ0) dx = ln
Zd (x; θ) dx = ln 1 = 0
where the inequality is due to Jensen. The inequality is strict when the ratio d(x;θ)d(x;θ0)
is non-constant with probability greater than 0. ¥When the observations x = (x1, x2, ..., xn) are randomly sampled, we have that
(θ) =nXi=1
zi where zi = ln d (xi; θ)
and the zi random variables are independent and have the same distribution with
mean E (zi) = z (θ). Then from the Weak Law of Large Numbers we have that
p lim1
n
nXi=1
zi = p lim1
n(θ) = z (θ) .
However, the above is true under weaker assumptions, e.g. for dependent observations
or for non-identically distributed random variables etc. To avoid a lengthy exhibition
of various cases we make the following assumption:
Assumption A3. Θ is a compact subset of <k, and
p lim1
n(θ) = z (θ) ∀θ ∈ Θ.
Recall that a closed and bounded subset of <k is compact.
48
Theorem 21 Under the above assumption and if the statistical model is identified
we have that∧θ
p→ θ0
where∧θ is the MLE and θ0 the true parameter values.
Proof: Let N is an open sphere with centre θ0 and radius ε, i.e. N =
θ ∈ Θ : kθ − θ0k < ε. Then N is closed and consequently A = N ∩ Θ is closed
and bounded, i.e. compact. Hence
maxθ∈A
z (θ)
exist and we can define
δ = z (θ0)−maxθ∈A
z (θ) (1)
Let Tδ ⊂ S the event (a subset of the sample space) which defined by
∀θ ∈ Θ
¯1
n(θ)− z (θ)
¯<
δ
2. (2)
Hence (2) applies for θ =∧θ as well. Hence
for Tδ ⇒ z
µ∧θ
¶>1
n
µ∧θ
¶− δ
2.
Given now thatµ∧θ
¶≥ (θ0) we have that
for Tδ ⇒ z
µ∧θ
¶>1
n(θ0)−
δ
2.
Furthermore, as θ0 ∈ Θ, we have that
for Tδ ⇒1
n(θ0) > z (θ0)−
δ
2.
from the relationship that if |x| < d => −d < x < d. Adding the above two
inequalities we have that
for Tδ ⇒ z
µ∧θ
¶> z (θ0)− δ.
Maximum Likelihood Estimation 49
Substituting out δ, employing (1) we get
for Tδ ⇒ z
µ∧θ
¶> max
θ∈Az (θ) .
Hence∧θ /∈ A⇒
∧θ /∈ N ∩Θ⇒
∧θ ∈ N ∩Θ⇒
∧θ ∈ N.
Consequently we have shown that when Tδ is true then∧θ ∈ N . This implies that
Pr
µ∧θ ∈ N
¶≥ Pr (Tδ)
and taking limits, as n→∞ we have
limn→∞
Pr (kθ − θ0k < ε) ≥ limn→∞
Pr (Tδ) = 1
by the definition of N and by assumption A3. Hence, as ε is any small positive
number, we have
∀ε > 0 limn→∞
Pr (kθ − θ0k < ε) = 1
which is the definition of probability limit. ¥When the observations x = (x1, x2, ..., xn) are randomly sampled, we have :
(θ) =nXi=1
ln d (xi; θ)
s (θ) =∂ (θ)
∂θ=
nXi=1
∂ ln d (xi; θ)
∂θ
H (θ) =∂2 (θ)
∂θ∂θ/=
nXi=1
∂2 ln d (xi; θ)
∂θ∂θ/.
Given now that the observations xi are independent and have the same distribution,
with density d (xi; θ), the same is true for the vectors∂ ln d(xi;θ)
∂θand the matrices
∂2 ln d(xi;θ)
∂θ∂θ/. Consequently we can apply a Central Limit Theorem to get:
n−1/2s (θ) = n−1/2nXi=1
∂ ln d (xi; θ)
∂θd→ N
³0,_
J (θ)´
50
and from the Law of Large Numbers
n−1H (θ) = n−1nXi=1
∂2 ln d (xi; θ)
∂θ∂θ/p→ −
_
J (θ)
where_
J (θ) = n−1J (θ) = n−1J (θ) = −n−1Eµ∂2 (θ)
∂θ∂θ/
¶,
i.e., the average Information matrix. However, the two asymptotic results apply
even if the observations are dependent or identically distributed. To avoid a lengthy
exhibition of various cases we make the following assumptions. As n→∞ we have:
Assumption A4.
n−1/2s (θ)d→ N
³0,_
J (θ)´
and
Assumption A5.
n−1H (θ)p→ −
_
J (θ)
where_
J (θ) = J (θ) /n = E¡ss/¢/n = −E (H) /n.
We can now state the following Theorem
Theorem 22 Under assumptions A2. and A3.the above two assumptions and iden-
tification we have:√n
µ∧θ − θ0
¶d→ N
∙0,³_J (θ0)
´−1¸where
∧θ is the MLE and θ0 the true parameter values.
Proof: As∧θ maximises the Likelihood Function we have from the first order
conditions that
s
µ∧θ
¶=
∂
µ∧θ
¶∂θ
= 0.
Maximum Likelihood Estimation 51
From the Mean Value Theorem, around θ0, we have that
s (θ0) +H (θ∗)
µ∧θ − θ0
¶= 0 (3)
where θ∗ ∈∙∧θ, θ0
¸, i.e. kθ∗ − θ0k ≤
°°°°∧θ − θ0
°°°°.Now from the consistency of the MLE we have that
∧θ = θ0 + op (1)
whereop (1) is a random variable that goes to 0 in probability as n → ∞. As now
θ∗ ∈∙∧θ, θ0
¸, we have that
θ∗ = θ0 + op (1)
as well. Hence from 3 we have that
√n
µ∧θ − θ0
¶= − [H (θ∗) /n]−1 s (θ0) /
√n.
As now θ∗ = θ0 + op (1) and under the second assumption we have that
√n
µ∧θ − θ0
¶=h_J (θ0)
i−1s (θ0) /
√n+ op (1) .
Under now the first assumption the above equation implies that√n
µ∧θ − θ0
¶d→
N
∙0,³_J (θ0)
´−1¸. ¥
Example: Let yt v N (μ, σ2) i.i.d for t = 1, ..., T . Then
(θ) = −T2ln (2π)− T
2ln¡σ2¢− 1
2σ2
TXt=1
(yt − μ)2
where θ =
⎛⎝ μ
σ2
⎞⎠. Now∂
∂μ=1
σ2
TXt=1
(yt − μ)
and∂
∂σ2=−T2σ2
+1
2σ4
TXt=1
(yt − μ)2 .
52
Now
∂
µ∧θ
¶∂μ
=
∂
µ∧θ
¶∂σ2
= 0.
Hence∧μ =
1
T
TXt=1
yt and∧σ2 =
1
T
TXt=1
³yt −
∧μ´2
.
Now
H (θ) =∂2 (θ)
∂θ∂θ/=
⎡⎣ − Tσ2
− 1σ4
PTt=1 (yt − μ)
− 1σ4
PTt=1 (yt − μ) T
2σ4− 1
σ6
PTt=1 (yt − μ)2
⎤⎦ ,and consequently evaluating H (θ) at
∧θ we have
H
µ∧θ
¶=
⎡⎢⎣ − T∧σ2
0
0 − T
2∧σ4
⎤⎥⎦which is clearly negative definite. Now the Information matrix is:
J (θ) = −E (H (θ)) = E
⎛⎝⎡⎣ Tσ2
1σ4
PTt=1 (yt − μ)
1σ4
PTt=1 (yt − μ) − T
2σ4+ 1
σ6
PTt=1 (yt − μ)2
⎤⎦⎞⎠=
⎡⎣ Tσ2
0
0 T2σ4
⎤⎦0.8 The Classical Tests
Let the null hypothesis be represented by
Ω = θ ∈ Θ : ϕ (θ) = 0
where θ is the vector of parameters and ϕ (θ) = 0 are the restrictions. Consequently
the Neyman ratio test is given by:
λ (x) =supθ∈Θ L (θ)
supθ∈Ω L (θ)=
L
µ∧θ
¶L³eθ´
The Classical Tests 53
where L (θ) is the Likelihood function. As now the ln (·) is a monotonic, strictly
increasing, an equivalent test can be based on
LR = 2 ln (λ (x)) = 2
∙ µ∧θ
¶−
³eθ´¸where where LR is the well known Likelihood Ratio test and (θ) is the log-likelihood
function.
Using a Taylor expansion of³eθ´ around ∧
θ and employing the Mean Value
Theorem we get:
³eθ´ = µ∧θ
¶+
∂
µ∧θ
¶∂θ/
µeθ − ∧θ
¶+1
2
µeθ − ∧θ
¶/∂2 (θ∗)
∂θ∂θ/
µeθ − ∧θ
¶
where
°°°°θ∗ − ∧θ
°°°° ≤ °°°°eθ − ∧θ
°°°°. Now, ∂
µ∧θ
¶∂θ/
= s
µ∧θ
¶= 0 due to the fact the the first
order conditions are satisfied by the ML estimator∧θ. Consequently, the LR test is
given by:
LR = 2
∙ µ∧θ
¶−
³eθ´¸ = −µeθ − ∧θ
¶/∂2 (θ∗)
∂θ∂θ/
µeθ − ∧θ
¶.
Now we know that
√n³eθ − θ0
´=
½Ik −
h_J (θ0)
i−1F / (θ0) [P (θ0)]
−1 F (θ0)
¾h_J (θ0)
i−1 s (θ0)√n+ op (1)
and√n
µ∧θ − θ0
¶=h_J (θ0)
i−1 s (θ0)√n+ op (1) .
Hence
√n
µeθ − ∧θ
¶= −
h_J (θ0)
i−1F / (θ0) [P (θ0)]
−1 F (θ0)h_J (θ0)
i−1 s (θ0)√n+ op (1)
and consequently
LR =
µh_J (θ0)
i−1F / (θ0) [P (θ0)]
−1 F (θ0)h_J (θ0)
i−1 s (θ0)√n
¶/µ−1n
∂2 (θ∗)
∂θ∂θ/
¶h_J (θ0)
i−1F / (θ0) [P (θ0)]
−1 F (θ0)h_J (θ0)
i−1 s (θ0)√n+ op (1) .
54
Now from assumption A5. we have
n−1H (θ) = −_
J (θ) + op (1) ,
θ∗ = θ0 + op (1) and P (θ0) = F (θ0)h_J (θ0)
i−1F / (θ0) .
Hence
LR =
µs (θ0)√
n
¶/ h_J (θ0)
i−1F / (θ0) [P (θ0)]
−1 F (θ0)h_J (θ0)
i−1 s (θ0)√n+ op (1) .
We can now state the following Theorem:
Theorem 23 Under the usual assumptions and under the null Hypothesis we have
that
LR = 2
∙ µ∧θ
¶−
³eθ´¸ d→ χ2r.
Proof: The Likelihood Ratio is written as
LR = (ξ0)/ Z0
hZ/0Z0
i−1Z/0ξ0 + op (1)
where h_J (θ0)
i−1/2 s (θ0)√n= ξ0, and Z0 =
h_J (θ0)
i−1/2F / (θ0) .
Now h_J (θ0)
i−1/2 s (θ0)√n
d→ N (0, Ik)
and Z0hZ/0Z0
i−1Z/0 is symmetric idempotent. Hence
r
µZ0hZ/0Z0
i−1Z/0
¶= tr
µZ0hZ/0Z0
i−1Z/0
¶= tr
µhZ/0Z0
i−1Z/0Z0
¶= tr (Ir) = r.
Consequently, we get the result. ¥TheWald test is based on the idea that if the restrictions are correct the vector
ϕ
µ∧θ
¶should be close to zero.
Expanding ϕµ∧θ
¶around ϕ (θ0) we get:
ϕ
µ∧θ
¶= ϕ (θ0) +
∂ϕ (θ∗)
∂θ/
µ∧θ − θ0
¶= F (θ∗)
µ∧θ − θ0
¶
The Classical Tests 55
as under the null ϕ (θ0) = 0. Hence
√nϕ
µ∧θ
¶=√nF (θ∗)
µ∧θ − θ0
¶and consequently
√nϕ
µ∧θ
¶=√nF (θ0)
µ∧θ − θ0
¶+ op (1) .
Furthermore recall that
√n
µ∧θ − θ0
¶d→ N
∙0,³_J (θ0)
´−1¸.
Hence,√nϕ
µ∧θ
¶d→ N
∙0, F (θ0)
³_J (θ0)
´−1F / (θ0)
¸.
Let us now consider the following quadratic:
n
∙ϕ
µ∧θ
¶¸/ ∙F (θ0)
³_J (θ0)
´−1F / (θ0)
¸−1ϕ
µ∧θ
¶,
which is the square of the Mahalanobis distance of the√nϕ
µ∧θ
¶vector. However the
above quantity can not be considered as a statistic as it is a function of the unknown
parameter θ0. The Wald test is given by the above quantity if the unknown vector of
parameters θ0 is substituted by the ML estimator∧θ, i.e.
W =
∙ϕ
µ∧θ
¶¸/ "F
µ∧θ
¶µn_
J
µ∧θ
¶¶−1F /
µ∧θ
¶#−1ϕ
µ∧θ
¶
=
∙ϕ
µ∧θ
¶¸/ "F
µ∧θ
¶µJ
µ∧θ
¶¶−1F /
µ∧θ
¶#−1ϕ
µ∧θ
¶,
where Jµ∧θ
¶is the estimated information matrix. In case that J
µ∧θ
¶does not have
an explicit formula it can be substituted by a consistent estimator, e.g. by
∧J = −
nXi=1
∂2µ∧θ
¶∂θ∂θ/
56
or by the asymptotically equivalent
∧J =
nXi=1
∂
µ∧θ
¶∂θ
∂
µ∧θ
¶∂θ/
.
Hence the Wald statistic is given by
W =
∙ϕ
µ∧θ
¶¸/ "F
µ∧θ
¶µ∧J
¶−1F /
µ∧θ
¶#−1ϕ
µ∧θ
¶.
Now we can prove the following Theorem:
Theorem 24 Under the usual regularity assumptions and the null hypothesis we that
W =
∙ϕ
µ∧θ
¶¸/ "F
µ∧θ
¶µ∧J
¶−1F /
µ∧θ
¶#−1ϕ
µ∧θ
¶d→ χ2r.
Proof: For any consistent estimator of θ0 we have that
F
µ∧θ
¶µ∧J
¶−1F /
µ∧θ
¶= F (θ0)
³n_
J (θ0)´−1
F / (θ0) + op (1) .
Hence
W = n
∙ϕ
µ∧θ
¶¸/ ∙F (θ0)
³_J (θ0)
´−1F / (θ0)
¸−1ϕ
µ∧θ
¶+ op (1) .
Furthermore,√nϕ
µ∧θ
¶d→ N
∙0, F (θ0)
³_J (θ0)
´−1F / (θ0)
¸,
and the result follows. ¥The Lagrange Multiplier (LM) test considers the distance from zero of the
estimated Lagrange Multipliers. Recall that
eλ√n
d→ N¡0, [P (θ0)]
−1¢ .Consequently, the square Mahalanobis distance isà eλ√
n
!/
P (θ0)
à eλ√n
!=³eλ´/ F (θ0) hn_J (θ0)i−1 F / (θ0)
³eλ´ .
The Classical Tests 57
Again, the above quantity is not a statistic as it is a function of the unknown
parameters θ0. However, we can employ the restricted ML estimates of θ0 to find the
the unknown quantities, i.e. eF = F³eθ´ and eJ = J
³eθ´. Hence we can prove thefollowing:
Theorem 25 Under the usual regularity assumptions and the null hypothesis we have
LM =³eλ´/ eF h eJi−1 eF /
³eλ´ d→ χ2r.
Proof: Again we have that for any consistent estimator of θ0, as is the restricted
MLE eθ, we have thatLM =
³eλ´/ eF h eJi−1 eF /³eλ´ = Ã eλ√
n
!/
P (θ0)
à eλ√n
!+ op (1)
and by the asymptotic distribution of the Lagrange Multipliers we get the result. ¥
Now we have that the Restricted MLE satisfy the first order conditions of the
Lagrangian, i.e.
s³eθ´+ F /
³eθ´eλ = 0.Consequently the LM test can be expressed as:
LM =³s³eθ´´/ h eJi−1 s³eθ´ .
Now Rao has suggested to find the score vector and the information matrix of the
unrestricted model and evaluate them at the restricted MLE. Under this form the
LM statistic is called efficient score statistic as it measures the distance of the
score vector, evaluated at the restricted MLE, from zero.
0.8.1 The Linear Regression
Let us consider the classical linear regression model:
y = Xβ + u, u|X v N¡0, σ2In
¢
58
where y is the n× 1 vector of endogenous variables, X is the n× k matrix of weakly
exogenous explanatory variables, β is the k × 1 vector of mean parameters and u is
the n× 1 vector of errors. Let us call the vector of parameters θ, i.e. θ/ =³β/, σ2
´a (k + 1)× 1 vector. The log-likelihood function is:
(θ) = −n2ln (2π)− n
2ln¡σ2¢− 12
(y −Xβ)/ (y −Xβ)
σ2.
The first order conditions are:
∂ (θ)
∂β=
X/ (y −Xβ)
σ2= 0
and∂ (θ)
∂σ2= − n
2σ2+1
2
(y −Xβ)/ (y −Xβ)
σ4= 0.
Solving the equations we get:
∧β =
¡X/X
¢−1X/y
∧σ2 =
∧u/∧u
n,
∧u = y −X
∧β.
Notice that the MLE of β is the same as OLS estimator. Something which is not true
for the MLE of σ2.
The Hessian is
H (θ) =∂2 (θ)
∂θ∂θ/=
⎛⎝ ∂2 (θ)
∂β∂β/∂2 (θ)∂β∂σ2
∂2 (θ)
∂σ2∂β/∂2 (θ)
∂(σ2)2
⎞⎠ =
⎛⎝ − 1σ2X/X − 1
2σ4X/u
− 12σ4
u/X n2σ4− u/u
σ6
⎞⎠ .
Hence the Information matrix is
J (θ) = E [−H (θ)] =
⎛⎝ 1σ2X/X 0
0 n2σ4
⎞⎠ ,
and the Cramer-Rao limit
J−1 (θ) =
⎛⎝ σ2¡X/X
¢−10
0 2σ4
n
⎞⎠ .
The Classical Tests 59
Notice that under normality, of the errors, the OLS estimator is asymptotically effi-
cient.
Let us now consider r linear constrains on the parameter vector β, i.e.
ϕ (β) = Qβ − q = 0 (4)
where Q is the r × k matrix of the restrictions (with r < k) and q a known vector.
Let us now form the Lagrangian, i.e.
L = (θ) + λ/ϕ (β) = (θ) + ϕ/ (β)λ = (θ) + (Qβ − q)/ λ,
where λ is the vector of the r Lagrange Multipliers. The first order conditions are:
∂L
∂β=
∂ (θ)
∂β+Q/λ =
X/ (y −Xβ)
σ2+Q/λ = 0 (5)
∂L
∂σ2=
∂ (θ)
∂σ2= − n
2σ2+1
2
(y −Xβ)/ (y −Xβ)
σ4= 0 (6)
and∂L
∂λ= Qβ − q = 0. (7)
Now from (5) we have that
X/y = X/Xβ − σ2Q/λ (8)
and it follows that
Q¡X/X
¢−1X/y = Q
¡X/X
¢−1X/Xβ − σ2Q
¡X/X
¢−1Q/λ.
Hence
Q¡X/X
¢−1X/y = Qβ − σ2Q
¡X/X
¢−1Q/λ.
It follows that
Qβ −QVQ/λ = Q∧β,
where∧β =
¡X/X
¢−1X/y and V = σ2
¡X/X
¢−1.
60
Now from (7) we have that Qβ = q. Hence we get
λ = −£QVQ/
¤−1µQ∧β − q
¶. (9)
Substituting out λ from (8) employing the above and solving for β we get:
eβ = ∧β −
¡X/X
¢−1Q/hQ¡X/X
¢−1Q/i−1µ
Q∧β − q
¶.
Solving (6) we get that eσ2 = (eu)/ eun
, eu = y −Xeβ,and from (9) we get:
eλ = − hQeV Q/i−1µ
Q∧β − q
¶, eV = eσ2 ¡X/X
¢−1.
The above 3 formulae give the restricted MLEs.
Now the Wald test for the linear restrictions in (4) is given by
W =
µQ∧β − q
¶/ ∙Q∧V Q/
¸−1µQ∧β − q
¶.
The restricted and unrestricted residuals are given by
eu = y −Xeβ, and∧u = y −X
∧β.
Hence eu = ∧u+X
µ∧β − eβ¶
and consequently, if X/∧u = 0, i.e. the regression has a constant we have that
eu/eu = ∧u/∧u+
µ∧β − eβ¶/
X/X
µ∧β − eβ¶ .
It follows that
eu/eu− ∧u/∧u =
µQ∧β − q
¶/ hQ¡X/X
¢−1Q/i−1µ
Q∧β − q
¶.
Hence the Wald test is given by
W = neu/eu− ∧
u/∧u
∧u/∧u
.
The Classical Tests 61
The LR test is given by
LR = 2
∙ µ∧θ
¶−
³eθ´¸ = n ln
Ãeu/eu∧u/∧u
!and the LM test is
LM = neu/eu− ∧
u/∧ueu/eu
as
LM =³eλ´/ eF h eJi−1 eF /
³eλ´ = µQ∧β − q
¶/ hQeV Q/
i−1µQ∧β − q
¶.
We can now state a well known result.
Theorem 26 Under the classical assumptions of the Linear Regression Model we
have that
W ≥ LR ≥ LM.
Proof: The three test can be written as
W = n (r − 1) , LR = n ln (r) , LM = n
µ1− 1
r
¶,
where r = eu/eu∧u/∧u≥ 1. Now we know that ln (x) ≥ x−1
xand the result follows by
considering x = r and x = 1/r.
0.8.2 Autocorrelation
Apply the LM test to test the hypothesis that ρ = 0 in the following model
yt = x/tβ + ut, ut = ρut−1 + εt, εt
i.i.d.v N¡0, σ2
¢.
Discuss the advantages of this LM test over the Wald and LR tests of this hypothesis.
First notice that from ut = ρut−1 + εt we get that
E (ut) = ρE (ut−1) +E (εt) = ρE (ut−1)
as E (εt) = 0 and for |ρ| < 1 we get that
E (ut)− ρE (ut−1) = 0⇒ E (ut) = 0
62
as E (ut) = E (ut−1) independent of t. Furthermore
V ar (ut) = E¡u2t¢= ρ2E
¡u2t−1
¢+E
¡ε2t¢+ 2ρE (ut−1εt) = ρ2E
¡u2t−1
¢+ σ2
as the first equality follows from the fact that E (ut) = 0, and the last from the fact
that
E (ut−1εt) = E [ut−1E (εt|It−1)] = E [ut−10] = 0
where It−1 the information set at time t − 1, i.e. the sigma-field generated by
εt−1, εt−2, .... Hence
E¡u2t¢− ρ2E
¡u2t−1
¢= σ2 ⇒ E
¡u2t¢=
σ2
1− ρ2
as E (u2t ) = E¡u2t−1
¢independent of t.
Substituting out ut we get
yt = x/tβ + ρut−1 + εt,
and observing that ut−1 = yt−1 − x/t−1β we get
yt = x/tβ + ρ
³yt−1 − x
/t−1β
´+ εt ⇒ εt = yt − x
/tβ − ρyt−1 + x
/t−1βρ
where by assumption the ε0ts are i.i.d. Hence the log-likelihood function is
l (θ) = −T2ln (2π)− T
2ln¡σ2¢−
TXt=1
³yt − x
/tβ − ρyt−1 + x
/t−1βρ
´22σ2
,
where we assume that y−1 = 0, and x−1 = 0. as we do not have any observations
for t = −1. In any case, given that |ρ| < 1, the first observation will not affect the
distribution LM test, as it is based in asymptotic theory, i.e. T →∞. The first order
conditions are:
∂l
∂β=
TXt=1
³yt − x
/tβ − ρyt−1 + x
/t−1βρ
´(xt − xt−1ρ)
σ2
∂l
∂ρ=
TXt=1
³yt − x
/tβ − ρyt−1 + x
/t−1βρ
´³yt−1 − x
/t−1β
´σ2
=TXt=1
εtut−1σ2
,
The Classical Tests 63
∂l
∂σ2= − T
2σ2+
TXt=1
³yt − x
/tβ − ρyt−1 + x
/t−1βρ
´22σ4
= − T
2σ2+
TXt=1
ε2t2σ4
.
The second derivatives are:
∂2l
∂β∂β/= −
TXt=1
(xt − xt−1ρ)³x/t − x
/t−1ρ
´σ2
∂2l
∂ρ2= −
TXt=1
³yt−1 − x
/t−1β
´2σ2
= −TXt=1
u2t−1σ2
,
∂2l
∂ (σ2)2=
T
2σ4−
TXt=1
³yt − x
/tβ − ρyt−1 + x
/t−1βρ
´2σ6
=T
2σ4−
TXt=1
ε2tσ6
.
∂2l
∂β∂ρ= −
TXt=1
³yt−1 − x
/t−1β
´³x/t − x
/t−1ρ
´+³yt − x
/tβ − ρyt−1 + x
/t−1βρ
´³x/t−1
´σ2
=
= −TXt=1
ut−1³x/t − x
/t−1ρ
´+ εt
³x/t−1
´σ2
∂2l
∂ρ∂σ2= −
TXt=1
³yt − x
/tβ − ρyt−1 + x
/t−1βρ
´³yt−1 − x
/t−1β
´σ4
= −TXt=1
εtut−1σ4
,
∂2l
∂β∂σ2= −
TXt=1
³yt − x
/tβ − ρyt−1 + x
/t−1βρ
´³x/t − x
/t−1ρ
´σ4
= −TXt=1
εt³x/t − x
/t−1ρ
´σ4
Notice now that the Information Matrix J is
J (θ) = −E [H (θ)] =
⎡⎢⎢⎢⎣PT
t=1
(xt−xt−1ρ)³x/t−x
/t−1ρ
´σ2
0 0
0 T1−ρ2 0
0 0 T2σ4
⎤⎥⎥⎥⎦asE
∙ut−1
³x/t−x
/t−1ρ
´+εt
³x/t−1
´σ2
¸= 0, E
∙εt³x/t−x
/t−1ρ
´σ4
¸= 0, E
hPTt=1
εtut−1σ4
i= 0, E
hε2tσ6
i=
1σ4, E
hu2t−1σ2
i=
E(u2t−1)σ2
= 11−ρ2 , i.e. the matrix is block diagonal between β, ρ, and σ
2.
Consequently the LM test has the form
LM = s/ρJ−1ρρ sρ =
(sρ)2
Jρρ
64
as sρ =PT
t=1εtut−1σ2
, Jρρ = T1−ρ2 . All these quantities evaluated under the null.
Hence under H0 : ρ = 0 we have that
Jρρ = T, and ut = εt
i.e. there is no autocorrelation. Consequently, we can estimate β by simple OLS, as
OLS and ML result in the same estimators and σ2 by the ML estimator, i.e.
eβ = Ã TXt=1
xtx/t
!−1 TXt=1
xtyt, and eσ2 = eu/euT=
PTt=1 eu2tT
,
where eut = yt − x/teβ = eεt the OLS residuals. Hence
LM =
³PTt=1
euteut−1fσ2´2
T= T
ÃTXt=1
euteut−1!2Ã TXt=1
eu2t!−2
.
0.9 Time Series
0.9.1 Projections (Orthogonal)
Assume the usual linear regression setup, i.e.
y = Xβ + u, u|X v D¡0, σ2In
¢where y is the n× 1 vector of endogenous variables, X is the n× k matrix of weakly
exogenous explanatory variables, β is the k × 1 vector of mean parameters and u is
the n× 1 vector of errors.
When we estimate a linear regression model, we simply map the regressand y
into a vector of fitted values Xbβ and a vector of residuals bu = y−Xbβ. Geometrically,these mappings are examples of orthogonal projections. A projection is a mapping
that takes each point of En into a point in a subset of En, while leaving all the
points of the subset unchanged, where En is the usual Euclidean vector space, i.e.
the set of all vectors in Rn where the addition, the scalar multiplication and the inner
product (hence the norm) are defined. Because of this invariance the subspace is
called invariant subspace of the projection. An orthogonal projectionmaps any
Time Series 65
point into the point of the subspace that is closest to it. If a point is already in the
invariant subspace, it is mapped into itself.
Algebraically, an orthogonal projection on to a given subspace can be per-
formed by premultiplying the vector to be projected by a suitable projection ma-
trix. In the case of OLS, the two projection matrices that yield the vector of fitted
values and the vector of residuals, respectively, are
PX = X¡X/X
¢−1X/
and
MX = In − PX = In −X¡X/X
¢−1X/.
To see this notice that the fitted values
by = Xbβ = X¡X/X
¢−1X/y = PXy.
Hence the PX projection matrix project on to S (X)., i.e. the subspace of En spanned
by the columns of X. Notice that for any vector α ∈ Rk the vector Xα belongs to
S (X). As now Xα ∈ S (X) then it should be the case, due to the invariance of PX ,
that
PXXα = Xα.
But notice that
PXX = X¡X/X
¢−1X/X = XIk = X.
It is clear that when PX is applied to y it yields the vector of fitted values.
On the other hand the MX projection matrix yields the vector of residuals as
MXy =hIn −X
¡X/X
¢−1X/iy = y − PXy = y −Xbβ = bu.
The image of MX is S⊥ (X), the orthogonal complement of the image of PX . To see
this, consider any vector w ∈ S⊥ (X). It must satisfy the condition X/w = 0, which
implies that PXw = 0, by the definition of PX . Consequently, (In − PX)w =MXw =
66
w and S⊥ (X) must be contained in the image of MX , i.e. S⊥ (X) ⊆ Im (MX). Now
consider any image of MX . It must take the form MXz. But then
(MXz)/X = z/MXX = 0
asMX symmetric. HenceMXz belongs to S⊥ (X), for any z. Consequently, Im (MX) ⊆
S⊥ (X) and hence the image of MX coincides with S⊥ (X).
For any matrix to represent a projection, it must be idempotent. This is
because the vector image of a projection matrix is say S (X), and then project it
again, the second projection should have no effect, i.e. PXPXz = PXz for any z. It
is easy to prove that this is the case with PX and MX , as
PXPX = PX and MXMX =MX .
By the definition of MX it is obvious that
MX = In −X¡X/X
¢−1X/ = In − PX ⇒MX + PX = In,
and consequently for any vector z ∈ En we have
MXz + PXz = z.
The pair of projections MX and PX are called complementary projections, since
the sum MXz and PXz restores the original vector z.
Assume that we have the following linear regression model:
y = Xβ + ε
where y and ε are N × 1, β is k × 1,and X is N × k.
For k = 2 and if the first variable is a consant we have that :
yi = β0 + xiβ1 + εi for i = 1, 2, ..., N.
Time Series 67
Now
bβ =
µcβ0cβ1¶=¡X/X
¢−1X/y =
⎛⎝ TP
xPxP
x2
⎞⎠−1⎛⎝ PyPxy
⎞⎠=
1
TP
x2 − (P
x)2
⎛⎝ Px2 −
Px
−P
x T
⎞⎠⎛⎝ PyPxy
⎞⎠=
⎛⎝ PyP
x2−P
xP
xy
TP
x2−(P
x)2
TP
xy−P
xP
y
TP
x2−(P
x)2
⎞⎠ .
Notice however that
TX
x2 −³X
x´2
= T
"Xx2 − T
µPx
T
¶2#= T
hXx2 − T (x)2
i= T
hX¡x2 − x2
¢i= T
hX¡x2 − 2xx+ 2xx+ x2 − 2x2
¢i= T
hX(x− x)2 + 2x
X(x− x)
i= T
X(x− x)2 ,
and
TX
xy −X
xX
y = ThX
(x− x) (y − y)i.
Henceµcβ0cβ1¶
=
⎛⎝ TyP
x2−TxP
xy
TP(x−x)2P
(x−x)(y−y)P(x−x)2
⎞⎠ =
⎛⎝ y(TP(x−x)2+(
Px)2)−xT [
P(x−x)(y−y)]+
PxP
yTP(x−x)2P
(x−x)(y−y)P(x−x)2
⎞⎠=
⎛⎝ y − [P(x−x)(y−y)]P(x−x)2 xP
(x−x)(y−y)P(x−x)2
⎞⎠ =
⎛⎝ y −cβ1xP(x−x)(y−y)P(x−x)2
⎞⎠ .