Stat Infer

7/27/2019 Stat Infer

1/114

LECTURE NOTES ONSTATISTICAL INFERENCE

KRZYSZTOF POD GORSKI

Department of Mathematics and Statistics

University of Limerick, Ireland

November 23, 2009


2/114

Contents

1 Introduction 4

1.1 Models of Randomness and Statistical Inference . . . . . . . . . . . . 4

1.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.1 Probability vs. likelihood . . . . . . . . . . . . . . . . . . . . 8

1.2.2 More data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3 Likelihood and theory of statistics . . . . . . . . . . . . . . . . . . . 15

1.4 Computationally intensive methods of statistics . . . . . . . . . . . . 15

1.4.1 Monte Carlo methods studying statistical methods using com-

puter generated random samples . . . . . . . . . . . . . . . . 16

1.4.2 Bootstrap performing statistical inference using computers . 18

2 Review of Probability 21

2.1 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Distribution of a Function of a Random Variable . . . . . . . . . . . . 22

2.3 Transforms Method Characteristic, Probability Generating and Mo-

ment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.1 Sums of Independent Random Variables . . . . . . . . . . . . 26

2.4.2 Covariance and Correlation . . . . . . . . . . . . . . . . . . 27

2.4.3 The Bivariate Change of Variables Formula . . . . . . . . . . 28

2.5 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . 29

2.5.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . 29

1


3/114

2.5.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . 29

2.5.3 Negative Binomial and Geometric Distribution . . . . . . . . 302.5.4 Hypergeometric Distribution . . . . . . . . . . . . . . . . . 31

2.5.5 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . 32

2.5.6 Discrete Uniform Distribution . . . . . . . . . . . . . . . . . 33

2.5.7 The Multinomial Distribution . . . . . . . . . . . . . . . . . 33

2.6 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . 34

2.6.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . 34

2.6.2 Exponential Distribution . . . . . . . . . . . . . . . . . . . . 35

2.6.3 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . 35

2.6.4 Gaussian (Normal) Distribution . . . . . . . . . . . . . . . . 36

2.6.5 Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . 38

2.6.6 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . 38

2.6.7 Chi-square Distribution . . . . . . . . . . . . . . . . . . . . . 39

2.6.8 The Bivariate Normal Distribution . . . . . . . . . . . . . . . 39

2.6.9 The Multivariate Normal Distribution . . . . . . . . . . . . . 40

2.7 Distributions further properties . . . . . . . . . . . . . . . . . . . . 42

2.7.1 Sum of Independent Random Variables special cases . . . . 42

2.7.2 Common Distributions Summarizing Tables . . . . . . . . 45

3 Likelihood 48

3.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 48

3.2 Multi-parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . 55

3.3 The Invariance Principle . . . . . . . . . . . . . . . . . . . . . . . . 59

4 Estimation 61

4.1 General properties of estimators . . . . . . . . . . . . . . . . . . . . 61

4.2 Minimum-Variance Unbiased Estimation . . . . . . . . . . . . . . . . 64

4.3 Optimality Properties of the MLE . . . . . . . . . . . . . . . . . . . 69

5 The Theory of Confidence Intervals 71

5.1 Exact Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . 71

2


4/114

5.2 Pivotal Quantities for Use with Normal Data . . . . . . . . . . . . . . 75

5.3 Approximate Confidence Intervals . . . . . . . . . . . . . . . . . . . 80

6 The Theory of Hypothesis Testing 87

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.2 Hypothesis Testing for Normal Data . . . . . . . . . . . . . . . . . . 92

6.3 Generally Applicable Test Procedures . . . . . . . . . . . . . . . . . 97

6.4 The Neyman-Pearson Lemma . . . . . . . . . . . . . . . . . . . . . 101

6.5 Goodness of Fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.6 The 2 Test for Contingency Tables . . . . . . . . . . . . . . . . . . 109

3


5/114

Chapter 1

Introduction

Everything existing in the universe is the fruit of chance.

Democritus, the 5th Century BC

1.1 Models of Randomness and Statistical Inference

Statistics is a discipline that provides with a methodology allowing to make an infer-

ence from real random data on parameters of probabilistic models that are believed to

generate such data. The position of statistics with relation to real world data and corre-

sponding mathematical models of the probability theory is presented in the following

diagram.

The following is the list of few from plenty phenomena to which randomness is

attributed.

Games of chance

Tossing a coin

Rolling a die

Playing Poker

Natural Sciences

4


6/114

Real World Science & Mathematics

Random Phenomena Probability Theory

Data Samples Models

Statistics

Prediction and Discovery Statistical Inference

-

? ?

-

? ?

-

HH

HH

HH

HH

HHHj ?

?

Figure 1.1: Position of statistics in the context of real world phenomena and mathe-

matical models representing them.

5


7/114

Physics (notable Quantum Physics)

Genetics

Climate

Engineering

Risk and safety analysis

Ocean engineering

Economics and Social Sciences

Currency exchange rates

Stock market fluctations

Insurance claims

Polls and election results

etc.

1.2 Motivating Example

Let X denote the number of particles that will be emitted from a radioactive source

in the next one minute period. We know that X will turn out to be equal to one of

the non-negative integers but, apart from that, we know nothing about which of the

possible values are more or less likely to occur. The quantity X is said to be a random

variable.

Suppose we are told that the random variable X has a Poisson distribution with

parameter = 2. Then, ifx is some non-negative integer, we know that the probability

that the random variable X takes the value x is given by the formula

P(X = x) =

x

exp()x!where = 2. So, for instance, the probability that X takes the value x = 4 is

P(X = 4) =24 exp(2)

4!= 0.0902 .

6


8/114

We have here a probability model for the random variable X. Note that we are using

upper case letters for random variables and lower case letters for the values taken byrandom variables. We shall persist with this convention throughout the course.

Let us still assume that the random variable X has a Poisson distribution with

parameter but where is some unspecified positive number. Then, if x is some non-

negative integer, we know that the probability that the random variable X takes the

value x is given by the formula

P(X = x|) = x exp()

x!, (1.1)

for

R+. However, we cannot calculate probabilities such as the probability that X

takes the value x = 4 without knowing the value of.

Suppose that, in order to learn something about the value of, we decide to measure

the value ofX for each of the next 5 one minute time periods. Let us use the notation

X1 to denote the number of particles emitted in the first period, X2 to denote the

number emitted in the second period and so forth. We shall end up with data consisting

of a random vector X = (X1, X2, . . . , X 5). Consider x = (x1, x2, x3, x4, x5) =

(2, 1, 0, 3, 4). Then x is a possible value for the random vector X. We know that the

probability that X1 takes the value x1 = 2 is given by the formula

P(X = 2|) = 2 exp()2!

and similarly that the probability that X2 takes the value x2 = 1 is given by

P(X = 1|) = exp()1!

and so on. However, what about the probability that X takes the value x? In order for

this probability to be specified we need to know something about the joint distribution

of the random variables X1, X2, . . . , X 5. A simple assumption to make is that the ran-

dom variables X1, X2, . . . , X 5 are mutually independent. (Note that this assumption

may not be correct since X2 may tend to be more similar to X1 that it would be to X5.)

However, with this assumption we can say that the probability that X takes the value x

7


9/114

is given by

P(X = x|) = 5i=1

xi exp()xi!

,

=2 exp()

2!

1 exp()1!

0 exp()

0!

3 exp()

3!

4 exp()4!

,

=10 exp(5)

288.

In general, ifX = (x1, x2, x3, x4, x5) is any vector of 5 non-negative integers, then

the probability that X takes the value x is given by

P(X = x|) =5i=1

xi exp()xi!

,

=5

i=1 xi exp(5)5i=1

xi!

.

We have here a probability model for the random vector X.

Our plan is to use the value x ofX that we actually observe to learn something

about the value of. The ways and means to accomplish this task make up the subject

matter of this course. The central tool for various statistical inference techniques is

the likelihood method. Below we present a simple introduction to it using the Poisson

model for radioactive decay.

1.2.1 Probability vs. likelihood

. In the introduced Poisson model for a given , say = 2, we can observe a function

p(x) of probabilities of observing values x = 0, 1, 2, . . . . This function is referred to

as probability mass function . The graph of it is presented below

The usage of such function can be utilized in bidding for a recorded result of future

experiments. If one wants to optimize chances of correctly predicting the future, the

choice of the number of recorded particles would be either on 1 or 2.

So far, we have been told that the random variable Xhas a Poisson distribution with

parameter where is some positive number and there are physical reason to assume

8


10/114

0 2 4 6 8 10

0

.00

0.0

5

0.1

0

0.1

5

0.2

0

0

.25

Number of particles

Probability

Figure 1.2: Probability mass function for Poisson model with = 2.

that such a model is correct. However, we have arbitrarily set = 2 and this is more

questionable. How can we know that it is correct a correct value of the parameter? Let

us analyze this issue in detail.

If x is some non-negative integer, we know that the probability that the random

variable X takes the value x is given by the formula

P(X = x|) = xe

x!,

for > 0. But without knowing the true value of , we cannot calculate probabilities

such as the probability that X takes the value x = 1.

Suppose that, in order to learn something about the value of , an experiment is

performed and a value ofX = 5 is recorded. Let us take a look at the probability mass

function for = 2 in Figure 1.2. What is the probability ofX to take value 2? Do we

like what we see? Why? Would you bet 1 or 2 in the next experiment?

We certainly have some serious doubt about our choice of = 2 which was arbi-

trary anyway. One can consider, for example, = 7 as an alternative to = 2. Here

are graphs of the pmf for the two cases. Which of the two choices do we like? Since it

9


11/114

0 2 4 6 8 10

0.0

0

0.0

5

0.1

0

0.1

5

0.2

0

0.2

5

Number of particles

Probability

0 2 4 6 8 10

0.0

0

0.0

5

0.1

0

0.1

5

Number of particles

Probability

Figure 1.3: The probability mass function for Poisson model with = 2 vs. the one

with = 7.

was more probable to get X = 5 under the assumption = 7 than when = 2, we say

= 7 is more likely to produce X = 5 than = 2. Based on this observation we can

develop a general strategy for chosing .

Let us summarize our position. So far we know (or assume) about the radioactive

emission that it follows Poisson model with some unknown > 0 and the value x = 5

has been once observed. Our goal is somehow to utilized this knowledge. First, we

note that the Poisson model is in fact not only a function of x but also of

p(x

|) =

xe

x!

.

Let us plug in the observed x = 5, so that we get a function of that is called

likelihood function

l() =5e

120.

The graph of it is presented on the next figure. Can you localize on this graph the values

of probabilities that were used to chose = 7 over = 2? What value of appears to

be the most preferable if the same argument is extended to all possible values of ? We

observe that the value of = 5 is most likely to produce value x = 5. In the result of

our likelihood approach we have used the data x = 5 and the Poisson model to makeinference - an example of statistical inference .

10


12/114

0 5 10 15

0.0

0

0.0

5

0.1

0

0.

15

theta

Likelihood

Figure 1.4: Likelihood function for the Poisson model when the observed value is

x = 5.

Likelihood Poisson model backward

Poisson model can be stated as a probability mass function that maps possible values

x into probabilities p(x) or if we emphasize the dependence on into p(x|) that isgiven below

p(x

|) = l(

|x) =

xe

x!

,

With the Poisson model with given one can compute probabilities that variouspossible numbers x of emitted particles can be recorded, i.e. we consider

x p(x|)

with fixed. We get the answer how probable are various outcomes x.

With the Poisson model where x is observed and thus fixed one can evaluate howlikely it would be to get x under various values of, i.e. we consider

l(|x)

with fixed. We get the answer how likely various could produced the observed

x.

11


13/114

Exercise 1. For the general Poisson model

p(x|) = l(|x) = xex!

,

1. for a given > find the most probable value of the observation x.

2. for a given observation x find the most likely value of.

Give a mathematical argument for your claims.

1.2.2 More data

Suppose that we perform another measurement of the number of emitted particles. Let

us use the notation X1 to denote the number of particles emitted in the first period, X2

to denote the number emitted in the second period. We shall end up with data consisting

of a random vector X = (X1, X2). The second measurement yielded x2 = 2, so that

x = (x1, x2) = (5, 2).

We know that the probability that X1 takes the value x1 = 5 is given by the formula

P(X = 5|) = 5e

5!

and similarly that the probability that X2 takes the value x2 = 2 is given by

P(X = 2|) = 2e

2!.

However, what about the probability that X takes the value x = (5, 2)? In order for

this probability to be specified we need to know something about the joint distribution

of the random variables X1, X2. A simple assumption to make is that the random

variables X1, X2 are mutually independent. In such a case the probability that X takes

the value x = (x1, x2) is given by

P(X = (x1, x2)

|) =

x1e

x1!

x2e

x2!

= e2x1+x2

x1!x2!

.

After little of algebra we easily find the likelihood function of observingX = (5, 2)

as

l(|(5, 2)) = e2 7

240

12


14/114

0 5 10 15

0.

000

0.0

05

0.0

10

0.0

15

0.0

20

0.0

25

Likelihood

0 5 10 15

0.

00

0.

05

0.

10

0.

15

theta

Likelihood

Figure 1.5: Likelihood of observing (5, 2) (top) vs. the one of observing 5 (bottom).

and its graph is presented in Figure 1.5 in comparison with the previous likelihood for

a single observation.

Two important effects of adding an extra information should be noted

We observe that the location of the maximum shifted from 5 to 3 compared tosingle observation.

We also note that the range of likely values for has diminished.

Let us suppose that eventually we decide to measure three more values of X.

Let us use the vector notation X = (X1, X2, . . . , X 5) to denote observable random

13


15/114

vector. Assume that three extra measurements yielded 3, 7, 7 so that we have x =

(x1, x2, x3, x4, x5) = (5, 2, 3, 7, 7). Under the assumption of independence the proba-bility that X takes the value x is given by

P(X = x|) =5i=1

xie

xi!.

The likelihood function of observing X = (5, 2, 3, 7, 7) under independence can

be easily derived to be24e5

14515200.

In general, ifX = (x1, . . . , xn) is any vector of 5 non-negative integers, then the

likelihood is given by

l(|(x1, . . . , xn) = n

i=1 xienni=1

xi!.

The value that maximizes this likelihood is called the maximum likelihood estimatorof.

In order to find values that effectively maximize likelihood, the method of calculus

can be implemented. We note that in our example we deal only with one variable and

computation of derivative is rather straightforward.

Exercise 2. For the general case of likelihood based on Poisson model

l(|x1, . . . , xn) = n

i=1 xienni=1

xi!

using methods of calculus derive a general formula for the maximum likelihood esti-

mator of. Using the result find for (x1, x2, x3, x4, x5) = (5, 2, 3, 7, 7).Exercise 3. It is generally believed that time X that passes until there is half of the

original radioactive material follow exponential distribution f(x|) = ex, x > 0.For beryllium 11 five experiments has been performed and values 13.21, 13.12, 13.95,

13.54, 13.88 seconds has been obtained. Find and plot the likelihood function for and

based on this determine the most likely .

14


16/114

1.3 Likelihood and theory of statistics

The strategy of making statistical inference based on the likelihood function as de-

scribed above is the recurrent theme in mathematical statistics and thus in our lecture.

Using mathematical argument we would compare various strategies to infering about

the parameters and often we will demonstrate that the likelihood based methods are

optimal. It will show its strength also as a criterium deciding between various claims

about parameters of the model which is the leading story of testing hypotheses.

In the modern days, the role of computers has increased in statistical methodology.

New computationally intense methods of data explorations become one of the central

areas of modern statistcs. Even there, methods that refer to likelihood play dominant

roles, in particular, in Bayesian methodology.

Despite this extensive penetration of statistical methodology by likelihood techin-

ques, by no means statistics can be reduced to analysis of likelihood. In every area of

statistics, there are important aspects that require reaching beyond likelihood, in many

cases, likelihood is not even a focus of studies and development. The purpose of this

course is to present both the importance of likelihood approach across statistics but also

presentation of topics for which likelihood plays a secondary role if any.

1.4 Computationally intensive methods of statistics

The second part of our presentation of modern statistical inference is devoted to compu-

tationally intensive statistical methods. The area of data explorations is rapidly growing

in importance due to

common access to inexpensive but advance computing tools,

emerging of new challenges associated with massive highly dimensional data farexceeding traditional assumptions on which traditional methods of statistics have

been based.

In this introduction we give two examples that illustrate the power of modern computers

and computing software both in analysis of statistical models and in performing actual

15


17/114

statistical inference. We start with analyzing a performance of a statistical procedure

using random sample generation.

1.4.1 Monte Carlo methods studying statistical methods using

computer generated random samples

Randomness can be used to study properties of a mathematical model. The model itself

may be probabilistic or not but here we focus on the probabilistic ones. Essentially, it

is based on repetitive simulations of random samples corresponding to the model and

observing behavior of objects of interests. An example of Monte Carlo method is ap-

proximate the area of circle by tossing randomly a point (typically computer generated)on the paper where a circle is drawn. The percentage of points that fall inside the circle

represents (approximately) percentage of the area covered by the circle, as illustrated

in Figure 1.6.

Exercise 4. Write an R code that would explore the area of an elipsoid using Monte

Carlo method.

Below we present an application of Monte Carlo approach to studying fitting meth-

ods for the Poisson model.

Deciding for Poisson model

Recall that the Poisson model is given by

P(X = x|) = xe

x!.

It is relatively easy to demonstrate that the mean value of this distribution is equal to

and standard deviation is also equal to .

Exercise 5. Present a formal argument showing that for a Poisson random variable X

with parameter ,E

X = and VarX = .

Thus for a sample of observations x = (x1, . . . , xn) it is reasonable to consider

16


18/114

Figure 1.6: Monte Carlo study of the circle area approximation for sample size of

10000 is 3.1248 which compares to the true value of = 3.141593.

both

1 = x,

2 = x2 x2as estimators of.

We want to employ Monte Carlo method to decide which one is better. In the

process we run many samples from the Poisson distribution and check which of the

17


19/114

Histogram of means

means

Frequency

2.5 3.0 3.5 4.0 4.5 5.0 5.5

0

100

Histogram of vars

vars

Frequenc

y

0 5 10 15

0

150

300

Figure 1.7: Monte Carlo results of comparing estimation of = 4 by the sample mean

(left) vs. estimation using the sample standard deviation right.

estimates performs better. The resulting histograms of the values of estimator are pre-

sented in Figure 1.8. It is quite clear from the graphs that the estimator based on the

mean is better than the one based on the variance.

1.4.2 Bootstrap performing statistical inference using computers

Bootstrap (resampling) methods are one of the examples of Monte Carlo based statis-

tical analysis. The methodology can be summarized as follows

Collect statistical sample, i.e. the same type of data as in classical statistics.

Used a properly chosen Monte Carlo based resampling from the data using RNG create so called bootstrap samples.

Analyze bootstrap samples to draw conclusions about the random mechanism

18


20/114

that produced the original statistical data.

This way randomness is used to analyze statistical samples that, by the way, are also a

result of randomness. An example illustrating the approach is presented next.

Estimating nitrate ion concentration

Nitrate ion concentration measurements in a certain chemical lab has been collected

and their results are given in the following table. The goal is to estimate, based on

0.51 0.51 0.51 0.50 0.51 0.49 0.52 0.53 0.50 0.47

0.51 0.52 0.53 0.48 0.49 0.50 0.52 0.49 0.49 0.50

0.49 0.48 0.46 0.49 0.49 0.48 0.49 0.49 0.51 0.47

0.51 0.51 0.51 0.48 0.50 0.47 0.50 0.51 0.49 0.48

0.51 0.50 0.50 0.53 0.52 0.52 0.50 0.50 0.51 0.51

Table 1.1: Results of 50 determinations of nitrate ion concentration in g per ml.

these values, the actual nitrate ion concentration. The overall mean of all observations

is 0.4998. It is natural to ask what is the error of this determination of the nitrate

concentration. If we would repeat our experiment of collecting 50 samples of nitrate

concentrations many times we would see the range of error that is made. However,

it would be a waste of resources and not a viable method at all. Instead we resample

new data from our data and use so obtained new samples for assessment of the error

and compare the obtained means (bootstrap means) with the original one. The differ-

ences of these represent the bootstrap estimation errors their distribution is viewed

as a good representation of the distribution of the true error. In Figure ??, we see the

bootstrap counterpart of the distribution of the estimation error.

Based on this we can safely say that the nitrate concentration is 49.99 0.005.Exercise 6. Consider a sample of daily number of buyers in a furniture store

8, 5, 2, 3, 1, 3, 9, 5, 5, 2, 3, 3, 8, 4, 7, 11, 7, 5, 12, 5

Consider the two estimators of for a Poisson distribution as discussed in the previous

section. Describe formally the procedure (in steps) of obtaining a bootstrap confidence

19


21/114

Histogram of bootstrap

bootstrap

Frequency

-0.006 -0.004 -0.002 0.000 0.002 0.004 0.006 0.008

0

20

40

60

80

Figure 1.8: Boostrap estimation error distribution.

interval for using each of the discussed estimatoand provide with 95% bootstrap

confidence intervals for each of them.

20


22/114

Chapter 2

Review of Probability

2.1 Expectation and Variance

The expected value E[Y] of a random variable Y is defined as

E[Y] =i=0

yiP(yi);

ifY is discrete, and

E[Y] = yf(y)dy;ifY is continuous, where f(y) is the probability density function. The varianceVar[Y]

of a random variable Y is defined as

Var[Y] = E(Y E[Y])2;

or

Var[Y] =

i=0

(yi E[Y])2P(yi);

ifY is discrete, and

V ar[Y] =

(y E[Y])2f(y)dy;

ifY is continuous. When there is no ambiguity we often writeEY forE[Y], andVarY

for Var[Y].

21


23/114

A function of a random variable is itself a random variable. If h(Y) is function of

the random variable Y, then the expected value ofh(Y) is given by

E[h(Y)] =i=0

h(yi)P(yi);

ifY is discrete, and ifY is continuous

E[h(Y)] =

h(y)f(y) dy.

It is relatively straightforward to derive the following results for the expectation

and variance of a linear function ofY .

E[aY + b] = aE[Y] + b,

V ar[aY + b] = a2Var[Y],

where a and b are constants. Also

Var[Y] = E[Y2] (E[Y])2 (2.1)

For expectations, it can be shown more generally that

E

k

i=1 aihi(Y) =k

i=1 aiE[hi(Y)],where ai, i = 1, 2, . . . , k are constants and hi(Y), i = 1, 2, . . . , k are functions of the

random variable Y.

2.2 Distribution of a Function of a Random Variable

If Y is a random variable than for any regular function X = g(Y) is also a random

variable. The cumulative distribution function ofX is given as

FX(x) = P(X x) = P(Y g1(, x]).

The density function of X if exists can be found by differentiating the right hand side

of the above equality.

22


24/114

Example 1. Let Y has a density fY and X = Y2. Then

FX(x) = P(Y2 < x) = P(x Y x) = FY(x) FY(x).

By taking a derivative in x we obtain

fX(x) =1

2

x

fY(

x) + fY(

x)

.

If additionally the distribution of Y is symmetric around zero, i.e. fY(y) = fY(y),then

fX(x) =1x

fY(

x).

Exercise 7. Let Z be a random variable with the density fZ(z) = ez2

/2/2, socalled the standard normal (Gaussian) random variable. Show that Z2 is a Gamma(1/2, 1/2)

random variable, i.e. that it has the density given by

12

x1/2ex/2.

The distribution ofZ2 is also called chi-square distribution with one degree of freedom.

Exercise 8. Let FY(y) be a cumulative distribution function of some random variable

Y that with probability one takes values in a set RY . Assume that there is an inverse

function F1

Y [0, 1] RY so that FYF1

Y (u) = u for u [0, 1]. Check that for U Unif(0, 1) the random variable Y = F1Y (U) has FY as its cumulative distribution

function.

The densities ofg(Y) are particularly easy to express if g is a strictly monotone as

shown in the next result

Theorem 2.2.1. LetY be a continuous random variable with probability density func-

tion fY. Suppose thatg(y) is a strictly monotone (increasing or decreasing) differ-

entiable (and hence continuous) function of y. The random variable Z defined by

Z = g(Y) has probability density function given by

fZ(z) = fY

g1(z) ddz ga(z)

(2.2)where g1(z) is defined to be the inverse function of g(y).

23


25/114

Proof. Let g(y) be a monotone increasing (decreasing) function and let FY(y) and

FZ(z) denote the probability distribution functions of the random variables Y and Z.Then

FZ(z) = P(Z z) = P(g(Y) z) = P(Y ()g1(z)) = (1)FY(g1(z))

By the chain rule,

fZ(z) =d

dzFZ(z) = () d

dzFY(g

1(z)) = fY(g1(z))dg1dz (z)

.

Exercise 9. (The Log-Normal Distribution) Suppose Z is a standard normal distribu-

tion and g(z) = eaz+b. Then Y = g(Z) is called a log-normal random variable.

Demonstrate that the density ofY is given by

fY(y) =1

2a2y1 exp

log

2(y/eb)

2a2

.

2.3 Transforms Method Characteristic, Probability Gen-

erating and Moment Generating Functions

The probability generating function of a random variable Y is a function denoted by

GY(t) and defined by

GY(t) = E(tY),

for those t R for which the above expectation is convergent. The expectation definingGY(t) converges absolutely if |t| 1. As the name implies, the p.g.f generates theprobabilities associated with a discrete distribution P(Y = j) = pj , j = 0, 1, 2, . . . .

GY(0) = p0, GY(0) = p1, GY(0) = 2!p2.

In general the kth derivative of the p.g.f ofY satisfies

G(k)Y(0) = k!pk.

24


26/114

The p.g.f can be used to calculate the mean and variance of a random variable Y. Note

that in the discrete case GY(t) = j=1jpjtj1 for 1 < t < 1. Let t approach onefrom the left, t 1, to obtain

GY(1) =j=1

jpj = E(Y) = Y.

The second derivative ofGY(t) satisfies

GY(t) =j=1

j(j 1)pjtj2,

and consequently

GY(1) =j = 1j(j 1)pj = E(Y2) E2(Y).

The variance ofY satisfies

2Y = EY2 EY + EY E2Y = GY(1) + GY(1) G2Y(1).

The moment generating function (m.g.f) of a random variable Y is denoted by

MY(t) and defined as

MY(t) = E

etY

,

for some tR. The moment generating function generates the moments EYk

MY(0) = 1, MY(0) = Y = E(Y), MY(0) = EY

2,

and, in general,

M(k)Y(0) = EYk.

The characteristic function (ch.f) of a random variable Y is defined by

Y(t) = EeitY,

where i =1.

A very important result concerning generating functions states that the moment

generating function uniquely defines the probability distribution (provided it exists in

an open interval around zero). The characteristic function also uniquely defines the

probability distribution.

25


27/114

Property 1. If Y has the characteristic function Y(t) and the moment generating

function MY(t), then forX = a + bY

X(t) =eaitY(bt)

MX(t) =eatMY(bt).

2.4 Random Vectors

2.4.1 Sums of Independent Random Variables

Suppose that Y1, Y2, . . . , Y n are independent random variables. Then the moment gen-

erating function of the linear combination Z = ni=1 aiYi is the product of the indi-vidual moment generating functions.

MZ(t) =EetaiYi

=Eea1tY1Eea2tY2 EeantYn

=ni=1

MYi(aiYi).

The same argument gives also that Z(t) =ni=1 Yi(aiY i).

When X and Y are discrete random variables, the condition of independence is

equivalent to pX,Y(x, y) = pX(x)pY(y) for all x, y. In the jointly continuous case

the condition of independence is equivalent to fX,Y(x, y) = fX(x)fY(y) for all x, y.

Consider random variables X and Y with probability densities fX(x) and fY(y) re-

spectively. We seek the probability density of the random variable X+ Y. Our general

result follows from

FX+Y(a) =P(X+ Y < a)

=

X+Y


28/114

Thus the density function fX+Y(z) =

fX(z y)fY(y) dy which is called the

convolution of the densities fX and fY .

2.4.2 Covariance and Correlation

Suppose that X and Y are real-valued random variables for some random experiment.

The covariance ofX and Y is defined by

Cov(X, Y) = E[(X EX)(Y EY)]

and (assuming the variances are positive) the correlation ofX and Y is defined by

(X, Y) =Cov(X, Y)Var(X)Var(Y) .

Note that the covariance and correlation always have the same sign (positive, nega-

tive, or 0). When the sign is positive, the variables are said to be positively correlated,

when the sign is negative, the variables are said to be negatively correlated, and when

the sign is 0, the variables are said to be uncorrelated. For an intuitive understanding

of correlation, suppose that we run the experiment a large number of times and that

for each run, we plot the values (X, Y) in a scatterplot. The scatterplot for positively

correlated variables shows a linear trend with positive slope, while the scatterplot for

negatively correlated variables shows a linear trend with negative slope. For uncorre-

lated variables, the scatterplot should look like an amorphous blob of points with no

discernible linear trend.

Property 2. You should satisfy yourself that the following are true

Cov(X, Y) =EXY EXEYCov(X, Y) =Cov(Y, X)

Cov(Y, Y) =Var(Y)

Cov(aX+ bY + c, Z) =aCov(X, Z) + bCov(Y, Z)

Var ni=1

Yi

=n

i,j=1

Cov(Yi, Yj)

If X and Y are independent, then they are uncorrelated. The converse is not true

however.

27


29/114

2.4.3 The Bivariate Change of Variables Formula

Suppose that (X, Y) is a random vector taking values in a subset S ofR2 with proba-

bility density function f. Suppose that U and V are random variables that are functions

ofX and Y

U = U(X, Y), V = V(X, Y).

If these functions have derivatives, there is a simple way to get the joint probability

density function g of (U, V). First, we will assume that the transformation (x, y) (u, v) is one-to-one and maps Sonto a subset T ofR2. Thus, the inverse transformation

(u, v) (x, y) is well defined and maps T onto S. We will assume that the inverse

transformation is smooth, in the sense that the partial derivatives

x

u,

x

v,

y

u,

y

v,

exist on T, and the Jacobian

(x, y)

(u, v)=

xu

xv

yu

yv

= xu yv xv yuis nonzero on T. Now, let B be an arbitrary subset of T. The inverse transformation

maps B onto a subset A ofS. Therefore,

P((U, V) B) = P((X, Y) A) = A f(x, y) dxdy.But, by the change of variables formula for double integrals, this can be written as

P((U, V) B) =

B

f(x(u, v), y(u, y))

(x, y)(u, v) dudv.

By the very meaning of density, it follows that the probability density function of

(U, V) is

g(u, v) = f(x(u, v), y(u, v))

(x, y)(u, v) , (u, v) T.

The change of variables formula generalizes to Rn.

Exercise 10. Let U1 and U2 be independent random variables with the density equalto one over [0, 1], i.e. standard uniform random variables. Find the density of the

following vector of variables

(Z1, Z2) = (

2log U1 cos(2U2),

2log U1 sin(2U2)).

28


30/114

2.5 Discrete Random Variables

2.5.1 Bernoulli Distribution

A Bernoulli trial is a probabilistic experiment which can have one of two outcomes,

success (Y = 1) or failure (Y = 0) and in which the probability of success is . We

refer to as the Bernoulli probability parameter. The value of the random variable Y is

used as an indicator of the outcome, which may also be interpreted as the presence or

absence of a particular characteristic. A Bernoulli random variable Y has probability

mass function

P(Y = y|) = y

(1 )1

y (2.4)for y = 0, 1 and some (0, 1). The notation Y Ber() should be read as therandom variable Y follows a Bernoulli distribution with parameter .

A Bernoulli random variable Y has expected value E[Y] = 0 P(Y = 0) + 1 P(Y = 1) = 0(1)+1 = , and varianceVar[Y] = (0)2(1)+(1)2 =(1 ).

2.5.2 Binomial Distribution

Consider independent repetitions of Bernoulli experiments, each with a probability of

success . Next consider the random variable Y, defined as the number of successes in

a fixed number of independent Bernoulli trials, n . That is,

Y =ni=1

Xi,

where Xi Bernoulli() for i = 1, . . . , n. Each sequence of length n containing yones and (n y) zeros occurs with probability y(1 )(n y). The number ofsequences with y successes, and consequently (n y) fails, is

n!

y!(n y)! = ny.The random variable Y can take on values y = 0, 1, 2, . . . , n with probabilities

P(Y = y|) =

n

y

y(1 )ny. (2.5)

29


31/114

The notation Y Bin(n, ) should be read as the random variable Y follows a bi-

nomial distribution with parameters n and . Finally using the fact that Y is the sumof n independent Bernoulli random variables we can calculate the expected value as

E[Y] = E[

Xi] = PE[Xi] =

= n and variance as Var[Y] = V ar[

Xi] =Var[Xi] =

(1 ) = n(1 ).

2.5.3 Negative Binomial and Geometric Distribution

Instead of fixing the number of trials, suppose now that the number of successes, r,

is fixed, and that the sample size required in order to reach this fixed number is the

random variable N. This is sometimes called inverse sampling. In the case of r = 1,

using the independence argument again, leads to geomoetric distribution

P(N = n|) = (1 )n1, n = 1, 2, . . . (2.6)

for n = 1, 2, . . . which is the geometric probability function with parameter . The

distribution is so named as successive probabilities form a geometric series. The no-

tation N Geo() should be read as the random variable N follows a geometricdistribution with parameter . Write (1 ) = q. Then

E[N] =

n=1

nqn =

n=0

d

dq(qn) = d

dqn=0

qn=

d

dq

1

1 q

=

(1 q)2 =1

.

Also,

E[N2] =n=1

n2qn1 = n=1

d

dq(nqn) =

d

dq

n=1

nqn

= d

dq

q

1 qE(N)

= d

dq

q(1 q)2

= 12 + 2(1 )3 = 22 1 .Using Var[N] = E[N2] (E[N])2, we get Var[N] = (1 )/2.

Consider now sampling continues until a total of r successes are observed. Again,

let the random variable N denote number of trial required. If the rth success occurs

30


32/114

on the nth trial, then this implies that a total of (r 1) successes are observed by the

(n 1)th trial. The probability of this happening can be calculated using the binomialdistribution as

n 1r 1

r1(1 )nr.

The probability that the nth trial is a success is . As these two events are indepen-

dent we have that

P(N = n|r, ) =

n 1r 1

r(1 )nr (2.7)

for n = r, r + 1, . . . . The notation N NegBin(r, ) should be read as the randomvariable N follows a negative binomial distribution with parameters r and . This is

also known as the Pascal distribution.

E[Nk] =n=r

nk

n 1r 1

r(1 )nr

=r

n=r

nk1

n

r

r+1(1 )nr since n

n 1r 1

= r

n

r

=r

m=r+1

(m 1)k1

m 1r

r+1(1 )m(r+1)

=r

E

(X 1)k1 ,where X Negativebinomial(r + 1, ). Setting k = 1 we get E(N) = r/. Settingk = 2 gives

E[N2] =r

E(X 1) = r

r + 1

1

.

ThereforeVar[N] = r(1 )/2.

2.5.4 Hypergeometric Distribution

The hypergeometric distribution is used to describe sampling without replacement.

Consider an urn containing b balls, of which w are white and b w are red. Weintend to draw a sample of size n from the urn. Let Y denote the number of white balls

selected. Then, for y = 0, 1, 2, . . . , n we have

P(Y = y|b,w,n) =wy

bwny

bn

. (2.8)31


33/114

The expected value of the jth moment of a hypergeometric random variable is

E[Y] =ny=0

yjP(Y = y) =ny=1

yj wybwnybn

.The identities

y

w

y

= w

w 1y 1

n

b

n

= b

b 1n 1

can be used to obtain

E[Yj ] =nw

b

n

y=1yj1

w1y1

bwn1

b1n1

=nw

b

n1x=0

(x + 1)j1w1

x

bwn1x

b1n1

=

nw

bE[(X+ 1)j1]

where Xis a hypergeometric random variable with parameters n1, b1, w1. Fromthis it is easy to establish that E[Y] = n and Var[Y] = n(1 )(b n)/(b 1),where = w/b is the fraction of white balls in the population.

2.5.5 Poisson Distribution

Certain problems involve counting the number of events that have occurred in a fixed

time period. A random variable Y, taking on one of the values 0, 1, 2, . . . , is said to be

a Poisson random variable with parameter if for some > 0,

P(Y = y|) = y

y!e, y = 0, 1, 2, . . . (2.9)

The notation Y Pois() should be read as random variable Y follows a Poissondistribution with parameter . Equation 2.9 defines a probability mass function, since

y=0y

y!

e = e

y=0y

y!

= ee = 1.

The expected value of a Poisson random variable is

E[Y] =y=0

yey

y!= e

y=1

y1

(y 1)! = e

j=0

j

(j)!= .

32


34/114

To get the variance we first compute the second moment

E[Y2] =y=0

y2e yy!

= y=1

ye y1y 1! =

j=0

(j + 1)e jj!

= ( + 1).

Since we already have E[Y] = , we obtain Var[Y] = E[Y2] (E[Y])2 = .Suppose that Y Binomial(n, p), and let = np. Then

P(Y = y|np) =

n

y

py(1 p)ny

=

n

y

n

y 1

n

ny=

n(n 1) (n y + 1)ny

y

y!

(1 /n)n

(1 /n)y

.

For n large and moderate, we have that1

n

n e, n(n 1) (n y + 1)

ny 1,

1

n

y 1.

Our result is that a binomial random variable Bin(n, p) is well approximated by a

Poisson random variable Pois( = np) when n is large and p is small. That is

P(Y = y|n, p) enp (np)y

y!.

2.5.6 Discrete Uniform Distribution

The discrete uniform distribution with integer parameter N has a random variable Y

that can take the vales y = 1, 2, . . . , N with equal probability 1/N. It is easy to show

that the mean and variance ofY are E[Y] = (N + 1)/2, and Var[Y] = (N2 1)/12.

2.5.7 The Multinomial Distribution

Suppose that we perform n independent and identical experiments, where each ex-

periment can result in any one of r possible outcomes, with respective probabilities

p1, p2, . . . , pr, where ri=1pi = 1. If we denote by Yi, the number of the n experi-ments that result in outcome number i, then

P(Y1 = n1, Y2 = n2, . . . , Y r = nr) =n!

n1!n2! nr!pn11 p

n22 pnr5 (2.10)

33


35/114

where

ri=1 ni = n. Equation 2.10 is justified by noting that any sequence of out-

comes that leads to outcome i occurring ni times for i = 1, 2, . . . , r, will, by theassumption of independence of experiments, have probability pn11 p

n22 pnrr of occur-

ring. As there are n! = (n1!n2! nr!) such sequence of outcomes equation 2.10 isestablished.

2.6 Continuous Random Variables

2.6.1 Uniform Distribution

A random variable Y is said to be uniformly distributed over the interval (a, b) if itsprobability density function is given by

f(y|a, b) = 1b a , ifa < y < b

and equals 0 for all other values of y. Since F(u) =u f(y)dy, the distribution

function of a uniform random variable on the interval (a, b) is

F(u) =

0; u a,(u a)/(b a); a < u b,

1; u > b

The expected value of a uniform random turns out to be the mid-point of the interval,

that is

E[Y] =

yf(y)dy =

ba

y

b a dy =b2 a22(b a) =

b + a

2.

The second moment is calculated as

E[Y2] =

ba

y2

b a dy =b3 a33(b a) =

1

3(b2 + ab + a2),

hence the variance is

Var[Y] = E[Y2] (E[Y])2 = 112

(b a)2.

The notation Y U(a, b) should be read as the random variable Y follows a uniformdistribution on the interval (a, b).

34


36/114

2.6.2 Exponential Distribution

A random variable Y is said to be an exponential random variable if its probability

density function is given by

f(y|) = ey , y > 0, > 0.

The cumulative distribution of an exponential random variable is given by

F(a) =

a0

eydy = ey |a0 = 1 ea, a > 0.

The expected value E[Y] =

0

yey dy requires integration by parts, yielding

E[Y] = yey |0 + 0

ey dy = ey

|0 = 1 .

Integration by parts can be used to verify that E[Y2] = 22. Hence Var[Y] = 1/2.

The notation Y Exp() should be read as the random variable Y follows an expo-nential distribution with parameter .

Exercise 11. Let Y U[0, 1]. Find the distribution ofY = log U. Can you identifyit as a one of the common distributions?

2.6.3 Gamma Distribution

A random variable Y is said to have a gamma distribution if its density function is

given by

f(y|) = eyy1/(), 0 < y, > 0, > 0

where (), is called the gamma function and is defined by

() =

0

euu1du.

The integration by parts of() yields the recursive relationship

() = euu1|0 +0

eu( 1)u2 du (2.11)

= ( 1)0

euu2 du = ( 1)( 1). (2.12)

35


37/114

For integer values = n, this recursive relationship reduces to (n + 1) = n!. Note,

by setting = 1 the gamma distribution reduces to an exponential distribution. Theexpected value of a gamma random variable is given by

E[Y] =

()

0

yey dy = !

()

0

ueu du,

after the change of variable u = y. Hence E[Y] = ( + 1)/(()) = /. Using

the same substitution

E[Y2] =

()

0

y+1ey dy =( + 1)

2,

so that Var[Y] = /2. The notation Y Gamma(, ) should be read as therandom variable Y follows a gamma distribution with parameters and .

Exercise 12. Let Y Gamma(, ). Show that the moment generating function forY is given for t (, ) by

MY(t) =1

(1 t/) .

2.6.4 Gaussian (Normal) Distribution

A random variable Z is a standard normal (or Gaussian) random variable if the density

ofZ is specified by

f(z) = 12

ez2/2. (2.13)

It is not immediately obvious that (2.13) specifies a probability density. To show that

this is the case we need to prove

12

ez2/2dy = 1

or, equivalently, that I = e

z2/2 dz =

2. This is a classic results and so is

well worth confirming. Consider

I2 =

e

z2/2 dz

e

w2/2 dw =

e(z

2+w2)/2 dzdw.

The double integral can be evaluated by a change of variables to polar coordinates.

Substituting z = r cos , w = r sin , and dzdw = rddr, we get

I2 =

0

0

er2/2rddr = 2

0

rer2/2 dr = 2er2/2|10 = 2.

36


38/114

Taking the square root we get I =

2. The result I =

2 can also be used to

establish the result (1/2) = . To prove that this is the case note that(1/2) =

0

euu1/2 du = 20

ez2

dz =

.

The expected value of Z equals zero because zez2/2 is integrable and asymmetric

around zero. The variance ofZ is given by

Var[Z] =12

z2ez2/2 dz.

Thus

Var[Z] = 12

z2ez2/2 dz

=12

zez2/2| + +

e z2/2 dz

=12

ez2/2 dz

=1.

If Z is a standard normal distribution then Y = + Z is called general normal

(Gaussian distribution) with parameters and . The density ofY is given by

f(y|, ) = 122

e(y)222 .

We have obviously E[Y] = and Var[Y] = 2. The notation Y N(, 2) shouldbe read as the random variable Y follows a normal distribution with mean parameter

and variance parameter 2. From the definition ofY it follows immediately that

a + bY, where a and b are known constants, is again normal distribution.

Exercise 13. Let Y N(, 2). What is the distribution ofX = a + bY?Exercise 14. Let Y N(, 2). Show that the moment generating function if Y is

given by

MY(t) = et+2t2/2.

Hint Consider first the standard normal variable and then apply Property 1.

37


39/114

2.6.5 Weibull Distribution

The Weibull distribution function has the form

F(y) = 1 expy

b

a, y > 0.

The Weibull density can be obtained by differentiation as

f(y|a, b) =a

b

yb

a1exp

y

b

a.

To calculate the expected value

E[Y] =

0

ya

1

b

aya1 exp

y

b

ady

we use the substitutions u = (y/b)

a

, and du = aba

y

a

1

dy. These yield

E[Y] = b

0

u1/aeudu = b

a + 1

a

.

In a similar manner, it is straightforward to verify that

E[Y2] = b2

a + 2

a

,

and thus

Var[Y] = b2

a + 2

a

2

a + 1

a

.

2.6.6 Beta Distribution

A random variable is said to have a beta distribution if its density is given by

f(y|a, b) = 1B(a, b)

ya1(1 y)b1, 0 < y < 1.

Here the function

B(a, b) =

10

ua1(1 u)b1 duis the beta function, and is related to the gamma function through

B(a, b) =(a)(b)

(a + b).

Proceeding in the usual manner, we can show that

E[Y] =a

a + b

Var[Y] =ab

(a + b)2(a + b + 1).

38


40/114

2.6.7 Chi-square Distribution

Let Z N(0, 1), and let Y = Z2. Then the cumulative distribution function

FY(y) = P(Y y) = P(Z2 y) = P(y Z y) = FZ(y) FZ(y)

so that by differentiating in y we arrive to the density

fY(y) =1

2

y[fz(

y) + fz(y)] = 1

2yey/2,

in which we recognize Gamma(1/2, 1/2). Suppose that Y =n

i=1 Z2i , where the

Zi N(0, 1) for i = 1, . . . , n are independent. From results on the sum of indepen-dent Gamma random variables, Y

Gamma(n/2, 1/2). This density has the form

fY(y|n) = ey/2yn/21

2n/2(n/2), y > 0 (2.14)

and is referred to as a chi-squared distribution on n degrees of freedom. The notation

Y Chi(n) should be read as the random variable Y follows a chi-squared dis-tribution with n degrees of freedom. Later we will show that if X Chi(u) andY Chi(v), it follows that X+ Y Chi(u + v).

2.6.8 The Bivariate Normal Distribution

Suppose that Uand V are independent random variables each, with the standard normal

distribution. We will need the following parameters X , Y , X > 0, Y > 0,

[1, 1]. Now let X and Y be new random variables defined by

X =X + XU,

V =Y + YU + Y

1 2V.

Using basic properties of mean, variance, covariance, and the normal distribution, sat-

isfy yourself of the following.

Property 3. The following properties hold

1. X is normally distributed with mean X and standard deviation X,

2. Y is normally distributed with mean Y and standard deviation Y ,

39


41/114

3. Corr(X, Y) = ,

4. X andY are independent if and only if = 0.

The inverse transformation is

u =x X

X

v =y Y

Y

1 2 (x X)

X

1 2so that the Jacobian of the transformation is

(x, y)

(u, v)=

1

XY1 2 .

Since U and V are independent standard normal variables, their joint probability den-

sity function is

g(u, v) =1

2e

u2+v2

2 .

Using the bivariate change of variables formula, the joint density of (X, Y) is

f(x, y) =1

2XY

1 2 exp (x X)

2

22X(1 2)+

(x X)(y Y)XY(1 2)

(y Y)222Y(1 2)

Bivariate Normal Conditional Distributions

In the last section we derived the joint probability density function f of the bivariate

normal random variables X and Y. The marginal densities are known. Then,

fY|X(y|x) = fY,X (y, x)fX(x) =1

22Y(1 2)exp

(y (Y + Y(x X)/X))

2

22Y(1 2)

.

Then the conditional distribution of Y given X = x is also Gaussian, with

E(Y|X = x) =Y + Y x XX

Var(Y|X = x) = 2Y(1 2)

2.6.9 The Multivariate Normal Distribution

Let denote the 2 2 symmetric matrix 2X XYYX

2Y

40


42/114

Then

det|| = 2

X2

Y (XY)2

= 2

X2

Y(1 2

)

and

1 =1

1 2

1/2X /(XY)/(XY) 1/2Y

.Hence the bivariate normal distribution (X, Y) can be written in matrix notation as

f(X,Y)(x, y) =1

2

det|| exp

12

x Xy Y

T 1 x X

y Y

.

Let Y = (Y1, . . . , Y p) be a random vector. Let E(Yi) = i, i = 1, . . . , p, and define

the p-length vector = (1, . . . , p). Define the p p matrix through its elementsCov(Yi, Yj) for i, j = 1, . . . p. Then, the random vector Y has a p-dimensional multi-

variate Gaussian distribution if its density function is specified by

fY(y) =1

(2)p/2||1/2 exp

12

(y )T1(y )

. (2.15)

The notation Y MV Np(, ) should be read as the random variable Y follows amultivariate Gaussian (normal) distribution with p-vector mean and p p variance-covariance matrix .

41


43/114

2.7 Distributions further properties

2.7.1 Sum of Independent Random Variables special cases

Poisson variables

Suppose X Pois() and Y Pois(). Assume that X and Y are independent.Then

P(X+ Y = n) =nk=0

P(X = k, Y = n k)

=n

k=0P(X = k)P(Y = n k)

=nk=0

ek

k!e

nk

(n k)!

=e(+)

n!

nk=0

n!

k!(n k)! knk

=e ( + ) ( + )n

n!.

That is, X+ Y Pois( + ).

Binomial Random Variables

We seek the distribution of Y + X, where Y Bin(n, ) and X Bin(m, ).Since X + Y is modelling the situation where the total number of trials is fixed at

n + m and the probability of a success in a single trial equals . Without performing

a calculations, we expect to find that X + Y Bin(n + m, ). To verify that notethat X = X1 + + Xn where Xi are independent Bernoulli variables with parameter while Y = Y1 + + Ym where Yi are also independent Bernoulli variables withparameter . Assuming that Xis are independent ofYis we obtain that X + Y is the

sum of n + m indpendent Bernoulli random variables with parameter , i.e. X + Y

has Bin(n + m, ) distribution.

42


44/114

Gamma, Chi-square, and Exponential Random Variables

Let X Gamma(, ) and Y Gamma() are independent. Then the momentgenerating function ofX+ Y is given as

MX+Y(t) = MX(t)MY(t) =1

(1 + t/)1

(1 + t/)=

1

(1 + t/)+

But this is the moment generating function of a Gamma random variable distributed

as Gamma( + , ). The result X + Y Chi(u + v) where X Chi(u) andY Chi(v), follows as a corollary.

Let Y1, . . . , Y n be n independent exponential random variables each with parameter

. Then Z = Y1 + Y2 + + Yn is a Gamma(n, ) random variable. To see thatthis is indeed the case, write Yi Exp(), or alternatively, Yi Gamma(1, ). ThenY1 + Y2 Gamma(2, ), and by induction

ni=1 Yi Gamma(n, ).

Gaussian Random Variables

Let X N(X , 2X) and Y N(Y, 2Y). Then the moment generating function ofX+ Y is given by

MX+Y(t) = MX(t)MY(t) = eXt+

2Xt

2/2eYt+2Yt

2/2 = e(X+Y)t+(2X+

2Y )t

2/2

which proves that X+ Y N(X + Y, 2X + 2Y).

43


45/114

44


46/114

2.7.2 Common Distributions Summarizing Tables

Discrete DistributionsBernoulli()

pmf P(Y = y|) = y(1 )1y, y = 0, 1, 0 1mean/variance E[Y] = ,Var[Y] = (1 )mgf M Y(t) = e

t + (1 )Binomial(n, )

pmf P(Y = y|) = ny

y(1 )ny, y = 0, 1, . . . , n , 0 1

mean/variance E[Y] = n,Var[Y] = n(1 )mgf M Y(t) = [e

t + (1 )]n

Discrete uniform(N)

pmf P(Y = y|N) = 1/N,y = 1, 2, . . . , N mean/variance E[Y] = (N + 1)/2,Var[Y] = (N + 1)(N 1)/12mgf M Y(t) =

1N

et 1eNt

1et

Geometric()

pmf P(Y = y|N) = (1 )y1, y = 1, . . . , 0 1mean/variance E[Y] = 1/,Var[Y] = (1 )/2

mgf M Y(t) = et/[1 (1 )et], t < log(1 )

notes The random variable X = Y 1 is NegBin(1, ).Hypergeometric(b,w,n)

pmf P(Y = y|b,w,n) = wy

bwny

/bn

, y = 0, 1, . . . , n ,

b (b w) y b,b,w,n 0mean/variance E[Y] = nw/b,Var[Y] = nw(b w)(b n)/(b2(b 1))Negative binomial(r, )

pmf P(Y = y|r, ) = r+y1y

r(1 )y, y = 0, 1, . . . , n ,

b (b w) y N, 0 < 1mean/variance E[Y] = r(1 )/,Var[Y] = r(1 )/2

mgf M Y(t) = /(1 (1 )et)r, t < log(1 )

notes

An alternative form of the pmf, used in the derivation in our notes, is

given by P(N = n

|r, ) =

n1r1

r(1

)nr, n = r, r + 1, . . .

where the random variable N = Y + r. The negative binomial can alsobe derived as a mixture of Poisson random variables.

Poisson()

pmf P(Y = y|) = ye/y!, y = 0, 1, 2, . . . , 0 < mean/variance E[Y] = ,Var[Y] = ,

mgf M Y(t) = e(et1)

45


47/114

Continuous Distributions

Uniform U(a, b)

pmf f(y|a, b) = 1/(b a), a < y < bmean/variance E[Y] = (b + a)/2,Var[Y] = (b a)2/12,mgf M Y(t) = (e

bt eat)/((b a)t)notes

A uniform distribution with a = 0 and b = 1 is a special case of thebeta distribution where ( = = 1).

Exponential E()

pmf f(y|) = ey, y > 0, > 0

mean/variance E[Y] = 1/,Var[Y] = 1/2,

mgf M Y(t) = 1/(1 t/)

notesSpecial case of the gamma distribution. X = Y1/ is Weibull, X =

2Y is Rayleigh, X = log(Y /) is Gumbel.

Gamma G(, )

pmf f(y|) = eyy1/(), y > 0, , > 0mean/variance E[Y] = /,Var[Y] = /2,

mgf M Y(t) = 1/(1 t/)

notes Includes the exponential ( = 1) and chi squared ( = n/2, = 1/2).

Normal N( , 2 )

pmf f(y|, 2) = 122

e(y)2/(22), > 0

mean/variance E[Y] = ,Var[Y] = 2,

mgf M Y(t) = et+2t2/2

notes Often called the Gaussian distribution.

Transforms

The generating functions of the discrete and continuous random variables discussed

thus far are given in Table 2.7.2.

46


48/114

Distrib. p.g.f. m.g.f. ch.f.

Bi(n, ) (t + )n (et + )n (eit + )n

Geo() t/(1 t) /(et ) /(eit )NegBin(r, ) r(1 t)r r(1 et)r r(1 eit)r

P oi() e(1t) e(et1) e(e

it1)

Unif(, ) et(et 1)/(t) eit(eit 1)/(it)Exp() (1 t/)1 (1 it/)1

Ga(c, ) (1 t/)c (1 it/)c

N(, 2

) exp t + 2t2/2 exp it 2t2/2Table 2.1: Transforms of distributions. In the formulas = 1 .

47


49/114

Chapter 3

Likelihood

3.1 Maximum Likelihood Estimation

Let x be a realization of the random variable Xwith probability density fX(x|) where = (1, 2, . . . , m)

T is a vector ofm unknown parameters to be estimated. The set

of allowable values for , denoted by , or sometimes by , is called the parameter

space. Define the likelihood function

l(|x) = fX(x|). (3.1)It is crucial to stress that the argument of fX(x|) is x, but the argument ofl(|x) is .It is therefore convenient to view the likelihood function l() as the probability of the

observed data x considered as a function of. Usually it is convenient to work with the

natural logarithm of the likelihood called the log-likelihood, denoted by

log l(|x) = log l(|x).

When R1 we can define the score function as the first derivative of the log-likelihood

S() =

log l().

The maximum likelihood estimate (MLE) of is the solution to the score equation

S() = 0.

48


50/114

At the maximum, the second partial derivative of the log-likelihood is negative, so we

define the curvature at as I(

) where

I() = 2

2log l().

We can check that a solution of the equation S() = 0 is actually a maximum by

checking that I() > 0. A large curvature I() is associated with a tight or strong

peak, intuitively indicating less uncertainty about .

The likelihood function l(|x) supplies an order of preference or plausibility amongpossible values of based on the observed y. It ranks the plausibility of possible values

of by how probable they make the observed y. IfP(x| = 1) > P(x| = 2) thenthe observed x makes = 1 more plausible than = 2, and consequently from

(3.1), l(1|x) > l(2|x). The likelihood ratio l(1|x)/l(2|x) = f(1|x)/f(2|x) isa measure of the plausibility of 1 relative to 2 based on the observed fact y. The

relative likelihood l(1|x)/l(2|x) = k means that the observed value x will occur ktimes more frequently in repeated samples from the population defined by the value 1

than from the population defined by 2. Since only ratios of likelihoods are meaningful,

it is convenient to standardize the likelihood with respect to its maximum.

When the random variables X1, . . . , X n are mutually independent we can write the

joint density as

fX(x) =nj=1

fXj (xj)

where x = (x1, . . . , xn) is a realization of the random vector X = (X1, . . . , X n),

and the likelihood function becomes

LX(|x) =nj=1

fXj (xj |).

When the densities fXj (xj) are identical, we unambiguously write f(xj).

Example 2 (Bernoulli Trials). Consider n independent Bernoulli trials. The jth ob-

servation is either a success or failure coded xj = 1 and xj = 0 respectively,

and

P (Xj = xj) = xj (1 )1xj

49


51/114

for j = 1, . . . , n . The vector of observations y = (x1, x2, . . . , xn)T is a sequence of

ones and zeros, and is a realization of the random vector Y = (X1, X2, . . . , X n)T

. Asthe Bernoulli outcomes are assumed to be independent we can write the joint probabil-

ity mass function ofY as the product of the marginal probabilities, that is

l() =

nj=1

P(Xj = xj)

=nj=1

xj (1 )1xj

= xj (1 )n

xj

= r(1

)nr

where r =n

i=1 xj is the number of observed successes (1s) in the vector y. The

log-likelihood function is then

log l() = r log + (n r) log(1 ),

and the score function is

S() =

log l() =

r

(n r)

1 .

Solving for S() = 0 we get = r/n. We also have

I() = r2

+ n r(1 )2 > 0 ,

guaranteeing that is the MLE. Each Xi is a Bernoulli random variable and has ex-

pected value E(Xi) = , and variance Var(Xi) = (1 ). The MLE (y) is itself arandom variable and has expected value

E() = E r

n

= E

ni=1 Xi

n

=

1

n

ni=1

E (Xi) =1

n

ni=1

= .

If an estimator has on average the value of the parameter that it is intended to estimate

than we call it unbiased, i.e. ifE = . From the above calculation it follows that (y)is an unbiased estimator of. The variance of(y) isVar() = Var

ni=1 Xi

n

=

1

n2

ni=1

Var (Xi) =1

n2

ni=1

(1 ) = (1 )n

.

2

50


52/114

Example 3 (Binomial sampling). The number of successes in n Bernoulli trials is a

random variable R taking on values r = 0, 1, . . . , n with probability mass function

P(R = r) =

n

r

r(1 )nr.

This is the exact same sampling scheme as in the previous example except that instead

of observing the sequence y we only observe the total number of successes r. Hence

the likelihood function has the form

LR (|r) =

n

r

r(1 )nr.

The relevant mathematical calculations are as follows

log lR (|r) = lognr+ r log() + (n r) log(1 )

S() =r

n+

n r1 =

r

n

I() =r

2+

n r(1 )2 > 0

E() =E(r)

n=

n

n= unbiased

Var() =Var(r)

n2=

n(1 )n2

=(1 )

n.

2

Example 4 (Prevalence of a Genotype). Geneticists interested in the prevalence of a

certain genotype, observe that the genotype makes its first appearance in the 22ndsub-

ject analysed. If we assume that the subjects are independent, the likelihood function

can be computed based on the geometric distribution, as l() = (1 )n1. Thescore function is then S() = 1 (n 1)(1 )1. Setting S() = 0 we get = n1 = 221. Moreover I() = 2 + (n 1)(1 )2 and is greater than zerofor all , implying that is MLE.

Suppose that the geneticists had planned to stop sampling once they observed r =

10 subjects with the specified genotype, and the tenth subject with the genotype was

the 100th subject anaylsed overall. The likelihood of can be computed based on the

negative binomial distribution, as

l() =

n 1r 1

r(1 )nr

51


53/114

for n = 100, r = 5. The usual calculation will confirm that = r/n is MLE. 2

Example 5 (Radioactive Decay). In this classic set of data Rutherford and Geiger

counted the number of scintillations in 72 second intervals caused by radioactive de-

cay of a quantity of the element polonium. Altogether there were 10097 scintillations

during 2608 such intervals

Count 0 1 2 3 4 5 6 7

Observed 57 203 383 525 532 408 573 139

Count 8 9 10 11 12 13 14

Observed 45 27 10 4 1 0 1

The Poisson probability mass function with mean parameter is

fX(x|) = x exp()

x!.

The likelihood function equals

l() = xi exp()

xi!=

xi exp(n)

xi!.

The relevant mathematical calculations are

log l() = (xi)log() n log [(xi!)]S() =

xi

n

= xin

= x

I() =xi2

> 0,

implying is MLE. AlsoE() =E(xi) =

1n

= , so is an unbiased estimator.

Next Var() = 1n2Var(xi) =

1n. It is always useful to compare the fitted values

from a model against the observed values.

i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Oi 57 203 383 525 532 408 573 139 45 27 10 4 1 0 1

Ei 54 211 407 525 508 393 254 140 68 29 11 4 1 0 0

+3 -8 -24 0 +24 +15 +19 -1 -23 -2 -1 0 -1 +1 +1

The Poisson law agrees with the observed variation within about one-twentieth of its

range. 2

52


54/114

Example 6 (Exponential distribution). Suppose random variables X1, . . . , X n are i.i.d.

as Exp(). Then

l() =ni=1

exp(xi)

= n exp

xi

log l() = n log

xi

S() =n

ni=1

xi

= n

xi

I() =n

2 > 0 .

Exercise 15. Demonstrate that the expectation and variance of are given as follows

E[] =n

n 1

Var[] =n2

(n 1)2(n 2) 2.

Hint Find the probability distribution of Z =ni=1

Xi, where Xi Exp().

Exercise 16. Propose the alternative estimator = n1n . Show that is unbiased

estimator of with the variance

Var[] =2

n 2 .

As this example demonstrates, maximum likelihood estimation does not automati-

cally produce unbiased estimates. If it is thought that this property is (in some sense)

desirable, then some adjustments to the MLEs, usually in the form of scaling, may be

required.

Example 7 (Gaussian Distribution). Consider data X1, X2 . . . , X n distributed as N(, ).

Then the likelihood function is

l(, ) =

1

nexp

ni=1

(xi )2

2

53


55/114

and the log-likelihood function is

log l(, ) = n2

log (2) n2

log() 12

ni=1

(xi )2 (3.2)

Unknown mean and known variance As is known we treat this parameter as a con-

stant when differentiating wrt . Then

S() =1

ni=1

(xi ), = 1n

ni=1

xi, and I() =n

> 0 .

Also, E[] = n/n = , and so the MLE of is unbiased. Finally

Var[] =1

n2Var

n

i=1 xi =

n= (E[I()])1 .

Known mean and unknown variance Differentiating (3.2) wrt returns

S() = n2

+1

22

ni=1

(xi )2,

and setting S() = 0 implies

=1

n

ni=1

(xi )2.

Differentiating again, and multiplying by 1 yields

I() = n22

+ 13

ni=1

(xi )2.

Clearly is the MLE since

I() =n

22> 0.

Define

Zi = (Xi )2/

,

so that Zi N(0, 1). From the appendix on probabilityn

i=1 Z2i 2n,implying E[

Z2i ] = n, and Var[

Z2i ] = 2n. The MLE

= (/n)ni=1

Z2i .

54


56/114

Then

E[] = Enn

i=1

Z2

i = ,and

Var[] =

n

2Var

ni=1

Z2i

=

22

n.

2

Our treatment of the two parameters of the Gaussian distribution in the last example

was to (i) fix the variance and estimate the mean using maximum likelihood; and then

(ii) fix the mean and estimate the variance using maximum likelihood. In practice we

would like to consider the simultaneous estimation of these parameters. In the next

section of these notes we extend MLE to multiple parameter estimation.

3.2 Multi-parameter Estimation

Suppose that a statistical model specifies that the data y has a probability distribution

f(y; , ) depending on two unknown parameters and . In this case the likelihood

function is a function of the two variables and and having observed the value y is

defined as l(, ) = f(y; , ) with log l(, ) = log l(, ). The MLE of(, ) is a

value (, ) for which l(, ) , or equivalently log l(, ) , attains its maximum value.

Define S1(, ) = log l/ and S2(, ) = log l/. The MLEs (, ) can

be obtained by solving the pair of simultaneous equations

S1(, ) = 0

S2(, ) = 0

Let us consider the matrix I(, )

I(, ) = I11(, ) I12(, )

I21(, ) I22(, ) = 2

2 log l2

log l

2

log l 2

2 log l The conditions for a value (0, 0) satisfying S1(0, 0) = 0 and S2(0, 0) = 0

to be a MLE are that

I11(0, 0) > 0, I22(0, 0) > 0,

55


57/114

and

det(I(0, 0) = I11(0, 0)I22(0, 0) I12(0, 0)2

> 0.

This is equivalent to requiring that both eigenvalues of the matrix I(0, 0) be positive.

Example 8 (Gaussian distribution). Let X1, X2 . . . , X n be iid observations from a

N(, 2) density in which both and 2 are unknown. The log likelihood is

log l(, 2) =ni=1

log

1

22exp[ 1

22(xi )2]

=

n

i=1 1

2log [2] 1

2log[2] 1

22(xi )2

= n

2log [2] n

2log[2] 1

22

ni=1

(xi )2.

Hence for v = 2

S1(, v) =log l

=

1

v

ni=1

(xi ) = 0

which implies that

=1

n

ni=1

xi = x. (3.3)

Also

S2(, v) =log l

v= n

2v+

1

2v2

ni=1

(xi )2 = 0

implies that

2 = v =1

n

ni=1

(xi )2 = 1n

ni=1

(xi x)2. (3.4)

Calculating second derivatives and multiplying by 1 gives that I(, v) equals

I(, v) =

nv

1v2

ni=1

(xi )1v2

n

i=1(xi ) n2v2 +

1v3

n

i=1(xi )2

Hence I(, v) is given by nv 0

0 n2v2

56


58/114

Clearly both diagonal terms are positive and the determinant is positive and so (, v)

are, indeed, the MLEs of(, v).

Go back to equation (3.3), and X N(,v/n). ClearlyE(X) = (unbiased) andVar(X) = v/n. Go back to equation (3.4). Then from Lemma 1 that is proven below

we havenv

v 2n1

so that

E

nv

v

= n 1

E(v) = n 1n vInstead, propose the (unbiased) estimator of2

S2 = v =n

n 1 v =1

n 1ni=1

(xi x)2 (3.5)

Observe that

E(v) =

n

n 1E(v) =

n

n 1

n 1n

v = v

and v is unbiased as suggested. We can easily show that

Var(v) =2v2

(n 1)2

Lemma 1 (Joint distribution of the sample mean and sample variance) . IfX1, . . . , X n

are iid N(, v) then the sample mean X and sample variance S2 are independent.Also X is distributed N(,v/n) and (n 1)S2/v is a chi-squared random variablewith n 1 degrees of freedom.

Proof. Define

W =ni=1

(Xi X)2 =ni=1

(Xi )2 n(X )2

Wv

+(X )2

v/n=

ni=1

(Xi )2v

57


59/114

The RHS is the sum of n independent standard normal random variables squared, and

so is distributed 2

n. Also,

X N(,v/n), therefore (

X )2

/(v/n) is the squareof a standard normal and so is distributed 21 These Chi-Squared random variables have

moment generating functions (1 2t)n/2 and (1 2t)1/2 respectively. Next, W/vand (X )2/(v/n) are independent

Cov(Xi X, X) = Cov(Xi, X) Cov(X, X)= Cov

Xi,

1

n

Xj

Var(X)

=1

n

j

Cov(Xi, Xj) vn

= vn vn = 0

But, Cov(Xi X, X ) = Cov(Xi X, X) = 0 , hencei

Cov(Xi X, X ) = Cov

i

(Xi X), X

= 0

As the moment generating function of the sum of independent random variables is

equal to the product of their individual moment generating functions, we see

E

et(W/v)

(1 2t)1/2 = (1 2t)n/2

E et(W/v) = (1 2t)(n1)/2But (12t)(n1)/2 is the moment generating function of a 2 random variables with(n1) degrees of freedom, and the moment generating function uniquely characterizesthe random variable S = (W/v).

Suppose that a statistical model specifies that the data x has a probability distri-

bution f(x;) depending on a vector of m unknown parameters = (1, . . . , m).

In this case the likelihood function is a function of the m parameters 1, . . . , m and

having observed the value ofx is defined as l() = f(x;) with log l() = log l().

The MLE of is a value for which l(), or equivalently log l(), attains its

maximum value. For r = 1, . . . , m define Sr() = log l/r. Then we can (usually)

find the MLE by solving the set of m simultaneous equations Sr() = 0 for r =

58


60/114

1, . . . , m . The matrix I() is defined to be the m m matrix whose (r, s) element is

given by Irs where Irs = 2

log l/rs. The conditions for a value

satisfyingSr() = 0 for r = 1, . . . , m to be a MLE are that all the eigenvalues of the matrix I()

are positive.

3.3 The Invariance Principle

How do we deal with parameter transformation? We will assume a one-to-one trans-

formation, but the idea applied generally. Consider a binomial sample with n = 10

independent trials resulting in data x = 8 successes. The likelihood ratio of 1 = 0.8

versus 2 = 0.3 isl(1 = 0.8)

l(2 = 0.3)=

81(1 1)282(1 2)2

= 208.7 ,

that is, given the data = 0.8 is about 200 times more likely than = 0.3.

Suppose we are interested in expressing on the logit scale as

log{/(1 )} ,

then intuitively our relative information about 1 = log(0.8/0.2) = 1.29 versus

2 = log(0.3/0.7) = 0.85 should beL(1)L(2)

=l(1)

l(2)= 208.7 .

That is, our information should be invariantto the choice of parameterization. ( For

the purposes of this example we are not too concerned about how to calculate L(). )

Theorem 3.3.1 (Invariance of the MLE). If g is a one-to-one function, and is the

MLE of then g() is the MLE ofg().

Proof. This is trivially true as we let = g1() then f{y|g1()} is maximized in exactly when = g(). When g is not one-to-one the discussion becomes more subtle,

but we simply choose to define gMLE() = g()

It seems intuitive that if is most likely for and our knowledge (data) remains

unchanged then g() is most likely for g(). In fact, we would find it strange if is an

59


61/114

estimate of, but 2 is not an estimate of2. In the binomial example with n = 10 and

x = 8 we get = 0.8, so the MLE ofg() = /(1 ) is

g() = /(1 ) = 0.8/0.2 = 4.

60


62/114

Chapter 4

Estimation

In the previous chapter we have seen an approach to estimation that is based on the

likelihood of observed results. Next we study general theory of estimation that is used

to compare between different estimators and to decide on the most efficient one.

4.1 General properties of estimators

Suppose that we are going to observe a value of a random vector X. Let

Xdenote the

set of possible valuesX can take and, for x X, let f(x|) denote the probability thatX takes the value x where the parameter is some unknown element of the set .

The problem we face is that of estimating . An estimator is a procedure which

for each possible value x X specifies which element of we should quote as anestimate of. When we observe X = x we quote (x) as our estimate of. Thus is

a function of the random vector X. Sometimes we write (X) to emphasise this point.

Given any estimator we can calculate its expected value for each possible value

of . As we have already mentioned when discussing the maximum likelihoodestimation, an estimator is said to be unbiased if this expected value is identically equal

to . If an estimator is unbiased then we can conclude that if we repeat the experiment

an infinite number of times with fixed and calculate the value of the estimator each

time then the average of the estimator values will be exactly equal to . To evaluate

61


63/114

the usefulness of an estimator = (x) of , examine the properties of the random

variable =

(X).

Definition 1 (Unbiased estimators). An estimator = (X) is said to be unbiased for

a parameter if it equals in expectation

E[(X)] = E() = .

Intuitively, an unbiased estimator is right on target. 2

Definition 2 (Bias of an estimator). The bias of an estimator = (X) of is defined

as bias() = E[(X) ]. 2

Note that even if is an unbiased estimator of , g() will generally not be an

unbiased estimator ofg() unless g is linear or affine. This limits the importance of the

notion of unbiasedness. It might be at least as important that an estimator is accurate

in the sense that its distribution is highly concentrated around .

Exercise 17. Show that for an arbitrary distribution the estimator S2 as defined in (3.5)

is an unbiased estimator of the variance of this distribution.

Exercise 18. Consider the estimator S2 of variance 2 in the case of the normal dis-

tribution. Demonstrate that although S2 is an unbiased estimator of 2, S is not an

unbiased estimator of. Compute its bias.

Definition 3 (Mean squared error). The mean squared error of the estimator is de-

fined as MSE() = E( )2. Given the same set of data, 1 is better than 2 ifMSE(1) MSE(2) (uniformly better if true ). 2

Lemma 2 (The MSE variance-bias tradeoff). The MSE decomposes as

MSE() = Var() + bias()2.

62


64/114

Proof. We have

MSE() = E( )2

= E{ [ E() ] + [ E() ]}2

= E[ E()]2 + E[E() ]2

+2E

[ E()][E() ]

=0

= E[ E()]2 + E[E() ]2

= Var() + [E() ]2

bias()2.

NOTE This lemma implies that the mean squared error of an unbiased estimator is

equal to the variance of the estimator.

Exercise 19. Consider X1, . . . , X n where Xi N(, 2) and is known. Threeestimators of are 1 = X =

1n

ni=1 Xi, 2 = X1, and 3 = (X1 + X)/2. Discuss

their properties which one you would recommend and why.

Example 9. Consider X1, . . . , X n to be independent random variables with means

E(Xi) = and variances Var(Xi) = 2i . Consider pooling the estimators of into a

common estimator using the linear combination = w1X1 + w2X2 + + wnXn.We will see that the following is true

(i) The estimator is unbiased if and only if

wi = 1.

(ii) The estimator has minimum variance among this class of estimators when the

weights are inversely proportional to the variances 2i .

(iii) The variance of for optimal weights wi is Var() = 1/

i

2i .

Indeed, we have E() = E(w1X1 + + wnXn) = i wiE(Xi) = i wi =

i wi so is unbiased if and only if

i wi = 1. The variance of our estimator is

Var() =

i w2i

2i , which should be minimized subject to the constraint

i wi = 1.

Differentiating the Lagrangian L = i w2i 2i (i wi 1) with respect to wi and63


65/114

setting equal to zero yields 2wi2i = wi 2i so that wi = 2i /(

j

2j ).

Then, for optimal weights we get Var() = i w2i 2i = (i 4i 2i )/(i 2i )2 =1/(

i 2i ).

Assume now that the instead of Xi we observe biased variable Xi = Xi + for

some = 0. When 2i = 2 we have that Var() = 2/n which tends to zero forn whereas bias() = and MSE() = 2/n + 2. Thus in the general casewhen the bias is present it tends to dominate the variance as n gets larger, which is very

unfortunate.

Exercise 20. Let X1, . . . , X n be an independent sample of size n from the uniform

distribution on the interval (0, ), with density for a single observation being f(x

|) =

1 for 0 < x < and 0 otherwise, and consider > 0 unknown.

(i) Find the expected value and variance of the estimator = 2X.

(ii) Find the expected value of the estimator = X(n), i.e. the largest observation.

(iii) Find an unbiased estimator of the form = cX(n) and calculate its variance.

(iv) Compare the mean square error of and .

4.2 Minimum-Variance Unbiased EstimationGetting a small MSE often involves a tradeoff between variance and bias. For unbiased

estimators, the MSE obviously equals the variance, MSE() = Var(), so no tradeoff

can be made. One approach is to restrict ourselves to the subclass of estimators that are

unbiasedand minimum variance.

Definition 4 (Minimum-variance unbiased estimator). If an unbiased estimator ofg()

has minimum variance among all unbiased estimators of g() it is called a minimum

variance unbiased estimator (MVUE). 2

We will develop a method of finding the MVUE when it exists. When such an

estimator does not exist we will be able to find a lower bound for the variance of an

unbiased estimator in the class of unbiased estimators, and compare the variance of our

unbiased estimator with this lower bound.

64

Stat Infer

Documents

Transcript of Stat Infer