Stat Infer

download Stat Infer

of 114

Transcript of Stat Infer

  • 7/27/2019 Stat Infer

    1/114

    LECTURE NOTES ONSTATISTICAL INFERENCE

    KRZYSZTOF POD GORSKI

    Department of Mathematics and Statistics

    University of Limerick, Ireland

    November 23, 2009

  • 7/27/2019 Stat Infer

    2/114

    Contents

    1 Introduction 4

    1.1 Models of Randomness and Statistical Inference . . . . . . . . . . . . 4

    1.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.2.1 Probability vs. likelihood . . . . . . . . . . . . . . . . . . . . 8

    1.2.2 More data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    1.3 Likelihood and theory of statistics . . . . . . . . . . . . . . . . . . . 15

    1.4 Computationally intensive methods of statistics . . . . . . . . . . . . 15

    1.4.1 Monte Carlo methods studying statistical methods using com-

    puter generated random samples . . . . . . . . . . . . . . . . 16

    1.4.2 Bootstrap performing statistical inference using computers . 18

    2 Review of Probability 21

    2.1 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.2 Distribution of a Function of a Random Variable . . . . . . . . . . . . 22

    2.3 Transforms Method Characteristic, Probability Generating and Mo-

    ment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . 24

    2.4 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.4.1 Sums of Independent Random Variables . . . . . . . . . . . . 26

    2.4.2 Covariance and Correlation . . . . . . . . . . . . . . . . . . 27

    2.4.3 The Bivariate Change of Variables Formula . . . . . . . . . . 28

    2.5 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . 29

    2.5.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . 29

    1

  • 7/27/2019 Stat Infer

    3/114

    2.5.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . 29

    2.5.3 Negative Binomial and Geometric Distribution . . . . . . . . 302.5.4 Hypergeometric Distribution . . . . . . . . . . . . . . . . . 31

    2.5.5 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . 32

    2.5.6 Discrete Uniform Distribution . . . . . . . . . . . . . . . . . 33

    2.5.7 The Multinomial Distribution . . . . . . . . . . . . . . . . . 33

    2.6 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . 34

    2.6.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . 34

    2.6.2 Exponential Distribution . . . . . . . . . . . . . . . . . . . . 35

    2.6.3 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . 35

    2.6.4 Gaussian (Normal) Distribution . . . . . . . . . . . . . . . . 36

    2.6.5 Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . 38

    2.6.6 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . 38

    2.6.7 Chi-square Distribution . . . . . . . . . . . . . . . . . . . . . 39

    2.6.8 The Bivariate Normal Distribution . . . . . . . . . . . . . . . 39

    2.6.9 The Multivariate Normal Distribution . . . . . . . . . . . . . 40

    2.7 Distributions further properties . . . . . . . . . . . . . . . . . . . . 42

    2.7.1 Sum of Independent Random Variables special cases . . . . 42

    2.7.2 Common Distributions Summarizing Tables . . . . . . . . 45

    3 Likelihood 48

    3.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 48

    3.2 Multi-parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . 55

    3.3 The Invariance Principle . . . . . . . . . . . . . . . . . . . . . . . . 59

    4 Estimation 61

    4.1 General properties of estimators . . . . . . . . . . . . . . . . . . . . 61

    4.2 Minimum-Variance Unbiased Estimation . . . . . . . . . . . . . . . . 64

    4.3 Optimality Properties of the MLE . . . . . . . . . . . . . . . . . . . 69

    5 The Theory of Confidence Intervals 71

    5.1 Exact Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . 71

    2

  • 7/27/2019 Stat Infer

    4/114

    5.2 Pivotal Quantities for Use with Normal Data . . . . . . . . . . . . . . 75

    5.3 Approximate Confidence Intervals . . . . . . . . . . . . . . . . . . . 80

    6 The Theory of Hypothesis Testing 87

    6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

    6.2 Hypothesis Testing for Normal Data . . . . . . . . . . . . . . . . . . 92

    6.3 Generally Applicable Test Procedures . . . . . . . . . . . . . . . . . 97

    6.4 The Neyman-Pearson Lemma . . . . . . . . . . . . . . . . . . . . . 101

    6.5 Goodness of Fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 106

    6.6 The 2 Test for Contingency Tables . . . . . . . . . . . . . . . . . . 109

    3

  • 7/27/2019 Stat Infer

    5/114

    Chapter 1

    Introduction

    Everything existing in the universe is the fruit of chance.

    Democritus, the 5th Century BC

    1.1 Models of Randomness and Statistical Inference

    Statistics is a discipline that provides with a methodology allowing to make an infer-

    ence from real random data on parameters of probabilistic models that are believed to

    generate such data. The position of statistics with relation to real world data and corre-

    sponding mathematical models of the probability theory is presented in the following

    diagram.

    The following is the list of few from plenty phenomena to which randomness is

    attributed.

    Games of chance

    Tossing a coin

    Rolling a die

    Playing Poker

    Natural Sciences

    4

  • 7/27/2019 Stat Infer

    6/114

    Real World Science & Mathematics

    Random Phenomena Probability Theory

    Data Samples Models

    Statistics

    Prediction and Discovery Statistical Inference

    -

    ? ?

    -

    ? ?

    -

    HH

    HH

    HH

    HH

    HHHj ?

    ?

    Figure 1.1: Position of statistics in the context of real world phenomena and mathe-

    matical models representing them.

    5

  • 7/27/2019 Stat Infer

    7/114

    Physics (notable Quantum Physics)

    Genetics

    Climate

    Engineering

    Risk and safety analysis

    Ocean engineering

    Economics and Social Sciences

    Currency exchange rates

    Stock market fluctations

    Insurance claims

    Polls and election results

    etc.

    1.2 Motivating Example

    Let X denote the number of particles that will be emitted from a radioactive source

    in the next one minute period. We know that X will turn out to be equal to one of

    the non-negative integers but, apart from that, we know nothing about which of the

    possible values are more or less likely to occur. The quantity X is said to be a random

    variable.

    Suppose we are told that the random variable X has a Poisson distribution with

    parameter = 2. Then, ifx is some non-negative integer, we know that the probability

    that the random variable X takes the value x is given by the formula

    P(X = x) =

    x

    exp()x!where = 2. So, for instance, the probability that X takes the value x = 4 is

    P(X = 4) =24 exp(2)

    4!= 0.0902 .

    6

  • 7/27/2019 Stat Infer

    8/114

    We have here a probability model for the random variable X. Note that we are using

    upper case letters for random variables and lower case letters for the values taken byrandom variables. We shall persist with this convention throughout the course.

    Let us still assume that the random variable X has a Poisson distribution with

    parameter but where is some unspecified positive number. Then, if x is some non-

    negative integer, we know that the probability that the random variable X takes the

    value x is given by the formula

    P(X = x|) = x exp()

    x!, (1.1)

    for

    R+. However, we cannot calculate probabilities such as the probability that X

    takes the value x = 4 without knowing the value of.

    Suppose that, in order to learn something about the value of, we decide to measure

    the value ofX for each of the next 5 one minute time periods. Let us use the notation

    X1 to denote the number of particles emitted in the first period, X2 to denote the

    number emitted in the second period and so forth. We shall end up with data consisting

    of a random vector X = (X1, X2, . . . , X 5). Consider x = (x1, x2, x3, x4, x5) =

    (2, 1, 0, 3, 4). Then x is a possible value for the random vector X. We know that the

    probability that X1 takes the value x1 = 2 is given by the formula

    P(X = 2|) = 2 exp()2!

    and similarly that the probability that X2 takes the value x2 = 1 is given by

    P(X = 1|) = exp()1!

    and so on. However, what about the probability that X takes the value x? In order for

    this probability to be specified we need to know something about the joint distribution

    of the random variables X1, X2, . . . , X 5. A simple assumption to make is that the ran-

    dom variables X1, X2, . . . , X 5 are mutually independent. (Note that this assumption

    may not be correct since X2 may tend to be more similar to X1 that it would be to X5.)

    However, with this assumption we can say that the probability that X takes the value x

    7

  • 7/27/2019 Stat Infer

    9/114

    is given by

    P(X = x|) = 5i=1

    xi exp()xi!

    ,

    =2 exp()

    2!

    1 exp()1!

    0 exp()

    0!

    3 exp()

    3!

    4 exp()4!

    ,

    =10 exp(5)

    288.

    In general, ifX = (x1, x2, x3, x4, x5) is any vector of 5 non-negative integers, then

    the probability that X takes the value x is given by

    P(X = x|) =5i=1

    xi exp()xi!

    ,

    =5

    i=1 xi exp(5)5i=1

    xi!

    .

    We have here a probability model for the random vector X.

    Our plan is to use the value x ofX that we actually observe to learn something

    about the value of. The ways and means to accomplish this task make up the subject

    matter of this course. The central tool for various statistical inference techniques is

    the likelihood method. Below we present a simple introduction to it using the Poisson

    model for radioactive decay.

    1.2.1 Probability vs. likelihood

    . In the introduced Poisson model for a given , say = 2, we can observe a function

    p(x) of probabilities of observing values x = 0, 1, 2, . . . . This function is referred to

    as probability mass function . The graph of it is presented below

    The usage of such function can be utilized in bidding for a recorded result of future

    experiments. If one wants to optimize chances of correctly predicting the future, the

    choice of the number of recorded particles would be either on 1 or 2.

    So far, we have been told that the random variable Xhas a Poisson distribution with

    parameter where is some positive number and there are physical reason to assume

    8

  • 7/27/2019 Stat Infer

    10/114

    0 2 4 6 8 10

    0

    .00

    0.0

    5

    0.1

    0

    0.1

    5

    0.2

    0

    0

    .25

    Number of particles

    Probability

    Figure 1.2: Probability mass function for Poisson model with = 2.

    that such a model is correct. However, we have arbitrarily set = 2 and this is more

    questionable. How can we know that it is correct a correct value of the parameter? Let

    us analyze this issue in detail.

    If x is some non-negative integer, we know that the probability that the random

    variable X takes the value x is given by the formula

    P(X = x|) = xe

    x!,

    for > 0. But without knowing the true value of , we cannot calculate probabilities

    such as the probability that X takes the value x = 1.

    Suppose that, in order to learn something about the value of , an experiment is

    performed and a value ofX = 5 is recorded. Let us take a look at the probability mass

    function for = 2 in Figure 1.2. What is the probability ofX to take value 2? Do we

    like what we see? Why? Would you bet 1 or 2 in the next experiment?

    We certainly have some serious doubt about our choice of = 2 which was arbi-

    trary anyway. One can consider, for example, = 7 as an alternative to = 2. Here

    are graphs of the pmf for the two cases. Which of the two choices do we like? Since it

    9

  • 7/27/2019 Stat Infer

    11/114

    0 2 4 6 8 10

    0.0

    0

    0.0

    5

    0.1

    0

    0.1

    5

    0.2

    0

    0.2

    5

    Number of particles

    Probability

    0 2 4 6 8 10

    0.0

    0

    0.0

    5

    0.1

    0

    0.1

    5

    Number of particles

    Probability

    Figure 1.3: The probability mass function for Poisson model with = 2 vs. the one

    with = 7.

    was more probable to get X = 5 under the assumption = 7 than when = 2, we say

    = 7 is more likely to produce X = 5 than = 2. Based on this observation we can

    develop a general strategy for chosing .

    Let us summarize our position. So far we know (or assume) about the radioactive

    emission that it follows Poisson model with some unknown > 0 and the value x = 5

    has been once observed. Our goal is somehow to utilized this knowledge. First, we

    note that the Poisson model is in fact not only a function of x but also of

    p(x

    |) =

    xe

    x!

    .

    Let us plug in the observed x = 5, so that we get a function of that is called

    likelihood function

    l() =5e

    120.

    The graph of it is presented on the next figure. Can you localize on this graph the values

    of probabilities that were used to chose = 7 over = 2? What value of appears to

    be the most preferable if the same argument is extended to all possible values of ? We

    observe that the value of = 5 is most likely to produce value x = 5. In the result of

    our likelihood approach we have used the data x = 5 and the Poisson model to makeinference - an example of statistical inference .

    10

  • 7/27/2019 Stat Infer

    12/114

    0 5 10 15

    0.0

    0

    0.0

    5

    0.1

    0

    0.

    15

    theta

    Likelihood

    Figure 1.4: Likelihood function for the Poisson model when the observed value is

    x = 5.

    Likelihood Poisson model backward

    Poisson model can be stated as a probability mass function that maps possible values

    x into probabilities p(x) or if we emphasize the dependence on into p(x|) that isgiven below

    p(x

    |) = l(

    |x) =

    xe

    x!

    ,

    With the Poisson model with given one can compute probabilities that variouspossible numbers x of emitted particles can be recorded, i.e. we consider

    x p(x|)

    with fixed. We get the answer how probable are various outcomes x.

    With the Poisson model where x is observed and thus fixed one can evaluate howlikely it would be to get x under various values of, i.e. we consider

    l(|x)

    with fixed. We get the answer how likely various could produced the observed

    x.

    11

  • 7/27/2019 Stat Infer

    13/114

    Exercise 1. For the general Poisson model

    p(x|) = l(|x) = xex!

    ,

    1. for a given > find the most probable value of the observation x.

    2. for a given observation x find the most likely value of.

    Give a mathematical argument for your claims.

    1.2.2 More data

    Suppose that we perform another measurement of the number of emitted particles. Let

    us use the notation X1 to denote the number of particles emitted in the first period, X2

    to denote the number emitted in the second period. We shall end up with data consisting

    of a random vector X = (X1, X2). The second measurement yielded x2 = 2, so that

    x = (x1, x2) = (5, 2).

    We know that the probability that X1 takes the value x1 = 5 is given by the formula

    P(X = 5|) = 5e

    5!

    and similarly that the probability that X2 takes the value x2 = 2 is given by

    P(X = 2|) = 2e

    2!.

    However, what about the probability that X takes the value x = (5, 2)? In order for

    this probability to be specified we need to know something about the joint distribution

    of the random variables X1, X2. A simple assumption to make is that the random

    variables X1, X2 are mutually independent. In such a case the probability that X takes

    the value x = (x1, x2) is given by

    P(X = (x1, x2)

    |) =

    x1e

    x1!

    x2e

    x2!

    = e2x1+x2

    x1!x2!

    .

    After little of algebra we easily find the likelihood function of observingX = (5, 2)

    as

    l(|(5, 2)) = e2 7

    240

    12

  • 7/27/2019 Stat Infer

    14/114

    0 5 10 15

    0.

    000

    0.0

    05

    0.0

    10

    0.0

    15

    0.0

    20

    0.0

    25

    Likelihood

    0 5 10 15

    0.

    00

    0.

    05

    0.

    10

    0.

    15

    theta

    Likelihood

    Figure 1.5: Likelihood of observing (5, 2) (top) vs. the one of observing 5 (bottom).

    and its graph is presented in Figure 1.5 in comparison with the previous likelihood for

    a single observation.

    Two important effects of adding an extra information should be noted

    We observe that the location of the maximum shifted from 5 to 3 compared tosingle observation.

    We also note that the range of likely values for has diminished.

    Let us suppose that eventually we decide to measure three more values of X.

    Let us use the vector notation X = (X1, X2, . . . , X 5) to denote observable random

    13

  • 7/27/2019 Stat Infer

    15/114

    vector. Assume that three extra measurements yielded 3, 7, 7 so that we have x =

    (x1, x2, x3, x4, x5) = (5, 2, 3, 7, 7). Under the assumption of independence the proba-bility that X takes the value x is given by

    P(X = x|) =5i=1

    xie

    xi!.

    The likelihood function of observing X = (5, 2, 3, 7, 7) under independence can

    be easily derived to be24e5

    14515200.

    In general, ifX = (x1, . . . , xn) is any vector of 5 non-negative integers, then the

    likelihood is given by

    l(|(x1, . . . , xn) = n

    i=1 xienni=1

    xi!.

    The value that maximizes this likelihood is called the maximum likelihood estimatorof.

    In order to find values that effectively maximize likelihood, the method of calculus

    can be implemented. We note that in our example we deal only with one variable and

    computation of derivative is rather straightforward.

    Exercise 2. For the general case of likelihood based on Poisson model

    l(|x1, . . . , xn) = n

    i=1 xienni=1

    xi!

    using methods of calculus derive a general formula for the maximum likelihood esti-

    mator of. Using the result find for (x1, x2, x3, x4, x5) = (5, 2, 3, 7, 7).Exercise 3. It is generally believed that time X that passes until there is half of the

    original radioactive material follow exponential distribution f(x|) = ex, x > 0.For beryllium 11 five experiments has been performed and values 13.21, 13.12, 13.95,

    13.54, 13.88 seconds has been obtained. Find and plot the likelihood function for and

    based on this determine the most likely .

    14

  • 7/27/2019 Stat Infer

    16/114

    1.3 Likelihood and theory of statistics

    The strategy of making statistical inference based on the likelihood function as de-

    scribed above is the recurrent theme in mathematical statistics and thus in our lecture.

    Using mathematical argument we would compare various strategies to infering about

    the parameters and often we will demonstrate that the likelihood based methods are

    optimal. It will show its strength also as a criterium deciding between various claims

    about parameters of the model which is the leading story of testing hypotheses.

    In the modern days, the role of computers has increased in statistical methodology.

    New computationally intense methods of data explorations become one of the central

    areas of modern statistcs. Even there, methods that refer to likelihood play dominant

    roles, in particular, in Bayesian methodology.

    Despite this extensive penetration of statistical methodology by likelihood techin-

    ques, by no means statistics can be reduced to analysis of likelihood. In every area of

    statistics, there are important aspects that require reaching beyond likelihood, in many

    cases, likelihood is not even a focus of studies and development. The purpose of this

    course is to present both the importance of likelihood approach across statistics but also

    presentation of topics for which likelihood plays a secondary role if any.

    1.4 Computationally intensive methods of statistics

    The second part of our presentation of modern statistical inference is devoted to compu-

    tationally intensive statistical methods. The area of data explorations is rapidly growing

    in importance due to

    common access to inexpensive but advance computing tools,

    emerging of new challenges associated with massive highly dimensional data farexceeding traditional assumptions on which traditional methods of statistics have

    been based.

    In this introduction we give two examples that illustrate the power of modern computers

    and computing software both in analysis of statistical models and in performing actual

    15

  • 7/27/2019 Stat Infer

    17/114

    statistical inference. We start with analyzing a performance of a statistical procedure

    using random sample generation.

    1.4.1 Monte Carlo methods studying statistical methods using

    computer generated random samples

    Randomness can be used to study properties of a mathematical model. The model itself

    may be probabilistic or not but here we focus on the probabilistic ones. Essentially, it

    is based on repetitive simulations of random samples corresponding to the model and

    observing behavior of objects of interests. An example of Monte Carlo method is ap-

    proximate the area of circle by tossing randomly a point (typically computer generated)on the paper where a circle is drawn. The percentage of points that fall inside the circle

    represents (approximately) percentage of the area covered by the circle, as illustrated

    in Figure 1.6.

    Exercise 4. Write an R code that would explore the area of an elipsoid using Monte

    Carlo method.

    Below we present an application of Monte Carlo approach to studying fitting meth-

    ods for the Poisson model.

    Deciding for Poisson model

    Recall that the Poisson model is given by

    P(X = x|) = xe

    x!.

    It is relatively easy to demonstrate that the mean value of this distribution is equal to

    and standard deviation is also equal to .

    Exercise 5. Present a formal argument showing that for a Poisson random variable X

    with parameter ,E

    X = and VarX = .

    Thus for a sample of observations x = (x1, . . . , xn) it is reasonable to consider

    16

  • 7/27/2019 Stat Infer

    18/114

    Figure 1.6: Monte Carlo study of the circle area approximation for sample size of

    10000 is 3.1248 which compares to the true value of = 3.141593.

    both

    1 = x,

    2 = x2 x2as estimators of.

    We want to employ Monte Carlo method to decide which one is better. In the

    process we run many samples from the Poisson distribution and check which of the

    17

  • 7/27/2019 Stat Infer

    19/114

    Histogram of means

    means

    Frequency

    2.5 3.0 3.5 4.0 4.5 5.0 5.5

    0

    100

    Histogram of vars

    vars

    Frequenc

    y

    0 5 10 15

    0

    150

    300

    Figure 1.7: Monte Carlo results of comparing estimation of = 4 by the sample mean

    (left) vs. estimation using the sample standard deviation right.

    estimates performs better. The resulting histograms of the values of estimator are pre-

    sented in Figure 1.8. It is quite clear from the graphs that the estimator based on the

    mean is better than the one based on the variance.

    1.4.2 Bootstrap performing statistical inference using computers

    Bootstrap (resampling) methods are one of the examples of Monte Carlo based statis-

    tical analysis. The methodology can be summarized as follows

    Collect statistical sample, i.e. the same type of data as in classical statistics.

    Used a properly chosen Monte Carlo based resampling from the data using RNG create so called bootstrap samples.

    Analyze bootstrap samples to draw conclusions about the random mechanism

    18

  • 7/27/2019 Stat Infer

    20/114

    that produced the original statistical data.

    This way randomness is used to analyze statistical samples that, by the way, are also a

    result of randomness. An example illustrating the approach is presented next.

    Estimating nitrate ion concentration

    Nitrate ion concentration measurements in a certain chemical lab has been collected

    and their results are given in the following table. The goal is to estimate, based on

    0.51 0.51 0.51 0.50 0.51 0.49 0.52 0.53 0.50 0.47

    0.51 0.52 0.53 0.48 0.49 0.50 0.52 0.49 0.49 0.50

    0.49 0.48 0.46 0.49 0.49 0.48 0.49 0.49 0.51 0.47

    0.51 0.51 0.51 0.48 0.50 0.47 0.50 0.51 0.49 0.48

    0.51 0.50 0.50 0.53 0.52 0.52 0.50 0.50 0.51 0.51

    Table 1.1: Results of 50 determinations of nitrate ion concentration in g per ml.

    these values, the actual nitrate ion concentration. The overall mean of all observations

    is 0.4998. It is natural to ask what is the error of this determination of the nitrate

    concentration. If we would repeat our experiment of collecting 50 samples of nitrate

    concentrations many times we would see the range of error that is made. However,

    it would be a waste of resources and not a viable method at all. Instead we resample

    new data from our data and use so obtained new samples for assessment of the error

    and compare the obtained means (bootstrap means) with the original one. The differ-

    ences of these represent the bootstrap estimation errors their distribution is viewed

    as a good representation of the distribution of the true error. In Figure ??, we see the

    bootstrap counterpart of the distribution of the estimation error.

    Based on this we can safely say that the nitrate concentration is 49.99 0.005.Exercise 6. Consider a sample of daily number of buyers in a furniture store

    8, 5, 2, 3, 1, 3, 9, 5, 5, 2, 3, 3, 8, 4, 7, 11, 7, 5, 12, 5

    Consider the two estimators of for a Poisson distribution as discussed in the previous

    section. Describe formally the procedure (in steps) of obtaining a bootstrap confidence

    19

  • 7/27/2019 Stat Infer

    21/114

    Histogram of bootstrap

    bootstrap

    Frequency

    -0.006 -0.004 -0.002 0.000 0.002 0.004 0.006 0.008

    0

    20

    40

    60

    80

    Figure 1.8: Boostrap estimation error distribution.

    interval for using each of the discussed estimatoand provide with 95% bootstrap

    confidence intervals for each of them.

    20

  • 7/27/2019 Stat Infer

    22/114

    Chapter 2

    Review of Probability

    2.1 Expectation and Variance

    The expected value E[Y] of a random variable Y is defined as

    E[Y] =i=0

    yiP(yi);

    ifY is discrete, and

    E[Y] = yf(y)dy;ifY is continuous, where f(y) is the probability density function. The varianceVar[Y]

    of a random variable Y is defined as

    Var[Y] = E(Y E[Y])2;

    or

    Var[Y] =

    i=0

    (yi E[Y])2P(yi);

    ifY is discrete, and

    V ar[Y] =

    (y E[Y])2f(y)dy;

    ifY is continuous. When there is no ambiguity we often writeEY forE[Y], andVarY

    for Var[Y].

    21

  • 7/27/2019 Stat Infer

    23/114

    A function of a random variable is itself a random variable. If h(Y) is function of

    the random variable Y, then the expected value ofh(Y) is given by

    E[h(Y)] =i=0

    h(yi)P(yi);

    ifY is discrete, and ifY is continuous

    E[h(Y)] =

    h(y)f(y) dy.

    It is relatively straightforward to derive the following results for the expectation

    and variance of a linear function ofY .

    E[aY + b] = aE[Y] + b,

    V ar[aY + b] = a2Var[Y],

    where a and b are constants. Also

    Var[Y] = E[Y2] (E[Y])2 (2.1)

    For expectations, it can be shown more generally that

    E

    k

    i=1 aihi(Y) =k

    i=1 aiE[hi(Y)],where ai, i = 1, 2, . . . , k are constants and hi(Y), i = 1, 2, . . . , k are functions of the

    random variable Y.

    2.2 Distribution of a Function of a Random Variable

    If Y is a random variable than for any regular function X = g(Y) is also a random

    variable. The cumulative distribution function ofX is given as

    FX(x) = P(X x) = P(Y g1(, x]).

    The density function of X if exists can be found by differentiating the right hand side

    of the above equality.

    22

  • 7/27/2019 Stat Infer

    24/114

    Example 1. Let Y has a density fY and X = Y2. Then

    FX(x) = P(Y2 < x) = P(x Y x) = FY(x) FY(x).

    By taking a derivative in x we obtain

    fX(x) =1

    2

    x

    fY(

    x) + fY(

    x)

    .

    If additionally the distribution of Y is symmetric around zero, i.e. fY(y) = fY(y),then

    fX(x) =1x

    fY(

    x).

    Exercise 7. Let Z be a random variable with the density fZ(z) = ez2

    /2/2, socalled the standard normal (Gaussian) random variable. Show that Z2 is a Gamma(1/2, 1/2)

    random variable, i.e. that it has the density given by

    12

    x1/2ex/2.

    The distribution ofZ2 is also called chi-square distribution with one degree of freedom.

    Exercise 8. Let FY(y) be a cumulative distribution function of some random variable

    Y that with probability one takes values in a set RY . Assume that there is an inverse

    function F1

    Y [0, 1] RY so that FYF1

    Y (u) = u for u [0, 1]. Check that for U Unif(0, 1) the random variable Y = F1Y (U) has FY as its cumulative distribution

    function.

    The densities ofg(Y) are particularly easy to express if g is a strictly monotone as

    shown in the next result

    Theorem 2.2.1. LetY be a continuous random variable with probability density func-

    tion fY. Suppose thatg(y) is a strictly monotone (increasing or decreasing) differ-

    entiable (and hence continuous) function of y. The random variable Z defined by

    Z = g(Y) has probability density function given by

    fZ(z) = fY

    g1(z) ddz ga(z)

    (2.2)where g1(z) is defined to be the inverse function of g(y).

    23

  • 7/27/2019 Stat Infer

    25/114

    Proof. Let g(y) be a monotone increasing (decreasing) function and let FY(y) and

    FZ(z) denote the probability distribution functions of the random variables Y and Z.Then

    FZ(z) = P(Z z) = P(g(Y) z) = P(Y ()g1(z)) = (1)FY(g1(z))

    By the chain rule,

    fZ(z) =d

    dzFZ(z) = () d

    dzFY(g

    1(z)) = fY(g1(z))dg1dz (z)

    .

    Exercise 9. (The Log-Normal Distribution) Suppose Z is a standard normal distribu-

    tion and g(z) = eaz+b. Then Y = g(Z) is called a log-normal random variable.

    Demonstrate that the density ofY is given by

    fY(y) =1

    2a2y1 exp

    log

    2(y/eb)

    2a2

    .

    2.3 Transforms Method Characteristic, Probability Gen-

    erating and Moment Generating Functions

    The probability generating function of a random variable Y is a function denoted by

    GY(t) and defined by

    GY(t) = E(tY),

    for those t R for which the above expectation is convergent. The expectation definingGY(t) converges absolutely if |t| 1. As the name implies, the p.g.f generates theprobabilities associated with a discrete distribution P(Y = j) = pj , j = 0, 1, 2, . . . .

    GY(0) = p0, GY(0) = p1, GY(0) = 2!p2.

    In general the kth derivative of the p.g.f ofY satisfies

    G(k)Y(0) = k!pk.

    24

  • 7/27/2019 Stat Infer

    26/114

    The p.g.f can be used to calculate the mean and variance of a random variable Y. Note

    that in the discrete case GY(t) = j=1jpjtj1 for 1 < t < 1. Let t approach onefrom the left, t 1, to obtain

    GY(1) =j=1

    jpj = E(Y) = Y.

    The second derivative ofGY(t) satisfies

    GY(t) =j=1

    j(j 1)pjtj2,

    and consequently

    GY(1) =j = 1j(j 1)pj = E(Y2) E2(Y).

    The variance ofY satisfies

    2Y = EY2 EY + EY E2Y = GY(1) + GY(1) G2Y(1).

    The moment generating function (m.g.f) of a random variable Y is denoted by

    MY(t) and defined as

    MY(t) = E

    etY

    ,

    for some tR. The moment generating function generates the moments EYk

    MY(0) = 1, MY(0) = Y = E(Y), MY(0) = EY

    2,

    and, in general,

    M(k)Y(0) = EYk.

    The characteristic function (ch.f) of a random variable Y is defined by

    Y(t) = EeitY,

    where i =1.

    A very important result concerning generating functions states that the moment

    generating function uniquely defines the probability distribution (provided it exists in

    an open interval around zero). The characteristic function also uniquely defines the

    probability distribution.

    25

  • 7/27/2019 Stat Infer

    27/114

    Property 1. If Y has the characteristic function Y(t) and the moment generating

    function MY(t), then forX = a + bY

    X(t) =eaitY(bt)

    MX(t) =eatMY(bt).

    2.4 Random Vectors

    2.4.1 Sums of Independent Random Variables

    Suppose that Y1, Y2, . . . , Y n are independent random variables. Then the moment gen-

    erating function of the linear combination Z = ni=1 aiYi is the product of the indi-vidual moment generating functions.

    MZ(t) =EetaiYi

    =Eea1tY1Eea2tY2 EeantYn

    =ni=1

    MYi(aiYi).

    The same argument gives also that Z(t) =ni=1 Yi(aiY i).

    When X and Y are discrete random variables, the condition of independence is

    equiva- lent to pX,Y(x, y) = pX(x)pY(y) for all x, y. In the jointly continuous case

    the condition of independence is equivalent to fX,Y(x, y) = fX(x)fY(y) for all x, y.

    Consider random variables X and Y with probability densities fX(x) and fY(y) re-

    spectively. We seek the probability density of the random variable X+ Y. Our general

    result follows from

    FX+Y(a) =P(X+ Y < a)

    =

    X+Y

  • 7/27/2019 Stat Infer

    28/114

    Thus the density function fX+Y(z) =

    fX(z y)fY(y) dy which is called the

    convolution of the densities fX and fY .

    2.4.2 Covariance and Correlation

    Suppose that X and Y are real-valued random variables for some random experiment.

    The covariance ofX and Y is defined by

    Cov(X, Y) = E[(X EX)(Y EY)]

    and (assuming the variances are positive) the correlation ofX and Y is defined by

    (X, Y) =Cov(X, Y)Var(X)Var(Y) .

    Note that the covariance and correlation always have the same sign (positive, nega-

    tive, or 0). When the sign is positive, the variables are said to be positively correlated,

    when the sign is negative, the variables are said to be negatively correlated, and when

    the sign is 0, the variables are said to be uncorrelated. For an intuitive understanding

    of correlation, suppose that we run the experiment a large number of times and that

    for each run, we plot the values (X, Y) in a scatterplot. The scatterplot for positively

    correlated variables shows a linear trend with positive slope, while the scatterplot for

    negatively correlated variables shows a linear trend with negative slope. For uncorre-

    lated variables, the scatterplot should look like an amorphous blob of points with no

    discernible linear trend.

    Property 2. You should satisfy yourself that the following are true

    Cov(X, Y) =EXY EXEYCov(X, Y) =Cov(Y, X)

    Cov(Y, Y) =Var(Y)

    Cov(aX+ bY + c, Z) =aCov(X, Z) + bCov(Y, Z)

    Var ni=1

    Yi

    =n

    i,j=1

    Cov(Yi, Yj)

    If X and Y are independent, then they are uncorrelated. The converse is not true

    however.

    27

  • 7/27/2019 Stat Infer

    29/114

    2.4.3 The Bivariate Change of Variables Formula

    Suppose that (X, Y) is a random vector taking values in a subset S ofR2 with proba-

    bility density function f. Suppose that U and V are random variables that are functions

    ofX and Y

    U = U(X, Y), V = V(X, Y).

    If these functions have derivatives, there is a simple way to get the joint probability

    density function g of (U, V). First, we will assume that the transformation (x, y) (u, v) is one-to-one and maps Sonto a subset T ofR2. Thus, the inverse transformation

    (u, v) (x, y) is well defined and maps T onto S. We will assume that the inverse

    transformation is smooth, in the sense that the partial derivatives

    x

    u,

    x

    v,

    y

    u,

    y

    v,

    exist on T, and the Jacobian

    (x, y)

    (u, v)=

    xu

    xv

    yu

    yv

    = xu yv xv yuis nonzero on T. Now, let B be an arbitrary subset of T. The inverse transformation

    maps B onto a subset A ofS. Therefore,

    P((U, V) B) = P((X, Y) A) = A f(x, y) dxdy.But, by the change of variables formula for double integrals, this can be written as

    P((U, V) B) =

    B

    f(x(u, v), y(u, y))

    (x, y)(u, v) dudv.

    By the very meaning of density, it follows that the probability density function of

    (U, V) is

    g(u, v) = f(x(u, v), y(u, v))

    (x, y)(u, v) , (u, v) T.

    The change of variables formula generalizes to Rn.

    Exercise 10. Let U1 and U2 be independent random variables with the density equalto one over [0, 1], i.e. standard uniform random variables. Find the density of the

    following vector of variables

    (Z1, Z2) = (

    2log U1 cos(2U2),

    2log U1 sin(2U2)).

    28

  • 7/27/2019 Stat Infer

    30/114

    2.5 Discrete Random Variables

    2.5.1 Bernoulli Distribution

    A Bernoulli trial is a probabilistic experiment which can have one of two outcomes,

    success (Y = 1) or failure (Y = 0) and in which the probability of success is . We

    refer to as the Bernoulli probability parameter. The value of the random variable Y is

    used as an indicator of the outcome, which may also be interpreted as the presence or

    absence of a particular characteristic. A Bernoulli random variable Y has probability

    mass function

    P(Y = y|) = y

    (1 )1

    y (2.4)for y = 0, 1 and some (0, 1). The notation Y Ber() should be read as therandom variable Y follows a Bernoulli distribution with parameter .

    A Bernoulli random variable Y has expected value E[Y] = 0 P(Y = 0) + 1 P(Y = 1) = 0(1)+1 = , and varianceVar[Y] = (0)2(1)+(1)2 =(1 ).

    2.5.2 Binomial Distribution

    Consider independent repetitions of Bernoulli experiments, each with a probability of

    success . Next consider the random variable Y, defined as the number of successes in

    a fixed number of independent Bernoulli trials, n . That is,

    Y =ni=1

    Xi,

    where Xi Bernoulli() for i = 1, . . . , n. Each sequence of length n containing yones and (n y) zeros occurs with probability y(1 )(n y). The number ofsequences with y successes, and consequently (n y) fails, is

    n!

    y!(n y)! = ny.The random variable Y can take on values y = 0, 1, 2, . . . , n with probabilities

    P(Y = y|) =

    n

    y

    y(1 )ny. (2.5)

    29

  • 7/27/2019 Stat Infer

    31/114

    The notation Y Bin(n, ) should be read as the random variable Y follows a bi-

    nomial distribution with parameters n and . Finally using the fact that Y is the sumof n independent Bernoulli random variables we can calculate the expected value as

    E[Y] = E[

    Xi] = PE[Xi] =

    = n and variance as Var[Y] = V ar[

    Xi] =Var[Xi] =

    (1 ) = n(1 ).

    2.5.3 Negative Binomial and Geometric Distribution

    Instead of fixing the number of trials, suppose now that the number of successes, r,

    is fixed, and that the sample size required in order to reach this fixed number is the

    random variable N. This is sometimes called inverse sampling. In the case of r = 1,

    using the independence argument again, leads to geomoetric distribution

    P(N = n|) = (1 )n1, n = 1, 2, . . . (2.6)

    for n = 1, 2, . . . which is the geometric probability function with parameter . The

    distribution is so named as successive probabilities form a geometric series. The no-

    tation N Geo() should be read as the random variable N follows a geometricdistribution with parameter . Write (1 ) = q. Then

    E[N] =

    n=1

    nqn =

    n=0

    d

    dq(qn) = d

    dqn=0

    qn=

    d

    dq

    1

    1 q

    =

    (1 q)2 =1

    .

    Also,

    E[N2] =n=1

    n2qn1 = n=1

    d

    dq(nqn) =

    d

    dq

    n=1

    nqn

    = d

    dq

    q

    1 qE(N)

    = d

    dq

    q(1 q)2

    = 12 + 2(1 )3 = 22 1 .Using Var[N] = E[N2] (E[N])2, we get Var[N] = (1 )/2.

    Consider now sampling continues until a total of r successes are observed. Again,

    let the random variable N denote number of trial required. If the rth success occurs

    30

  • 7/27/2019 Stat Infer

    32/114

    on the nth trial, then this implies that a total of (r 1) successes are observed by the

    (n 1)th trial. The probability of this happening can be calculated using the binomialdistribution as

    n 1r 1

    r1(1 )nr.

    The probability that the nth trial is a success is . As these two events are indepen-

    dent we have that

    P(N = n|r, ) =

    n 1r 1

    r(1 )nr (2.7)

    for n = r, r + 1, . . . . The notation N NegBin(r, ) should be read as the randomvariable N follows a negative binomial distribution with parameters r and . This is

    also known as the Pascal distribution.

    E[Nk] =n=r

    nk

    n 1r 1

    r(1 )nr

    =r

    n=r

    nk1

    n

    r

    r+1(1 )nr since n

    n 1r 1

    = r

    n

    r

    =r

    m=r+1

    (m 1)k1

    m 1r

    r+1(1 )m(r+1)

    =r

    E

    (X 1)k1 ,where X Negativebinomial(r + 1, ). Setting k = 1 we get E(N) = r/. Settingk = 2 gives

    E[N2] =r

    E(X 1) = r

    r + 1

    1

    .

    ThereforeVar[N] = r(1 )/2.

    2.5.4 Hypergeometric Distribution

    The hypergeometric distribution is used to describe sampling without replacement.

    Consider an urn containing b balls, of which w are white and b w are red. Weintend to draw a sample of size n from the urn. Let Y denote the number of white balls

    selected. Then, for y = 0, 1, 2, . . . , n we have

    P(Y = y|b,w,n) =wy

    bwny

    bn

    . (2.8)31

  • 7/27/2019 Stat Infer

    33/114

    The expected value of the jth moment of a hypergeometric random variable is

    E[Y] =ny=0

    yjP(Y = y) =ny=1

    yj wybwnybn

    .The identities

    y

    w

    y

    = w

    w 1y 1

    n

    b

    n

    = b

    b 1n 1

    can be used to obtain

    E[Yj ] =nw

    b

    n

    y=1yj1

    w1y1

    bwn1

    b1n1

    =nw

    b

    n1x=0

    (x + 1)j1w1

    x

    bwn1x

    b1n1

    =

    nw

    bE[(X+ 1)j1]

    where Xis a hypergeometric random variable with parameters n1, b1, w1. Fromthis it is easy to establish that E[Y] = n and Var[Y] = n(1 )(b n)/(b 1),where = w/b is the fraction of white balls in the population.

    2.5.5 Poisson Distribution

    Certain problems involve counting the number of events that have occurred in a fixed

    time period. A random variable Y, taking on one of the values 0, 1, 2, . . . , is said to be

    a Poisson random variable with parameter if for some > 0,

    P(Y = y|) = y

    y!e, y = 0, 1, 2, . . . (2.9)

    The notation Y Pois() should be read as random variable Y follows a Poissondistribution with parameter . Equation 2.9 defines a probability mass function, since

    y=0y

    y!

    e = e

    y=0y

    y!

    = ee = 1.

    The expected value of a Poisson random variable is

    E[Y] =y=0

    yey

    y!= e

    y=1

    y1

    (y 1)! = e

    j=0

    j

    (j)!= .

    32

  • 7/27/2019 Stat Infer

    34/114

    To get the variance we first compute the second moment

    E[Y2] =y=0

    y2e yy!

    = y=1

    ye y1y 1! =

    j=0

    (j + 1)e jj!

    = ( + 1).

    Since we already have E[Y] = , we obtain Var[Y] = E[Y2] (E[Y])2 = .Suppose that Y Binomial(n, p), and let = np. Then

    P(Y = y|np) =

    n

    y

    py(1 p)ny

    =

    n

    y

    n

    y 1

    n

    ny=

    n(n 1) (n y + 1)ny

    y

    y!

    (1 /n)n

    (1 /n)y

    .

    For n large and moderate, we have that1

    n

    n e, n(n 1) (n y + 1)

    ny 1,

    1

    n

    y 1.

    Our result is that a binomial random variable Bin(n, p) is well approximated by a

    Poisson random variable Pois( = np) when n is large and p is small. That is

    P(Y = y|n, p) enp (np)y

    y!.

    2.5.6 Discrete Uniform Distribution

    The discrete uniform distribution with integer parameter N has a random variable Y

    that can take the vales y = 1, 2, . . . , N with equal probability 1/N. It is easy to show

    that the mean and variance ofY are E[Y] = (N + 1)/2, and Var[Y] = (N2 1)/12.

    2.5.7 The Multinomial Distribution

    Suppose that we perform n independent and identical experiments, where each ex-

    periment can result in any one of r possible outcomes, with respective probabilities

    p1, p2, . . . , pr, where ri=1pi = 1. If we denote by Yi, the number of the n experi-ments that result in outcome number i, then

    P(Y1 = n1, Y2 = n2, . . . , Y r = nr) =n!

    n1!n2! nr!pn11 p

    n22 pnr5 (2.10)

    33

  • 7/27/2019 Stat Infer

    35/114

    where

    ri=1 ni = n. Equation 2.10 is justified by noting that any sequence of out-

    comes that leads to outcome i occurring ni times for i = 1, 2, . . . , r, will, by theassumption of independence of experiments, have probability pn11 p

    n22 pnrr of occur-

    ring. As there are n! = (n1!n2! nr!) such sequence of outcomes equation 2.10 isestablished.

    2.6 Continuous Random Variables

    2.6.1 Uniform Distribution

    A random variable Y is said to be uniformly distributed over the interval (a, b) if itsprobability density function is given by

    f(y|a, b) = 1b a , ifa < y < b

    and equals 0 for all other values of y. Since F(u) =u f(y)dy, the distribution

    function of a uniform random variable on the interval (a, b) is

    F(u) =

    0; u a,(u a)/(b a); a < u b,

    1; u > b

    The expected value of a uniform random turns out to be the mid-point of the interval,

    that is

    E[Y] =

    yf(y)dy =

    ba

    y

    b a dy =b2 a22(b a) =

    b + a

    2.

    The second moment is calculated as

    E[Y2] =

    ba

    y2

    b a dy =b3 a33(b a) =

    1

    3(b2 + ab + a2),

    hence the variance is

    Var[Y] = E[Y2] (E[Y])2 = 112

    (b a)2.

    The notation Y U(a, b) should be read as the random variable Y follows a uniformdistribution on the interval (a, b).

    34

  • 7/27/2019 Stat Infer

    36/114

    2.6.2 Exponential Distribution

    A random variable Y is said to be an exponential random variable if its probability

    density function is given by

    f(y|) = ey , y > 0, > 0.

    The cumulative distribution of an exponential random variable is given by

    F(a) =

    a0

    eydy = ey |a0 = 1 ea, a > 0.

    The expected value E[Y] =

    0

    yey dy requires integration by parts, yielding

    E[Y] = yey |0 + 0

    ey dy = ey

    |0 = 1 .

    Integration by parts can be used to verify that E[Y2] = 22. Hence Var[Y] = 1/2.

    The notation Y Exp() should be read as the random variable Y follows an expo-nential distribution with parameter .

    Exercise 11. Let Y U[0, 1]. Find the distribution ofY = log U. Can you identifyit as a one of the common distributions?

    2.6.3 Gamma Distribution

    A random variable Y is said to have a gamma distribution if its density function is

    given by

    f(y|) = eyy1/(), 0 < y, > 0, > 0

    where (), is called the gamma function and is defined by

    () =

    0

    euu1du.

    The integration by parts of() yields the recursive relationship

    () = euu1|0 +0

    eu( 1)u2 du (2.11)

    = ( 1)0

    euu2 du = ( 1)( 1). (2.12)

    35

  • 7/27/2019 Stat Infer

    37/114

    For integer values = n, this recursive relationship reduces to (n + 1) = n!. Note,

    by setting = 1 the gamma distribution reduces to an exponential distribution. Theexpected value of a gamma random variable is given by

    E[Y] =

    ()

    0

    yey dy = !

    ()

    0

    ueu du,

    after the change of variable u = y. Hence E[Y] = ( + 1)/(()) = /. Using

    the same substitution

    E[Y2] =

    ()

    0

    y+1ey dy =( + 1)

    2,

    so that Var[Y] = /2. The notation Y Gamma(, ) should be read as therandom variable Y follows a gamma distribution with parameters and .

    Exercise 12. Let Y Gamma(, ). Show that the moment generating function forY is given for t (, ) by

    MY(t) =1

    (1 t/) .

    2.6.4 Gaussian (Normal) Distribution

    A random variable Z is a standard normal (or Gaussian) random variable if the density

    ofZ is specified by

    f(z) = 12

    ez2/2. (2.13)

    It is not immediately obvious that (2.13) specifies a probability density. To show that

    this is the case we need to prove

    12

    ez2/2dy = 1

    or, equivalently, that I = e

    z2/2 dz =

    2. This is a classic results and so is

    well worth confirming. Consider

    I2 =

    e

    z2/2 dz

    e

    w2/2 dw =

    e(z

    2+w2)/2 dzdw.

    The double integral can be evaluated by a change of variables to polar coordinates.

    Substituting z = r cos , w = r sin , and dzdw = rddr, we get

    I2 =

    0

    0

    er2/2rddr = 2

    0

    rer2/2 dr = 2er2/2|10 = 2.

    36

  • 7/27/2019 Stat Infer

    38/114

    Taking the square root we get I =

    2. The result I =

    2 can also be used to

    establish the result (1/2) = . To prove that this is the case note that(1/2) =

    0

    euu1/2 du = 20

    ez2

    dz =

    .

    The expected value of Z equals zero because zez2/2 is integrable and asymmetric

    around zero. The variance ofZ is given by

    Var[Z] =12

    z2ez2/2 dz.

    Thus

    Var[Z] = 12

    z2ez2/2 dz

    =12

    zez2/2| + +

    e z2/2 dz

    =12

    ez2/2 dz

    =1.

    If Z is a standard normal distribution then Y = + Z is called general normal

    (Gaussian distribution) with parameters and . The density ofY is given by

    f(y|, ) = 122

    e(y)222 .

    We have obviously E[Y] = and Var[Y] = 2. The notation Y N(, 2) shouldbe read as the random variable Y follows a normal distribution with mean parameter

    and variance parameter 2. From the definition ofY it follows immediately that

    a + bY, where a and b are known constants, is again normal distribution.

    Exercise 13. Let Y N(, 2). What is the distribution ofX = a + bY?Exercise 14. Let Y N(, 2). Show that the moment generating function if Y is

    given by

    MY(t) = et+2t2/2.

    Hint Consider first the standard normal variable and then apply Property 1.

    37

  • 7/27/2019 Stat Infer

    39/114

    2.6.5 Weibull Distribution

    The Weibull distribution function has the form

    F(y) = 1 expy

    b

    a, y > 0.

    The Weibull density can be obtained by differentiation as

    f(y|a, b) =a

    b

    yb

    a1exp

    y

    b

    a.

    To calculate the expected value

    E[Y] =

    0

    ya

    1

    b

    aya1 exp

    y

    b

    ady

    we use the substitutions u = (y/b)

    a

    , and du = aba

    y

    a

    1

    dy. These yield

    E[Y] = b

    0

    u1/aeudu = b

    a + 1

    a

    .

    In a similar manner, it is straightforward to verify that

    E[Y2] = b2

    a + 2

    a

    ,

    and thus

    Var[Y] = b2

    a + 2

    a

    2

    a + 1

    a

    .

    2.6.6 Beta Distribution

    A random variable is said to have a beta distribution if its density is given by

    f(y|a, b) = 1B(a, b)

    ya1(1 y)b1, 0 < y < 1.

    Here the function

    B(a, b) =

    10

    ua1(1 u)b1 duis the beta function, and is related to the gamma function through

    B(a, b) =(a)(b)

    (a + b).

    Proceeding in the usual manner, we can show that

    E[Y] =a

    a + b

    Var[Y] =ab

    (a + b)2(a + b + 1).

    38

  • 7/27/2019 Stat Infer

    40/114

    2.6.7 Chi-square Distribution

    Let Z N(0, 1), and let Y = Z2. Then the cumulative distribution function

    FY(y) = P(Y y) = P(Z2 y) = P(y Z y) = FZ(y) FZ(y)

    so that by differentiating in y we arrive to the density

    fY(y) =1

    2

    y[fz(

    y) + fz(y)] = 1

    2yey/2,

    in which we recognize Gamma(1/2, 1/2). Suppose that Y =n

    i=1 Z2i , where the

    Zi N(0, 1) for i = 1, . . . , n are independent. From results on the sum of indepen-dent Gamma random variables, Y

    Gamma(n/2, 1/2). This density has the form

    fY(y|n) = ey/2yn/21

    2n/2(n/2), y > 0 (2.14)

    and is referred to as a chi-squared distribution on n degrees of freedom. The notation

    Y Chi(n) should be read as the random variable Y follows a chi-squared dis-tribution with n degrees of freedom. Later we will show that if X Chi(u) andY Chi(v), it follows that X+ Y Chi(u + v).

    2.6.8 The Bivariate Normal Distribution

    Suppose that Uand V are independent random variables each, with the standard normal

    distribution. We will need the following parameters X , Y , X > 0, Y > 0,

    [1, 1]. Now let X and Y be new random variables defined by

    X =X + XU,

    V =Y + YU + Y

    1 2V.

    Using basic properties of mean, variance, covariance, and the normal distribution, sat-

    isfy yourself of the following.

    Property 3. The following properties hold

    1. X is normally distributed with mean X and standard deviation X,

    2. Y is normally distributed with mean Y and standard deviation Y ,

    39

  • 7/27/2019 Stat Infer

    41/114

    3. Corr(X, Y) = ,

    4. X andY are independent if and only if = 0.

    The inverse transformation is

    u =x X

    X

    v =y Y

    Y

    1 2 (x X)

    X

    1 2so that the Jacobian of the transformation is

    (x, y)

    (u, v)=

    1

    XY1 2 .

    Since U and V are independent standard normal variables, their joint probability den-

    sity function is

    g(u, v) =1

    2e

    u2+v2

    2 .

    Using the bivariate change of variables formula, the joint density of (X, Y) is

    f(x, y) =1

    2XY

    1 2 exp (x X)

    2

    22X(1 2)+

    (x X)(y Y)XY(1 2)

    (y Y)222Y(1 2)

    Bivariate Normal Conditional Distributions

    In the last section we derived the joint probability density function f of the bivariate

    normal random variables X and Y. The marginal densities are known. Then,

    fY|X(y|x) = fY,X (y, x)fX(x) =1

    22Y(1 2)exp

    (y (Y + Y(x X)/X))

    2

    22Y(1 2)

    .

    Then the conditional distribution of Y given X = x is also Gaussian, with

    E(Y|X = x) =Y + Y x XX

    Var(Y|X = x) = 2Y(1 2)

    2.6.9 The Multivariate Normal Distribution

    Let denote the 2 2 symmetric matrix 2X XYYX

    2Y

    40

  • 7/27/2019 Stat Infer

    42/114

    Then

    det|| = 2

    X2

    Y (XY)2

    = 2

    X2

    Y(1 2

    )

    and

    1 =1

    1 2

    1/2X /(XY)/(XY) 1/2Y

    .Hence the bivariate normal distribution (X, Y) can be written in matrix notation as

    f(X,Y)(x, y) =1

    2

    det|| exp

    12

    x Xy Y

    T 1 x X

    y Y

    .

    Let Y = (Y1, . . . , Y p) be a random vector. Let E(Yi) = i, i = 1, . . . , p, and define

    the p-length vector = (1, . . . , p). Define the p p matrix through its elementsCov(Yi, Yj) for i, j = 1, . . . p. Then, the random vector Y has a p-dimensional multi-

    variate Gaussian distribution if its density function is specified by

    fY(y) =1

    (2)p/2||1/2 exp

    12

    (y )T1(y )

    . (2.15)

    The notation Y MV Np(, ) should be read as the random variable Y follows amultivariate Gaussian (normal) distribution with p-vector mean and p p variance-covariance matrix .

    41

  • 7/27/2019 Stat Infer

    43/114

    2.7 Distributions further properties

    2.7.1 Sum of Independent Random Variables special cases

    Poisson variables

    Suppose X Pois() and Y Pois(). Assume that X and Y are independent.Then

    P(X+ Y = n) =nk=0

    P(X = k, Y = n k)

    =n

    k=0P(X = k)P(Y = n k)

    =nk=0

    ek

    k!e

    nk

    (n k)!

    =e(+)

    n!

    nk=0

    n!

    k!(n k)! knk

    =e ( + ) ( + )n

    n!.

    That is, X+ Y Pois( + ).

    Binomial Random Variables

    We seek the distribution of Y + X, where Y Bin(n, ) and X Bin(m, ).Since X + Y is modelling the situation where the total number of trials is fixed at

    n + m and the probability of a success in a single trial equals . Without performing

    a calculations, we expect to find that X + Y Bin(n + m, ). To verify that notethat X = X1 + + Xn where Xi are independent Bernoulli variables with parameter while Y = Y1 + + Ym where Yi are also independent Bernoulli variables withparameter . Assuming that Xis are independent ofYis we obtain that X + Y is the

    sum of n + m indpendent Bernoulli random variables with parameter , i.e. X + Y

    has Bin(n + m, ) distribution.

    42

  • 7/27/2019 Stat Infer

    44/114

    Gamma, Chi-square, and Exponential Random Variables

    Let X Gamma(, ) and Y Gamma() are independent. Then the momentgenerating function ofX+ Y is given as

    MX+Y(t) = MX(t)MY(t) =1

    (1 + t/)1

    (1 + t/)=

    1

    (1 + t/)+

    But this is the moment generating function of a Gamma random variable distributed

    as Gamma( + , ). The result X + Y Chi(u + v) where X Chi(u) andY Chi(v), follows as a corollary.

    Let Y1, . . . , Y n be n independent exponential random variables each with parameter

    . Then Z = Y1 + Y2 + + Yn is a Gamma(n, ) random variable. To see thatthis is indeed the case, write Yi Exp(), or alternatively, Yi Gamma(1, ). ThenY1 + Y2 Gamma(2, ), and by induction

    ni=1 Yi Gamma(n, ).

    Gaussian Random Variables

    Let X N(X , 2X) and Y N(Y, 2Y). Then the moment generating function ofX+ Y is given by

    MX+Y(t) = MX(t)MY(t) = eXt+

    2Xt

    2/2eYt+2Yt

    2/2 = e(X+Y)t+(2X+

    2Y )t

    2/2

    which proves that X+ Y N(X + Y, 2X + 2Y).

    43

  • 7/27/2019 Stat Infer

    45/114

    44

  • 7/27/2019 Stat Infer

    46/114

    2.7.2 Common Distributions Summarizing Tables

    Discrete DistributionsBernoulli()

    pmf P(Y = y|) = y(1 )1y, y = 0, 1, 0 1mean/variance E[Y] = ,Var[Y] = (1 )mgf M Y(t) = e

    t + (1 )Binomial(n, )

    pmf P(Y = y|) = ny

    y(1 )ny, y = 0, 1, . . . , n , 0 1

    mean/variance E[Y] = n,Var[Y] = n(1 )mgf M Y(t) = [e

    t + (1 )]n

    Discrete uniform(N)

    pmf P(Y = y|N) = 1/N,y = 1, 2, . . . , N mean/variance E[Y] = (N + 1)/2,Var[Y] = (N + 1)(N 1)/12mgf M Y(t) =

    1N

    et 1eNt

    1et

    Geometric()

    pmf P(Y = y|N) = (1 )y1, y = 1, . . . , 0 1mean/variance E[Y] = 1/,Var[Y] = (1 )/2

    mgf M Y(t) = et/[1 (1 )et], t < log(1 )

    notes The random variable X = Y 1 is NegBin(1, ).Hypergeometric(b,w,n)

    pmf P(Y = y|b,w,n) = wy

    bwny

    /bn

    , y = 0, 1, . . . , n ,

    b (b w) y b,b,w,n 0mean/variance E[Y] = nw/b,Var[Y] = nw(b w)(b n)/(b2(b 1))Negative binomial(r, )

    pmf P(Y = y|r, ) = r+y1y

    r(1 )y, y = 0, 1, . . . , n ,

    b (b w) y N, 0 < 1mean/variance E[Y] = r(1 )/,Var[Y] = r(1 )/2

    mgf M Y(t) = /(1 (1 )et)r, t < log(1 )

    notes

    An alternative form of the pmf, used in the derivation in our notes, is

    given by P(N = n

    |r, ) =

    n1r1

    r(1

    )nr, n = r, r + 1, . . .

    where the random variable N = Y + r. The negative binomial can alsobe derived as a mixture of Poisson random variables.

    Poisson()

    pmf P(Y = y|) = ye/y!, y = 0, 1, 2, . . . , 0 < mean/variance E[Y] = ,Var[Y] = ,

    mgf M Y(t) = e(et1)

    45

  • 7/27/2019 Stat Infer

    47/114

    Continuous Distributions

    Uniform U(a, b)

    pmf f(y|a, b) = 1/(b a), a < y < bmean/variance E[Y] = (b + a)/2,Var[Y] = (b a)2/12,mgf M Y(t) = (e

    bt eat)/((b a)t)notes

    A uniform distribution with a = 0 and b = 1 is a special case of thebeta distribution where ( = = 1).

    Exponential E()

    pmf f(y|) = ey, y > 0, > 0

    mean/variance E[Y] = 1/,Var[Y] = 1/2,

    mgf M Y(t) = 1/(1 t/)

    notesSpecial case of the gamma distribution. X = Y1/ is Weibull, X =

    2Y is Rayleigh, X = log(Y /) is Gumbel.

    Gamma G(, )

    pmf f(y|) = eyy1/(), y > 0, , > 0mean/variance E[Y] = /,Var[Y] = /2,

    mgf M Y(t) = 1/(1 t/)

    notes Includes the exponential ( = 1) and chi squared ( = n/2, = 1/2).

    Normal N( , 2 )

    pmf f(y|, 2) = 122

    e(y)2/(22), > 0

    mean/variance E[Y] = ,Var[Y] = 2,

    mgf M Y(t) = et+2t2/2

    notes Often called the Gaussian distribution.

    Transforms

    The generating functions of the discrete and continuous random variables discussed

    thus far are given in Table 2.7.2.

    46

  • 7/27/2019 Stat Infer

    48/114

    Distrib. p.g.f. m.g.f. ch.f.

    Bi(n, ) (t + )n (et + )n (eit + )n

    Geo() t/(1 t) /(et ) /(eit )NegBin(r, ) r(1 t)r r(1 et)r r(1 eit)r

    P oi() e(1t) e(et1) e(e

    it1)

    Unif(, ) et(et 1)/(t) eit(eit 1)/(it)Exp() (1 t/)1 (1 it/)1

    Ga(c, ) (1 t/)c (1 it/)c

    N(, 2

    ) exp t + 2t2/2 exp it 2t2/2Table 2.1: Transforms of distributions. In the formulas = 1 .

    47

  • 7/27/2019 Stat Infer

    49/114

    Chapter 3

    Likelihood

    3.1 Maximum Likelihood Estimation

    Let x be a realization of the random variable Xwith probability density fX(x|) where = (1, 2, . . . , m)

    T is a vector ofm unknown parameters to be estimated. The set

    of allowable values for , denoted by , or sometimes by , is called the parameter

    space. Define the likelihood function

    l(|x) = fX(x|). (3.1)It is crucial to stress that the argument of fX(x|) is x, but the argument ofl(|x) is .It is therefore convenient to view the likelihood function l() as the probability of the

    observed data x considered as a function of. Usually it is convenient to work with the

    natural logarithm of the likelihood called the log-likelihood, denoted by

    log l(|x) = log l(|x).

    When R1 we can define the score function as the first derivative of the log-likelihood

    S() =

    log l().

    The maximum likelihood estimate (MLE) of is the solution to the score equation

    S() = 0.

    48

  • 7/27/2019 Stat Infer

    50/114

    At the maximum, the second partial derivative of the log-likelihood is negative, so we

    define the curvature at as I(

    ) where

    I() = 2

    2log l().

    We can check that a solution of the equation S() = 0 is actually a maximum by

    checking that I() > 0. A large curvature I() is associated with a tight or strong

    peak, intuitively indicating less uncertainty about .

    The likelihood function l(|x) supplies an order of preference or plausibility amongpossible values of based on the observed y. It ranks the plausibility of possible values

    of by how probable they make the observed y. IfP(x| = 1) > P(x| = 2) thenthe observed x makes = 1 more plausible than = 2, and consequently from

    (3.1), l(1|x) > l(2|x). The likelihood ratio l(1|x)/l(2|x) = f(1|x)/f(2|x) isa measure of the plausibility of 1 relative to 2 based on the observed fact y. The

    relative likelihood l(1|x)/l(2|x) = k means that the observed value x will occur ktimes more frequently in repeated samples from the population defined by the value 1

    than from the population defined by 2. Since only ratios of likelihoods are meaningful,

    it is convenient to standardize the likelihood with respect to its maximum.

    When the random variables X1, . . . , X n are mutually independent we can write the

    joint density as

    fX(x) =nj=1

    fXj (xj)

    where x = (x1, . . . , xn) is a realization of the random vector X = (X1, . . . , X n),

    and the likelihood function becomes

    LX(|x) =nj=1

    fXj (xj |).

    When the densities fXj (xj) are identical, we unambiguously write f(xj).

    Example 2 (Bernoulli Trials). Consider n independent Bernoulli trials. The jth ob-

    servation is either a success or failure coded xj = 1 and xj = 0 respectively,

    and

    P (Xj = xj) = xj (1 )1xj

    49

  • 7/27/2019 Stat Infer

    51/114

    for j = 1, . . . , n . The vector of observations y = (x1, x2, . . . , xn)T is a sequence of

    ones and zeros, and is a realization of the random vector Y = (X1, X2, . . . , X n)T

    . Asthe Bernoulli outcomes are assumed to be independent we can write the joint probabil-

    ity mass function ofY as the product of the marginal probabilities, that is

    l() =

    nj=1

    P(Xj = xj)

    =nj=1

    xj (1 )1xj

    = xj (1 )n

    xj

    = r(1

    )nr

    where r =n

    i=1 xj is the number of observed successes (1s) in the vector y. The

    log-likelihood function is then

    log l() = r log + (n r) log(1 ),

    and the score function is

    S() =

    log l() =

    r

    (n r)

    1 .

    Solving for S() = 0 we get = r/n. We also have

    I() = r2

    + n r(1 )2 > 0 ,

    guaranteeing that is the MLE. Each Xi is a Bernoulli random variable and has ex-

    pected value E(Xi) = , and variance Var(Xi) = (1 ). The MLE (y) is itself arandom variable and has expected value

    E() = E r

    n

    = E

    ni=1 Xi

    n

    =

    1

    n

    ni=1

    E (Xi) =1

    n

    ni=1

    = .

    If an estimator has on average the value of the parameter that it is intended to estimate

    than we call it unbiased, i.e. ifE = . From the above calculation it follows that (y)is an unbiased estimator of. The variance of(y) isVar() = Var

    ni=1 Xi

    n

    =

    1

    n2

    ni=1

    Var (Xi) =1

    n2

    ni=1

    (1 ) = (1 )n

    .

    2

    50

  • 7/27/2019 Stat Infer

    52/114

    Example 3 (Binomial sampling). The number of successes in n Bernoulli trials is a

    random variable R taking on values r = 0, 1, . . . , n with probability mass function

    P(R = r) =

    n

    r

    r(1 )nr.

    This is the exact same sampling scheme as in the previous example except that instead

    of observing the sequence y we only observe the total number of successes r. Hence

    the likelihood function has the form

    LR (|r) =

    n

    r

    r(1 )nr.

    The relevant mathematical calculations are as follows

    log lR (|r) = lognr+ r log() + (n r) log(1 )

    S() =r

    n+

    n r1 =

    r

    n

    I() =r

    2+

    n r(1 )2 > 0

    E() =E(r)

    n=

    n

    n= unbiased

    Var() =Var(r)

    n2=

    n(1 )n2

    =(1 )

    n.

    2

    Example 4 (Prevalence of a Genotype). Geneticists interested in the prevalence of a

    certain genotype, observe that the genotype makes its first appearance in the 22ndsub-

    ject analysed. If we assume that the subjects are independent, the likelihood function

    can be computed based on the geometric distribution, as l() = (1 )n1. Thescore function is then S() = 1 (n 1)(1 )1. Setting S() = 0 we get = n1 = 221. Moreover I() = 2 + (n 1)(1 )2 and is greater than zerofor all , implying that is MLE.

    Suppose that the geneticists had planned to stop sampling once they observed r =

    10 subjects with the specified genotype, and the tenth subject with the genotype was

    the 100th subject anaylsed overall. The likelihood of can be computed based on the

    negative binomial distribution, as

    l() =

    n 1r 1

    r(1 )nr

    51

  • 7/27/2019 Stat Infer

    53/114

    for n = 100, r = 5. The usual calculation will confirm that = r/n is MLE. 2

    Example 5 (Radioactive Decay). In this classic set of data Rutherford and Geiger

    counted the number of scintillations in 72 second intervals caused by radioactive de-

    cay of a quantity of the element polonium. Altogether there were 10097 scintillations

    during 2608 such intervals

    Count 0 1 2 3 4 5 6 7

    Observed 57 203 383 525 532 408 573 139

    Count 8 9 10 11 12 13 14

    Observed 45 27 10 4 1 0 1

    The Poisson probability mass function with mean parameter is

    fX(x|) = x exp()

    x!.

    The likelihood function equals

    l() = xi exp()

    xi!=

    xi exp(n)

    xi!.

    The relevant mathematical calculations are

    log l() = (xi)log() n log [(xi!)]S() =

    xi

    n

    = xin

    = x

    I() =xi2

    > 0,

    implying is MLE. AlsoE() =E(xi) =

    1n

    = , so is an unbiased estimator.

    Next Var() = 1n2Var(xi) =

    1n. It is always useful to compare the fitted values

    from a model against the observed values.

    i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

    Oi 57 203 383 525 532 408 573 139 45 27 10 4 1 0 1

    Ei 54 211 407 525 508 393 254 140 68 29 11 4 1 0 0

    +3 -8 -24 0 +24 +15 +19 -1 -23 -2 -1 0 -1 +1 +1

    The Poisson law agrees with the observed variation within about one-twentieth of its

    range. 2

    52

  • 7/27/2019 Stat Infer

    54/114

    Example 6 (Exponential distribution). Suppose random variables X1, . . . , X n are i.i.d.

    as Exp(). Then

    l() =ni=1

    exp(xi)

    = n exp

    xi

    log l() = n log

    xi

    S() =n

    ni=1

    xi

    = n

    xi

    I() =n

    2 > 0 .

    Exercise 15. Demonstrate that the expectation and variance of are given as follows

    E[] =n

    n 1

    Var[] =n2

    (n 1)2(n 2) 2.

    Hint Find the probability distribution of Z =ni=1

    Xi, where Xi Exp().

    Exercise 16. Propose the alternative estimator = n1n . Show that is unbiased

    estimator of with the variance

    Var[] =2

    n 2 .

    As this example demonstrates, maximum likelihood estimation does not automati-

    cally produce unbiased estimates. If it is thought that this property is (in some sense)

    desirable, then some adjustments to the MLEs, usually in the form of scaling, may be

    required.

    Example 7 (Gaussian Distribution). Consider data X1, X2 . . . , X n distributed as N(, ).

    Then the likelihood function is

    l(, ) =

    1

    nexp

    ni=1

    (xi )2

    2

    53

  • 7/27/2019 Stat Infer

    55/114

    and the log-likelihood function is

    log l(, ) = n2

    log (2) n2

    log() 12

    ni=1

    (xi )2 (3.2)

    Unknown mean and known variance As is known we treat this parameter as a con-

    stant when differentiating wrt . Then

    S() =1

    ni=1

    (xi ), = 1n

    ni=1

    xi, and I() =n

    > 0 .

    Also, E[] = n/n = , and so the MLE of is unbiased. Finally

    Var[] =1

    n2Var

    n

    i=1 xi =

    n= (E[I()])1 .

    Known mean and unknown variance Differentiating (3.2) wrt returns

    S() = n2

    +1

    22

    ni=1

    (xi )2,

    and setting S() = 0 implies

    =1

    n

    ni=1

    (xi )2.

    Differentiating again, and multiplying by 1 yields

    I() = n22

    + 13

    ni=1

    (xi )2.

    Clearly is the MLE since

    I() =n

    22> 0.

    Define

    Zi = (Xi )2/

    ,

    so that Zi N(0, 1). From the appendix on probabilityn

    i=1 Z2i 2n,implying E[

    Z2i ] = n, and Var[

    Z2i ] = 2n. The MLE

    = (/n)ni=1

    Z2i .

    54

  • 7/27/2019 Stat Infer

    56/114

    Then

    E[] = Enn

    i=1

    Z2

    i = ,and

    Var[] =

    n

    2Var

    ni=1

    Z2i

    =

    22

    n.

    2

    Our treatment of the two parameters of the Gaussian distribution in the last example

    was to (i) fix the variance and estimate the mean using maximum likelihood; and then

    (ii) fix the mean and estimate the variance using maximum likelihood. In practice we

    would like to consider the simultaneous estimation of these parameters. In the next

    section of these notes we extend MLE to multiple parameter estimation.

    3.2 Multi-parameter Estimation

    Suppose that a statistical model specifies that the data y has a probability distribution

    f(y; , ) depending on two unknown parameters and . In this case the likelihood

    function is a function of the two variables and and having observed the value y is

    defined as l(, ) = f(y; , ) with log l(, ) = log l(, ). The MLE of(, ) is a

    value (, ) for which l(, ) , or equivalently log l(, ) , attains its maximum value.

    Define S1(, ) = log l/ and S2(, ) = log l/. The MLEs (, ) can

    be obtained by solving the pair of simultaneous equations

    S1(, ) = 0

    S2(, ) = 0

    Let us consider the matrix I(, )

    I(, ) = I11(, ) I12(, )

    I21(, ) I22(, ) = 2

    2 log l2

    log l

    2

    log l 2

    2 log l The conditions for a value (0, 0) satisfying S1(0, 0) = 0 and S2(0, 0) = 0

    to be a MLE are that

    I11(0, 0) > 0, I22(0, 0) > 0,

    55

  • 7/27/2019 Stat Infer

    57/114

    and

    det(I(0, 0) = I11(0, 0)I22(0, 0) I12(0, 0)2

    > 0.

    This is equivalent to requiring that both eigenvalues of the matrix I(0, 0) be positive.

    Example 8 (Gaussian distribution). Let X1, X2 . . . , X n be iid observations from a

    N(, 2) density in which both and 2 are unknown. The log likelihood is

    log l(, 2) =ni=1

    log

    1

    22exp[ 1

    22(xi )2]

    =

    n

    i=1 1

    2log [2] 1

    2log[2] 1

    22(xi )2

    = n

    2log [2] n

    2log[2] 1

    22

    ni=1

    (xi )2.

    Hence for v = 2

    S1(, v) =log l

    =

    1

    v

    ni=1

    (xi ) = 0

    which implies that

    =1

    n

    ni=1

    xi = x. (3.3)

    Also

    S2(, v) =log l

    v= n

    2v+

    1

    2v2

    ni=1

    (xi )2 = 0

    implies that

    2 = v =1

    n

    ni=1

    (xi )2 = 1n

    ni=1

    (xi x)2. (3.4)

    Calculating second derivatives and multiplying by 1 gives that I(, v) equals

    I(, v) =

    nv

    1v2

    ni=1

    (xi )1v2

    n

    i=1(xi ) n2v2 +

    1v3

    n

    i=1(xi )2

    Hence I(, v) is given by nv 0

    0 n2v2

    56

  • 7/27/2019 Stat Infer

    58/114

    Clearly both diagonal terms are positive and the determinant is positive and so (, v)

    are, indeed, the MLEs of(, v).

    Go back to equation (3.3), and X N(,v/n). ClearlyE(X) = (unbiased) andVar(X) = v/n. Go back to equation (3.4). Then from Lemma 1 that is proven below

    we havenv

    v 2n1

    so that

    E

    nv

    v

    = n 1

    E(v) = n 1n vInstead, propose the (unbiased) estimator of2

    S2 = v =n

    n 1 v =1

    n 1ni=1

    (xi x)2 (3.5)

    Observe that

    E(v) =

    n

    n 1E(v) =

    n

    n 1

    n 1n

    v = v

    and v is unbiased as suggested. We can easily show that

    Var(v) =2v2

    (n 1)2

    Lemma 1 (Joint distribution of the sample mean and sample variance) . IfX1, . . . , X n

    are iid N(, v) then the sample mean X and sample variance S2 are independent.Also X is distributed N(,v/n) and (n 1)S2/v is a chi-squared random variablewith n 1 degrees of freedom.

    Proof. Define

    W =ni=1

    (Xi X)2 =ni=1

    (Xi )2 n(X )2

    Wv

    +(X )2

    v/n=

    ni=1

    (Xi )2v

    57

  • 7/27/2019 Stat Infer

    59/114

    The RHS is the sum of n independent standard normal random variables squared, and

    so is distributed 2

    n. Also,

    X N(,v/n), therefore (

    X )2

    /(v/n) is the squareof a standard normal and so is distributed 21 These Chi-Squared random variables have

    moment generating functions (1 2t)n/2 and (1 2t)1/2 respectively. Next, W/vand (X )2/(v/n) are independent

    Cov(Xi X, X) = Cov(Xi, X) Cov(X, X)= Cov

    Xi,

    1

    n

    Xj

    Var(X)

    =1

    n

    j

    Cov(Xi, Xj) vn

    = vn vn = 0

    But, Cov(Xi X, X ) = Cov(Xi X, X) = 0 , hencei

    Cov(Xi X, X ) = Cov

    i

    (Xi X), X

    = 0

    As the moment generating function of the sum of independent random variables is

    equal to the product of their individual moment generating functions, we see

    E

    et(W/v)

    (1 2t)1/2 = (1 2t)n/2

    E et(W/v) = (1 2t)(n1)/2But (12t)(n1)/2 is the moment generating function of a 2 random variables with(n1) degrees of freedom, and the moment generating function uniquely characterizesthe random variable S = (W/v).

    Suppose that a statistical model specifies that the data x has a probability distri-

    bution f(x;) depending on a vector of m unknown parameters = (1, . . . , m).

    In this case the likelihood function is a function of the m parameters 1, . . . , m and

    having observed the value ofx is defined as l() = f(x;) with log l() = log l().

    The MLE of is a value for which l(), or equivalently log l(), attains its

    maximum value. For r = 1, . . . , m define Sr() = log l/r. Then we can (usually)

    find the MLE by solving the set of m simultaneous equations Sr() = 0 for r =

    58

  • 7/27/2019 Stat Infer

    60/114

    1, . . . , m . The matrix I() is defined to be the m m matrix whose (r, s) element is

    given by Irs where Irs = 2

    log l/rs. The conditions for a value

    satisfyingSr() = 0 for r = 1, . . . , m to be a MLE are that all the eigenvalues of the matrix I()

    are positive.

    3.3 The Invariance Principle

    How do we deal with parameter transformation? We will assume a one-to-one trans-

    formation, but the idea applied generally. Consider a binomial sample with n = 10

    independent trials resulting in data x = 8 successes. The likelihood ratio of 1 = 0.8

    versus 2 = 0.3 isl(1 = 0.8)

    l(2 = 0.3)=

    81(1 1)282(1 2)2

    = 208.7 ,

    that is, given the data = 0.8 is about 200 times more likely than = 0.3.

    Suppose we are interested in expressing on the logit scale as

    log{/(1 )} ,

    then intuitively our relative information about 1 = log(0.8/0.2) = 1.29 versus

    2 = log(0.3/0.7) = 0.85 should beL(1)L(2)

    =l(1)

    l(2)= 208.7 .

    That is, our information should be invariantto the choice of parameterization. ( For

    the purposes of this example we are not too concerned about how to calculate L(). )

    Theorem 3.3.1 (Invariance of the MLE). If g is a one-to-one function, and is the

    MLE of then g() is the MLE ofg().

    Proof. This is trivially true as we let = g1() then f{y|g1()} is maximized in exactly when = g(). When g is not one-to-one the discussion becomes more subtle,

    but we simply choose to define gMLE() = g()

    It seems intuitive that if is most likely for and our knowledge (data) remains

    unchanged then g() is most likely for g(). In fact, we would find it strange if is an

    59

  • 7/27/2019 Stat Infer

    61/114

    estimate of, but 2 is not an estimate of2. In the binomial example with n = 10 and

    x = 8 we get = 0.8, so the MLE ofg() = /(1 ) is

    g() = /(1 ) = 0.8/0.2 = 4.

    60

  • 7/27/2019 Stat Infer

    62/114

    Chapter 4

    Estimation

    In the previous chapter we have seen an approach to estimation that is based on the

    likelihood of observed results. Next we study general theory of estimation that is used

    to compare between different estimators and to decide on the most efficient one.

    4.1 General properties of estimators

    Suppose that we are going to observe a value of a random vector X. Let

    Xdenote the

    set of possible valuesX can take and, for x X, let f(x|) denote the probability thatX takes the value x where the parameter is some unknown element of the set .

    The problem we face is that of estimating . An estimator is a procedure which

    for each possible value x X specifies which element of we should quote as anestimate of. When we observe X = x we quote (x) as our estimate of. Thus is

    a function of the random vector X. Sometimes we write (X) to emphasise this point.

    Given any estimator we can calculate its expected value for each possible value

    of . As we have already mentioned when discussing the maximum likelihoodestimation, an estimator is said to be unbiased if this expected value is identically equal

    to . If an estimator is unbiased then we can conclude that if we repeat the experiment

    an infinite number of times with fixed and calculate the value of the estimator each

    time then the average of the estimator values will be exactly equal to . To evaluate

    61

  • 7/27/2019 Stat Infer

    63/114

    the usefulness of an estimator = (x) of , examine the properties of the random

    variable =

    (X).

    Definition 1 (Unbiased estimators). An estimator = (X) is said to be unbiased for

    a parameter if it equals in expectation

    E[(X)] = E() = .

    Intuitively, an unbiased estimator is right on target. 2

    Definition 2 (Bias of an estimator). The bias of an estimator = (X) of is defined

    as bias() = E[(X) ]. 2

    Note that even if is an unbiased estimator of , g() will generally not be an

    unbiased estimator ofg() unless g is linear or affine. This limits the importance of the

    notion of unbiasedness. It might be at least as important that an estimator is accurate

    in the sense that its distribution is highly concentrated around .

    Exercise 17. Show that for an arbitrary distribution the estimator S2 as defined in (3.5)

    is an unbiased estimator of the variance of this distribution.

    Exercise 18. Consider the estimator S2 of variance 2 in the case of the normal dis-

    tribution. Demonstrate that although S2 is an unbiased estimator of 2, S is not an

    unbiased estimator of. Compute its bias.

    Definition 3 (Mean squared error). The mean squared error of the estimator is de-

    fined as MSE() = E( )2. Given the same set of data, 1 is better than 2 ifMSE(1) MSE(2) (uniformly better if true ). 2

    Lemma 2 (The MSE variance-bias tradeoff). The MSE decomposes as

    MSE() = Var() + bias()2.

    62

  • 7/27/2019 Stat Infer

    64/114

    Proof. We have

    MSE() = E( )2

    = E{ [ E() ] + [ E() ]}2

    = E[ E()]2 + E[E() ]2

    +2E

    [ E()][E() ]

    =0

    = E[ E()]2 + E[E() ]2

    = Var() + [E() ]2

    bias()2.

    NOTE This lemma implies that the mean squared error of an unbiased estimator is

    equal to the variance of the estimator.

    Exercise 19. Consider X1, . . . , X n where Xi N(, 2) and is known. Threeestimators of are 1 = X =

    1n

    ni=1 Xi, 2 = X1, and 3 = (X1 + X)/2. Discuss

    their properties which one you would recommend and why.

    Example 9. Consider X1, . . . , X n to be independent random variables with means

    E(Xi) = and variances Var(Xi) = 2i . Consider pooling the estimators of into a

    common estimator using the linear combination = w1X1 + w2X2 + + wnXn.We will see that the following is true

    (i) The estimator is unbiased if and only if

    wi = 1.

    (ii) The estimator has minimum variance among this class of estimators when the

    weights are inversely proportional to the variances 2i .

    (iii) The variance of for optimal weights wi is Var() = 1/

    i

    2i .

    Indeed, we have E() = E(w1X1 + + wnXn) = i wiE(Xi) = i wi =

    i wi so is unbiased if and only if

    i wi = 1. The variance of our estimator is

    Var() =

    i w2i

    2i , which should be minimized subject to the constraint

    i wi = 1.

    Differentiating the Lagrangian L = i w2i 2i (i wi 1) with respect to wi and63

  • 7/27/2019 Stat Infer

    65/114

    setting equal to zero yields 2wi2i = wi 2i so that wi = 2i /(

    j

    2j ).

    Then, for optimal weights we get Var() = i w2i 2i = (i 4i 2i )/(i 2i )2 =1/(

    i 2i ).

    Assume now that the instead of Xi we observe biased variable Xi = Xi + for

    some = 0. When 2i = 2 we have that Var() = 2/n which tends to zero forn whereas bias() = and MSE() = 2/n + 2. Thus in the general casewhen the bias is present it tends to dominate the variance as n gets larger, which is very

    unfortunate.

    Exercise 20. Let X1, . . . , X n be an independent sample of size n from the uniform

    distribution on the interval (0, ), with density for a single observation being f(x

    |) =

    1 for 0 < x < and 0 otherwise, and consider > 0 unknown.

    (i) Find the expected value and variance of the estimator = 2X.

    (ii) Find the expected value of the estimator = X(n), i.e. the largest observation.

    (iii) Find an unbiased estimator of the form = cX(n) and calculate its variance.

    (iv) Compare the mean square error of and .

    4.2 Minimum-Variance Unbiased EstimationGetting a small MSE often involves a tradeoff between variance and bias. For unbiased

    estimators, the MSE obviously equals the variance, MSE() = Var(), so no tradeoff

    can be made. One approach is to restrict ourselves to the subclass of estimators that are

    unbiasedand minimum variance.

    Definition 4 (Minimum-variance unbiased estimator). If an unbiased estimator ofg()

    has minimum variance among all unbiased estimators of g() it is called a minimum

    variance unbiased estimator (MVUE). 2

    We will develop a method of finding the MVUE when it exists. When such an

    estimator does not exist we will be able to find a lower bound for the variance of an

    unbiased estimator in the class of unbiased estimators, and compare the variance of our

    unbiased estimator with this lower bound.

    64

  • 7/27