Lecture Note on Statistics

download Lecture Note on Statistics

of 74

Transcript of Lecture Note on Statistics

  • 8/18/2019 Lecture Note on Statistics

    1/74

    Lecture Notes on MS237

    Mathematical statistics

    Lecture notes by Janet Godolphin

    2010

  • 8/18/2019 Lecture Note on Statistics

    2/74

    ii

  • 8/18/2019 Lecture Note on Statistics

    3/74

    Contents

    1 Introductory revision material 3

    1.1 Basic probability . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Probability axioms . . . . . . . . . . . . . . . . . . . . . 41.1.3 Conditional probability . . . . . . . . . . . . . . . . . . . 51.1.4 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 7

    1.2 Random variables and probability distributions . . . . . . . . . . 91.2.1 Random variables . . . . . . . . . . . . . . . . . . . . . . 91.2.2 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . 101.2.3 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 12

    2 Random variables and distributions 132.1 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.1.1 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 142.2 Some standard discrete distributions . . . . . . . . . . . . . . . . 15

    2.2.1 Binomial distribution . . . . . . . . . . . . . . . . . . . . 152.2.2 Geometric distribution . . . . . . . . . . . . . . . . . . . 162.2.3 Poisson distribution . . . . . . . . . . . . . . . . . . . . . 172.2.4 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 18

    2.3 Some standard continuous distributions . . . . . . . . . . . . . . 192.3.1 Uniform distribution . . . . . . . . . . . . . . . . . . . . 192.3.2 Exponential distribution . . . . . . . . . . . . . . . . . . 192.3.3 Pareto distribution . . . . . . . . . . . . . . . . . . . . . 202.3.4 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 21

    2.4 The normal (Gaussian) distribution . . . . . . . . . . . . . . . . . 222.4.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . 222.4.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 232.4.3 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 24

    2.5 Bivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . 252.5.1 Denitions and notation . . . . . . . . . . . . . . . . . . 252.5.2 Marginal distributions . . . . . . . . . . . . . . . . . . . 26

    iii

  • 8/18/2019 Lecture Note on Statistics

    4/74

    iv CONTENTS

    2.5.3 Conditional distributions . . . . . . . . . . . . . . . . . . 26

    2.5.4 Covariance and correlation . . . . . . . . . . . . . . . . . 272.5.5 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 30

    2.6 Generating functions . . . . . . . . . . . . . . . . . . . . . . . . 312.6.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.6.2 Probability generating function . . . . . . . . . . . . . . . 312.6.3 Moment generating function . . . . . . . . . . . . . . . . 322.6.4 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 34

    3 Further distribution theory 353.1 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . 35

    3.1.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . 353.1.2 Mean and covariance matrix . . . . . . . . . . . . . . . . 363.1.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 363.1.4 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 38

    3.2 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.2.1 The univariate case . . . . . . . . . . . . . . . . . . . . . 383.2.2 The multivariate case . . . . . . . . . . . . . . . . . . . . 393.2.3 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 41

    3.3 Moments, generating functions and inequalities . . . . . . . . . . 41

    3.3.1 Moment generating function . . . . . . . . . . . . . . . . 413.3.2 Cumulant generating function . . . . . . . . . . . . . . . 423.3.3 Some useful inequalities . . . . . . . . . . . . . . . . . . 423.3.4 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 45

    3.4 Some limit theorems . . . . . . . . . . . . . . . . . . . . . . . . 453.4.1 Modes of convergence of random variables . . . . . . . . 453.4.2 Limit theorems for sums of independent random variables 473.4.3 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 48

    3.5 Further discrete distributions . . . . . . . . . . . . . . . . . . . . 493.5.1 Negative binomial distribution . . . . . . . . . . . . . . . 493.5.2 Hypergeometric distribution . . . . . . . . . . . . . . . . 503.5.3 Multinomial distribution . . . . . . . . . . . . . . . . . . 503.5.4 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 51

    3.6 Further continuous distributions . . . . . . . . . . . . . . . . . . 523.6.1 Gamma and beta functions . . . . . . . . . . . . . . . . . 523.6.2 Gamma distribution . . . . . . . . . . . . . . . . . . . . 523.6.3 Beta distribution . . . . . . . . . . . . . . . . . . . . . . 533.6.4 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 55

  • 8/18/2019 Lecture Note on Statistics

    5/74

    CONTENTS v

    4 Normal and associated distributions 57

    4.1 The multivariate normal distribution . . . . . . . . . . . . . . . . 574.1.1 Multivariate normal . . . . . . . . . . . . . . . . . . . . . 574.1.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 584.1.3 Marginal and conditional distributions . . . . . . . . . . . 604.1.4 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 61

    4.2 The chi-square, t and F distributions . . . . . . . . . . . . . . . . 624.2.1 Chi-square distribution . . . . . . . . . . . . . . . . . . . 624.2.2 Student’s t distribution . . . . . . . . . . . . . . . . . . . 644.2.3 Variance ratio (F) distribution . . . . . . . . . . . . . . . 65

    4.3 Normal theory tests and condence intervals . . . . . . . . . . . . 65

    4.3.1 One-sample t-test . . . . . . . . . . . . . . . . . . . . . . 654.3.2 Two-samples . . . . . . . . . . . . . . . . . . . . . . . . 664.3.3 k samples (One-way Anova) . . . . . . . . . . . . . . . . 674.3.4 Normal linear regression . . . . . . . . . . . . . . . . . . 68

  • 8/18/2019 Lecture Note on Statistics

    6/74

    CONTENTS 1

    MS237 Mathematical Statistics

    Level 2 Spring Semester Credits 15

    Course Lecturer in 2010D. Terhesiu email: [email protected]

    Class TestThe Class Test will be held on Thursady 11th March (week 5), starting at 12.00.

    Class tests will include questions of the following types:

    • examples and proofs previously worked in lectures,• questions from the self-study exercises,• previously unseen questions in a similar style.

    The Class Test will comprise 15 % of the overall assessment for the course.

    CourseworkDistribution: Coursework will be distributed at 14.00 on Friday 26th March .Collection: Coursework will be collected on Thursday 29th April in Room LTB.

    The Coursework will comprise 10 % of the overall assessment for the course.

    Chapter 1Chapter 1 contains and reviews prerequisite material from MS132. Due to timeconstraints, students are expected to work through at least part of this materialindependently at the start of the course.

    Objectives and learning outcomesThis module provides theoretical background for many of the topics introducedin MS132 and for some of the topics which will appear in subsequent statisticsmodules.

    At the end of the module, you should

    (1) be familiar with the main results of statistical distribution theory;

    (2) be able to apply this knowledge to suitable problems in statistics

    Examples, exercises, and problemsBlank spaces have been left in the notes at various positions. These are for addi-tional material and worked examples presented in the lectures. Most chapters end

  • 8/18/2019 Lecture Note on Statistics

    7/74

    2 CONTENTS

    with a set of self-study exercises, which you should attempt in your own study

    time in parallel with the lectures.In addition, six exercise sheets will be distributed during the course. You will

    be given a week to complete each sheet, which will then be marked by the lecturerand returned with model solutions. It should be stressed that completion of theseexercise sheets is not compulsory but those students who complete the sheets dogive themselves a considerable advantage!

    Selected textsFreund, J. E. Mathematical Statistics with Applications , Pearson (2004)

    Hogg, R. V. and Tanis, E. A. Probability and Statistical Inference , Prentice Hall(1997)Lindgren, B. W. Statistical Theory , Macmillan (1976)

    Mood, A. M., Graybill, F. G. and Boes, D. C. Introduction to the Theory of Statis-tics , McGraw-Hill (1974)

    Wackerly, D.D., Mendenhall, W., and Scheaffer, R.L. Mathematical Statistics with Applications , Duxbury (2002)

    Useful seriesThese series will be useful during the course:

    (1 −x)−1 =∞

    k=0

    xk = 1 + x + x2 + x3 + · · ·(1 −x)−2 =

    k=0

    (k + 1) xk = 1 + 2 x + 3 x2 + 4 x3 + · · ·ex =

    k=0

    xk

    k! = 1 + x +

    x2

    2! +

    x3

    3! + · · ·

  • 8/18/2019 Lecture Note on Statistics

    8/74

    Chapter 1

    Introductory revision material

    This chapter contains and reviews prerequisite material from MS132. If necessaryyou should review your notes for that module for additional details. Several exam-ples, together with numerical answers, are included in this chapter. It is stronglyrecommended that you work independently through these examples in order toconsolidate your understanding of the material.

    1.1 Basic probability

    Probability or chance can be measured on a scale which runs from zero , whichrepresents impossibility , to one , which represents certainty .

    1.1.1 Terminology

    A sample space , Ω, is the set of all possible outcomes of an experiment. Anevent E ∈Ω is a subset of Ω.Example 1 Experiment: roll a die twice. Possible events are E 1 = {1st face is a 6 }, E 2 ={sum of faces = 3}, E 3 = {sum of faces is odd }, E 4 = {1st face - 2nd face =3}

    . Identify the sample space and the above events. Obtain their probabilities

    when the die is fair .

    3

  • 8/18/2019 Lecture Note on Statistics

    9/74

    4 CHAPTER 1. INTRODUCTORY REVISION MATERIAL

    Answer:

    second roll1 (1,1) (1,2) (1,3) (1,4) (1,5) (1,6)2 (2,1) (2,2) (2,3) (2,4) (2,5) (2,6)

    rst 3 (3,1) (3,2) (3,3) (3,4) (3,5) (3,6)roll 4 (4,1) (4,2) (4,3) (4,4) (4,5) (4,6)

    5 (5,1) (5,2) (5,3) (5,4) (5,5) (5,6)6 (6,1) (6,2) (6,3) (6,4) (6,5) (6,6)

    p(E 1) = 16 ; p(E 2) = 118 ; p(E 3) =

    12 ; p(E 4) =

    112 .

    Combinations of eventsGiven events A and B , further events can be identied as follows.

    • The complement of any event A, written Ā or Ac, means that A does notoccur.

    • The union of any two events A and B , written A∪B , means that A or Bor both occur.• The intersection of A and B , written as A ∩B , means that both A and B

    occur.

    Venn diagrams are useful in this context.

    1.1.2 Probability axioms

    Let F be the class of all events in Ω. A probability (measure) P on (Ω, F ) is areal-valued function satisfying the following three axioms:

    1. P (E ) ≥0 for every E ∈F 2. P (Ω) = 13. Suppose the events E 1 and E 2 are mutually exclusive (that is, E 1 ∩E 2 =∅).

    ThenP (E 1

    E 2) = P (E 1) + P (E 2)

    Some consequences:(i) P ( Ē ) = 1 −P (E ) (so in particular P (∅) = 0 )(ii) For any two events E 1 and E 2 we have the addition rule

    P (E 1∪E 2) = P (E 1) + P (E 2) −P (E 1 ∩E 2)

  • 8/18/2019 Lecture Note on Statistics

    10/74

    1.1. BASIC PROBABILITY 5

    Example 1: (continued)

    Obtain P (E 1 ∩E 2), P (E 1∪E 2), P (E 1 ∩E 3) and P (E 1∪E 3)Answer: P (E 1 ∩E 2) = P (∅) = 0P (E 1∪E 2) = P (E 1) + P (E 2) = 16 + 118 = 29P (E 1 ∩E 3) = P (6, 1), (6, 3), (6, 5) = 336 = 112P (E 1∪E 3) = P (E 1) + P (E 3) −P (E 1 ∩E 3) = 16 + 12 − 112 = 712[ Notes on axioms:(1) In order to cope with innite sequences of events, it is necessary to strengthen axiom3 to3’. P (∪∞i=1 ) =

    ∑∞i=1 P (E i) for any sequence (E 1, E 2, · · ·) of mutually exclusive events.

    (2) When Ω is noncountably innite, in order to make the theory rigorous it is usuallynecessary to restrict the class of events F to which probabilities are assigned.]

    1.1.3 Conditional probability

    Supose P (E 2) = 0 . The conditional probability of the event E 1 given E 2 isdened as

    P (E 1|E 2) = P (E 1 ∩E 2)

    P (E 2) .

    The conditional probability is undened if P (E 2) = 0 . The conditional probabil-

    ity formula above yields the multiplication rule :

    P (E 1 ∩E 2) = P (E 1)P (E 2|E 1)= P (E 2)P (E 1|E 2)

    IndependenceEvents E 1 and E 2 are said to be independent if

    P (E 1

    ∩E 2) = P (E 1)P (E 2) .

    Note that this implies that P (E 1|E 2) = P (E 1) and P (E 2|E 1) = P (E 2). Thusknowledge of the occurrence of one of the events does not affect the likelihood of occurrence of the other.Events E 1, . . . , E k are pairwise independent if P (E i∩E j ) = P (E i)P (E j ) for alli = j . They are mutually independent if for all subsets P (∩ j E j ) = ∏ j P (E j ).Clearly, mutual independence ⇒pairwise independence, but the converse is false(see question 4 of the self study exercises).

  • 8/18/2019 Lecture Note on Statistics

    11/74

    6 CHAPTER 1. INTRODUCTORY REVISION MATERIAL

    Example 1 (continued): Find P (E 1

    |E 2) and P (E 1

    |E 3). Are E 1, E 2 indepen-

    dent?

    Answer: P (E 1|E 2) = P (E 1∩E 2 )P (E 2 ) = 0, P (E 1|E 3) = P (E 1∩E 3 )P (E 3 ) = 1/ 121/ 2 = 16P (E 1)P (E 2) = 0 so P (E 1 ∩E 2) = P (E 1)P (E 2) and thus E 1 and E 2 are not

    independent.

    Law of total probability (partition law)Suppose that B1, . . . , B k are mutually exclusive and exhaustive events ( i.e. B i ∩B j =∅for all i = j and∪iB i = Ω).

    Let A be any event. Then

    P (A) =k

    j =1

    P (A|B j )P (B j )

    Bayes’ RuleSuppose that events B1, . . . , B k are mutually exclusive and exhaustive and let Abe any event. Then

    P (B j |A) = P (A

    |B j )P (B j )

    P (A) =

    P (A

    |B j )P (B j )

    ∑i P (A|B i)P (B i)Example 2: (Cancer diagnosis) A screening programme for a certain type of cancer has reliabilities P (A|D) = 0 .98 , P (A|D̄ ) = 0 .05, where D is the event“disease is present” and A is the event “test gives a positive result”. It is knownthat 1 in 10, 000 of the population has the disease. Suppose that an individual’stest result is positive. What is the probability that that person has the disease?

    Answer: We require P (D |A). First nd P (A).P (A) = P (A|D)P (D) + P (A|D̄ )P ( D̄ ) = 0 .98 ×0.0001 + 0.05 ×0.9999 =0.050093.

    By Bayes’ rule; P (D |A) = P (A|D )P (D )P (A) = 0.0001×0.980.050093 = 0.002.The person is still very unlikely to have the disease even though the test is positive.

    Example 3: (Bertrand’s Box Paradox) Three indistinguishable boxes containblack and white beads as shown: [ww], [wb], [bb]. A box is chosen at random

  • 8/18/2019 Lecture Note on Statistics

    12/74

    1.1. BASIC PROBABILITY 7

    and a bead chosen at random from the selected box. What is the probability of

    that the [wb] box was chosen given that selected bead was white?

    Answer: E ≡ ’chose the [wb] box’, W ≡ ’selected bead is white’. By thepartition law: P (W ) = 1 × 13 + 12 × 13 + 0 × 13 = 12 . Now using Bayes’ ruleP (E |W ) = P (E )P (W |E )P (W ) =

    13 ×121

    2= 13 (i.e. even though a bead from the selected

    box has been seen, the probability that the box is [wb] is still 13 ).

    1.1.4 Self-study exercises

    1. Consider families of three children, a typical outcome being bbg (boy-boy-girl in birth order) with probability 18 . Find the probabilities of

    (i) 2 boys and 1 girl (any order),

    (ii) at least one boy,

    (iii) consecutive children of different sexes.

    Answer: (i) 38 ; (ii) 78 ; (iii)

    14 .

    2. Use pA = P (A), pB = P (B) and pAB = P (A

    ∩B) to obtain expressions

    for:

    (a) P ( Ā∪ B̄ ),

    (b) P ( Ā ∩B),(c) P ( Ā∪B),

    (d) P ( Ā ∩ B̄ ),(e) P ((A ∩ B̄ )∪(B ∩ Ā)) .Describe each event in words. (Use a Venn diagram.)

    Answer: (a) 1− pAB ; (b) pB − pAB ; (c) 1− pA + pAB ; (d) 1− pA − pB + pAB ;(e) pA + pB −2 pAB .

    3. (i) Express P (E 1∪E 2∪E 3) in terms of the probabilities of E 1, E 2, E 3 andtheir intersections only. Illustrate with a sketch.

    (ii)Three types of fault can occur which lead to the rejection of a certainmanufactured item. The probabilities of each of these faults ( A, B and C )

  • 8/18/2019 Lecture Note on Statistics

    13/74

    8 CHAPTER 1. INTRODUCTORY REVISION MATERIAL

    occurring are 0.1, 0.05 and 0.04 respectively. The three faults are known to

    be interrelated; the probability that A & B both occur is 0.04, A & C 0.02,and B & C 0.02. The probability that all three faults occur together is 0.01.

    What percentage of items are rejected?

    Answer: (i) P (E 1)+ P (E 2)+ P (E 3)−P (E 1 ∩E 2)−P (E 1∩E 3)−P (E 2∩E 3) + P (E 1 ∩E 2 ∩E 3)(ii) P (A∪B∪C ) = .01 + .05 + .04 −(.04 + .02 + .02) + .01 = 0.12

    4. Two fair dice rolled: 36 possible outcomes each with probability 136 . Let

    E 1 = {odd face 1st }, E 2 = {odd face 2nd }, E 3 = {one odd, one even }, soP (E 1) = 12 , P (E 2) =

    12 , P (E 3) =

    12 . Show that E 1, E 2, E 3 are pairwise

    independent, but not mutually independent.

    Answer: P (E 2|E 1) = 12 = P (E 2), P (E 3|E 1) = 12 = P (E 3), P (E 3|E 2) =12 = P (E 3), so E 1, E 2, E 3 are pairwise independent. But P (E 1∩E 2∩E 3) =0 = P (E 1)P (E 2)P (E 3), so E 1, E 2, E 3 are not mutually independent.

    5. An engineering company uses a ‘selling aptitude test’ to aid it in the se-

    lection of its sales force. Past experience has shown that only 65% of allpersons applying for a sales position achieved a classication of ‘satisfac-tory’ in actual selling and of these 80% had passed the aptitude test. Only30% of the ‘unsatisfactory’ persons had passed the test.

    What is the probability that a candidate would be a ‘satisfactory’ salesper-son given that they had passed the aptitude test?

    Answer: A = pass aptitude test, S = satisfactory. P (S ) = 0 .65, P (A|S ) =0.8, P (A|S̄ ) = 0 .3. Therefore P (A) = (0 .65×0.8)+(0 .35×0.3) = 0 .625so P (S |A) = P (S )P (A|S )/P (A) = (0 .65 ×0.8)/ 0.625 = 0.832.

  • 8/18/2019 Lecture Note on Statistics

    14/74

    1.2. RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS 9

    1.2 Random variables and probability distributions

    1.2.1 Random variables

    A random variable X is a real-valued function on the sample space Ω; that is,X : Ω→ R. If P is a probability measure on (Ω, F ) then the induced probabilitymeasure on Ris called the probability distribution of X .A discrete random variable X takes values x1, x2, . . . with probabilities p(x1), p(x2), . . . ,where p(x) = pr(X = x) = P ({ω : X (ω) = x}) is the probability mass func-tion (pmf) of X . (E.g. X = place of horse in race, grade of egg.)

    Example 4: (i) Toss a coin twice: outcomes HH, HT, TH, TT. The random vari-able X = number of heads, takes values 0, 1, 2.(ii) Roll two dice: X = total score. Probabilities for X are P (X = 2) = 136 , P(X =3) = 236 , P(X = 4) =

    336 etc .

    Example 5: X takes values 1, 2, 3, 4, 5 with probabilities k, 2k, 3k, 4k, 5k. Cal-culate k and P (2 ≤X ≤4).Answer: 1 = ∑

    5x=1 P (x) = k(1 + 2 + 3 + 4 + 5) = 15 k so k =

    115 .

    P (2 ≤X ≤4) = P (2) + P (3) + P (4) = 215 + 315 + 415 = 35 .A continuous random variable X takes values over an interval. E.g. X = timeover racecourse, weight of egg. Its probability density function (pdf) f (x) isdened by

    pr(a < X < b ) = b

    af (x)dx .

    Note that f (x) ≥0 for all x, and ∫ ∞−∞f (x)dx = 1 .Example 6: Let f (x) = k(1 −x2) on (−1, 1). Calculate k and pr (|X | > 1/ 2).Answer: 1 = ∫ ∞−∞f (x)dx = ∫

    1

    −1 k(1 −x2)dx = k[x − 13 x

    3]1−1 = 4k

    3 ⇒k = 34P (|X | > 1/ 2) = 1 −P (−12 ≤X ≤ 12 ) = 1 −∫

    12

    −12k(1 −x2)dx = 1 − 11k12 = 516

    A mixed discrete/continuous random variable is such that the probability is shared

  • 8/18/2019 Lecture Note on Statistics

    15/74

    10 CHAPTER 1. INTRODUCTORY REVISION MATERIAL

    between discrete and continuous components with

    ∑ p(x) +

    ∫ f (x)dx = 1 , e.g.

    rainfall on given day, waiting time in queue, ow in pipe, contents of reservoir.The distribution function F of the random variable X is dened as

    F (x) = pr(X ≤x) = P ({ω : X (ω) ≤x}).Thus F (−∞) = 0 , F (∞) = 1 , F is monotone increasing, and pr (a < X ≤b) =F (b) −F (a).Discrete case: F (x) =

    ∑u≤x p(u)

    Continuous case: F (x) = ∫ x

    −∞f (u)du and F (x) = f (x).

    1.2.2 Expectation

    The expectation (or expected value or mean ) of the random variable X is denedas

    µ = E (X ) =

    ∑xp(x) X discrete

    ∫ xf (x)dx X continuousThe Variance ofXis σ2 = Var(X ) = E {(X −µ)2}. Equivalently σ2 = E (X 2)−{E (X )}2 (exercise: prove).σ is called the standard deviation .

    Functions of X :

    (i) E {h(X )}= ∑h(x) p(x) X discrete

    ∫ h(x)f (x)dx X continuous(ii) E (aX + b) = aE (X ) + b, Var(aX + b) = a2Var (X ).Proof (for discrete X )

    (i) h(X ) takes values h(x1), h(x2), . . . with probabilities p(x1), p(x2), . . . , so,by denition, E {h(X )}= h(x1) p(x1) + h(x2) p(x2) + · · ·= ∑h(x) p(x).(ii) E [aX + b] = ∑(aX + b)P (x) = a∑xP (x) + b∑P (x) = aE [X ] + b

  • 8/18/2019 Lecture Note on Statistics

    16/74

    1.2. RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS 11

    Var[ aX + b] = E [(aX + b)

    −E [aX + b]2] = E [aX + b

    −aE [X ]

    −b2] = E [a2(X

    −E [X ])2] = a2Var[ X ]Example 7: X = 0, 1, 2 with probabilities 1/ 4, 1/ 2, 1/ 4.Find E (X ), E (X −1), E (X 2) and Var (X ).Answer: E [X ] = 0 × 14 + 1 × 12 + 2 × 14 = 1E [X −1] = E [X ]−1, E [X 2] = 02 × 14 + 12 × 12 + 22 × 14 = 32

    Var[ X ] = E [X 2

    ]−E [X ]2

    = 12 .

    Example 8: f (x) = k(1+ x)−4 on (0, ∞). Find k and hence obtain E (X ), E {(1+X )−1}, E (X 2) and Var (X ).Answer: 1 = k∫ ∞0 (1 + x)−

    4dx = k[−13 (1 + x)−3]∞0 = k3 ⇒k = 3E [X ] = 3∫ ∞0 x(1 + x)−

    4dx = 3∫ ∞1 (u −1)u−4du = 3[−12 u−2 + 13 u−3]∞1 =

    3( 12 − 13 ) = 12E [(1 + X )−1] = 3∫ ∞0 (1 + x)−

    5dx = 3[−14 (1 + x)−4]∞0 = 34E [X 2] = 3∫ x

    2(1+ x)−4dx = 3∫ ∞1 (u−1)2u−4du = 3[−u−1+ u−2−13 u−3]∞1 = 1

    Var[ X ] = E [X 2]−E [X ]2 = 34 .

  • 8/18/2019 Lecture Note on Statistics

    17/74

    12 CHAPTER 1. INTRODUCTORY REVISION MATERIAL

    1.2.3 Self-study exercises

    1. X takes values 0, 1, 2, 3 with probabilities 14 , 15 ,

    310 ,

    14 . Compute (as frac-

    tions) E (X ), E (2X + 3) , Var( X ) and Var(2 X + 3) .

    Answer: E (X ) = 3120 , E (2X + 3) = 2 E (X ) + 3 = 6110 , E (X

    2) = 7320 , soVar( X ) = E (X 2) −E (X )2 = 499400 , Var(2 X + 3) = 4Var( X ) = 499100 .

    2. The random variable X has density function f (x) = kx(1 −x) on (0,1),f (x) = 0 elsewhere. Calculate k and sketch f (x). Compute the mean and

    variance of X , and pr (0.3 ≤X ≤0.6).Answer: k = 6 , E (X ) = 12 , Var( X ) =

    120 , pr (0.3 ≤X ≤0.6) = 0 .432.

  • 8/18/2019 Lecture Note on Statistics

    18/74

    Chapter 2

    Random variables and distributions

    2.1 Transformations

    Suppose that X has distribution function F X (x) and that the distribution functionF Y (y) of Y = h(X ) is required, where h is a strictly increasing function. Then

    F Y (y) = pr(Y

    ≤y) = pr(h(X )

    ≤y) = pr(X

    ≤x) = F X (x)

    where x ≡ x(y) = h−1(y). If X is continuous and h is differentiable, then itfollows that Y has density

    f Y (y) = dF Y (y)

    dy =

    dF X (x)dy

    = f X (x)dxdy

    .

    On the other hand, if h is strictly decreasing then

    F Y (y) = pr(Y ≤y) = pr(h(X ) ≤y) = pr(X ≥x) = 1 −F X (x)

    which yields f Y (y) = −f X (x)(dx/dy ). Both formulae are covered by

    f Y (y) = f X (x)dxdy

    .

    13

  • 8/18/2019 Lecture Note on Statistics

    19/74

    14 CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS

    Example 9: Suppose that X has pdf f X (x) = 2e−2x on (0,

    ∞). Obtain the pdf of

    Y = log X .

    Probability integral transform. Let X be a continuous random variable with dis-tribution function F (x). Then Y = F (X ) is uniformly distributed on (0, 1).Proof . First note that 0 ≤Y ≤1. Let 0 ≤y ≤1; then

    pr(Y ≤y) = pr(F (X ) ≤y) = pr(X ≤F −1(y)) = F (F −1(y)) = y,so Y has pdf f (y) = 1 on (0, 1) (by differentiation), which is the density of the

    uniform distribution on (0, 1).This result has an important application to the simulation of random variables:

    2.1.1 Self-study exercises

    1. X takes values 1, 2, 3, 4 with probabilities 110 , 15 ,

    310 ,

    25 and Y = ( X −2)2.

    (i) Find E (Y ) and Var( Y ) using the formula for E {h(X )}.(ii) Calculate the pmf of Y and use it to calculate E (Y ) and Var( Y ) directly.

  • 8/18/2019 Lecture Note on Statistics

    20/74

    2.2. SOME STANDARD DISCRETE DISTRIBUTIONS 15

    2. The random variable X has pdf f (x) = 13 , x = 1, 2, 3, zero elsewhere. Findthe pdf of Y = 2X + 1 .

    3. The random variable X has pdf f (x) = e−x on (0, ∞). Obtain the pdf of Y = eX .

    4. Let X have the pdf f (x) = (12)

    x, x = 1, 2, 3, . . . , zero elsewhere. Find the

    pdf of Y = X 3.

    2.2 Some standard discrete distributions

    2.2.1 Binomial distribution

    Consider a sequence of independent trials in each of which there are only twopossible results, ‘success’, with probability π, or ‘failure’, with probability 1 −π(independent Bernoulli trials ).

    Outcomes can be represented as binary sequences, with 1 for success and 0 forfailure, e.g. 110001 has probability ππ(1

    −π)(1

    −π)(1

    −π)π , since the trials are

    independent.

    Let the random variable X be the number of successes in n trials, with n xed.The probability of a particular sequence of r 1’s and n −r 0’s is π r (1 −π)n−r ,and the event {X = r}contains

    nr such sequences. Hence

    p(r ) = pr(X = r ) = nr πr (1 −π)n−r , r = 0, 1, . . . , n .

    This is the pmf of the binomial (n, π ) distribution . The name comes from thebinomial theorem

    {π + (1 −π)}n =n

    r =0

    nr π

    r (1 −π)n−r ,

    from which ∑r p(r ) = 1 follows.The mean is µ = nπ :

  • 8/18/2019 Lecture Note on Statistics

    21/74

    16 CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS

    The variance is σ2 = nπ (1

    −π) (see exercise 3).

    Example 10: A biased coin with pr (head ) = 2 / 3 is tossed ve times. Calculate p(r ).

    2.2.2 Geometric distribution

    Suppose now that, instead of a xed number of Bernoulli trials, one continues untila success is achieved, so that the number of trials, N , is now a random variable.Then N takes the value n if and only if the previous (n −1) trials result in failuresand the nth trial results in a success. Thus

    p(n) = pr(N = n) = (1 −π)n−1π, n = 1, 2, . . . .

    This is the pmf of the geometric (π) distribution : the probabilities are in geo-metric progression. Note that the sum of the probabilities over n = 1, 2, . . . is1.

    The mean is µ = 1/π :

  • 8/18/2019 Lecture Note on Statistics

    22/74

    2.2. SOME STANDARD DISCRETE DISTRIBUTIONS 17

    The variance is σ2 = (1

    −π)/π 2 (see exercise 4).

    Eg. Toss a biased coin with pr (head ) = 2 / 3. Then, on average, it takes threetosses to get a tail.

    2.2.3 Poisson distribution

    The pmf of the Poisson (λ) distribution is dened as

    p(r ) = e−λ λ r

    r ! , r = 0, 1, 2, . . . ,

    where λ > 0. Note that the sum of the probabilities over r = 0, 1, 2, . . . is 1(exponential series).The mean is µ = λ:

    The variance is σ2 = λ (see exercise 6).The Poisson distribution arises in various contexts, one being the limit of a binomial (n, π )as n → ∞and π →0 with nπ = λ xed.Example 11: (Random events in time.) Cars are recorded as they pass a check-point. The probability π that a car is level with the checkpoint at any given instantis very small, but the number n of such instants in a given time period is large.Hence X t , the number of cars passing the checkpoint during a time interval t min-utes, can be modelled as Poisson with mean proportional to t. For example, if

  • 8/18/2019 Lecture Note on Statistics

    23/74

    18 CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS

    the average rate is two cars per minute, nd the probability of exactly 3 cars in 5

    minutes.

    2.2.4 Self-study exercises

    1. In a large consignment of widgets 5% are defective. What is the probabilityof getting one or two defectives in a four-pack?

    2. X is binomial with mean 2 and variance 1. Compute pr (X

    ≤1).

    3. Derive the variance of the binomial (n, π ) distribution.

    [Hint: nd E {X (X −1)}.]

    4. Derive the variance of the geometric (π) distribution.

    [Hint: nd E {X (X −1)}.]

    5. A leaet contains one thousand words and the probability that any one wordcontains a misprint is 0.005. Use the Poisson distribution to estimate theprobability of 2 or fewer misprints.

    6. Derive the variance of the Poisson (λ) distribution.

    [Hint: nd E {X (X −1)}.]

  • 8/18/2019 Lecture Note on Statistics

    24/74

    2.3. SOME STANDARD CONTINUOUS DISTRIBUTIONS 19

    2.3 Some standard continuous distributions

    2.3.1 Uniform distribution

    The pdf of the uniform (α, β ) distribution is

    f (x) = ( β −α)−1, α < x < β .

    The mean is µ = ( β + α)/ 2:

    The variance is σ2 = ( β −α)2/ 12 (see exercise 1).Application. Simulation of continuous random variables via the probability inte-gral transform: see Section 2.1.

    2.3.2 Exponential distribution

    The pdf of the exponential (λ) distribution is

    f (x) = λe−λx , x > 0,

    where λ > 0. The distribution function is F (x) = ∫ x

    0 λe−λu du = 1 − e−λx(verify).

    The mean is µ = 1/λ :

  • 8/18/2019 Lecture Note on Statistics

    25/74

    20 CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS

    The variance is σ2 = 1 /λ 2 (see exercise 4).

    Lack of memory property.

    pr(X > a + b|X > a ) = pr(X > b)Proof:

    For example, if the lifetime of a component is exponentially distributed, then thefact that it has lasted for 100 hours does not affect its chances of failing during thenext 100 hours. That is, the component is not subject to ageing .

    Application to random events in time.Example: cars passing a checkpoint. The distribution of the waiting time , T , forthe rst event can be obtained as follows:

    pr(T > t ) = pr(N t = 0) = e−λt ,

    since N t , the number of events occurring during the time interval (0, t ), has aPoisson distribution with mean λt . Hence T has distribution function F (t) =1 −e−λt , that of the exponential ( λ) distribution.

    2.3.3 Pareto distribution

    The Pareto (α, β ) distribution has pdf

    f (x) = α

    β (1 + xβ )α +1 , x > 0 ,

    where α > 0 and β > 0. The distribution function is F (x) = 1 −(1 + xβ )−α(verify).

  • 8/18/2019 Lecture Note on Statistics

    26/74

    2.3. SOME STANDARD CONTINUOUS DISTRIBUTIONS 21

    The mean is µ = β/ (α

    −1) for α > 1:

    The variance is σ2 = αβ 2/ {(α −1)2(α −2)}for α > 2.

    2.3.4 Self-study exercises

    1. Obtain the variance of the uniform (α, β ) distribution.

    2. The lifetime of a valve has an exponential distribution with mean 350 hours.What proportion of valves will last 400 hours or longer? For how manyhours should the valves be guaranteed so that only 1% are returned underguarantee?

    3. A machine suffers random breakdowns at a rate of three per day. Given thatit is functioning at 10am what is the probability that

    (i) no breakdown occurs before noon?

    (ii) the rst breakdown occurs between 12pm and 1pm?

    4. Obtain the variance of the exponential (λ) distribution.

    5. The random variable X has the Pareto distribution with α = 3 , β = 1 . Findthe probability that X exceeds µ+2 σ, where µ, σ are respectively the meanand standard deviation of X .

  • 8/18/2019 Lecture Note on Statistics

    27/74

    22 CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS

    2.4 The normal (Gaussian) distribution

    2.4.1 Normal distribution

    The normal distribution is the most important distribution in Statistics, for boththeoretical and practical reasons. Its pdf is

    f (x) = 1σ√ 2π e−

    ( x − µ ) 2

    2σ 2 , −∞< x < ∞.The parameters µ and σ2 are the mean and variance respectively. The distributionis denoted by N (µ, σ2).Mean:

    The importance of the normal distribution follows from its use as an approxima-tion in various statistical methods (consequence of Central Limit Theorem: seeSection 3.4.2), its convenience for theoretical manipulation, and its application todescribe observed data.

    Standard normal distributionThe standard normal distribution is N (0, 1), for which the distribution functionhas the special notation Φ(x). Thus

    Φ(x) = x

    −∞1√ 2π e−

    u 22 du .

    The function Φ is tabulated widely ( e.g. New Cambridge Statistical Tables). Use-ful values are Φ(1.64) = 0.95, Φ(1.96) = 0 .975.

    Example 12: Suppose that X is N (0, 1) and Y is N (2, 4). Use tables to calculatepr(X < 1), pr(X < −1), pr(−1.5 < X < −0.5), pr(Y < 1) and pr (Y 2 >5Y −6).

  • 8/18/2019 Lecture Note on Statistics

    28/74

    2.4. THE NORMAL (GAUSSIAN) DISTRIBUTION 23

    2.4.2 Properties

    (i) If X is N (µ, σ2) then aX + b is N (aµ + b, a2σ2).In particular, the standardized variate (X −µ)/σ is N (0, 1).(ii) if X 1 is N (µ1, σ21), X 2 is N (µ2, σ22 ) and X 1 and X 2 are independent, thenX 1 + X 2 is

    N (µ1 + µ2, σ21 + σ22).

    [Hence, from property (i), the distribution of X 1 −X 2 is N (µ1 −µ2, σ21 + σ22).](iii) If X i , i = 1, . . . , n , are independent N (µi , σ2i ), then

    ∑i X i is N (

    ∑i µi ,

    ∑i σ

    2i ).

    (iv) The moment generating function (see Section 2.6.3) of N (µ, σ2) is M (z ) =E (ezX ) = eµz +

    12 σ

    2 z2 .(Properties (i) - (iii) are easily proved via mgfs - see Section 2.6.3.)

    (v) Central moments of N (µ, σ2). Let µr = E {(X −µ)r}, the r th central mo-ment of X . Then

    µr = 0 for r odd , µr = ( σ/ √ 2)r r !/ (r/ 2)! for r even.Note that µ2 = σ2, the variance of X .

    Sampling distribution of the sample meanLet X 1, . . . , X n be independently and identically distributed (iid) as N (µ, σ2).Then the distribution of X̄ = n−1∑X i is N (µ, n−

    1σ2). This is the samplingdistribution of the sample mean, a result of fundamental importance in Statistics.

  • 8/18/2019 Lecture Note on Statistics

    29/74

  • 8/18/2019 Lecture Note on Statistics

    30/74

    2.5. BIVARIATE DISTRIBUTIONS 25

    2.5 Bivariate distributions

    2.5.1 Denitions and notation

    Suppose that X 1, X 2 are two random variables dened on the same probabilityspace (Ω, F , P ). Then P induces a joint distribution for X 1, X 2. The jointdistribution function is dened as

    F (x1, x2) = P ({ω : X 1(ω) ≤x1, X 2(ω) ≤x2})= pr(X 1

    ≤x1, X 2

    ≤x2) .

    In the discrete case the joint pmf is p(x1, x2) = pr(X 1 = x1, X 2 = x2). In thecontinuous case, the joint pdf is f (x1, x2) = ∂F (x1 ,x 2 )∂x 1 ∂x 2 .

    Example 13: (discrete) Two biased coins are tossed. Score heads = 1 (with prob-ability π), tails = 0 (with probability 1 −π). Let X 1 = sum of scores, X 2 =difference of scores (1st - 2nd). The tables below show

    (i) the possible values of X 1, X 2 and their probabilities,(ii) the joint probability table for X 1, X 2.

    (i)Outcome 00 01 10 11

    X 1

    X 2Prob

    (ii)

    X 2-1 0 10

    X 1 12

    Example 14: (continuous) Suppose X 1 and X 2 have joint pdf f (x1, x2) = k(1 −x1x22) on (0, 1)2. Obtain the value of k.

  • 8/18/2019 Lecture Note on Statistics

    31/74

    26 CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS

    2.5.2 Marginal distributions

    These follow from the law of total probability.Discrete case. Marginal probability mass functions

    p1(x1) = pr(X 1 = x1) =x2

    p(x1, x2) and p2(x2) = pr(X 2 = x2) =x1

    p(x1, x2)

    Continuous case. Marginal probability density functions

    f 1(x1) = f (x1, x2)dx2 and f 2(x2) = f (x1, x2)dx1Marginal means and variances. µ1 = E (X 1) = ∑x1 p1(x1) (discrete) or ∫ x1f 1(x1)dx1(continuous)

    σ21 = var(X 1) = E {(X 1 −µ1)2}= E (X 21 ) −µ21Likewise µ2 and σ22 .

    2.5.3 Conditional distributions

    These follow from the denition of conditional probability.Discrete case. Conditional probability mass function of X 1 given X 2 is

    p1(x1|X 2 = x2) = pr(X 1 = x1|X 2 = x2)=

    pr(X 1 = x1, X 2 = x2)pr(X 2 = x2)

    = p(x1, x2)

    p2(x2) .

    Similarly

    p2(x2|X 1 = x1) = p(x1, x2)

    p1(x1) .

    Continuous case. Conditional probability density function of X 1 given X 2 is

    f 1(x1|X 2 = x2) = f (x1, x2)

    f 2(x2) .

    Similarly

    f 2(x2|X 1 = x1) = f (x1, x2)

    f 1(x1) .

    Independence. X 1 and X 2 are said to be independent if F (x1, x2) = F 1(x1)F 2(x2).Equivalently, p(x1, x2) = p1(x1) p2(x2) (discrete), or f (x1, x2) = f 1(x1)f 2(x2)(continuous).

  • 8/18/2019 Lecture Note on Statistics

    32/74

    2.5. BIVARIATE DISTRIBUTIONS 27

    Example 15: Suppose that R and N have a joint distribution in which R

    |N is

    binomial (N, π ) and N is Poisson (λ). Show that R is Poisson (λπ ).

    2.5.4 Covariance and correlation

    The covariance between X 1 and X 2 is dened as

    σ12 = Cov(X 1, X 2) = E {(X 1 −µ1)(X 2 −µ2)}= E (X 1X 2) −µ1µ2 ,where E (X 1X 2) = ∑x1x2 p(x1, x2) (discrete) or ∫ x1x2f (x1, x2)dx1dx2 (con-tinuous).The correlation between X 1 and X 2 is

    ρ = Corr (X 1, X 2) = σ12σ1σ2

    .

    Example 13: (continued)Marginal distributions:

    x1 = 0, 1, 2 with p1(x1) =

    x2 = −1, 0, 1 with p2(x2) =Marginal means:

    µ1 = ∑x1 p1(x1) =µ2 = ∑x2 p2(x2) =Variances:

  • 8/18/2019 Lecture Note on Statistics

    33/74

    28 CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS

    σ21 =

    ∑x21 p1(x1)

    −µ21 =

    σ22 = ∑x22 p2(x2) −µ22 =

    Conditional distributions: e.g.

    p(x1|X 2 = 0) =x1 = 0

    x1 = 2

    Independence: e.g. p(1, 0) = 0 but p1(0) p2(1) = 0 , so X 1, X 2 are not indepen-dent.

    Covariance: σ12 = ∑x1x2 p(x1, x2) −µ1µ2 =Example 14: (continued)Marginal distributions:

    f 1(x1) = ∫ 1

    0 k(1 −x1x22)dx2 =

    f 2(x2) = ∫ 1

    0 k(1 −x1x22)dx1 =Marginal means:

    µ1 = ∫ 1

    0 x1f 1(x1)dx1 =

    µ2 = ∫ 1

    0 x2f 2(x2)dx2 =

    Variances:

    σ21 = ∫ 1

    0 x21f 1(x1)dx1 −µ21 =

    σ22 = ∫ 1

    0 x22f 2(x2)dx2 −µ22 =

    Conditional distributions: e.g.

    f (x2|X 1 = 13 ) =

    Independence:

    f (x1, x2) = k(1 −x1x22) , which does not factorise into f 1(x1)f 2(x2)so X 1, X 2 are not independent.Covariance:

    σ12 = ∫ x1x2f (x1, x2)dx1dx2 −µ1µ2

  • 8/18/2019 Lecture Note on Statistics

    34/74

    2.5. BIVARIATE DISTRIBUTIONS 29

    Properties

    (i) E (aX 1 + bX 2) = aµ1 + bµ2, Var(aX 1 + bX 2) = a2σ21 + 2abσ12 + b2σ22Cov (aX 1 + b,cX 2 + d) = acσ12 , Corr (aX 1 + b,cX 2 + d) = Corr (X 1, X 2)

    (note: invariance under linear transformation)Proof:

    (ii) X 1, X 2 independent ⇒ Cov(X 1, X 2) = 0 . The converse is false .Proof:

    (iii) −1 ≤ Corr (X 1, X 2) ≤ +1 , with equality if and only if X 1, X 2 are linearlydependent.Proof:

  • 8/18/2019 Lecture Note on Statistics

    35/74

  • 8/18/2019 Lecture Note on Statistics

    36/74

    2.6. GENERATING FUNCTIONS 31

    2.6 Generating functions

    2.6.1 General

    The generating function for a sequence (an : n ≥0) is A(z ) = a0 + a1z + a2z 2 +· · ·= ∑∞n =0 an z

    n . Here z is a dummy variable. The denition is useful only if theseries converges. The idea is to replace the sequence (an ) by the function A(z ),which may be easier to analyse than the original sequence.

    Examples:

    (i) If an = 1 for n = 0, 1, 2, . . . , then A(z ) = (1

    −z )−1 for

    |z

    | < 1 (geometric

    series).

    (ii) If an = mn for n = 0, 1, . . . , m , and an = 0 for n > m , then A(z ) =

    (1 + z )m (binomial series).

    2.6.2 Probability generating function

    Let ( pn ) be the pmf of some discrete random variable X , so pn = pr(X = n) ≥0and

    ∑n pn = 1 . Dene the probability generating function (pgf) of X by

    P (z ) = E (z X ) =n

    pn z n .

    Properties(i) |P (z )| ≤1 for |z | ≤1 .Proof:

    (ii) µ = E (X ) = P (1) .Proof:

  • 8/18/2019 Lecture Note on Statistics

    37/74

    32 CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS

    (iii) σ2 = Var(X ) = P (1) + P (1)

    − {P (1)

    }2 .

    Proof:

    (iv) Let X and Y be independent random variables with pgfs P X and P Y respec-tively. Then the pgf of X + Y is given by P X + Y (z ) = P X (z )P Y (z ) .Proof:

    Example 16: (i) Find the pgf of the Poisson (λ) distribution.(ii) Let X 1, X 2 be independent Poisson random variables with parameters λ1, λ2respectively. Obtain the distribution of X 1 + X 2 .

    2.6.3 Moment generating function

    The moment generating function (mgf) is dened as

    M (z ) = E (ezX ) .

    The pgf tends to be used more for discrete distributions, and the mgf for continu-ous ones, although note that the two are related by M (z ) = P (ez ).

  • 8/18/2019 Lecture Note on Statistics

    38/74

    2.6. GENERATING FUNCTIONS 33

    Properties

    (i) µ = E (X ) = M (0), σ2 = Var(X ) = M (0) −µ2 .Proof:

    (ii) Let X and Y be independent random variables with mgfs M X (z ) , M Y (z )respectively. Then the mgf of X + Y is given by M X + Y (z ) = M X (z )M Y (z ) .Proof:

    Normal distribution. We prove properties (i) - (iv) of Section 2.4.2.

  • 8/18/2019 Lecture Note on Statistics

    39/74

    34 CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS

    2.6.4 Self-study exercises

    1. Show that the pgf of the binomial (n, π ) distribution is {πz + (1 −π)}n .2. (Zero-truncated Poisson distribution) Find the pgf of the discrete distribu-

    tion with pmf p(r ) = e−λ λ r / {r !(1 −e−λ )} for r = 1, 2, . . . . Deduce themean and variance.

    3. The random variable X has density f (x) = k(1 + x)e−λx on (0, ∞) withλ > 0. Find the value of k. Show that the moment generating functionM (z ) = k

    {(z

    − λ)−2

    − (z

    − λ)−1

    }. Use it to calculate the mean and

    standard deviation of X .

  • 8/18/2019 Lecture Note on Statistics

    40/74

    Chapter 3

    Further distribution theory

    3.1 Multivariate distributions

    Let X 1, . . . , X p be p real-valued random variables on (Ω, F ) and consider the jointdistribution of X 1, . . . , X p. Equivalently, consider the distribution of the randomvector

    X =

    X 1X 2

    ··X p

    3.1.1 Denitions

    The joint distribution function

    F (x) = pr(X ≤x) = pr(X 1 ≤x1, . . . , X p ≤x p)The joint probability mass function (pmf)

    p(x) = pr(X = x) = pr(X 1 = x1, . . . , X p = x p)

    (discrete case)The joint probability density function (pdf) f (x) is such that

    pr(X ∈A) = A f (x)dx(continuous case)

    35

  • 8/18/2019 Lecture Note on Statistics

    41/74

    36 CHAPTER 3. FURTHER DISTRIBUTION THEORY

    The marginal distributions are those of the individual components:

    F j (x j ) = pr(X j ≤x j ) = F (∞, . . . , x j , . . . , ∞)The conditional distributions are those of one component given another:

    F (x j |xk) = pr(X j ≤x j |X k = xk)The X j s are independent if F (x) = ∏ j F j (x j ). Equivalently, p(x) = ∏ j p j (x j )(discrete case), or f (x) = ∏ j f j (x j ) (continuous case).Means: µ j = E (X j )Variances: σ2 j = Var (X j ) = E {(X j −µ j )2}= E (X 2 j ) −µ2 jCovariances: σ jk = Cov (X j , X k) = E

    {(X j

    −µ j )(X k

    −µk)

    }= E (X j X k)

    −µ j µk

    Correlations: ρ jk = Corr (X j , X k) = σjkσj σk

    3.1.2 Mean and covariance matrix

    The mean vector of X is µ = E (X ) =

    µ1µ2

    ··µ p

    The covariance matrix (variance-covariance matrix, dispersion matrix ) of X is

    Σ =

    σ11 σ12 · · · σ1 pσ21 σ22 · · · σ2 p· · ·· · ·σ p1 σ p2 · · · σ pp

    Since the (i, j )th element of (X −µ)(X −µ)T is (X i −µi)(X j −µ j ), we see thatΣ = E {(X −µ)(X −µ)T }= E (XX T ) −µµT .

    3.1.3 Properties

    Let X have mean µ and covariance matrix Σ . Let a , b be p-vectors and A be aq × p matrix. Then(i) E (aT X ) = aT µ

    (ii) Var (aT X ) = aT Σa . It follows that Σ is positive semi-denite .

    (iii) Cov (aT X, bT X ) = aT Σb

  • 8/18/2019 Lecture Note on Statistics

    42/74

    3.1. MULTIVARIATE DISTRIBUTIONS 37

    (iv) Cov (AX ) = AΣAT

    (v) E (X T AX ) = trace (AΣ) + µT Aµ

    Proof:

  • 8/18/2019 Lecture Note on Statistics

    43/74

    38 CHAPTER 3. FURTHER DISTRIBUTION THEORY

    3.1.4 Self-study exercises

    1. Let X 1 = I 1Y, X 2 = I 2Y , where I 1, I 2 and Y are independent and I 1 andI 2 take values ±1 each with probability 12 .Show that E (X j ) = 0 , Var(X j ) = E (Y 2), Cov(X 1, X 2) = 0 .

    2. Verify that E (X 1 + · · ·+ X p) = µ1 + · · ·+ µ p and Var (X 1 + · · ·+ X p) =

    ∑ij σij , where µi = E (X i) and σij = Cov (X i , X j ).Suppose now that the X i ’s are iid. Verify that X̄ has mean µ and varianceσ2/p , where µ = E (X i) and σ2 = Var (X i).

    3.2 Transformations

    3.2.1 The univariate case

    Problem: to nd the distribution of Y = h(X ) from the known distribution of X .The case where h is a one-to-one function was treated in Section 1.2.3. When his many-to-one we use the following generalised formulae:Discrete case: pY (y) =

    ∑ pX (x)

    Continuous case: f Y (y) = ∑f X (x)dxdy

    where in both cases the summations are over the set {x : h(x) = y}. That is, weadd up the contributions to the mass or density at y from all x values which mapto y.

    Example 17: (discrete) Suppose pX (x) = px for x = 0, 1, 2, 3, 4, 5 and let Y =(X −2)2. Obtain the pmf of Y .

  • 8/18/2019 Lecture Note on Statistics

    44/74

    3.2. TRANSFORMATIONS 39

    Example 18: (continuous) Suppose f X (x) = 2x on (0, 1) and let Y = ( X

    − 12 )

    2.

    Obtain the pdf of Y .

    3.2.2 The multivariate case

    Problem: to nd the distribution of Y = h(X ), where Y is s ×1 and X is r ×1,from the known distribution of X .Discrete case: pY (y) = ∑ pX (x) with the summation over the set {x : h(x) =y}.Continuous case:Case (i): h is a one-to-one transformation (so that s = r ). Then the rule is

    f Y (y) = f X (x(y))dxdy +

    where dxdy is the Jacobian of transformation, withdxdy

    ij

    = ∂x i∂y j .

    Case (ii): s < r . First transform the s-vector Y to the r -vector Y , where Y i =Y i , i = 1, . . . , s , and Y i , i = s + 1, . . . , r , are chosen for convenience. Nownd the density of Y as above and then integrate out Y s+1 , . . . , Y r to obtain themarginal density of Y , as required.

    Case (iii): s = r but h(·) is not monotonic. Then there will generally be morethan one value of x corresponding to a given y and we need to add the probabilitycontributions from all relevant xs.

  • 8/18/2019 Lecture Note on Statistics

    45/74

    40 CHAPTER 3. FURTHER DISTRIBUTION THEORY

    Example 19: (linear transformation) Suppose that Y = AX , where A is an r

    ×r

    nonsingular matrix. Then f Y (y) = f X (A−1y)|A|−1+ .

    Example 20: Suppose f X (x) = e−x1−x2 on (0, ∞)2. Obtain the density of Y 1 =12 (X 1 + X 2).

    Sums and products If X 1 and X 2 are independent random variables with densitiesf 1 and f 2, then(i) X 1 + X 2 has density g(u) =

    ∫ f 1(u −v)f 2(v)dv (convolution integral)

    (ii) X 1X 2 has density g(u) = ∫ f 1(u/v )f 2(v)|v|−1dv .Proof:

  • 8/18/2019 Lecture Note on Statistics

    46/74

    3.3. MOMENTS, GENERATING FUNCTIONS AND INEQUALITIES 41

    3.2.3 Self-study exercises

    1. If f X (x) = 29 (x + 1) on (−1, 2) and Y = X 2, nd f Y (y).2. If X has density f (x) calculate the density g(y) of Y = X 2 when

    (i) f (x) = 2 xe−x2 on (0, ∞);(ii) f (x) = 12 (1 + x) on |x| ≤1;(iii) f (x) = 12 on −12 ≤x ≤ 32 .

    3. Let X 1 and X 2 be independent exponential (λ), and let Y 1 = X 1 + X 2 and

    Y 2 = X 1/X 2. Show that Y 1 and Y 2 are independent and nd their densities.

    3.3 Moments, generating functions and inequalities

    3.3.1 Moment generating function

    The moment generating function of the random vector X is dened as

    M (z ) = E (ezT X ) .

    Here z T X = ∑ j z j X j .

    PropertiesSuppose X has mgf M (z ). Then(i) X + a has mgf ea

    T zM (z ) and aX has mgf M (az ).

    (ii) The mgf of ∑k j =1 X j is M (z , . . . , z ).

    (iii) If X 1, . . . , X k are independent random variables with mgfs M j (z j ), j=1,. . . ,k,then the mgf of X = ( X 1, . . . , X k)T is M (z ) = ∏

    k j =1 M j (z j ), the product of the

    individual mgfs.

    Proof:

  • 8/18/2019 Lecture Note on Statistics

    47/74

    42 CHAPTER 3. FURTHER DISTRIBUTION THEORY

    3.3.2 Cumulant generating function

    The cumulant generating function (cgf) of X is dened as K (z ) = log M (z ).The cumulants of X are dened as the coefcients κ j in the power series expan-sion K (z ) = ∑∞ j =1 κ j z

    j /j !.The rst two cumulants are

    κ1 = µ = E (X ), κ2 = σ2 = Var (X )

    Similarly, the third and fourth cumulants are found to be κ3 = E (X −µ)3, κ4 =E (X −µ)4 −3σ4 . These are used to dene the skewness , γ 1 = κ3/κ

    3/ 22 , and the

    kurtosis , γ 2 = κ4/κ22.

    Cumulants of the sample mean. Suppose that X 1, . . . , X n is a random sample froma distribution with cgf K (z ) and cumulants κ j . Then the mgf of X̄ = n−1∑

    n j =1 X j

    is {M (n−1z )}n , so the cgf islog{M (n−1z )}n = nK (n−1z ) = n

    j =1

    κ j (n−1z ) j /j ! .

    Hence the j th cumulant of X̄ is κ j /n j−1 and it follows that X̄ has mean κ1 = µ,variance κ2/n = σ2/n , skewness (κ3/n 2)/ (κ2/n )3/ 2 = γ 1/n 1/ 2 and kurtosis(κ4/n 3)/ (κ2/n )2 = γ 2/n .

    3.3.3 Some useful inequalities

    Markov’s inequalityLet X be any random variable with nite mean. Then for all a > 0

    pr(|X | ≥a) ≤ E |X |

    a .

  • 8/18/2019 Lecture Note on Statistics

    48/74

    3.3. MOMENTS, GENERATING FUNCTIONS AND INEQUALITIES 43

    Proof:

    Cauchy-Schwartz inequality

    Let X, Y be any two random variables with nite variances. Then

    {E (XY )}2 ≤E (X 2)E (Y 2) .Proof:

    Jensen’s inequalityIf u(x) is a convex function then

    E {u(X )} ≥u(E (X )) .Note that u(·) is convex if the curve y = u(x) has a supporting line underneath ateach point, e.g. bowl-shaped.

    Proof:

  • 8/18/2019 Lecture Note on Statistics

    49/74

    44 CHAPTER 3. FURTHER DISTRIBUTION THEORY

    Examples

    1. Chebyshev’s inequality.Let Y be any random variable with nite variance. Then for all a > 0

    pr(|Y −µ| ≥a) ≤ σ2

    a2 .

    2. Correlation inequality.

    {Cov (X, Y )}2 ≤σ2X σ2Y (which implies that |Corr (X, Y )| ≤1).

    3. |E (X )| ≤E (|X |).[It follows that |E {h(Y )}| ≤E {|h(Y )|}for any function h(·).]

  • 8/18/2019 Lecture Note on Statistics

    50/74

    3.4. SOME LIMIT THEOREMS 45

    4. E

    {(

    |X

    |s )r/s

    } ≥ {E (

    |X

    |s )

    }r/s .

    [Thus {E (|X |r )}1/r ≥ {E (|X |s )}1/s and it follows that {E (|X |r )}1/r is an in-creasing function of r .]

    5. A cumulant generating function is a convex function; i.e. K (z ) ≥0.Proof. K (z ) = log M (z ), so K = M /M and K = {MM −(M )2}/M 2.Hence M (z )2K (z ) = E (ezX )E (X 2ezX ) − {E (Xe zX )}2 ≥ 0, by the Cauchy-Schwartz inequality.

    (on writing XezX = ( ezX/ 2)(Xe zX/ 2))

    3.3.4 Self-study exercises

    1. Find the joint mgf M (z ) of (X, Y ) when the pdf is f (x, y) = 12 λ3(x +

    y)e−λ (x+ y) on (0, ∞)2. Deduce the mgf of U = X + Y .2. Find all the cumulants of the N (µ, σ2) distribution.

    [You may assume the mgf eµz +12 σ

    2 z2 .]

    3. Suppose that X is such that E (X ) = 3 and E (X 2) = 13 . Use Chebyshev’sinequality to determine a lower bound for pr (−2 < X < 8).

    4. Show that {E (|X |)}−1 ≤E (|X |−1).

    3.4 Some limit theorems

    3.4.1 Modes of convergence of random variables

    Let X 1, X 2, . . . be a sequence of random variables. There are a number of alter-native modes of convergence of (X n ) to a limit random variable X . Suppose rstthat X 1, X 2, . . . and X are all dened on the same sample space Ω.

  • 8/18/2019 Lecture Note on Statistics

    51/74

    46 CHAPTER 3. FURTHER DISTRIBUTION THEORY

    Convergence in probability

    X n p→ X if pr(|X n −X | > ) → 0 as n → ∞ for all > 0. Equivalently,pr(|X n −X | ≤ ) →1. Often X = c, a constant.Almost sure convergence

    X na.s.

    → X if pr(X n → X ) = 1 . Again, often X = c. Also referred to asconvergence ‘with probability one’.

    Almost sure convergence is a stronger property than convergence in probability.i.e. a.s. ⇒ p, but p⇒ a.s.Example 21: Consider independent Bernoulli trials with constant probability of success 12 .

    A typical sequence would be 01001001110101100010. . . .Here the rst 20 trials resulted in 9 successes, giving an observed proportion of X̄ 20 = 0.45 successes.Intuitively, as we increase n we would expect this proportion to get closer to 1.However, this will not be the case for all sequences: for example, the sequence11111111111111111111 has exactly the same probability as the earlier sequence,but X̄ 20 = 1 .

    It can be shown that the total probability of all innite sequences for which theproportion of successes does not converge to 12 is zero; i.e. pr( X̄ n → 12 ) = 1 soX̄ n

    a.s.

    → 12 (and hence also X̄ n p

    → 12 ).Convergence in r th mean

    X nr

    →X if E |X n −X |r →0 as n → ∞.[r th mean ⇒ p, but r th mean ⇔ a.s. ]Suppose now that the distribution functions are F 1, F 2, . . . and F . The randomvariables need not be dened on the same sample spaces for the following deni-tion.

    Convergence in distribution

    X nd

    →X if F n (x) →F (x) as n → ∞at each continuity point of F . We say thatthe asymptotic distribution of X n is F .[ p⇒d, but d⇒ p] A useful result .

    Let (X n ), (Y n ) be two sequences of random variables such that

  • 8/18/2019 Lecture Note on Statistics

    52/74

    3.4. SOME LIMIT THEOREMS 47

    X nd

    →X and Y n

    p

    →c, a constant. Then

    X n + Y nd

    →X + c , X n Y nd

    →cX , X n /Y nd

    →X/c (c = 0) .

    3.4.2 Limit theorems for sums of independent random vari-ables

    Let X 1, X 2, . . . be a sequence of iid random variables with (common) mean µ. LetS n =

    ∑ni=1 X i , X̄ n = n−1S n .

    Weak Law of Large Numbers (WLLN). If E |X i| < ∞then X̄ n p

    →µ.Proof (case σ2 = Var (X i) < ∞). Use Chebyshev’s inequality: since E ( X̄ n ) = µwe have, for every > 0,

    pr(|X̄ n −µ| > ) ≤ Var( X̄ n )

    2 = σ2

    n 2 →0as n → ∞.Example 21: (continued). Here σ2 = Var (X i) = 1

    4 (Bernoulli r.v.) so the WLLN

    applies to X̄ n , the proportion of successes.

    Strong Law of Large Numbers (SLLN). If E |X i| < ∞then X̄ na.s.

    → µ.[The proof is more tricky and is omitted.]

    Central Limit Theorem (CLT). If σ2 = Var (X i) < ∞thenS n −nµ

    σ√ nd

    →N (0, 1) .

    Equivalently,X̄ n −µσ/ √ n

    d

    →N (0, 1) .Proof. Suppose that X i has mgf M (z ). Write Z n = S n −nµσ√ n . The mgf of Z n isgiven by

    M Z n (z ) = E (ezZ n ) = exp −zµ√ n

    σM

    z σ√ n

    n

    .

  • 8/18/2019 Lecture Note on Statistics

    53/74

    48 CHAPTER 3. FURTHER DISTRIBUTION THEORY

    Therefore the cgf of Z n is

    K Z n (z ) = log M Z n (z ) = −zµ√ n

    σ + nK

    z σ√ n

    = −zµ√ n

    σ + n µ

    z σ√ n +

    σ2

    2 z σ√ n

    2

    + O 1√ n

    = −zµ√ n

    σ +

    zµ√ nσ

    + z 2

    2 + O

    1√ n →

    z 2

    2

    as n → ∞, which is the cgf of the N (0, 1) distribution, as required.[ Note on the proof of the CLT. In cases where the mgf does not exist, a similar proof can be given in terms of the function φ(z ) = E (e izX j ) where i = √ −1. φ(·) is called thecharacteristic function and always exists.]

    Example 21: (continued). Normal approximation to the binomialSuppose now that the success probability is π, so that pr (X i = 1) = π . Thenµ = π and σ2 = π(1 − π), so the CLT gives √ n( X̄ n − π)/√ {π(1 −π)} isapproximately N (0, 1).Furthermore, X̄ n

    p

    →π by the WLLN, and it follows from the ‘useful result’ that√ n( X̄ n −π)/

    √ {X̄ n (1 − X̄ n )}is also approximately N (0, 1).

    Poisson limit of binomial. Suppose that X n is binomial (n, π ) where π is such thatnπ →λ as n → ∞. Then X n

    d

    → Poisson (λ).Proof. X n is expressible as ∑

    ni=1 Y i , where the Y i are independent Bernoulli

    random variables with pr (Y i = 1) = π . Thus X n has pgf

    (1 −π + πz )n = {1 −n−1λ(1 −z ) + o(n−1)}n →exp{−λ(1 −z )}as n → ∞, which is the pgf of the Poisson (λ) distribution.

    3.4.3 Self-study exercises1. In a large consignment of manufactured items 25% are defective. A random

    sample of 50 is drawn. Use the binomial distribution to compute the exactprobability that the number of defectives in the sample is ve or fewer. Usethe CLT to approximate this answer.

    2. The random variable Y has the Poisson (50) distribution. Use the CLT tond pr (Y = 50) , pr(Y ≤45) and pr (Y > 60).

  • 8/18/2019 Lecture Note on Statistics

    54/74

    3.5. FURTHER DISCRETE DISTRIBUTIONS 49

    3. A machine in continuous use contains a certain critical component which

    has an exponential lifetime distribution with mean 100 hours. When a com-ponent fails it is immediately replaced by one from the stock, originally of 90 such components. Use the CLT to nd the probability that the machinecan be kept running for a year without the stock running out.

    3.5 Further discrete distributions

    3.5.1 Negative binomial distribution

    Let X be the number of Bernoulli trials until the kth success. Then

    pr(X = x) = pr(k −1 successes in rst x −1 trials, followed by success on kth trial )= x −1k −1

    πk−1(1 −π)x−k ×π(where the rst factor comes from the binomial distribution). Hence dene thepmf of the negative binomial (k, π ) distribution as

    p(x) = x −1k −1πk(1 −π)x−k , x = k, k + 1 , . . .

    The mean is k/π :The variance is k(1 −π)/π 2 (see exercise 1).The pgf is {π/ (z −1 −1 + π)}k :

    The name “negative binomial” comes from the binomial expansion

    1 = πk{1 −(1 −π)}−k =∞

    x= k

    p(x)

  • 8/18/2019 Lecture Note on Statistics

    55/74

    50 CHAPTER 3. FURTHER DISTRIBUTION THEORY

    where p(x) are the negative binomial probabilities. (Exercise: verify)

    3.5.2 Hypergeometric distribution

    An urn contains n1 red beads and n2 black beads. Suppose that m beads are drawnwithout replacement and let X be the number of red beads in the sample. Notethat, since X ≤n1 and X ≤m , the possible values of X are 0, 1, ..., min(n1, m).Then

    p(x) = pr(X = x) = no. of selections of x reds and m −x blackstotal no. of selections of m beads

    =

    n1x

    n2m −x

    n1 + n2m

    , x = 0, 1,..., min(n1, m) .

    This is the pmf of the hypergeometric (n1, n 2, m) distribution .The mean is n1m/ (n1 + n2) and the variance is n1n2m(n1 + n2

    −m)/

    {(n1 +

    n2)2(n1 + n2 −1)}.

    3.5.3 Multinomial distribution

    An urn contains n j beads of colour j ( j = 1, . . . k ). Suppose that m beads aredrawn with replacement and let X j be the number of beads of colour j in thesample. Then, for x j = 0, 1, . . . , m and ∑

    k j =1 x j = m,

    p(x) = pr(X = x) = mx π

    x11 π

    x22 · · ·π

    xkk ,

    where π j = n j / ∑ki=1 n i . This is the pmf of the multinomial (k,m,π ) distribu-

    tion . Here

    mx = no. of different orderings of x1 + · · ·+ xk beads

    = m!x1!· · ·xk!

  • 8/18/2019 Lecture Note on Statistics

    56/74

    3.5. FURTHER DISCRETE DISTRIBUTIONS 51

    and the probability of any given order is πx11 πx22

    · · ·πx kk . The name “multinomial”

    comes from the multinomial expansion of (π1 + · · ·+ πk)m in which the coefcientof πx11 π

    x22 · · ·πxkk is

    mx .

    The means are mπ j :

    The covariances are σ jk = m(δ jk π j −π j πk).The joint pgf is E (∏ j z

    X j j ) = (∑

    k j =1 π j z j )

    m :

    3.5.4 Self-study exercises

    1. Derive the variance of the negative binomial (k, π) distribution.

    [You may assume the formula for the pgf.]

    2. Suppose that X 1, . . . , X k are independent geometric (π) random variables.Using pgfs, show that

    ∑k j =1 X j is negative binomial (k, π ).

    [Hence, the waiting times X j between successes in Bernoulli trials are in-dependent geometric, and the overall waiting time to the kth success is neg-ative binomial.]

    3. If X is multinomial (k,m,π ) show that X j is binomial (m, π j ), X j + X k isbinomial (m, π j + πk), etc .

    [Either by direct calculation or using the pgf.]

  • 8/18/2019 Lecture Note on Statistics

    57/74

    52 CHAPTER 3. FURTHER DISTRIBUTION THEORY

    3.6 Further continuous distributions

    3.6.1 Gamma and beta functions

    Gamma function: Γ(a) = ∫ ∞0 xa−1e−xdx for a > 0

    Integration by parts gives Γ(a) = ( a −1)Γ(a −1).In particular, for integer a , Γ(a) = ( a −1)! (since Γ(1) = 1 ). Also, Γ(1/ 2) = √ π . Beta function: B (a, b) = ∫

    10 x

    a−1(1 −x)b−1dx for a > 0, b > 0Relationship with Gamma function: B (a, b) = Γ(a )Γ( b)Γ( a+ b)

    3.6.2 Gamma distribution

    The pdf of the gamma (α, β ) distribution is dened as

    f (x) = β α xα−1e−βx

    Γ(α) , x > 0

    where α > 0 and β > 0. When α = 1 , this is the exponential (β ) distribution.The mean is α/β :

    The variance is α/β 2 (see exercise 2).The mgf is (1 −z/β )−α :

    Note that the mode is (α −1)/β if α ≥1, but f (0) = ∞if α < 1.

  • 8/18/2019 Lecture Note on Statistics

    58/74

    3.6. FURTHER CONTINUOUS DISTRIBUTIONS 53

    Example 22: The journey time of a bus on a nominal 12 -hour route has the gamma

    (3, 6) distribution. What is the probability that the bus is over half an hour late?

    Sums of exponential random variables Suppose that X 1, . . . , X n are iid exponen-tial (λ) random variables. Then

    ni=1 X i is gamma (n, λ ).

    Proof:

    3.6.3 Beta distribution

    The pdf of the beta (α, β ) distribution is

    f (x) = xα−1(1 −x)β −1

    B(α, β ) , 0 < x < 1 ,

    where α > 0 and β > 0.

  • 8/18/2019 Lecture Note on Statistics

    59/74

    54 CHAPTER 3. FURTHER DISTRIBUTION THEORY

    The mean is α/ (α + β ):

    The variance is αβ/ {(α + β )2(α + β + 1)}.The mode is (α −1)/ (α + β −2) if α ≥1 and α + β > 2.

    Property If X 1 and X 2 are independent, respectively gamma (ν 1, λ ) and gamma(ν 2, λ ), then U 1 = X 1 + X 2 and U 2 = X 1/ (X 1 + X 2) are independent, respectivelygamma (ν 1 + ν 2, λ ) and beta (ν 1, ν 2).

    Proof The inverse transformation is

    X 1X 2

    = U 1U 2U 1(1 −U 2)

    with Jacobian

    dxdu

    = u2 u11 −u2 −u1= −u1 .

  • 8/18/2019 Lecture Note on Statistics

    60/74

    3.6. FURTHER CONTINUOUS DISTRIBUTIONS 55

    Therefore

    f U (u) =λν 1 (u1u2)ν 1−1e−λu 1 u2

    Γ(ν 1)λν 2 {u1(1 −u2)}ν 2−1e−λu 1 (1−u2 )

    Γ(ν 2) | −u1|=

    λν 1 + ν 2 uν 1 + ν 2−11 e−λu 1Γ(ν 1 + ν 2)

    Γ(ν 1 + ν 2)Γ(ν 1)Γ(ν 2)

    uν 1−12 (1 −u2)ν 2−1

    on (0, ∞) ×(0, 1) and the result follows.

    3.6.4 Self-study exercises

    1. Suppose X has the gamma (2, 4) distribution. Find the probability that X exceeds µ+2 σ, where µ, σ are respectively the mean and standard deviationof X .

    2. Derive the variance of the gamma (α, β ) distribution. [Either by direct cal-culation or using the mgf.]

    3. Find the distribution of −log X when X is uniform (0,1). Hence showthat if X 1, . . . , X k are iid uniform (0,1) then −log(X 1X 2 · · ·X k) is gamma(k, 1).

    4. If X is gamma (ν, λ ) show that logX has mgf λ−zΓ(z + ν )/ Γ(ν ).

    5. Suppose X is uniform (0, 1) and γ > 0 Show that Y = X 1/γ is beta (γ, 1).

  • 8/18/2019 Lecture Note on Statistics

    61/74

    56 CHAPTER 3. FURTHER DISTRIBUTION THEORY

  • 8/18/2019 Lecture Note on Statistics

    62/74

    Chapter 4

    Normal and associated distributions

    4.1 The multivariate normal distribution

    4.1.1 Multivariate normal

    The multivariate normal distribution, denoted N p(µ, Σ) , has pdf

    f (x) = |2πΣ |−1/ 2 exp{−12

    (x −µ)T Σ−1(x −µ)}on (−∞, ∞) p.The mean is µ ( p×1) and the covariance matrix is Σ ( p× p) (see property (v)).Bivariate case, p = 2 . Here

    X = X 1X 2, µ = µ1µ2

    , Σ = Σ11 Σ12Σ 21 Σ22= σ

    21 ρσ1σ2

    ρσ1σ2 σ22

    |2πΣ | = (2 π)2σ

    21σ

    22(1 −ρ

    2)

    Σ−1 = (1 −ρ2)−1 1/σ 21 −ρ/ (σ1σ2)−ρ/ (σ1σ2) 1/σ 22

    , giving

    f (x1, x2) =exp − 12(1−ρ2 )

    x1−µ1σ12

    −2ρ x1−µ1σ1 x2−µ2σ2 + x2−µ2σ22

    2πσ1σ2√ 1 −ρ257

  • 8/18/2019 Lecture Note on Statistics

    63/74

    58 CHAPTER 4. NORMAL AND ASSOCIATED DISTRIBUTIONS

    4.1.2 Properties

    i) Suppose X is N p(µ, Σ) and let Y = T −1(X −µ), where Σ = T T T . ThenY i , i = 1, . . . , p , are independent N (0, 1).

    (ii) The joint mgf of N p(µ, Σ) is eµT z+ 12 z

    T Σ z . (C.f. property (iv), Section 2.4.2.)

  • 8/18/2019 Lecture Note on Statistics

    64/74

    4.1. THE MULTIVARIATE NORMAL DISTRIBUTION 59

    (iii) If X is N p(µ, Σ) then AX + b (where A is q

    × p and b is q

    ×1) is N q(Aµ +

    b, AΣAT ).(C.f. property (i), Section 2.4.2.)

    (iv) If X i , i = 1, . . . , n , are independent N p(µi , Σ i), then ∑i X i is N p(∑i µi ,∑i Σ i).(C.f. property (iii), Section 2.4.2.)

  • 8/18/2019 Lecture Note on Statistics

    65/74

    60 CHAPTER 4. NORMAL AND ASSOCIATED DISTRIBUTIONS

    (v) Moments of N p(µ, Σ) . Obtain by differentiation of the mgf. In particu-lar, differentiating w.r.t. z j and z k gives E (X j ) = µ j , Var(X j ) = Σ jj andCov (X j , X k) = Σ jk .Note that if X 1, . . . , X p are all uncorrelated ( i.e. Σ jk = 0 for j = k) thenX 1, . . . , X p are independent N (µ j , σ2 j ).

    (vi) If X is N p

    (µ, Σ) then aT X and bT X are independent if and only if aT Σb = 0 .Similarly for AT X and BT X .

    4.1.3 Marginal and conditional distributions

    Suppose that X is N p(µ, Σ) . Partition X T as (X T 1 , X T 2 ) where X 1 is p1 ×1, X 2 is

    p2 ×1 and p1 + p2 = p. Correspondingly µT = ( µT 1 , µT 2 ) and Σ = Σ11 Σ12Σ 21 Σ22

    .

    Note that ΣT 21 = Σ 12 and X 1 and X 2 are independent if and only if Σ12 = 0 (since

  • 8/18/2019 Lecture Note on Statistics

    66/74

    4.1. THE MULTIVARIATE NORMAL DISTRIBUTION 61

    the joint density factorises if and only if Σ 12 = 0).The marginal distribution of X 1 is N p1 (µ1, Σ 11 ).Proof:

    The conditional distribution of X 2|X 1 is N p2 (µ2.1, Σ 22.1), where

    µ2.1 = µ2 + Σ 21Σ−1

    11 (X 1 −µ1)Σ 22.1 = Σ 22 −Σ 21Σ−111 Σ12(proof omitted). Note that µ2.1 is linear in X 1.

    4.1.4 Self-study exercises

    1. Write down the joint density of the N 2 01 ,

    1 11 4 distribution

    in component form.

    2. Suppose that X i , i = 1, . . . , n , are independent N p(µ, Σ) . Show that thesample mean vector, X̄ = n−1∑i X i is N p(µ, n−

    1Σ) .

    3. For the distribution in exercise 1, obtain the marginal distributions of X 1and X 2 and the conditional distributions of X 2 given X 1 = x1 and X 1 givenX 2 = x2.

  • 8/18/2019 Lecture Note on Statistics

    67/74

    62 CHAPTER 4. NORMAL AND ASSOCIATED DISTRIBUTIONS

    4.2 The chi-square, t and F distributions

    4.2.1 Chi-square distribution

    The pdf of the chi-square distribution with ν degrees of freedom (ν > 0) is

    f (u) = u

    12 ν −1e−12 u

    212 ν Γ( 12 ν )

    , u > 0 .

    Denoted by χ 2ν . Note that the χ 2ν distribution is identical to the gamma (ν 2 , 12 )

    distribution ( c.f. Section 3.6). It follows that the mean is ν , the variance is 2ν and

    the mgf is (1 −2z )−ν/ 2.Properties(i) Let ν be a positive integer and suppose that X 1, . . . , X ν are iid N (0, 1). Then

    ∑ν i=1 X

    2i is χ 2ν . In particular, if X is N (0, 1) then X 2 is χ 21.

    (ii) If U i , i = 1, . . . , n , are independent χ 2ν i then ∑ni=1 U i is χ

    2ν with ν = ∑

    ni=1 ν i .

  • 8/18/2019 Lecture Note on Statistics

    68/74

    4.2. THE CHI-SQUARE, T AND F DISTRIBUTIONS 63

    (iii) If X is N p(µ, Σ) then (X

    −µ)T Σ−1(X

    −µ) is χ 2 p.

    Theorem (Joint distribution of the sample mean and variance)

    Suppose that X 1, . . . , X n are iid N (µ, σ2). Let X̄ = n−1∑i X i be the samplemean and S 2 = ( n −1)−1∑i(X i − X̄ )2 the sample variance.

    Then X̄ is N (µ, σ2/n ), (n −1)S 2/σ 2 is χ 2n−1 and X̄ and S 2 are independent.Proof:

  • 8/18/2019 Lecture Note on Statistics

    69/74

    64 CHAPTER 4. NORMAL AND ASSOCIATED DISTRIBUTIONS

    4.2.2 Student’s t distribution

    The pdf of the Student’s t distribution with ν degrees of freedom (ν > 0) is

    f (t) = 1

    B( 12 , ν 2 )ν

    12 (1 + t2ν )

    12 (ν +1)

    , −∞< t < ∞.Denoted by tν . The mean is 0 (provided ν > 1):

    The variance is ν/ (ν −2) (provided ν > 2).Theorem If X is N (0, 1), U is χ 2ν and X and U are independent, then

    T ≡ X

    √ U/ν ∼tν .

    Proof:

  • 8/18/2019 Lecture Note on Statistics

    70/74

    4.3. NORMAL THEORY TESTS AND CONFIDENCE INTERVALS 65

    4.2.3 Variance ratio (F) distribution

    The pdf of the variance ratio , or F distribution with ν 1, ν 2 degrees of freedom(ν 1, ν 2 > 0) is

    f (x) =ν 1ν 2

    12 ν 1

    x12 ν 1−1

    B( ν 12 , ν 2

    2 )(1 + ν 1 x

    ν 2 )12 (ν 1 + ν 2 )

    , x > 0 .

    Denoted by F ν 1 ,ν 2 . The mean is ν 2/ (ν 2 −2) (provided ν 2 > 2) and the variance is2ν 22 (ν 1 + ν 2 −2)/ {ν 1(ν 2 −2)2(ν 2 −4)}(provided ν 2 > 4).Theorem. If U 1 and U 2 are independent, respectively χ 2ν 1 and χ

    2ν 2 , then

    F ≡ U 1/ν 1U 2/ν 2 ∼

    F ν 1 ,ν 2 .

    Proof:

    It follows from the above result that (i) F ν 1 ,ν 2 ≡ 1/F ν 2 ,ν 1 and (ii) F 1,ν ≡ t2ν .(Exercise: check)

    4.3 Normal theory tests and condence intervals

    4.3.1 One-sample t-test

    Suppose that Y 1, . . . , Y n are iid N (µ, σ2). Then, from Section 3.2, Ȳ = n−1∑i Y i(the sample mean) and S 2 = ( n −1)−1∑i(Y i − Ȳ )2 (the sample variance) are

    independent, respectively N (µ, σ2/n ) and σ2χ 2n−1/ (n −1). Hence

    Z = (Ȳ −µ)

    σ/ √ n

  • 8/18/2019 Lecture Note on Statistics

    71/74

    66 CHAPTER 4. NORMAL AND ASSOCIATED DISTRIBUTIONS

    is N (0, 1),

    U = (n −1)S 2σ2is χ 2n−1 and Z, U are independent.It follows that

    T =Ȳ −µS/ √ n =

    Z

    √ U/ (n −1)is tn−1.Applications:

    Inference about µ: one-sample z -test ( σ known) and t-test ( σ unknown).Inference about σ2: χ 2 test.

    4.3.2 Two-samples

    Two independent samples. Suppose that Y 11 , . . . , Y 1n 1 are iid N (µ1, σ21) and Y 21, . . . , Y 2n 2are iid N (µ2, σ22).Summary statistics: (n1, Ȳ 1, S 21 ) and (n2, Ȳ 2, S 22 )Pooled sample variance: S 2 = (n 1−1)S

    21 +( n 2−1)S 22

    n 1 + n 2−2From Section 4.2, if σ21 = σ22 = σ2, say, then Ȳ 1 and (n1

    −1)S 21 are indepen-

    dent N (µ1, n−11 σ2) , σ2χ 2n 1−1 respectively, and Ȳ 2 and (n1 −1)S 22 are independentN (µ2, n−12 σ2) , σ2χ 2n 2−1 respectively.Furthermore, (Ȳ 1, (n1 −1)S 21 ) and (Ȳ 2, (n2 −1)S 22 ) are independent.Therefore (Ȳ 1 −Ȳ 2) is N (µ1 −µ2), (n−11 + n−12 )σ2), (n1 + n2 −2)S 2 is σ2χ 2n 1 + n 2−2and (Ȳ 1 − Ȳ 2) and (n1 + n2 −2)S 2 are independent.Therefore

    T ≡ (Ȳ 1 − Ȳ 2) −(µ1 −µ2)

    S

    √ ( 1n 1 +

    1n 2 )

    is tn 1 + n 2−2.Also, since S 21 , S 22 are independent,

    F ≡ S 21S 22 ∼

    F n 1−1,n 2−1.

    Applications:

    Inference about µ1 −µ2: two-sample z -test ( σ known) and t-test ( σ unknown).Inference about σ21 /σ 22 : F (variance ratio) test.

  • 8/18/2019 Lecture Note on Statistics

    72/74

    4.3. NORMAL THEORY TESTS AND CONFIDENCE INTERVALS 67

    Matched pairs Observations (Y i1, Y i2 : i = 1, . . . , n ) where the differences D i =Y i1 −Y i2 are independent N (µ, σ2). Then

    T =D̄ −µS/ √ n

    is tn−1, where S 2 is the sample variance of the D i’s.

    Application:

    Inference about µ from paired observations: paired-sample t-test.

    4.3.3 k samples (One-way Anova)

    Suppose we have k groups, with group means µ1, . . . , µ k .Denote the independent observations by (Y i1, . . . , Y in i : i = 1, . . . , k ) with Y ij ∼N (µi , σ2), j = 1, . . . , n i , i = 1, . . . k .Summary statistics: ((n i , S 2i ) : i = 1, . . . , k ).Total sum of squares : ssT = ∑ij (Y ij −Ȳ )

    2, where Ȳ = n−1∑ij Y ij (the overallmean) and n = ∑i n i .Then ssT = ssW + ssB wheressW =

    ∑ij (Y ij − Ȳ i)2 =

    ∑i(n i −1)S 2i (the within-samples ss )

    ssB = ∑i n i(Ȳ i − Ȳ )2 (the between-samples ss )

    From Sections 4.1 and 4.2, (ni −

    1)S 2i/σ 2 is χ 2

    n i −1 independent of Ȳ

    i.

    Hence ssW/σ 2 is χ 2n−k independent of ssB .Also, by a similar argument to that of the Theorem in Section 3.2 (proof omitted),ssB is σ2χ 2k−1 when µi = µ, say, for all i.Hence we obtain the F -test for equality of the group means µi :

    F = ssB/ (k −1)ssW/ (n −k)

    is F k−1,n −k under the null hypothesis µ1 = · · ·= µk .

  • 8/18/2019 Lecture Note on Statistics

    73/74

    68 CHAPTER 4. NORMAL AND ASSOCIATED DISTRIBUTIONS

    4.3.4 Normal linear regression

    Observations Y 1, . . . , Y n are independently N (α + βx i , σ2), where x1, . . . , x n aregiven constants.The least-squares estimator (α̂, β̂ ) is found by minimizing the sum of squaresQ(α, β ) = ∑

    ni=1 (Y i −α −βx i)2.

    By partial differentiation with respect to α and β , we obtain

    β̂ = T xy /T xx , α̂ = Ȳ − β̂ ̄xwhere T xx =

    ∑i(x i

    −x̄)2 and T xy =

    ∑i(xi

    −x̄)(Y i

    − Ȳ )

    Note that, since both α̂ and β̂ are linear combinations of Y = ( Y 1, . . . , Y n )T , theyare jointly normally distributed.Using properties of expectation and covariance matrices, we nd that (α̂, β̂ )T isbivariate normal with mean (α, β ) and covariance matrix

    V = σ2

    T xx n−1∑i x

    2i −̄x

    −x̄ 1Sums of squares

    Total ss : T yy =

    ∑i(Y i

    − Ȳ )2;

    Residual ss : Q(α̂, β̂ );Regression ss : T yy −Q(α̂, β̂ )

    Results:

    (a) Residual ss = T yy −T xx β̂ 2 , Regression ss = T xx β̂ 2 = T 2xy /T xx(b) E (Total ss ) = T xx β 2 + ( n −1)σ2 , E (Regression ss ) = T xx β 2 + σ2 ,E (Residual ss ) = ( n −2)σ2(c) By a similar argument to that of the Theorem in Section 3.2 (proof omitted),

    Residual ss is σ2χ 2n−2 and, if β = 0 , Regression ss is σ2χ 21, independently of

    Residual ss.Application:

    The residual mean square, S 2 = Residual ss / (n −2), is an unbiased estimator of σ2, β̂ is an unbiased estimator of β with estimated standard error S/ √ T xx , and α̂ isan unbiased estimator of α with estimated standard error (S/ √ T xx )(∑i x

    2i /n )1/ 2.

    If β = β 0 then

    T =β̂ −β 0

    S/ √ T xx

  • 8/18/2019 Lecture Note on Statistics

    74/74

    4.3. NORMAL THEORY TESTS AND CONFIDENCE INTERVALS 69

    is tn−2, giving rise to tests and condence intervals about β .If β = 0 then

    F = Regression ss

    S 2is F 1,n −2, hence a test for β = 0 .(Alternatively, and equivalently, use T = β̂ S/ √ T xx as tn−2.)

    The coefcient of determination is

    r 2 = Regression ss

    Total ss =

    T 2xyT xx T yy

    (square of the sample correlation coefcient). The coefcient of determinationgives the proportion of Y -variation attributable to regression on x.