210 Book

download 210 Book

of 199

Transcript of 210 Book

  • 7/30/2019 210 Book

    1/199

    COURSE NOTES

    STATS 210

    Statistical Theory

    Department of Statistics

    University of Auckland

  • 7/30/2019 210 Book

    2/199

    Contents

    1. Probability 3

    1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.2 Sample spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.4 Partitioning sets and events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    1.5 Probability: a way of measuring sets . . . . . . . . . . . . . . . . . . . . . . . . 14

    1.6 Probabilities of combined events . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.7 The Partition Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    1.8 Examples of basic probability calculations . . . . . . . . . . . . . . . . . . . . . 20

    1.9 Formal probability proofs: non-examinable . . . . . . . . . . . . . . . . . . . . . 22

    1.10 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    1.11 Examples of conditional probability and partitions . . . . . . . . . . . . . . . . . 30

    1.12 Bayes Theorem: inverting conditional probabilities . . . . . . . . . . . . . . . . 32

    1.13 Statistical Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    1.14 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    1.15 Key Probability Results for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . 411.16 Chains of events and probability trees: non-examinable . . . . . . . . . . . . . . 43

    1.17 Equally likely outcomes and combinatorics: non-examinable . . . . . . . . . . . 46

    2. Discrete Probability Distributions 50

    2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    2.2 The probability function, fX(x) . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    2.3 Bernoulli trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    2.4 Example of the probability function: the Binomial Distribution . . . . . . . . . 54

    2.5 The cumulative distribution function, FX

    (x) . . . . . . . . . . . . . . . . . . . . 582.6 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    2.7 Example: Presidents and deep-sea divers . . . . . . . . . . . . . . . . . . . . . . 69

    2.8 Example: Birthdays and sports professionals . . . . . . . . . . . . . . . . . . . . 76

    2.9 Likelihood and estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    2.10 Random numbers and histograms . . . . . . . . . . . . . . . . . . . . . . . . . . 91

    2.11 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

    2.12 Variable transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

    2 . 1 3 V a r i a n c e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 0 6

    2.14 Mean and variance of the Binomial(n, p) distribution . . . . . . . . . . . . . . . 112

    1

  • 7/30/2019 210 Book

    3/199

    3. Modelling with Discrete Probability Distributions 118

    3.1 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

    3.2 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

    3.3 Negative Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

    3.4 Hypergeometric distribution: sampling without replacement . . . . . . . . . . . 127

    3.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

    3.6 Subjective modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

    4. Continuous Random Variables 138

    4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

    4.2 The probability density function . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

    4.3 The Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

    4.4 Likelihood and estimation for continuous random variables . . . . . . . . . . . . 156

    4.5 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

    4.6 Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

    4.7 Exponential distribution mean and variance . . . . . . . . . . . . . . . . . . . . 1624.8 The Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

    4.9 The Change of Variable Technique: finding the distribution of g(X) . . . . . . . 1 6 8

    4.10 Change of variable for non-monotone functions: non-examinable . . . . . . . . . 173

    4.11 The Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

    4.12 The Beta Distribution: non-examinable . . . . . . . . . . . . . . . . . . . . . . . 178

    5. The Normal Distribution and the Central Limit Theorem 179

    5.1 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

    5.2 The Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . 184

    6. Wrapping Up 191

    6.1 Estimators the good, the bad, and the estimator PDF . . . . . . . . . . . . . 191

    6.2 Hypothesis tests: in search of a distribution . . . . . . . . . . . . . . . . . . . . 196

  • 7/30/2019 210 Book

    4/199

    3

    Chapter 1: Probability

    1.1 Introduction

    Definition: A probability is a number between 0 and 1 representing how likely itis that an event will occur.

    Probabilities can be:

    1. Frequentist (based on frequencies),

    e.g. number of times event occursnumber of opportunities for event to occur

    ;

    2. Subjective: probability represents a persons degree of belief that anevent will occur,e.g. I think there is an 80% chance it will rain today,

    written as P(rain) = 0.80.

    Regardless of how we obtain probabilities, we always combine and manipulatethem according to the same rules.

    1.2 Sample spaces

    Definition: A random experiment is an experiment whose outcome is not knownuntil it is observed.

    Definition: A sample space, , is a set of outcomes of a random experiment.

    Every possible outcome must be listed once and only once.

    Definition: A sample point is an element of the sample space.

    For example, if the sample space is ={

    s1, s2, s3}

    , then each si is a samplepoint.

  • 7/30/2019 210 Book

    5/199

    4

    Examples:

    Experiment: Toss a coin twice and observe the result.Sample space: = {H H , H T , T H , T T }An example of a sample point is: HT

    Experiment: Toss a coin twice and count the number of heads.Sample space: = {0, 1, 2}

    Experiment: Toss a coin twice and observe whether the two tosses are the same

    (e.g. HH or TT).Sample space: = {same, different}

    Discrete and continuous sample spaces

    Definition: A sample space is finite if it has a finite number of elements.

    Definition: A sample space is discrete if there are gaps between the differentelements, or if the elements can be listed, even if an infinite list(eg. 1, 2, 3, . . .).

    In mathematical language, a sample space is discrete if it is finite or countable.

    Definition: A sample space is continuous if there are no gaps between the elements,so the elements cannot be listed (eg. the interval[0, 1]).

    Examples:

    = {0, 1, 2, 3} (discrete and finite) = {0, 1, 2, 3, . . .} (discrete, infinite) = {4.5, 4.6, 4.7} (discrete, finite) =

    {H H , H T , T H , T T

    }(discrete, finite)

    = [0, 1] ={all numbers between 0 and 1 inclusive} (continuous, infinite) =

    [0, 90), [90, 360)

    (discrete, finite)

  • 7/30/2019 210 Book

    6/199

    1.3 Events

    Kolmogorov (1903-1987).

    One of the founders of

    probability theory.

    Suppose you are setting out to create a science

    of randomness. Somehow you need to harnessthe idea of randomness, which is all about theunknown, and express it in terms of mathematics.

    How would you do it?

    So far, we have introduced the sample space, ,which lists all possible outcomes of a randomexperiment, and might seem unexciting.

    However, is a set. It lays the ground for a whole mathematical formulationof randomness, in terms of set theory.

    The next concept that you would need to formulate is that ofsomething thathappens at random, or an event.

    How would you express the idea of an event in terms of set theory?

    Definition: An event is a subset of the sample space.

    That is, any collection of outcomes forms an event.

    Example: Toss a coin twice. Sample space: = {H H , H T , T H , T T }

    Let event A be the event that there is exactly one head.

    We write: A =exactly one head

    Then A = {H T , T H }.A is a subset of , as in the definition. We write A .

    Definition: Event A occurs if we observe an outcome that is a member of the setA.

    Note: is a subset of itself, so is an event. The empty set, = {}, is also asubset of . This is called the null event, or the event with no outcomes.

  • 7/30/2019 210 Book

    7/199

  • 7/30/2019 210 Book

    8/199

    7

    1. Alternatives: the union or operator,

    We wish to describe an event that is composed of several different alternatives.

    For example, the event that you used a motor vehicle to get to campus is theevent that your journey involved a car, or a bus, or both.

    To represent the set of journeys involving both alternatives, we shade alloutcomes in Bus and all outcomes in Car.

    Bus

    Bike

    Walk

    Car

    Train

    People in class

    Overall, we have shadedall outcomes in the UNION of Bus and Car.

    We write the event that you used a motor vehicle as the event Bus Car, readas Bus UNION Car.

    The union operator, , denotes Bus OR Car OR both.

    Note: Be careful not to confuse Or and And. To shade the union of Bus andCar, we had to shade everything in Bus AND everything in Car.

    To remember whether union refers to Or or And, you have to consider whatdoes an outcome need to satisfy for the shaded event to occur?

    The answer is Bus, OR Car, OR both. NOT Bus AND Car.

    Definition: Let A and B be events on the same sample space : so A andB .

    The union of events A and B is written A B, and is given byA B = {s : s A or s B or both} .

  • 7/30/2019 210 Book

    9/199

    8

    2. Concurrences and coincidences: the intersection and operator,

    The intersection is an event that occurs when two or more events ALL occurtogether.

    For example, consider the event that your journey today involved BOTH a carAND a train. To represent this event, we shade all outcomes in the OVERLAPof Car and Train.

    Bus

    Bike

    Walk

    Car

    Train

    People in class

    We write the event that you used both car and train as Car Train, read asCar INTERSECT Train.

    The intersection operator, , denotes both Car AND Train together.

    Definition: The intersection of events A and B is written AB and is given byA B = {s : s A AND s B} .

    3. Opposites: the complement or not operator

    The complement of an event is the opposite of the event: everything EXCEPTthe event.

    For example, consider the event that your journey today did NOT involvewalking. To represent this event, we shade all outcomes in except those in theevent Walk.

  • 7/30/2019 210 Book

    10/199

    9

    People in class

    Bus

    Bike

    Walk

    Car

    Train

    People in class

    We write the event not Walk as Walk.

    Definition: The complement of event A is written A and is given by

    A = {s : s / A}.

    Examples:

    Experiment: Pick a person in this class at random.

    Sample space: = {all people in class}.Let event A =person is male and event B =person travelled by bike today.

    Suppose I pick a male who did not travel by bike. Say whether the followingevents have occurred:

    1) A Yes. 2) B No.

    3) A No. 4) B Yes.

    5) A B = {female or bike rider or both}. No.6) A B = {male and non-biker}. Yes.7) A B = {male and bike rider}. No.8) A B = everything outsideA B. A B did not occur, soA B did occur.Yes.

    Question: What is the event ? = Challenge: can you express A B using only unions and complements?Answer: A B = (A B).

  • 7/30/2019 210 Book

    11/199

    10

    Limitations of Venn diagrams

    Venn diagrams are generally useful for up to 3 events, although they are not

    used to provide formal proofs. For more than 3 events, the diagram might notbe able to represent all possible overlaps of events. (This was probably the casefor our transport Venn diagram.)

    Example: A B

    C

    (a) A B C

    A B

    C

    (b) A B C

    Properties of union, intersection, and complement

    The following properties hold.

    (i) = and = . The opposite of everything is nothing.(ii) For any event A,

    A A = Everything is eitherA or notA.and A A = Nothing is both A and notA.

    (iii) For any events A and B, A

    B = B

    A,

    and A B = B A. Commutative.

    (iv) (a) (A B) = A B. (b) (A B) = A B.

    A B

    A B

  • 7/30/2019 210 Book

    12/199

    11

    Distributive laws

    We are familiar with the fact that multiplication is distributive over addition.

    This means that, if a, b, and c are any numbers, then

    a (b + c) = a b + a c.However, addition is not distributive over multiplication:

    a + (b c) = (a + b) (a + c).

    For set union and set intersection, union is distributive over intersection, ANDintersection is distributive over union.

    Thus, for any sets A, B, and C:

    A (B C) = (A B) (A C),

    and A (B C) = (A B) (A C).

    A B

    C

    A B

    C

    More generally, for several events A and B1, B2, . . . , Bn,

    A (B1 B2 . . . Bn) = (A B1) (A B2) . . . (A Bn)

    i.e. A

    ni=1

    Bi

    =

    ni=1

    (A Bi),

    and

    A (B1 B2 . . . Bn) = (A B1) (A B2) . . . (A Bn)

    i.e. A ni=1

    Bi = ni=1

    (A Bi).

  • 7/30/2019 210 Book

    13/199

    12

    1.4 Partitioning sets and events

    The idea of a partition is fundamental in probability manipulations. Later in

    this chapter we will encounter the important Partition Theorem. For now, wegive some background definitions.

    Definition: Two events A and B are mutually exclusive, or disjoint, if AB =.

    This means events A and B cannot happen together. If A happens, it excludes B

    from happening, and vice-versa.

    A B

    Note: Does this mean that A and B are independent?

    No: quite the opposite. A EXCLUDES B from happening, soB depends stronglyon whether or notA happens.

    Definition: Any number of events A1, A2, . . . , Ak are mutually exclusive if everypair of the events is mutually exclusive: ie. Ai Aj = for alli, j with i = j.

    A1 A2 A3

    Definition: A partition of the sample space is a collection of mutually exclusiveevents whose union is .

    That is, sets B1, B2, . . . , Bk form a partition of if

    Bi Bj = for all i, j with i = j ,and B1 B2 . . . Bk = .

  • 7/30/2019 210 Book

    14/199

    13

    Examples:

    B1, B2, B3, B4 form a partition of: B1, . . . , B5 partition :

    B1

    B2

    B3

    B4

    B1

    B2

    B3

    B4B5

    Important: B andB partition for any event B:

    BB

    Partitioning an event A

    Any set or eventA can be partitioned: it doesnt have to be.IfB1, . . . , Bk form a partition of, then (A B1), . . . , (A Bk) form a partitionofA.

    A

    B1

    B2

    B3

    B4

    We will see that this is very useful for finding the probability of event A.

    This is because it is often easier to find the probability of small chunks of A(the partitioned sections) than to find the whole probability of A at once. Thepartition idea shows us how to add the probabilities of these chunks together:see later.

  • 7/30/2019 210 Book

    15/199

    14

    1.5 Probability: a way of measuring sets

    Remember that you are given the job of building the science of randomness.This means somehow measuring chance.

    If I sent you away to measure heights, the firstthing you would ask is what you are supposedto be measuring the heights of.People? Trees? Mountains?

    We have the same question when setting out to measure chance.

    Chance of what?

    The answer is sets.

    It was clever to formulate our notions of events and sample spaces in terms ofsets: it gives us something to measure. Probability, the name that we give toour chance-measure, is a way of measuring sets.

    You probably already have a good idea for a suitable way to measure the sizeof a set or event. Why not just count the number of elements in it?

    In fact, this is often what we do to measure probability (although countingthe number of elements can be far from easy!) But there are circumstanceswhere this is not appropriate.

    What happens, for example, if one set is far more likely than another, butthey have the same number of elements? Should they be the same probability?

    First set:{

    Springboks win

    }.

    Second set: {All Blacks win}.Both sets have just one element, butwe need to give them different probabilities!

    More problems arise when the sets are infiniteor continuous.

    Should the intervals [3, 4] and [13, 14] be the same probability, just becausethey are the same length? Yes they should, if (say) our random experiment isto pick a random number on [0, 20] but no they shouldnt (hopefully!) if ourexperiment was the time in years taken by a student to finish their degree.

  • 7/30/2019 210 Book

    16/199

    15

    Most of this course is about probability distributions.

    A probability distribution is a rule according to which probability is apportioned,

    or distributed, among the diff

    erent sets in the sample space.

    At its simplest, a probability distribution just lists every element in the samplespace and allots it a probability between 0 and 1, such that the total sum of

    probabilities is 1.

    In the rugby example, we could use the following probability distribution:

    P(Springboks win)= 0.3, P(All Blacks win)= 0.7.

    In general, we have the following definition for discrete sample spaces.

    Discrete probability distributions

    Definition: Let = {s1, s2, . . .} be a discrete sample space.A discrete probability distribution on is a set of real numbers {p1, p2, . . .}associated with the sample points {s1, s2, . . .} such that:

    1. 0 pi 1 for all i;

    2.i

    pi = 1.

    pi is called the probability of the event that the outcome is si.

    We write: pi = P(si).

    The rule for measuring the probability of any set, or event, A , is to sumthe probabilities of the elements ofA:

    P(A) =

    iApi.

    E.g. ifA = {s3, s5, s14}, then P(A) = p3 +p5 + p14.

  • 7/30/2019 210 Book

    17/199

    16

    Continuous probability distributions

    On a continuous sample space , e.g. = [0, 1], we can not list all the ele-

    ments and give them an individual probability. We will need more sophisticatedmethods detailed later in the course.

    However, the same principle applies. A continuous probability distribution is arule under which we can calculate a probability between 0 and 1 for any set, orevent, A .

    Probability Axioms

    For any sample space, discrete or continuous, all of probability theory is basedon the following three definitions, or axioms.

    Axiom 1: P() = 1.

    Axiom 2: 0 P(A) 1 for all events A.Axiom 3: If A1, A2, . . . , An aremutually exclusive events, (no overlap), then

    P(A1 A2 . . . An) = P(A1) + P(A2) + . . . + P(An).

    Note: The axioms can never be proved: they are definitions.

    If our rule for measuring sets satisfies the three axioms, it is a valid probabilitydistribution.

    The idea of the axioms is that ALL possible properties of probability should bederivable using ONLY these three axioms. To see how this works, see Section1.9 (non-examinable).

    The definition of a discrete probability distribution on page 15 clearly satisfiesthe axioms. The challenge of defining a probability distribution on a continuoussample space is left till later.

    Note: P() = 0.

    Note: Remember that an EVENT is a SET: an event is a subset of the sample space.

  • 7/30/2019 210 Book

    18/199

    17

    1.6 Probabilities of combined events

    In Section 1.3 we discussed unions, intersections, and complements of events.We now look at the probabilities of these combinations. Everything belowapplies to events (sets) in either a discrete or a continuous sample space.

    1. Probability of a union

    Let A and B be events on a sample space . There are two cases for theprobability of the union A

    B:

    1. A andB are mutually exclusive (no overlap): i.e. A B = .2. A andB are not mutually exclusive: A B = .

    For Case 1, we get the probability of A B straight from Axiom 3:

    If A B = then P(A B) = P(A) + P(B).

    For Case 2, we have the following formula;

    For ANY events A, B, P(A B) = P(A) + P(B) P(A B).

    Note: The formula for Case 2 applies also to Case 1: just substituteP(A B) = P() = 0.

    For three or more events: e.g. for any A, B, and C,

    P(A B C) = P(A) + P(B) + P(C)

    P(A

    B)

    P(A

    C)

    P(B

    C)

    + P(A B C) .

  • 7/30/2019 210 Book

    19/199

    18

    Explanation

    For any events A and B, P(A

    B) = P(A) + P(B)

    P(A

    B).

    The formal proof of this formula is in Section 1.9 (non-examinable).To understand the formula, think of the Venn diagrams:

    A

    B \ (A n B)

    A

    B

    When we add P(A) + P(B), weadd the intersection twice.

    So we have to subtract theintersection once to getP(A B):P(A

    B) = P(A) + P(B)

    P(A

    B).

    Alternatively, think of A B astwo disjoint sets: all ofA,

    and the bits ofB without the

    intersection. SoP(A B) =P

    (A) + P(B) P(A B).2. Probability of an intersection

    A

    B

    There is no easy formula for P(A B).We might be able to use statistical independence

    (Section 1.13).If A and B are not statistically independent,we often use conditional probability(Section 1.10.)

    3. Probability of a complement

    A

    A

    P(A) = 1 P(A).

    This is obvious, but a formal proof is given in Sec. 1.9.

  • 7/30/2019 210 Book

    20/199

    19

    1.7 The Partition Theorem

    The Partition Theorem is one of the most useful tools for probability calcula-

    tions. It is based on the fact that probabilities are often easier to calculate ifwe break down a set into smaller parts.

    Recall that a partition of is a collectionof non-overlapping sets B1, . . . , Bm which

    together cover everything in .

    B1

    B3

    B2

    B4

    Also, if B1, . . . , Bm form a partition of , then (A B1), . . . , (A Bm) form apartition of the set or eventA.

    A

    A B1 A B2

    A B3 A B4

    B1 B2

    B3 B4

    The probability of event A is therefore the sum of its parts:

    P(A) = P(A B1) + P(A B2) + P(A B3) + P(A B4).

    The Partition Theorem is a mathematical way of saying the whole is the sumof its parts.

    Theorem 1.7: The Partition Theorem. (Proof in Section 1.9.)

    LetB1, . . . , Bm form a partition of. Then for any event A,

    P(A) =mi=1

    P(A Bi).

    Note: Recall the formal definition of a partition. Sets B1, B2, . . . , Bm form a par-tition of if Bi Bj = for all i = j , and

    mi=1 Bi = .

  • 7/30/2019 210 Book

    21/199

    20

    1.8 Examples of basic probability calculations

    An Australian survey asked people what sort of car they would like if they could

    choose any car at all. 13% of respondents hadchildren and chose a large car. 12% ofrespondents did not have children and chosea large car. 33% of respondents had children.

    Find the probability that a respondent:(a) chose a large car;(b) either had children or chose a large car

    (or both).

    First formulate events:

    Let C = has children C = no children

    L = chooses large car.

    Next write down all the information given:

    P(C) = 0.33

    P(C L) = 0.13P(C L) = 0.12.

    (a) Asked forP(L).

    P(L) = P(L C) + P(L C) (Partition Theorem)= P(C

    L) + P(C

    L)

    = 0.13 + 0.12

    = 0.25. P(chooses large car)= 0.25.

    (b) Asked forP(L C).P(L C) = P(L) + P(C) P(L C) (Section 1.6)

    = 0.25 + 0.33 0.13= 0.45.

  • 7/30/2019 210 Book

    22/199

    21

    Example 2: Facebook statistics for New Zealand university students aged between18 and 24 suggest that 22% are interested in music, while 34% are interestedin sport.

    Formulate events: M = interested in music, S = interested in sport.

    (a) What is P(M)?

    (b) What is P(M S)?Information given: P(M) = 0.22 P(S) = 0.34.

    (a) P(M) = 1 P(M)

    = 1 0.22= 0.78.

    (b) We can not calculateP(M S) from the information given.

    (c) Given the further information that 48% of the students are interested in

    neither music nor sport, find P(M S) and P(M S).Information given: P(M S) = 0.48. M S

    Thus P(M S) = 1 P(M S)= 1 0.48= 0.52.

    Probability that a student is interested in music, or sport, or both.

    P(M S) = P(M) + P(S) P(M S) (Section 1.6)= 0.22 + 0.34 0.52= 0.04.

    Only 4% of students are interested in both music and sport.

  • 7/30/2019 210 Book

    23/199

    22

    (d) Find the probability that a student is interested in music, but not sport.

    P(M S) = P(M) P(M S) (Partition Theorem)= 0.22 0.04= 0.18.

    1.9 Formal probability proofs: non-examinable

    If you are a mathematician, you will be interested to see how properties ofprobability are proved formally. Only the Axioms, together with standard set-theoretic results, may be used.

    Theorem : The probability measure P has the following properties.

    (i) P() = 0.

    (ii) P(A) = 1 P(A) for any event A.(iii) (Partition Theorem.) IfB1, B2, . . . , Bm form a partition of , then for any

    event A,

    P(A) =mi=1

    P(A Bi).

    (iv) P(A B) = P(A) + P(B) P(A B) for any events A, B.

    Proof:

    i) For any A, we have A = A ; and A = (mutually exclusive).So P(A) = P(A ) = P(A) + P() (Axiom 3) P() = 0.

    ii) = A A; and A A = (mutually exclusive).So 1 = P()

    Axiom 1= P(A A) = P(A) + P(A). (Axiom 3)

  • 7/30/2019 210 Book

    24/199

    23

    iii) Suppose B1, . . . , Bm are a partition of :

    then Bi Bj = if i = j, and

    mi=1 Bi = .

    Thus, (A Bi) (A Bj) = A (Bi Bj) = A = , for i = j,ie. (A B1), . . . , (A Bm) are mutually exclusive also.

    So,mi=1

    P(A Bi) = P

    mi=1

    (A Bi)

    (Axiom 3)

    = PA m

    i=1 Bi (Distributive laws)= P(A )= P(A) .

    iv)

    A B = (A ) (B ) (Set theory)= A (B B) B (A A) (Set theory)= (A B) (A B) (B A) (B A) (Distributive laws)= (A B) (A B) (A B).

    These 3 events are mutually exclusive:

    eg. (A B) (A B) = A (B B) = A = , etc.

    So, P(A

    B) = P(A

    B) + P(A

    B) + P(A

    B) (Axiom 3)

    =P(A) P(A B)

    from (iii) using B and B

    +P(B) P(A B)

    from (iii) using A and A

    + P(A B)

    = P(A) + P(B) P(A B).

  • 7/30/2019 210 Book

    25/199

    24

    1.10 Conditional Probability

    Conditioning is another of the fundamental tools of probability: probably the

    most fundamental tool. It is especially helpful for calculating the probabilitiesof intersections, such as P(A B), which themselves are critical for the usefulPartition Theorem.

    Additionally, the whole field of stochastic processes (Stats 320 and 325) is basedon the idea of conditional probability. What happens next in a process depends,or is conditional, on what has happened beforehand.

    Dependent events

    Suppose A and B are two events on the same sample space. There will oftenbe dependence between A and B. This means that if we know that B hasoccurred, it changes our knowledge of the chance that A will occur.

    Example: Toss a die once.

    Let event A = get 6Let event B= get 4 or better

    If the die is fair, then P(A) = 16 andP(B) =12

    .

    However, if we know that B has occurred, then there is an increased chancethat A has occurred:

    P(A occurs given that B has occurred) = 13.

    result 6result 4 or 5 or 6

    We write

    P(A given B) = P(A |B) = 13

    .

    Question: what would be P(B |A)?

    P(B |A) = P(B occurs, given thatA has occurred)= P(get 4, given that we know we got a 6)= 1.

  • 7/30/2019 210 Book

    26/199

    25

    Conditioning as reducing the sample space

    Sally wants to use Facebook to find a boyfriend at Uni. Her friend Kate tellsher not to bother, because there are more women than men on Facebook. Hereare the 2012 figures for Facebook users at the University of Auckland:

    Relationship status Male Female Total

    Single 700 560 1260

    In a relationship 460 660 1120

    Total 1160 1220 2380

    Before we go any further . . . do you agree with Kate?

    No, because out of the SINGLE people on Facebook, there are a lot more men

    than women!

    Conditioning is all about the sample space of interest. The table above shows

    the following sample space: = {Facebook users at UoA}.

    But the sample space that should interest Sally is different: it is

    S = {members of who are SINGLE}.

    Suppose we pick a person from those in the table.

    Define event M to be: M ={ person is male}.Kate is referring to the following probability:

    P(M) =# Ms

    total # in table=

    1160

    2380= 0.49.

    Kate is correct that there are more women than men on Facebook, but she isusing the wrong sample space so her answer is not relevant.

  • 7/30/2019 210 Book

    27/199

    26

    Now suppose we reduce our sample space from

    = {everyone in the table}to

    S = {single people in the table}.

    Then P(person is male, given that the person is single)

    =# single males

    # singles

    = # who areM andS# who areS

    =700

    1260

    = 0.56.

    We write: P(M|S) = 0.56.This is the probability that Sally is interested in and she can rest assuredthat there are more single men than single women out there.

    Example: Define event R that a person is in a relationship. What is the proportionof males among people in a relationship, P(M|R)?

    P(M|R) = # males in a relationship# in a relationship

    =# who areM andR

    # who areR

    =460

    1120= 0.41.

  • 7/30/2019 210 Book

    28/199

    27

    We could follow the same working for any pair of events, A and B:

    P(A |B) = # who areA andB# who areB

    =(# who areA andB)/ (# in )

    (# who areB)/ (# in )

    =P(A B)P(B)

    .

    This is our definition of conditional probability:

    Definition: Let A and B be two events. The conditional probability that eventA occurs, given that event B has occurred, is written P(A |B),

    and is given by

    P(A |B) =P(A

    B)

    P(B) .

    Read P(A |B) as probability of A, given B.

    Note: P(A |B) gives P(A and B , from within the set of Bs only).P(A B) gives P(A and B , from the whole sample space).

    Note: Follow the reasoning above carefully. It is important to understand whythe conditional probability is the probability of the intersection within the newsample space.

    Conditioning on event B means changing the sample space to B.

    Think ofP(A |B) as the chance of getting an A, from the set of Bs only.

  • 7/30/2019 210 Book

    29/199

    28

    Note: In the Facebook example, we found that P(M|S) = 0.56,and P(M|R) = 0.41. This means that a single person on UoAFacebook is more likely to be male than female, but a person

    in a relationship is more likely to be female than male!Why the difference? Your guess is as good as mine, but I thinkits because men in a relationship are too busy buyingflowers for their girlfriends to have time to spend on Facebook.

    The symbol P belongs to the sample space

    Recall the first of our probability axioms on page 16: P() = 1.This shows that the symbol P is defined with respect to . That is,P BELONGS to the sample space.

    If we change the sample space, we need to change the symbol P. This is whatwe do in conditional probability:

    to change the sample space from to B, say, we change from the symbolP tothe symbolP( |B).

    The symbol P( |B) should behave exactly like the symbolP.For example:

    P(CD) = P(C) + P(D) P(C D),so

    P(C D |B) = P(C|B) + P(D |B) P(C D |B).

    Trick for checking conditional probability calculations:

    A useful trick for checking a conditional probability expression is to replace theconditioned set by, and see whether the expression is still true.

    For example, is P(A |B) + P(A |B) = 1?Answer: ReplaceB by: this gives

    P(A |) + P(A |) = P(A) + P(A) = 1.So, yes, P(A |B) + P(A |B) = 1 for any other sample spaceB.

  • 7/30/2019 210 Book

    30/199

    29

    Is P(A |B) + P(A |B) = 1?Try to replace the conditioning set by : we cant! There are two conditioning

    sets: B andB.

    The expression is NOT true. It doesnt make sense to try to add together proba-

    bilities from two different sample spaces.

    The Multiplication Rule

    For any events A and B, P(A B) = P(A |B)P(B) = P(B |A)P(A).

    Proof: Immediate from the definitions:

    P(A |B) = P(A B)P(B)

    P(A B) = P(A |B)P(B) ,

    and

    P(B

    |A) =

    P(B A)P

    (A) P(B

    A) = P(A

    B) = P(B

    |A)P(A).

    New statement of the Partition Theorem

    The Multiplication Rule gives us a new statement of the Partition Theorem:IfB1, . . . , Bm partition S, then for any event A,

    P(A) =mi=1

    P(A Bi) =mi=1

    P(A |Bi)P(Bi).

    Both formulations of the Partition Theorem are very widely used, but especiallythe conditional formulation

    mi=1 P(A |Bi)P(Bi).

    Warning: Be sure to use this new version of the Partition Theorem correctly:

    it is P(A) = P(A |B1)P(B1) + . . . + P(A |Bm)P(Bm),NOT P(A) = P(A |B1) + . . . + P(A |Bm).

  • 7/30/2019 210 Book

    31/199

    30

    Conditional probability and Peter Pan

    When Peter Pan was hungry but had nothing to eat,

    he would pretend to eat.(An excellent strategy, I have always found.)

    Conditional probability is the Peter Pan of Stats 210. When you dont knowsomething that you need to know, pretend you know it.

    Conditioning on an event is like pretending that you know that the event hashappened.

    For example, if you know the probability of getting to work on time in differentweather conditions, but you dont know what the weather will be like today,pretend you do and add up the different possibilities.

    P(work on time)= P(work on time|fine)P(fine)+ P(work on time|wet)P(wet).

    1.11 Examples of conditional probability and partitions

    Tom gets the bus to campus every day. The bus is on time with probability0.6, and late with probability 0.4.

    The sample space can be written as = {bus journeys}. We can formulateevents as follows:

    T = on time; L = late.From the information given, the events have probabilities:

    P(T) = 0.6 ; P(L) = 0.4.

    (a) Do the events T and L form a partition of the sample space ? Explain whyor why not.

    Yes: they cover all possible journeys (probabilities sum to 1), and there is no

    overlap in the events by definition.

  • 7/30/2019 210 Book

    32/199

    31

    The buses are sometimes crowded and sometimes noisy, both of which areproblems for Tom as he likes to use the bus journeys to do his Stats assign-ments. When the bus is on time, it is crowded with probability 0.5. When it

    is late, it is crowded with probability 0.7. The bus is noisy with probability0.8 when it is crowded, and with probability 0.4 when it is not crowded.

    (b) Formulate events C and N corresponding to the bus being crowded and noisy.Do the events C and N form a partition of the sample space? Explain whyor why not.

    Let C = crowded, N =noisy.C and N do NOT form a partition of . It is possible for the bus to be noisywhen it is crowded, so there must be some overlap between C andN.

    (c) Write down probability statements corresponding to the information givenabove. Your answer should involve two statements linking C with T and L,and two statements linking N with C.

    P(C|T) = 0.5; P(C|L) = 0.7.P(N|C) = 0.8; P(N|C) = 0.4.

    (d) Find the probability that the bus is crowded.

    P(C) = P(C| T)P(T) + P(C|L)P(L) (Partition Theorem)= 0.5 0.6 + 0.7 0.4= 0.58.

    (e) Find the probability that the bus is noisy.

    P(N) =

    P(N|C)

    P(C) +

    P(N|C)

    P(C) (Partition Theorem)

    = 0.8 0.58 + 0.4 (1 0.58)= 0.632.

  • 7/30/2019 210 Book

    33/199

    32

    1.12 Bayes Theorem: inverting conditional probabilities

    ConsiderP(B A) = P(A B). Apply multiplication rule to each side:

    P(B |A)P(A) = P(A |B)P(B)

    Thus P(B |A) = P(A |B)P(B)P(A)

    . ()

    This is the simplest form of Bayes Theorem, named

    after Thomas Bayes (170261), English clergymanand founder of Bayesian Statistics.

    Bayes Theorem allows us to invert the conditioning,i.e. to express P(B |A) in terms ofP(A |B).

    This is very useful. For example, it might be easy to calculate,

    P(later event|earlier event),but we might only observe the later event and wish to deduce the probability

    that the earlier event occurred,

    P(earlier event| later event).

    Full statement of Bayes Theorem:

    Theorem 1.12: Let B1, B2, . . . , Bm form a partition of . Then for any event A,and for any j = 1, . . . , m,

    P(Bj |A) = P(A |Bj)P(Bj)mi=1 P(A |Bi)P(Bi)

    (Bayes Theorem)

    Proof:

    Immediate from () (putB = Bj), and the Partition Rule,which gives P(A) =

    mi=1 P(A |Bi)P(Bi).

  • 7/30/2019 210 Book

    34/199

    33

    Special case of Bayes Theorem when m = 2: use B andB as the partition of:

    then P(B |A) = P(A |B)P(B)P(A |B)P(B) + P(A |B)P(B)

    Example: The case of the Perfidious Gardener.Mr Smith owns a hysterical rosebush. It will die withprobability 1/2 if watered, and with probability 3/4 ifnot watered. Worse still, Smith employs a perfidiousgardener who will fail to water the rosebush withprobability 2/3.

    Smith returns from holiday to find the rosebush . . . DEAD!!!What is the probability that the gardener did not water it?

    Solution:

    First step: formulate events

    Let : D = rosebush dies

    W = gardener waters rosebush

    W = gardener fails to water rosebush

    Second step: write down all information given

    P(D |W) = 12 P(D |W) = 34 P(W) = 23 (so P(W) = 13)

    Third step: write down what were looking for

    P(W |D)

    Fourth step: compare this to what we know

    Need to invert the conditioning, so use Bayes Theorem:

    P(W |D) =P(D

    |W)P(W)

    P(D |W)P(W) + P(D |W)P(W) =3/4

    2/3

    3/4 2/3 + 1/2 1/3 =3

    4

    So the gardener failed to water the rosebush with probability 34

    .

  • 7/30/2019 210 Book

    35/199

    Example: The case of the Defective Ketchup Bottle.

    Ketchup bottles are produced in 3 different factories, accounting

    for 50%, 30%, and 20% of the total output respectively.The percentage of bottles from the 3 factories that are defectiveis respectively 0.4%, 0.6%, and 1.2%. A statistics lecturer whoeats only ketchup finds a defective bottle in her lunchbox.What is the probability that it came from Factory 1?

    Solution:

    1. Events:

    letFi = bottle comes from Factory i (i=1,2,3)letD = bottle is defective

    2. Information given:

    P(F1) = 0.5 P(F2) = 0.3 P(F3) = 0.2

    P(D |F1) = 0.004 P(D |F2) = 0.006 P(D |F3) = 0.012

    3. Looking for:

    P(F1 |D) (so need to invert conditioning).

    4. Bayes Theorem:

    P(F1 |D) = P(D |F1)P(F1)P(D |F1)P(F1) + P(D |F2)P(F2) + P(D |F3)P(F3)

    =0.004 0.5

    0.004 0.5 + 0.006 0.3 + 0.012 0.2

    =0.002

    0.0062

    = 0.322.

  • 7/30/2019 210 Book

    36/199

    35

    1.13 Statistical Independence

    Two events A and B are statistically independent if the occurrence of one doesnot affect the occurrence of the other.

    This means P(A |B) = P(A) and P(B |A) = P(B).

    Now P(A |B) = P(A B)P(B)

    ,

    so ifP

    (A |B) = P(A) then P(A B) = P(A) P(B).We use this as our definition of statistical independence.

    Definition: Events A and B are statistically independent if

    P(A B) = P(A)P(B).

    For more than two events, we say:

    Definition: Events A1, A2, . . . , An are mutually independent if

    P(A1 A2 . . . An) = P(A1)P(A2) . . .P(An), AND

    the same multiplication rule holds for every subcollection of the events too.

    Eg. events A1, A2, A3, A4 are mutually independent if

    i) P(Ai Aj) = P(Ai)P(Aj) for all i, j with i = j;AND

    ii) P(Ai Aj Ak) = P(Ai)P(Aj)P(Ak) for all i,j,k that are all different;AND

    iii) P(A1 A2 A3 A4) = P(A1)P(A2)P(A3)P(A4).

    Note: If events are physically independent, then they will also be statisticallyindependent.

  • 7/30/2019 210 Book

    37/199

    36

    Statistical independence for calculating the probability of an intersection

    In section 1.6 we said that it is often hard to calculate P(A

    B).

    We usually have two choices.

    1. IFA andB are statistically independent, then

    P(A B) = P(A) P(B).

    2. IfA andB are not known to be statistically independent, we usually have to

    use conditional probability and the multiplication rule:

    P(A B) = P(A |B)P(B).This still requires us to be able to calculate P(A |B).

    Example: Toss a fair coin and a fair die together.The coin and die are physically independent.

    Sample space: = {H1, H2, H3, H4, H5, H6, T1, T2, T3, T4, T5, T6}- all 12 items are equally likely.

    Let A= heads and B= six.

    ThenP

    (A) =P

    ({H1, H2, H3, H4, H5, H6}) =6

    12 =

    1

    2

    P(B) = P({H6, T6}) = 212 = 16Now P(A B) = P(Heads and 6) = P({H6}) = 112But P(A) P(B) =12 16 = 112 also,

    So P(A B) = P(A)P(B) and thus A and B are statistically indept.

  • 7/30/2019 210 Book

    38/199

    37

    Pairwise independence does not imply mutual independence

    Example: A jar contains 4 balls: one red, one white, one blue, and one red, white

    & blue. Draw one ball at random.

    Let A =ball has red on it,B =ball has white on it,C =ball has blue on it.

    Two balls satisfy A, so P(A) = 24 =12. Likewise, P(B) = P(C) =

    12 .

    Pairwise independent:

    Consider P(A B) = 14

    (one of 4 balls has both red and white on it).

    But, P(A) P(B) = 12 1

    2= 1

    4, so P(A B) = P(A)P(B).

    Likewise, P(A C) = P(A)P(C), and P(B C) = P(B)P(C).So A, B and C are pairwise independent.

    Mutually independent?

    Consider P(A B C) = 14 (one of 4 balls)while P(A)P(B)P(C) = 1

    2 1

    2 1

    2= 1

    8= P(A B C).

    So A, B and C are NOT mutually independent, despite being pairwise indepen-dent.

    1.14 Random Variables

    We have one more job to do in laying the foundations of our science of random-ness. So far we have come up with the following ideas:

    1. Things that happen are sets, also called events.

    2. We measure chance by measuring sets, using a measure called probability.

    Finally, what are the sets that we are measuring? It is a nuisance to have lotsof different sample spaces:

    = {head, tail}; = {same, different}; = {Springboks, All Blacks}.

  • 7/30/2019 210 Book

    39/199

    38

    All of these sample spaces could be represented more concisely in terms ofnumbers:

    =

    {0, 1

    }.

    On the other hand, there are many random experiments that genuinely producerandom numbers as their outcomes.

    For example, the number of girls in a three-child family; the number of headsfrom 10 tosses of a coin; and so on.

    When the outcome of a random experiment is a number, it enables us to quantifymany new things of interest:

    1. quantify the average value (e.g. the average number of heads we would getif we made 10 coin-tosses again and again);

    2. quantify how much the outcomes tend to diverge from the average value;

    3. quantify relationships between different random quantities (e.g. is the num-ber of girls related to the hormone levels of the fathers?)

    The list is endless. To give us a framework in which these investigations cantake place, we give a special name to random experiments that produce numbersas their outcomes.

    A random experiment whose possible outcomes are real numbers is called a

    random variable.

    In fact, any random experiment can be made to have outcomes that are realnumbers, simply by mapping the sample space to a set of real numbers usinga function.

    For example: function X : RX(Springboks win) = 0; X(All Blacks win) = 1.

    This gives us our formal definition of a random variable:

    Definition: A random variable (r.v.) is a function from a sample space to the

    real numbers R.We writeX : R.

  • 7/30/2019 210 Book

    40/199

    39

    Although this is the formal definition, the intuitive definition of a random vari-able is probably more useful. Intuitively, remember that a random variableequates to a random experiment whose outcomes are numbers.

    A random variable produces random real numbersas the outcome of a random experiment.

    Defining random variables serves the dual purposes of:

    1. Describing many different sample spaces in the same terms:

    e.g. = {0, 1} with P(1) = p andP(0) = 1 p describes EVERY possibleexperiment with two outcomes.

    2. Giving a name to a large class of random experiments that genuinely pro-duce random numbers, and for which we want to develop general rules forfinding averages, variances, relationships, and so on.

    Example: Toss a coin 3 times. The sample space is

    = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}One example of a random variable is X : R such that, for sample pointsi,we haveX(si) = # heads in outcomesi.

    So X(HHH) = 3, X(T HT) = 1, etc.

    Another example is Y :

    R such thatY(si) =

    1 if 2nd toss is a head,

    0 otherwise.Then Y(HT H) = 0, Y(T HH) = 1, Y(HHH) = 1, etc.

    Probabilities for random variables

    By convention, we use CAPITAL LETTERS for random variables (e.g. X), andlower case letters to represent the values that the random variable takes (e.g.x).

    For a sample space and random variable X : R, and for a real number x,P(X = x) = P(outcome s is such thatX(s) = x) = P({s : X(s) = x}).

  • 7/30/2019 210 Book

    41/199

    40

    Example: toss a fair coin 3 times. All outcomes are equally likely:P(HHH) = P(HHT) = ...= P(TTT) = 1/8.

    Let X : R, such that X(s) = # heads in s.Then P(X = 0) = P({T T T}) = 1/8.

    P(X = 1) = P({H T T , T H T , T T H }) = 3/8.P(X = 2) = P({H H T , H T H , T H H }) = 3/8.P(X = 3) = P({HHH}) = 1/8.

    Note that P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) = 1.

    Independent random variables

    Random variables X and Y are independent if each does not affect the other.

    Recall that two events A and B are independent ifP(A B) = P(A)P(B).Similarly, random variables X and Y are defined to be independent if

    P({X = x} {Y = y}) = P(X = x)P(Y = y)forall possible values x andy.

    We usually replace the cumbersome notation P({X = x} {Y = y}) by thesimpler notation P(X = x, Y = y).

    From now on, we will use the following notations interchangeably:P({X = x} {Y = y}) = P(X = x ANDY = y) = P(X = x, Y = y).

    Thus X andY are independent if and only if

    P(X = x, Y = y) = P(X = x)P(Y = y) for ALL possible values x, y.

  • 7/30/2019 210 Book

    42/199

    41

    1.15 Key Probability Results for Chapter 1

    1. IfA and B are mutually exclusive (i.e. A B = ), thenP(A B) = P(A) + P(B).

    2. Conditional probability: P(A |B) = P(A B)P(B)

    for any A, B.

    Or: P(A B) = P(A |B)P(B).

    3. For any A, B, we can write

    P(A |B) = P(B |A)P(A)P(B)

    .

    This is a simplified version of Bayes Theorem. It shows how to invert the conditioning,i.e. how to find P(A |B) when you know P(B |A).

    4. Bayes Theorem slightly more generalized:

    for any A, B,

    P(A |B) = P(B |A)P(A)P(B |A)P(A) + P(B |A)P(A) .

    This works because A and A form a partition of the sample space.

    5. Complete version of Bayes Theorem:

    If sets A1, . . . , Am form a partition of the sample space, i.e. they do not overlap(mutually exclusive) and collectively cover all possible outcomes (their union is thesample space), then

    P(Aj |B) = P(B |Aj)P(Aj)P(B |A1)P(A1) + . . . + P(B |Am)P(Am)

    =P(B |Aj)P(Aj)mi=1 P(B |Ai)P(Ai)

    .

  • 7/30/2019 210 Book

    43/199

    42

    6. Partition Theorem: ifA1, . . . , Am form a partition of the sample space, then

    P(B) = P(B A1) + P(B A2) + . . . + P(B Am) .

    This can also be written as:P(B) = P(B |A1)P(A1) + P(B |A2)P(A2) + . . . + P(B |Am)P(Am) .

    These are both very useful formulations.

    7. Chains of events:

    P(A1 A2 A3) = P(A1)P(A2 |A1)P(A3 |A2 A1) .

    8. Statistical independence:

    if A and B are independent, then

    P(A B) = P(A)P(B)and

    P(A |B) = P(A)and

    P(B

    |A) = P(B) .

    9. Conditional probability:

    IfP(B) > 0, then we can treat P( |B) just like P:e.g. if A1 and A2 are mutually exclusive, then P(A1 A2 |B) = P(A1 |B) + P(A2 |B)(compare with P(A1 A2) = P(A1) + P(A2));ifA1, . . . , Am partition the sample space, then P(A1 |B) +P(A2 |B) +. . .+P(Am |B) = 1;and P(A |B) = 1 P(A |B) for any A.(Note: it is not generally true that P(A

    |B) = 1

    P(A

    |B).)

    The fact that P( |B) is a valid probability measure is easily verified by checking that itsatisfies Axioms 1, 2, and 3.

    10. Unions: For any A, B, C,

    P(A B) = P(A) + P(B) P(A B) ;

    P(A

    B

    C) = P(A) +P(B) +P(C)P(A

    B)P(A

    C)P(B

    C) +P(A

    B

    C) .

    The second expression is obtained by writing P(ABC) = P

    A(BC)

    and applying

    the first expression to A and (B C), then applying it again to expand P(B C).

  • 7/30/2019 210 Book

    44/199

    43

    1.16 Chains of events and probability trees: non-examinable

    The multiplication rule is very helpful for calculating probabilities when eventshappen in sequence.

    Example: Two balls are drawn at random without replacement from a box con-taining 4 white and 2 red balls. Find the probability that:(a) they are both white,(b) the second ball is red.

    Solution

    Let eventWi = ith ball is white and Ri = ith ball is red.

    a)P(W1 W2) = P(W2 W1) = P(W2 |W1)P(W1)

    NowP

    (W1) =

    4

    6 andP

    (W2 |W1) =3

    5 . W1

    SoP(both white) = P(W1 W2) = 35 4

    6=

    2

    5.

    b) Looking forP(2nd ball is red). To find this, we have to condition on whathappened in the first draw.

    Event 2nd ball is red is actually event{W1R2, R1R2} = (W1R2) (R1R2).SoP(2nd ball is red) = P(W1 R2) + P(R1 R2) (mutually exclusive)

    = P(R2 |W1)P(W1) + P(R2 |R1)P(R1)

    =2

    5 4

    6+

    1

    5 2

    6=

    1

    3.

    W1 R 1

  • 7/30/2019 210 Book

    45/199

    44

    Probability trees

    Probability trees are a graphical way of representing the multiplication rule.

    First Draw Second Draw

    P(W1) =4

    6

    P(R1) =2

    6

    P(W2 |W1) = 35P(R2 |W1) = 25

    P(W2 |R1) = 45P(R2 |R1) = 15

    W1

    R1

    W2

    R2

    W2

    R2

    Write conditional probabilities on the branches, and multiply to get probability

    of an intersection: eg. P(W1 W2) = 46 3

    5, or P(R1 W2) = 2

    6 4

    5.

    More than two events

    To find P(A1 A2 A3) we can apply the multiplication rule successively:P(A1 A2 A3) = P(A3 (A1 A2))

    = P(A3 |A1 A2)P(A1 A2) (multiplication rule)= P(A3 |A1 A2)P(A2 |A1)P(A1) (multiplication rule)

    Remember as: P(A1 A2 A3) = P(A1)P(A2 |A1)P(A3 |A2 A1).

  • 7/30/2019 210 Book

    46/199

    45

    On the probability tree:

    P(A1)

    P(A1)

    P(A2 |A1)

    P(A3 |A2 A1) P(A1 A2 A3)

    In general, for n events A1, A2, . . . , An, we have

    P(A1A2 . . .An) = P(A1)P(A2 |A1)P(A3 |A2A1) . . .P(An |An1 . . .A1).

    Example: A box contains w white balls and r red balls. Draw 3 balls withoutreplacement. What is the probability of getting the sequence white, red, white?

    Answer:

    P(W1 R2 W3) = P(W1)P(R2 |W1)P(W3 |R2 W1)

    =

    w

    w + r

    r

    w + r 1

    w 1w + r 2

    .

  • 7/30/2019 210 Book

    47/199

    46

    1.17 Equally likely outcomes and combinatorics: non-examinable

    Sometimes, all the outcomes in a discrete finite sample space are equally likely.This makes it easy to calculate probabilities. If:

    i) = {s1, . . . , sk};ii) each outcome si is equally likely, so p1 = p2 = . . . = pk =

    1k

    ;

    iii) event A = {s1, s2, . . . , sr} contains r possible outcomes,then

    P(A) = rk = # outcomes in A# outcomes in .

    Example: For a 3-child family, possible outcomes from oldest to youngest are:

    = {GGG,GGB,GBG,GBB,BGG,BGB,BBG,BBB}=

    {s1, s2, s3, s4, s5, s6, s7, s8

    }Let {p1, p2, . . . , p8} be a probability distribution on . If every baby is equallylikely to be a boy or a girl, then all of the 8 outcomes in are equally likely, so

    p1 = p2 = . . . = p8 =18 .

    Let event A be A = oldest child is a girl.

    Then A ={GGG,GGB,GBG,GBB}.

    Event A contains 4 of the 8 equally likely outcomes, so event A occurs withprobability P(A) = 4

    8= 1

    2.

    Counting equally likely outcomes

    To count the number of equally likely outcomes in an event, we often needto use permutations or combinations. These give the number of ways of

    choosing r objects from n distinct objects.For example, if we wish to select 3 objects from n = 5 objects (a, b, c, d, e), wehave choices abc, abd, abe, acd, ace, . . . .

  • 7/30/2019 210 Book

    48/199

    47

    1. Number of Permutations, nPr

    The number of permutations, nPr, is the number of ways of selecting r objectsfrom n distinct objects when different orderings constitute different choices.

    That is, choice(a,b,c) counts separately from choice(b,a,c).

    Then

    #permutations = nPr = n(n 1)(n 2) . . . (n r + 1) = n!(n r)! .

    (n choices for first object, (n 1) choices for second, etc.)

    2. Number of Combinations, nCr =n

    r

    The number of combinations, nCr, is the number of ways of selecting r objects

    from n distinct objects when diff

    erent orderings constitute the same choice.

    That is, choice(a,b,c) and choice(b,a,c) are the same.

    Then

    #combinations = nCr =

    n

    r

    =

    nPrr!

    =n!

    (n r)!r! .

    (becausenPr counts each permutation r! times, and we only want to count it once:so dividenPr byr!)

    Use the same rule on the numerator and the denominator

    When P(A) = # outcomes in A# outcomes in , we can often think about the problemeither with different orderings constituting different choices, or with differentorderings constituting the same choice. The critical thing is to use the samerule for both numerator and denominator.

  • 7/30/2019 210 Book

    49/199

    48

    Example: (a) Tom has five elderly great-aunts who live together in a tiny bunga-low. They insist on each receiving separate Christmas cards, and threaten todisinherit Tom if he sends two of them the same picture. Tom has Christmas

    cards with 12 different designs. In how many different ways can he select 5different designs from the 12 designs available?

    Order of cards is not important, so use combinations. Number of ways of select-

    ing 5 distinct designs from 12 is

    12C5 =

    12

    5

    =

    12 !

    (12 5)! 5! = 792.

    b) The next year, Tom buys a pack of 40 Christmas cards, featuring 10 differentpictures with 4 cards of each picture. He selects 5 cards at random to send tohis great-aunts. What is the probability that at least two of the great-auntsreceive the same picture?

    Looking forP(at least 2 cards the same)= P(A) (say).

    Easiest to findP(all 5 cards are different)= P(A).

    Number of outcomes in A is

    (# ways of selecting 5 different designs)= 40 36 32 28 24 .(40 choices for first card; 36 for second, because the 4 cards with the

    first design are excluded; etc.

    Note that order matters: e.g. we are counting choice 12345 separately

    from 23154.)

    Total number of outcomes is(total # ways of selecting 5 cards from 40)= 40 39 38 37 36 .

    (Note: order mattered above, so we need order to matter here too.)

    So

    P(A) =40 36 32 28 2440 39 38 37 36 = 0.392.

    Thus

    P(A) = P(at least 2 cards are the same design) = 1 P(A) = 1 0.392 = 0.608.

  • 7/30/2019 210 Book

    50/199

    49

    Alternative solution if order does not matter on numerator and denominator:(much harder method)

    P(A) = 10

    5 4540

    5

    .This works because there are

    105

    ways of choosing 5 different designs from 10,

    and there are 4 choices of card within each of the 5 chosen groups. So the totalnumber of ways of choosing 5 cards of different designs is

    105

    45. The total

    number of ways of choosing 5 cards from 40 is

    405

    .

    Exercise: Check that this gives the same answer for P(A) as before.

    Note: Problems like these belong to the branch of mathematics calledCombinatorics: the science of counting.

  • 7/30/2019 210 Book

    51/199

    50

    Chapter 2: Discrete Probability

    Distributions

    2.1 Introduction

    In the next two chapters we meet several important concepts:

    1. Probability distributions, and the probability function fX(x):

    the probability functionof a random variable lists the values the random

    variable can take, and their probabilities.

    2. Hypothesis testing:

    I toss a coin ten times and get nine heads. How unlikely is that? Can wecontinue to believe that the coin is fair when it produces nine heads outof ten tosses?

    3. Likelihood and estimation:

    what if we know that our random variable is (say) Binomial(5, p), for somep, but we dont know the value of p? We will see how to estimate thevalue of p using maximum likelihood estimation.

    4. Expectation and variance of a random variable:

    the expectation of a random variable is the value it takes on average. the variance of a random variable measures how much the random variable

    varies about its average.

    5. Change of variable procedures:

    calculating probabilities and expectations of g(X), where X is a randomvariable and g(X) is a function, e.g. g(X) =

    X or g(X) = X2.

    6. Modelling:

    we have a situation in real life that we know is random. But what doesthe randomness look like? Is it highly variable, or little variability? Doesit sometimes give results much higher than average, but never give results

    much lower(long-tailed distribution)? We will see how different probabilitydistributions are suitable for different circumstances. Choosing a probabil-ity distribution to fit a situation is called modelling.

  • 7/30/2019 210 Book

    52/199

    51

    2.2 The probability function, fX(x)

    The probability function fX(x) lists all possible values of X,

    and gives a probability to each value.

    Recall that a random variable, X, assigns a real number to every possibleoutcome of a random experiment. The random variable is discrete if the set ofreal values it can take is finite or countable, eg. {0,1,2,. . . }.

    Ferrari

    Porsche

    MG...

    Random experiment: which car?

    Random variable: X.

    X gives numbers to the possible outcomes.

    If he chooses. . . Ferrari X = 1

    Porsche

    X = 2

    MG X = 3

    Definition: The probability function, fX(x), for a discrete random variable X, isgiven by,

    fX(x) = P(X = x), for all possible outcomes x ofX.

    Example: Which car?

    Outcome: Ferrari Porsche MGx 1 2 3

    Probability function, fX(x) = P(X = x)16

    16

    46

    We write: P(X = 1) = fX(1) =16 : the probability he makes choice 1 (a Ferrari)

    is 16

    .

  • 7/30/2019 210 Book

    53/199

    52

    We can also write the probability function as: fX(x) =

    1/6 ifx = 1,1/6 ifx = 2,4/6 ifx = 3,

    0 otherwise.

    Example: Toss a fair coin once, and let X=number of heads. Then

    X =

    0 with probability 0.5,1 with probability 0.5.

    The probability function of X is given by:

    x 0 1

    fX(x) = P(X = x) 0.5 0.5 or fX(x) =

    0.5 if x=00.5 if x=10 otherwise

    We write (eg.) fX(0) = 0.5, fX(1) = 0.5, fX(7.5) = 0, etc.

    fX(x) is just a list of probabilities.

    Properties of the probability function

    i) 0 fX(x) 1 for all x; probabilities are always between 0 and 1.

    ii) x fX(x) = 1; probabilities add to 1 overall.iii) P (X A) =

    xAfX(x);

    e.g. in the car example,

    P(X {1, 2}) = P(X = 1 or2) = P(X = 1) +P(X = 2) = 16 + 16 = 26 .This is the probability of choosing either a Ferrari or a Porsche.

  • 7/30/2019 210 Book

    54/199

    53

    2.3 Bernoulli trials

    Many of the discrete random variables that we meet

    are based on counting the outcomes of a series oftrials called Bernoulli trials. Jacques Bernoulli wasa Swiss mathematician in the late 1600s. He andhis brother Jean, who were bitter rivals, both stud-ied mathematics secretly against their fathers will.Their father wanted Jacques to be a theologist andJean to be a merchant.

    Definition: A random experiment is called a set of Bernoulli trials if it consistsof several trials such that:

    i) Each trial has only 2 possible outcomes, usually called Success and Fail-ure;

    ii) The probability of success, p, remains constant for all trials;

    iii) The trials are independent, ie. the event success in trial i does not dependon the outcome of any other trials.

    Examples: 1) Repeated tossing of a fair coin:Each toss is a Bernoulli trial with P(success) = P(head) = 12 .

    2) Repeated tossing of a fair die: success = 6, failure= not 6.Each toss is a Bernoulli trial with P(success) = 16.

    Definition: The random variable Y is called a Bernoulli random variable if it

    takes only 2 values, 0 and 1.

    The probability function is,

    fY(y) =

    p ify = 11 p ify = 0

    That is,

    P(Y = 1) = P(success) = p,P(Y = 0) = P(failure) = 1 p.

  • 7/30/2019 210 Book

    55/199

    54

    2.4 Example of the probability function: the Binomial Distribution

    The Binomial distribution counts the number of successesin a fixed number of Bernoulli trials.

    Definition: LetX be the number of successes in n independent Bernoulli trials eachwith probability of success = p. Then X has the Binomial distribution with

    parameters n andp. We writeX Bin(n, p), orX Binomial(n, p).

    Thus X Bin(n, p) if X is the number of successes out of n independenttrials, each of which has probabilityp of success.

    Probability function

    If X Binomial(n, p), then the probability function for X is

    fX(x) = P(X = x) =

    n

    x

    px(1p)nx for x = 0, 1, . . . , n

    Explanation:

    ForX = x, we need an outcome with x successes and(n x) failures.A single outcome with x successes andn x failures has probability:

    px(1)

    (1p)nx (2)

    where:

    (1) succeeds x times, each with probabilityp(2) fails (n x) times, each with probability(1p).

  • 7/30/2019 210 Book

    56/199

    55

    There are

    n

    x

    possible outcomes with x successes and(n x) failures because

    we must selectx trials to be our successes, out ofn trials in total.

    Thus,

    P(#successes= x) = (#outcomes with x successes) (prob. of each such outcome)=

    n

    x

    px(1p)nx

    Notes:

    1. fX(x) = 0 if x / {0, 1, 2, . . . , n}.

    2. Check thatn

    x=0

    fX(x) = 1:

    nx=0

    fX(x) =n

    x=0

    n

    x

    px(1p)nx

    = [p + (1 p)]n (Binomial Theorem)

    = 1n

    = 1 .

    It is this connection with the Binomial Theorem that gives the Binomial Dis-tribution its name.

  • 7/30/2019 210 Book

    57/199

    56

    Example 1: Let X Binomial(n = 4, p = 0.2). Write down the probabilityfunction of X.

    x 0 1 2 3 4 fX(x) = P(X = x) 0.4096 0.4096 0.1536 0.0256 0.0016

    Example 2: Let X be the number of times I get a 6 out of 10 rolls of a fair die.

    1. What is the distribution of X?2. What is the probability that X 2?

    1. X Binomial(n = 10, p = 1/6).2.

    P(X 2 ) = 1 P(X < 2)

    = 1 P(X = 0) P(X = 1)= 1

    10

    0

    1

    6

    01 1

    6

    100

    10

    1

    1

    6

    11 1

    6

    101= 0.515.

    Example 3: Let X be the number of girls in a three-child family. What is thedistribution of X?

    Assume:

    (i) each child is equally likely to be a boy or a girl;

    (ii) all children are independent of each other.

    Then X Binomial(n = 3, p = 0.5).

  • 7/30/2019 210 Book

    58/199

    57

    Shape of the Binomial distribution

    The shape of the Binomial distribution depends upon the values ofn and p. Forsmall n, the distribution is almost symmetrical for values of p close to 0.5, buthighly skewed for values of p close to 0 or 1. As n increases, the distributionbecomes more and more symmetrical, and there is noticeable skew only if p isvery close to 0 or 1.

    The probability functions for various values of n and p are shown below.

    0 1 2 3 4 5 6 7 8 9 10

    0.0

    0.0

    5

    0.1

    0

    0.1

    5

    0.2

    0

    0.2

    5

    0 1 2 3 4 5 6 7 8 9 10

    0.0

    0.1

    0.2

    0.3

    0.4

    0.0

    0.0

    2

    0.0

    4

    0.0

    6

    0.0

    8

    0.1

    0

    0.1

    2

    80 90 100

    n = 10, p = 0.5 n = 10, p = 0.9 n = 100, p = 0.9

    Sum of independent Binomial random variables:

    If X and Y are independent, and X Binomial(n, p), Y Binomial(m, p),then

    X + Y Bin(n + m, p).

    This is because X counts the number of successes out ofn trials, and Y countsthe number of successes out of m trials: so overall, X + Y counts the totalnumber of successes out of n + m trials.

    Note: X and Y must both share the same value ofp.

  • 7/30/2019 210 Book

    59/199

    58

    2.5 The cumulative distribution function, FX(x)

    We have defined the probability function, fX(x), as fX(x) = P(X = x).

    The probability function tells us everything there is to know about X.

    The cumulative distribution function, or just distribution function, written asFX(x), is an alternative function that also tells us everything there is to knowabout X.

    Definition: The (cumulative) distribution function (c.d.f.) is

    FX(x) = P(X x) for < x <

    If you are asked to give the distribution ofX, you could answer by giving eitherthe distribution function, FX(x), or the probability function, fX(x). Each ofthese functions encapsulate all possible information about X.

    The distribution function FX(x) as a probability sweeper

    The cumulative distribution function, FX(x),

    sweeps up all the probability up to and including the pointx.

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    0 1 2 3 4 5 6 7 8 9 10

    X ~ Bin(10, 0.5)

    0.0

    0.1

    0.2

    0.3

    0.4

    0 1 2 3 4 5 6 7 8 9 10

    X ~ Bin(10, 0.9)

  • 7/30/2019 210 Book

    60/199

    59

    Example: Let X Binomial(2, 12).x 0 1 2

    fX(x) = P(X = x)14

    12

    14

    Then FX(x) = P(X x) =

    0 if x < 00.25 if 0 x < 10.25 + 0.5 = 0.75 if 1 x < 20.25 + 0.5 + 0.25 = 1 if x 2.

    0

    0

    1

    1

    1

    2

    2

    1

    4

    1

    4

    1

    2

    1

    2

    3

    4

    x

    x

    f(x)

    F(x)

    FX

    (x) gives the cumulative probability up to and including point x.

    SoFX(x) =

    yx

    fX(y)

    Note that FX(x) is a step function: it jumps by amount fX(y) at every pointy with positive probability.

  • 7/30/2019 210 Book

    61/199

    60

    Reading off probabilities from the distribution function

    As well as using the probability function to find the distribution function, wecan also use the distribution function to find probabilities.

    fX(x) = P(X = x) = P(X x) P(X x 1) (ifX takes integer values)= FX(x) FX(x 1).

    This is why the distribution function FX(x) contains as much information asthe probability function, fX(x), because we can use either one to find the other.

    In general:

    P(a < X b) = FX(b) FX(a) ifb > a.

    Proof: P(X b) = P(X a) + P(a < X b)

    a b

    X b

    a < X bX a

    So

    FX(b) = FX(a) + P(a < X b) FX(b) FX(a) = P(a < X b).

  • 7/30/2019 210 Book

    62/199

    Warning: endpoints

    Be careful of endpoints and the difference between

    and 42)?1 P(X 42) = 1 FX(42).

    5. P(50 X 60)?P(X 60) P(X 49) = FX(60) FX(49).

    Properties of the distribution function

    1) F() =P(X ) = 0.F(+) =P(X +) = 1.(These are true because values are strictly between and ).

    2) FX(x) is a non-decreasing function of x: that is,

    if x1 < x2, then FX(x1) FX(x2).

    3) P(a < X b) = FX(b) FX(a) if b > a.4) F is right-continuous: that is, limh0 F(x + h) = F(x).

  • 7/30/2019 210 Book

    63/199

    62

    2.6 Hypothesis testing

    You have probably come across the idea of hypothesis tests, p-values, and sig-nificance in other courses. Common hypothesis tests include t-tests and chi-squared tests. However, hypothesis tests can be conducted in much simplercircumstances than these. The concept of the hypothesis test is at its easiest tounderstand with the Binomial distribution in the following example. All otherhypothesis tests throughout statistics are based on the same idea.

    Example: Weird Coin?

    H

    H

    I toss a coin 10 times and get 9 heads. How weird is that?

    What is weird?

    Getting 9 heads out of 10 tosses: well call this weird. Getting 10 heads out of 10 tosses: even more weird!

    Getting 8 heads out of 10 tosses: less weird. Getting 1 head out of 10 tosses: same as getting 9 tails out of 10 tosses:

    just as weird as 9 heads if the coin is fair.

    Getting 0 heads out of 10 tosses: same as getting 10 tails: more weird than9 heads if the coin is fair.

    Set of weird outcomes

    If our coin is fair, the outcomes that are as weird or weirder than 9 headsare:

    9 heads, 10 heads, 1 head, 0 heads.

    So how weird is 9 heads or worse, if the coin is fair?

    Define X =#heads out of 10 tosses.

    Distribution of X, if the coin is fair: X Binomial(n = 10, p = 0.5).

  • 7/30/2019 210 Book

    64/199

    63

    Probability of observing something at least as weird as 9 heads,

    if the coin is fair:

    We can add the probabilities of all the outcomes that are at least as weirdas 9 heads out of 10 tosses, assuming that the coin is fair.

    P(X = 9)+P(X = 10)+P(X = 1)+P(X = 0) where X Binomial(10, 0.5).

    Probabilities for Binomial(n = 10, p = 0.5)

    0 1 2 3 4 5 6 7 8 9 10

    0.0

    0.0

    5

    0.1

    5

    0.2

    5

    x

    P(X=x)

    For X Binomial(10, 0.5), we have:P(X = 9) + P(X = 10) + P(X = 1) + P(X = 0) =

    10

    9

    (0.5)9(0.5)1 +

    10

    10

    (0.5)10(0.5)0 +

    10

    1

    (0.5)1(0.5)9 +

    10

    0

    (0.5)0(0.5)10

    = 0.00977 + 0.00098 + 0.00977 + 0.00098

    = 0.021.

    Is this weird?

    Yes, it is quite weird. If we had a fair coin and tossed it 10 times, we would onlyexpect to see something as extreme as 9 heads on about 2.1% of occasions.

  • 7/30/2019 210 Book

    65/199

    64

    Is the coin fair?

    Obviously, we cant say. It might be: after all, on 2.1% of occasions that youtoss a fair coin 10 times, you do get something as weird as 9 heads or more.

    However, 2.1% is a small probability, so it is still very unusual for a fair coin toproduce something as weird as what weve seen. If the coin really was fair, itwould be very unusual to get 9 heads or more.

    We can deduce that, EITHER we have observed a very unusual event with a faircoin, OR the coin is not fair.

    In fact, this gives us some evidence that the coin is not fair.

    The value 2.1% measures the strength of our evidence. The smaller this proba-bility, the more evidence we have.

    Formal hypothesis test

    We now formalize the procedure above. Think of the steps:

    We have a question that we want to answer: Is the coin fair?

    There are two alternatives:1. The coin is fair.

    2. The coin is not fair.

    Our observed information is X, the number of heads out of 10 tosses. Wewrite down the distribution of X if the coin is fair:X Binomial(10, 0.5).

    We calculate the probability of observing something AT LEAST ASEXTREME as our observation, X = 9, if the coin is fair: prob=0.021.

    The probability is small (2.1%). We conclude that this is unlikely with afair coin, so we have observed some evidence that the coin is NOT fair.

  • 7/30/2019 210 Book

    66/199

    65

    Null hypothesis and alternative hypothesis

    We express the steps above as two competing hypotheses.

    Null hypothesis: the first alternative, that the coin IS fair.

    We expect to believe the null hypothesis unless we see convincing evidence that

    it is wrong.

    Alternative hypothesis: the second alternative, that the coin is NOT fair.

    In hypothesis testing, we often use this same formulation.

    The null hypothesis is specific.It specifies an exact distribution for our observation: X Binomial(10, 0.5).

    The alternative hypothesis is general.

    It simply states that the null hypothesis is wrong. It does not say whatthe right answer is.

    We use H0 andH1 to denote the null and alternative hypotheses respectively.

    The null hypothesis is H0 : the coin is fair.The alternative hypothesis is H1 : the coin is NOT fair.

    More precisely, we write:

    Number of heads, X Binomial(10, p),and

    H0 : p = 0.5

    H1 : p = 0.5.

    Think of null hypothesis as meaning the default: the hypothesis we willaccept unless we have a good reason not to.

  • 7/30/2019 210 Book

    67/199

  • 7/30/2019 210 Book

    68/199

    67

    Interpreting the hypothesis test

    There are different schools of thought about how ap-value should be interpreted.

    Most people agree that the p-value is a useful measure of the strength ofevidence against the null hypothesis. The smaller the p-value, thestronger the evidence against H0.

    Some people go further and use an accept/reject framework. Underthis framework, the null hypothesis H0 should be rejected if the p-value is

    less than 0.05 (say), and accepted if the p-value is greater than 0.05.

    In this course we use the strength of evidence interpretation. Thep-value measures how far out our observation lies in the tails of the dis-tribution specified by H0. We do not talk about accepting or rejectingH0. This decision should usually be taken in the context of other scientificinformation.

    However, as a rule of thumb, consider that p-values of 0.05 and less start

    to suggest that the null hypothesis is doubtful.

    Statistical significance

    You have probably encountered the idea of statistical significance in othercourses.

    Statistical significance refers to thep-value.

    The result of a hypothesis test is significant at the 5% level if the p-valueis less than 0.05.

    This means that the chance of seeing what we did see (9 heads), or more, is lessthan 5% if the null hypothesis is true.

    Saying the test is significant is a quick way of saying that there is evidenceagainst the null hypothesis, usually at the 5% level.

  • 7/30/2019 210 Book

    69/199

    68

    In the coin example, we can say that our test of H0 : p = 0.5 against H1 : p = 0.5is significant at the 5% level, because thep-value is 0.021 which is < 0.05.

    This means: we have some evidence thatp = 0.5.

    It does not mean:

    the difference between p and 0.5 is large, or the difference between p and 0.5 is important in practical terms.

    Statistically significant means that we have evidence that

    there IS a difference. It says NOTHING about the SIZE,

    or the IMPORTANCE, of the difference.

    Substantial evidence of a difference, not Evidence of a substantial difference.

    Beware!

    The p-value gives the probability of seeing something as weird as what we didsee, ifH0 is true.

    This means that 5% of the time, we will get a p-value < 0.05 WHEN H0 ISTRUE!!

    Similarly, about once in every thousand tests, we will get a p-value < 0.001,when H0 is true!

    A smallp-value does NOT mean thatH0 is definitely wrong.

    One-sided and two-sided tests

    The test above is a two-sided test. This means that we considered it just asweird to get 9 tails as 9 heads.

    If we had a good reason, before tossing the coin, to believe that the binomialprobability could only be = 0.5 or > 0.5, i.e. that it would be impossible

    to have p < 0.5, then we could conduct a one-sided test: H0 : p = 0.5 versusH1 : p > 0.5.

    This would have the effect of halving the resultant p-value.

  • 7/30/2019 210 Book

    70/199

    69

    2.7 Example: Presidents and deep-sea divers

    Men in the class: would you like to have daughters? Then become a deep-seadiver, a fighter pilot, or a heavy smoker.

    Would you prefer sons? Easy!Just become a US president.

    Numbers suggest that men in differentprofessions tend to have more sons thandaughters, or the reverse. Presidents have

    sons, fighter pilots have daughters. But is it real, or just chance? We can usehypothesis tests to decide.

    The facts

    The 44 US presidents from George Washington to Barack Obama have hada total of 153 children, comprising 88 sons and only 65 daughters: a sex

    ratio of 1.4 sons for every daughter. Two studies of deep-sea divers revealed that the men had a total of 190

    children, comprising 65 sons and 125 daughters: a sex ratio of 1.9 daughtersfor every son.

    Could this happen by chance?

    Is it possible that the men in each group really had a 50-50 chance of producingsons and daughters?

    This is the same as the question in Section 2.6.

    For the presidents: If I tossed a coin 153 times and got only 65 heads, couldI continue to believe that the coin was fair?

    For the divers: If I tossed a coin 190 times and got only 65 heads, could Icontinue to believe that the coin was fair?

  • 7/30/2019 210 Book

    71/199

  • 7/30/2019 210 Book

    72/199

    71

    Calculating the p-value

    The p-value for the president problem is given by

    P(X 65) + P(X 88) whereX Binomial(153, 0.5).

    In principle, we could calculate this as

    P(X = 0) + P(X = 1) + . . . + P(X = 65) + P(X = 88) + . . . + P(X = 153)

    =

    153

    0

    (0.5)0(0.5)153 +

    153

    1

    (0.5)1(0.5)152 + . . .

    This would take a lot of calculator time! Instead, we use a computer with apackage such as R.

    R command for the p-value

    The R command for calculating the lower-tailp-value for theBinomial(n = 153, p = 0.5) distribution is

    pbinom(65, 153, 0.5).

    Typing this in R gives:

    > pbinom(65, 153, 0.5)

    [1] 0.03748079

    This gives us the lower-tailp-value only:P(X

    65) = 0.0375.

    0.00

    0.02

    0.04

    0.06

    0 20 40 60 80 100 120 140

    To get the overall p-value, we have two choices:

    1. Multiply the lower-tailp-value by 2:

    2 0.0375 = 0.0750.

    In R:

    > 2 * pbinom(65, 153, 0.5)

    [1] 0.07496158

  • 7/30/2019 210 Book

    73/199

    72

    This works because the upper-tail p-value, by definition, is always goingto be the same as the lower-tail p-value. The upper tail gives us theprobability of finding something equally surprising at the opposite end of

    the distribution.

    2. Calculate the upper-tail p-value explicitly (only works for H0 : p = 0.5):

    The upper-tailp-value is

    P(X 8 8 ) = 1 P(X < 88)= 1 P(X 87)= 1

    pbinom(87, 153, 0.5).

    In R:

    > 1-pbinom(87, 153, 0.5)

    [1] 0.03748079

    The overall p-value is the sum of the lower-tail and the upper-tail p-values:

    pbinom(65, 153, 0.5) + 1 - pbinom(87, 153, 0.5)= 0.0375 + 0.0375 = 0.0750. (Same as before.)

    Note: The R command pbinom is equivalent to the cumulative distribution functionfor the Binomial distribution:

    pbinom(65, 153, 0.5) = P(X 65) whereX Binomial(153, 0.5)= FX(65) forX Binomial(153, 0.5).

    The overall p-value in this example is 2 FX(65).

    Note: In the R command pbinom(65, 153, 0.5), the order that you enter the

    numbers 65, 153, and 0.5 is important. If you enter them in a different order, youwill get an error. An alternative is to use the longhand command pbinom(q=65,size=153, prob=0.5), in which case you can enter the terms in any order.

  • 7/30/2019 210 Book

    74/199

    73

    Summary: are presidents more likely to have sons?

    Back to our hypothesis test. Recall that X was the number of daughters out of

    153 presidential children, and X Binomial(153, p), where p is the probabilitythat each child is a daughter.Null hypothesis: H0 : p = 0.5.

    Alternative hypothesis: H1 : p = 0.5.p-value: 2 FX(65) = 0.075.What does this mean?

    The p-value of 0.075 means that, if the presidents really were as likely to havedaughters as sons, there would only be 7.5% chance of observing something as

    unusual as only 65 daughters out of the total 153 children.

    This is slightly unusual, but not very unusual.

    We conclude that there is no real evidence that presidents are more likely to have

    sons than daughters. The observations are compatible with the possibility thatthere is no difference.

    Does this mean presidents are equally likely to have sons and daughters? No:the observations are also compatible with the possibility that there is a difference.

    We just dont have enough evidence either way.

    Hypothesis test for the deep-sea divers

    For the deep-sea divers, there were 190 children: 65 sons, and 125 daughters.

    Let X be the number of sons out of 190 diver children.

    Then X Binomial(190, p), wherep is the probability that each child is a son.

    Note: We could just as easily formulate our hypotheses in terms of daughtersinstead of sons. Because pbinom is defined as a lower-tail probability, however,it is usually easiest to formulate them in terms of the low result (sons).

  • 7/30/2019 210 Book

    75/199

    74

    Null hypothesis: H0 : p = 0.5.

    Alternative hypothesis: H1 : p = 0.5.p-value: Probability of getting a result AT LEAST

    AS EXTREME as X = 65 sons, ifH0 is trueandp really is 0.5.

    Results at least as extreme as X = 65 are:

    X = 0, 1, 2, . . . , 65, for even fewer sons.

    X = (19065), . . . , 190, for the equally surprising result in the opposite direction(too many sons).

    Probabilities for X Binomial(n = 190, p = 0.5)

    0.00

    0.01

    0.02

    0.0

    3

    0.04

    0.05

    0.06

    0 20 40 60 80 100 120 140 160 180

    R command for the p-value

    p-value= 2pbinom(65, 190, 0.5).Typing this in R gives:

    > 2*pbinom(65, 190, 0.5)

    [1] 1.603136e-05

    This is 0.000016, or a little more than one chance in 100 thousand.

  • 7/30/2019 210 Book

    76/199

    75

    We conclude that it is extremely unlikely that this observation could have oc-curred by chance, if the deep-sea divers had equal probabilities of having sons

    and daughters.

    We have very strong evidence that deep-sea divers are more likely to have daugh-ters than sons. The data are not really compatible with H0.

    What next?

    p-values are often badly used in science and business. They are regularly treatedas the end point of an analysis, after which no more work is needed. Manyscientific journals insist that scientists quote a p-value with every set of results,and often only p-values less than 0.05 are regarded as interesting. The outcomeis that some scientists do every analysis they can think of until they finally comeup with a p-value of 0.05 or less.

    A good statistician will recommend a different attitude. It is very rare in science

    for numbers and statistics to tell us the full story.

    Results like the p-value should be regarded as an investigative starting point,rather than the final conclusion. Why is the p-value small? What possiblemechanismcould there be for producing this result?

    If you were a medical statistician and you gave me a p-value, I

    would ask you for a mechanism.

    Dont accept that Drug A is better than Drug B only because the p-value saysso: find a biochemist who can explain what Drug A does that Drug B doesnt.Dont accept that sun exposure is a cause of skin cancer on the basis of a p-valuealone: find a mechanism by which skin is damaged by the sun.

    Why might divers have daughters and presidents have sons?

    Deep-sea divers are thought to have more daughters than sons because theunderwater work at high atmospheric pressure lowers the level of the hormonetestosterone in the mens blood, which is thought to make them more likely toconceive daughters. For the presidents, your guess is as good as mine . . .

  • 7/30/2019 210 Book

    77/199

    2.8 Example: Birthdays and sports professionals

    Have you ever wondered what makes a professional

    sports player? Talent? Dedication? Good coaching?

    Or is it just that they happen to have the rightbirthday. . . ?