Download - Assignment #3 - | Department of Zoology at UBCmfscott/lectures/08_Fitting.pdf · Assignment #3 Chapter 5: 28, 36, 37 Chapter 6: 16, 18, 19 Due tomorrow Oct. 9th by 2pm in your TA’s

Assignment #3

Chapter 5: 28, 36, 37 Chapter 6: 16, 18, 19 Due tomorrow Oct. 9th by 2pm in your TA’s homework box

Assignment #4

Chapter 7: 21, 22, 28 Due next Friday Oct. 16th by 2pm in your TA’s homework box

Reading

For Today: Chapter 8 For Tuesday: Chapter 9

Labs Lab 5 is no longer required Tuesday, Thursday and Friday Labs: No labs the week of Oct. 19th (Midterm week) Monday labs: No lab on Oct. 12th (Thanksgiving) Wednesday labs: No lab on Nov. 11th (Remembrance Day)

Second part of Chapter 7 Review

Binomial distribution

Probability of obtaining X left-handed flowers out of n = 27 randomly sampled, if the proportion of left-handed flowers in the population is 0.25!

Binomial test

The binomial test uses data to test whether a population proportion p matches a null expectation for the proportion.

H0: The relative frequency of successes in the population is p0 .

HA: The relative frequency of successes in the population is not p0 .

Binomial test Null distribution can be calculated using the binomial formula for each possible value of X, where n = your sample size, p = p0 and X = number of successes.

€

Pr[X ] =nX"

# $

%

& ' pX 1− p( )n−X

P = The probability of the number of successes in your sample (X) plus the probabilities of any equally or more extreme values of X. If P<0.05 we reject H0

Estimating Proportions: Proportion of successes in a sample

p̂ = Xn

The hat (^) shows that !this is an estimate of p.!

p is the true population proportion!

Standard error of the estimate of a proportion is the standard

deviation of the sampling distribution

σ ρ̂ =p 1− p( )

n

We usually don’t know p so we estimate the standard error

with

SEp̂ =p̂ 1− p̂( )

n

p̂

95% confidence interval for a proportion

€

" p =X + 2n + 4

€

" p −1.96" p 1− " p ( )n + 4

$

% &

'

( ) ≤ p ≤ " p +1.96

" p 1− " p ( )n + 4

$

% &

'

( )

This is the Agresti-Coull confidence interval!

Fitting probability models to frequency data

Probability Model

A probability distribution that represents how we think a natural process works

Example of a probability model: The proportional model

Simple probability model in which the frequency of occurrence of events is

proportional to the number of opportunities.

Example of expectations from the proportional model

Day of the week of 350 births Observed Frequencies

Day of the week of 350 births Expected Frequencies

Day Number of days in 1999

Proportion of days in

1999

Expected frequency of births

Sun 52 52/365 49.863 Mon 52 52/365 49.863 Tues 52 52/365 49.863 Wed 52 52/365 49.863 Thurs 52 52/365 49.863

Fri 53 53/365 50.822 Sat 52 52/365 49.863

Sum 365 1 350

Goodness-of-fit tests Compare an observed frequency distribution with frequency distribution expected under simple probability model Binomial Test: Limited to categorical variables with only two possible outcomes χ2 Test: Can handle categorical and discrete numerical variables having more than two outcomes

χ2 Goodness-of-fit test

Uses a test statistic called χ2 to measure the discrepancy between an observed discrete frequency distribution and the frequencies expected under a simple probability model serving as the null

hypothesis.

Hypotheses for χ2 test

H0: The data come from a particular discrete probability distribution. HA: The data do not come from that

distribution.

Test statistic for χ2 test

€

χ 2 =Observedi − Expectedi( )2

Expectediall classes∑

The month of birth for 1245 NHL players

Month Number of players

January 133February 125March 114April 119May 119June 123July 96August 91September 83October 84November 73December 85

Data from http://www.nhl.com/players/search/all.html in 2006

Hypotheses for birth month example

H0: The probability of a NHL birth occurring on any given month is equal to national proportions. HA: The probability of a NHL birth occurring on any given month is not equal to national proportions.

Month Number ofplayers

Expected(%)

January 133 7.94February 125 7.63March 114 8.72April 119 8.63May 119 8.95June 123 8.57July 96 8.76August 91 8.5September 83 8.54October 84 8.19November 73 7.70December 85 7.86Total 1245 100

NHL compared to all Canadians

Computing Expected values Month Number of

playersExpected(%)

Expected(of 1245)

January 133 7.94 99February 125 7.63 95March 114 8.72 109April 119 8.63 107May 119 8.95 111June 123 8.57 107July 96 8.76 109August 91 8.5 106September 83 8.54 106October 84 8.19 102November 73 7.70 96December 85 7.86 98Total 1245 100% 1245

Note: For simplicity, we have rounded the expected column to integers. In any real calculation, we would keep a couple decimal places.

The calculation for January

€

Observed − Expected( )2

Expected=133− 99( )2

99=115699

Calculating χ2

€



=115699

+90095

+25109

+144107

+64111

+256107

+

169109

+225106

+529106

+324102

+52996

+16998

= 44.77

The sampling distribution of χ2 (null distribution) by simulation

Sampling distribution of χ2 (null distribution) by the χ2 distribution

Degrees of freedom The number of degrees of freedom of a test specifies which of a family of distributions to use.

Degrees of freedom for χ2 test

df = (Number of categories)

– (Number of parameters estimated from the data)

– 1

Degrees of freedom for NHL month of birth

df = 12 - 0 - 1 = 11

Finding the P-value

Critical value

The value of the test statistic where P = α.

Table A - χ2 distribution

The 5% critical value

P<0.05, so we can reject the null hypothesis NHL players are not born in the same proportions per month as the population at large.

χ2 test as approximation of binomial test

•  χ2 goodness-of-fit test works even when there are only two categories, so it can be used as a substitute for the binomial test.

•  Very useful if the number of data points is large. –  Imagine if, in our red/blue wrestler example, rather

than 16/20 wins by red, we had 1600/2000 wins by red. Imagine calculating:

–  And then imagine calculating:

Pr[1600]= 2000!1600!400!

0.516000.5400

P = 2*(Pr[1600]+Pr[1601]+...+Pr[2000])

The experiment and the results

•  Animals use red as a sign of aggression

•  Does red influence the outcome of wrestling, taekwondo, and boxing?

–  16 of 20 rounds had more red-shirted than blue-shirted winners in these sports in the 2004 Olympics

–  Shirt color was randomly assigned

Hill, RA, and RA Burton 2005. Red enhances human performance in contests Nature 435:293.

Stating the hypotheses

H0: Red- and blue-shirted athletes are equally likely to win (proportion = 0.5).

HA: Red- and blue-shirted athletes

are not equally likely to win (proportion ≠ 0.5).

χ2 test as approximation of binomial test

Shirt color of winners Observed Expected Red (success) 16 10 Blue (failure) 4 10

Sum 20 20

χ 2 =Observedi −Expectedi( )2


=3610

+3610

= 7.2

χ0.05,12

= 3.843.84 < 7.2 < 7.88 0.01> P > 0.005

χ0.005,12

= 7.88

Fitting the binomial distribution is different than the binomial test

Binomial test - uses data to test whether the proportion of successes in one set of trials matches a null expectation of the proportion of successes Fitting the binomial distribution - uses data to test whether the observed distribution of the proportion of successes in multiple sets of trials matches a null expectation of the the binomial distribution

Assumptions of χ2 test

•  No more than 20% of categories have Expected<5

•  No category with Expected ≤ 1

Fitting other distributions: the Poisson distribution

The Poisson distribution describes the probability that a certain number of events occur in a block of time or space, when those events happen independently of each other and occur with equal probability at every point in time or space.

Poisson distribution

€

Pr X[ ] =e−µ µ X

X!

Example: Number of goals per side in World

Cup Soccer

Q: Is the outcome of a soccer game (at this level) random?

In other words, is the number of goals per team distributed as expected by pure chance?

Hypotheses

•  H0: Number of goals per side follows a Poisson distribution.

•  HA: Number of goals per side does not follow a Poisson distribution.

World Cup 2002 scores

Number of goals for a team (World Cup 2002)

What’s the mean, µ?

€

x =37 0( ) + 47 1( ) + 27 2( ) +13 3( ) + 2 4( ) +1 5( ) +1 8( )

128

=161128

=1.26

Poisson with µ = 1.26

Example:

€

Pr 2[ ] =e−µµX

X!=e−1.26 1.26( )2

2!=0.284( )1.59

2= 0.225

Poisson with µ = 1.26 X Pr[X] 0 0.284 1 0.357 2 0.225 3 0.095 4 0.030 5 0.008 6 0.002 7 0 ≥8 0

Finding the Expected X Pr[X] Expected 0 0.284 36.3 1 0.357 45.7 2 0.225 28.8 3 0.095 12.1 4 0.030 3.8 5 0.008 1.0 6 0.002 0.2 7 0 0.04 ≥8 0 0.007

} Too small!

Calculating χ2 X Expected Observed 0 36.3 37 0.013 1 45.7 47 0.037 2 28.8 27 0.113 3 12.1 13 0.067 ≥ 4 5.0 4 0.200

€

Observedi − Expectedi( )2

Expectedi

€


Expectediall classes∑ = 0.429

Degrees of freedom

df = (Number of categories)

– (Number of parameters estimated from the data)

– 1

= 5 – 1 – 1 = 3

Critical value

Comparing χ2 to the critical value

€

χ 2 = 0.429χ32 = 7.81

0.429 < 7.81

So we cannot reject the null hypothesis. There is no evidence that the score of a World Cup Soccer game is not Poisson distributed.

World Cup 2002 scores

Poisson distribution