20 Advances ljh - | Department of Zoology at UBC

16
1 Advances in Statistics Or, what you might find if you picked up a current issue of a Biological Journal 2 Advances in Statistics Extensions to the ANOVA Computer-intensive methods Maximum likelihood 3 Extensions to ANOVA One-way ANOVA – This works for a single explanatory variable – Simplest possible design Two-way ANOVA – Two categorical explanatory variables – Factorial design 4 ANOVA Tables P N-1 Total N-k Error k-1 Treatment F ratio Mean Squares df Sum of squares Source of variation SS error = s i 2 (n i "1) # SS group = n i (Y i " Y ) 2 # SS group + SS error MS error = SS error df error MS group = SS group df group F = MS group MS error *

Transcript of 20 Advances ljh - | Department of Zoology at UBC

Page 1: 20 Advances ljh - | Department of Zoology at UBC

1

Advances in Statistics

Or, what you might find if you picked

up a current issue of a Biological

Journal

2

Advances in Statistics

• Extensions to the ANOVA

• Computer-intensive methods

• Maximum likelihood

3

Extensions to ANOVA

• One-way ANOVA

– This works for a single explanatory variable

– Simplest possible design

• Two-way ANOVA

– Two categorical explanatory variables

– Factorial design

4

ANOVA Tables

P

N-1Total

N-kError

k-1Treatment

F ratioMean

Squares

dfSum of squaresSource of

variation

!

SSerror

= si

2(n

i"1)#!

SSgroup = ni(Y i "Y )2#

!

SSgroup + SSerror!

MSerror =SSerror

dferror!

MSgroup =SSgroup

dfgroup

!

F =MSgroup

MSerror*

Page 2: 20 Advances ljh - | Department of Zoology at UBC

5

Two-factor ANOVA Table

N-1SStotalTotal

SSerror

XXX

XXXSSerrorError

MS1*2

MSE

SS1*2

(k1 - 1)*(k2 - 1)

(k1 - 1)*(k2 - 1)SS1*2Treatment 1 *

Treatment 2

MS2

MSE

SS2

k2 - 1

k2 - 1SS2Treatment 2

MS1

MSE

SS1

k1 - 1

k1 - 1SS1Treatment 1

PF ratioMean SquaredfSum of

Squares

Source of

variation

6

Two-factor ANOVA Table

N-1SStotalTotal

SSerror

XXX

XXXSSerrorError

MS1*2

MSE

SS1*2

(k1 - 1)*(k2 - 1)

(k1 - 1)*(k2 - 1)SS1*2Treatment 1 *

Treatment 2

MS2

MSE

SS2

k2 - 1

k2 - 1SS2Treatment 2

MS1

MSE

SS1

k1 - 1

k1 - 1SS1Treatment 1

PF ratioMean SquaredfSum of

Squares

Source of

variation

Two categorical

explanatory

variables

7

General Linear Models

• Used to analyze variation in Y when there is

more than one explanatory variable

• Explanatory variables can be categorical or

numerical

8

General Linear Models

• First step: formulate a model statement

• Example:

!

Y = µ + TREATMENT

Page 3: 20 Advances ljh - | Department of Zoology at UBC

9

General Linear Models

• First step: formulate a model statement

• Example:

!

Y = µ + TREATMENT

Overall

mean

Treatment

effect10

General Linear Models

• Second step: Make an ANOVA table

• Example:

P

N-1Total

N-kError

k-1Treatment

F ratioMean

Squares

dfSum of

squares

Source of

variation

!

SSerror

= si

2(n

i"1)#!

SSgroup = ni(Y i "Y )2#

!

SSgroup + SSerror

!

MSerror =SSerror

dferror!

MSgroup =SSgroup

dfgroup

!

F =MSgroup

MSerror*

11

General Linear Models

• Second step: Make an ANOVA table

• Example:

P

N-1Total

N-kError

k-1Treatment

F ratioMean

Squares

dfSum of

squares

Source of

variation

!

SSerror

= si

2(n

i"1)#!

SSgroup = ni(Y i "Y )2#

!

SSgroup + SSerror

!

MSerror =SSerror

dferror!

MSgroup =SSgroup

dfgroup

!

F =MSgroup

MSerror*

This is the same as a

one-way ANOVA!

12

General Linear Models

• If there is only one explanatory variable,

these are exactly equivalent to things we’ve

already done

– One categorical variable: ANOVA

– One numerical variable: regression

• Great for more complicated situations

Page 4: 20 Advances ljh - | Department of Zoology at UBC

13

Example 1: Experiment with

blocking

• Fish experiment: sensitivity of goldfish to

light

• Fish are randomly selected from the

population

• Four different light treatments are applied to

each fish

14

Randomized Block Design

Blocks (fish)

Treatments

(light wavelengths)

15

Randomized Block Design

16

Step 1: Make a model statement

!

Y = µ + BLOCK + TREATMENT

Page 5: 20 Advances ljh - | Department of Zoology at UBC

17

Step 2: Make an ANOVA table

18

Another Example: Mole Rats

• Are there lazy mole rats?

• Two variables:

– Worker type: categorical

• “frequent workers” and “infrequent workers”

– Body mass (ln-transformed): numerical

19 20

Step 1: Make a model statement

!

Y = µ + CASTE + LNMASS + CASTE *LNMASS

Page 6: 20 Advances ljh - | Department of Zoology at UBC

21

Step 2: Make an ANOVA table

22

Step 2: Make an ANOVA table

23

Step 1: Make a model statement

!

Y = µ + CASTE + LNMASS

24

Step 2: Make an ANOVA table

Page 7: 20 Advances ljh - | Department of Zoology at UBC

25

Step 2: Make an ANOVA table

Also called ANCOVA-

Analysis of Covariance

26

General Linear Models

• Can handle any number of predictor

variables

• Each can be categorical or numerical

• Tables have the same basic structure

• Same assumptions as ANOVA

27

General Linear Models

• Don’t run out of degrees of freedom!

• Sometimes, the F-statistics will have

DIFFERENT denominators - see book for

an example

28

Computer-intensive methods

• Hypothesis testing:

– Simulation

– Randomization

• Confidence intervals

– Bootstrap

Page 8: 20 Advances ljh - | Department of Zoology at UBC

29

Simulation

• Simulates the sampling process on a

computer many times: generates the null

distribution from estimates done on the

simulated data

• Computer assumes the null hypothesis is

true

30

Example: Social spider sex ratios

Social spiders live in groups

31

Example: Social spider sex ratios

• Groups are mostly females

• Hypothesis: Groups have just enough malesto allow reproduction

• Test: Whether distribution of number ofmales is as predicted by chance

• Problem: Groups are of many different sizes

• Binomial distribution therefore doesn’t apply

32

Simulation:

• For each group, the number of spiders is known.

The overall proportion of males, pm, is known.

• For each group, the computer draws the real

number of spiders, and each has pm probability of

being male.

• This is done for all groups, and the variance in

proportion of males is calculated.

• This is repeated a large number of times.

Page 9: 20 Advances ljh - | Department of Zoology at UBC

33

0.5 1. 1.5 2. 2.5 3. 3.5

200

400

600

800

1000

Variance in proportion of males(Pseudo-values)

Fre

quency

Actual Observed Value (0.44)

The observed value (0.44), or something more extreme,

is observed in only 4.9% of the simulations.

Therefore P = 0.049.34

Randomization

• Used for hypothesis testing

• Mixes the real data randomly

• Variable 1 from an individual is paired withvariable 2 data from a randomly chosenindividual. This is done for all individuals.

• The estimate is made on the randomized data.

• The whole process is repeated numerous times.The distribution of the randomized estimates is thenull distribution.

35

Without replacement

• Randomization is done without

replacement.

• In other words, all data points are used

exactly once in each randomized data set.

36

Randomization can be done for

any test of association between

two variables

Page 10: 20 Advances ljh - | Department of Zoology at UBC

37

Example: Sage crickets

Sage cricket males sometimes

offer their hind-wings to females

to eat during mating.

Do females who eat hind-wings

wait longer to re-mate?

38

Table 12.3A Waiting time to remating in sage cricket females after

initial mating with either a wingless or winged male (presented in

ln(days))

Male wingless Male winged

0 1.4

0.7 1.6

0.7 1.9

1.4 2.3

1.6 2.6

1.8 2.8

1.9 2.8

1.9 2.8

1.9 3.1

2.2 3.8

2.1 3.9

2.1 4.5

4.7

39

ln(Time to remating): First mate had no wings

ln(Time to remating): First mate had intact wings

Problems:

Unequal variance,

non-normal distributions

404.7

4.52.1

3.92.1

3.82.2

3.11.9

2.81.9

2.81.9

2.81.8

2.61.6

2.31.4

1.90.7

1.60.7

1.40

Malewinged

Malewingless

Real data: Randomized data:

!

Y 1"Y

2= "1.41

3.1

0.72.8

2.81.9

4.52.6

1.64.7

2.13.9

2.21.9

1.41.4

03.8

1.61.8

2.11.9

1.92.3

2.80.7

Malewinged

Malewingless

!

Y 1"Y

2= 0.41

Page 11: 20 Advances ljh - | Department of Zoology at UBC

41

Note that each data point was

only used once

42

1000 randomizations

P < 0.001

43

Randomization: Other questions

Q: Is this periodic?

(yes)

44

Bootstrap

• Method for estimation (and confidence

intervals)

• Often used for hypothesis testing too

• "Picking yourself up by your own

bootstraps"

Page 12: 20 Advances ljh - | Department of Zoology at UBC

45

Bootstrap

• For each group, randomly pick with

replacement an equal number of data points,

from the data of that group

• With this bootstrap dataset, calculate the

estimate -- bootstrap replicate estimate

464.7

4.52.1

3.92.1

3.82.2

3.11.9

2.81.9

2.81.9

2.81.8

2.61.6

2.31.4

1.90.7

1.60.7

1.40

Malewinged

Malewingless

Real data: Bootstrap data:

!

Y 1"Y

2= "1.41

4.7

4.72.1

4.72.1

4.72.1

4.51.9

3.91.9

3.11.8

3.11.8

2.81.8

2.81.4

2.81.4

1.40.7

1.40.7

Malewinged

Malewingless

!

Y 1"Y

2= "1.78

47 48

Bootstraps are often used in

evolutionary trees

Page 13: 20 Advances ljh - | Department of Zoology at UBC

49

Likelihood

!

L hypothesis A | data( ) = P data | hypothesis A[ ]

Likelihood considers many possible hypotheses,

not just one

50

Law of likelihood

A particular data set supports one hypothesis

better than another if the likelihood of that

hypothesis is higher than the likelihood of the

other hypothesis.

Therefore we try to find the hypothesis

with the maximum likelihood.

51

All estimates we have learned so

far are also maximum likelihood

estimates.

52

"Simple" example

• Using likelihood to estimate a proportion

• Data: 3 out of 8 individuals are male.

• Question: What is the maximum likelihood

estimate of the proportion of males?

Page 14: 20 Advances ljh - | Department of Zoology at UBC

53

Likelihood

!

L p = x( ) = P 3 males out of 8 | p = x[ ]

where x is a hypothesized value of the proportion of males.

e.g., L(p=0.5) is the likelihood of the hypothesis that

the proportion of males is 0.5.

54

For this example only...

The probability of getting 3 males out of 8

independent trials is given by the binomial

distribution.

!

L p = x( ) = Pr data | p = x[ ]

= Pr 3out of 8 | p = x[ ]

=8

3

"

# $ %

& ' x

31( x( )

8(3

55

How to find maximum likelihood

hypothesis

1. Calculus

or

2. Computer calculations

56

By calculus...

• Maximum value of L(p=x) is found when x

= 3/8.

• Note that this is the same value we would

have gotten by methods we already learned.

Page 15: 20 Advances ljh - | Department of Zoology at UBC

57

By computer calculation...

0.2 0.4 0.6 0.8 1

0.001

0.002

0.003

0.004

0.005

x

L(p=x)

x = 3/8

Input likelihood formula to computer, plot the

value of L for each value of x, and find the

largest L. 58

Finding genes for corn yield:

Corn Chromosome 5

59

Hypothesis testing by likelihood

• Compares the likelihood of maximum

likelihood estimate to a null hypothesis

Log-likelihood ratio =

!

lnLikelihood[Maximum likelihood hypothesis]

Likelihood[Null hypothesis]

"

# $

%

& '

60

Test statistic

!

" 2= 2 log likelihood ratio

With df equal to the number of variables

fixed to make null hypothesis

Page 16: 20 Advances ljh - | Department of Zoology at UBC

61

Example:3 males out of 8 individuals

• H0: 50% are male

• Maximum likelihood estimate

!

ˆ p =3

8

!

L p = 3/8[ ] =8

3

"

# $ %

& ' 3/8( )

3

1( 3/8( )5

= 0.2816

62

Likelihood of null hypothesis

!

L p = 0.5[ ] =8

3

"

# $ %

& ' 0.5( )

3

1( 0.5( )5

= 0.21875

63

Log likelihood ratio

!

lnL p = 3/8[ ]L p = 0.5[ ]

"

# $

%

& ' = ln

0.2816

0.21875

"

# $ %

& ' = 0.2526

( 2 = 2 0.2526( ) = 0.5051

We fixed one variable in the null hypothesis (p),

So the test has df = 1.

!

"0.05,1

2= 3.84, so we do not reject H0.