UBC Technical Guidelines Division 26 UBC Vancouver and UBC ...
20 Advances ljh - | Department of Zoology at UBC
Transcript of 20 Advances ljh - | Department of Zoology at UBC
1
Advances in Statistics
Or, what you might find if you picked
up a current issue of a Biological
Journal
2
Advances in Statistics
• Extensions to the ANOVA
• Computer-intensive methods
• Maximum likelihood
3
Extensions to ANOVA
• One-way ANOVA
– This works for a single explanatory variable
– Simplest possible design
• Two-way ANOVA
– Two categorical explanatory variables
– Factorial design
4
ANOVA Tables
P
N-1Total
N-kError
k-1Treatment
F ratioMean
Squares
dfSum of squaresSource of
variation
!
SSerror
= si
2(n
i"1)#!
SSgroup = ni(Y i "Y )2#
!
SSgroup + SSerror!
MSerror =SSerror
dferror!
MSgroup =SSgroup
dfgroup
!
F =MSgroup
MSerror*
5
Two-factor ANOVA Table
N-1SStotalTotal
SSerror
XXX
XXXSSerrorError
MS1*2
MSE
SS1*2
(k1 - 1)*(k2 - 1)
(k1 - 1)*(k2 - 1)SS1*2Treatment 1 *
Treatment 2
MS2
MSE
SS2
k2 - 1
k2 - 1SS2Treatment 2
MS1
MSE
SS1
k1 - 1
k1 - 1SS1Treatment 1
PF ratioMean SquaredfSum of
Squares
Source of
variation
6
Two-factor ANOVA Table
N-1SStotalTotal
SSerror
XXX
XXXSSerrorError
MS1*2
MSE
SS1*2
(k1 - 1)*(k2 - 1)
(k1 - 1)*(k2 - 1)SS1*2Treatment 1 *
Treatment 2
MS2
MSE
SS2
k2 - 1
k2 - 1SS2Treatment 2
MS1
MSE
SS1
k1 - 1
k1 - 1SS1Treatment 1
PF ratioMean SquaredfSum of
Squares
Source of
variation
Two categorical
explanatory
variables
7
General Linear Models
• Used to analyze variation in Y when there is
more than one explanatory variable
• Explanatory variables can be categorical or
numerical
8
General Linear Models
• First step: formulate a model statement
• Example:
!
Y = µ + TREATMENT
9
General Linear Models
• First step: formulate a model statement
• Example:
!
Y = µ + TREATMENT
Overall
mean
Treatment
effect10
General Linear Models
• Second step: Make an ANOVA table
• Example:
P
N-1Total
N-kError
k-1Treatment
F ratioMean
Squares
dfSum of
squares
Source of
variation
!
SSerror
= si
2(n
i"1)#!
SSgroup = ni(Y i "Y )2#
!
SSgroup + SSerror
!
MSerror =SSerror
dferror!
MSgroup =SSgroup
dfgroup
!
F =MSgroup
MSerror*
11
General Linear Models
• Second step: Make an ANOVA table
• Example:
P
N-1Total
N-kError
k-1Treatment
F ratioMean
Squares
dfSum of
squares
Source of
variation
!
SSerror
= si
2(n
i"1)#!
SSgroup = ni(Y i "Y )2#
!
SSgroup + SSerror
!
MSerror =SSerror
dferror!
MSgroup =SSgroup
dfgroup
!
F =MSgroup
MSerror*
This is the same as a
one-way ANOVA!
12
General Linear Models
• If there is only one explanatory variable,
these are exactly equivalent to things we’ve
already done
– One categorical variable: ANOVA
– One numerical variable: regression
• Great for more complicated situations
13
Example 1: Experiment with
blocking
• Fish experiment: sensitivity of goldfish to
light
• Fish are randomly selected from the
population
• Four different light treatments are applied to
each fish
14
Randomized Block Design
Blocks (fish)
Treatments
(light wavelengths)
15
Randomized Block Design
16
Step 1: Make a model statement
!
Y = µ + BLOCK + TREATMENT
17
Step 2: Make an ANOVA table
18
Another Example: Mole Rats
• Are there lazy mole rats?
• Two variables:
– Worker type: categorical
• “frequent workers” and “infrequent workers”
– Body mass (ln-transformed): numerical
19 20
Step 1: Make a model statement
!
Y = µ + CASTE + LNMASS + CASTE *LNMASS
21
Step 2: Make an ANOVA table
22
Step 2: Make an ANOVA table
23
Step 1: Make a model statement
!
Y = µ + CASTE + LNMASS
24
Step 2: Make an ANOVA table
25
Step 2: Make an ANOVA table
Also called ANCOVA-
Analysis of Covariance
26
General Linear Models
• Can handle any number of predictor
variables
• Each can be categorical or numerical
• Tables have the same basic structure
• Same assumptions as ANOVA
27
General Linear Models
• Don’t run out of degrees of freedom!
• Sometimes, the F-statistics will have
DIFFERENT denominators - see book for
an example
28
Computer-intensive methods
• Hypothesis testing:
– Simulation
– Randomization
• Confidence intervals
– Bootstrap
29
Simulation
• Simulates the sampling process on a
computer many times: generates the null
distribution from estimates done on the
simulated data
• Computer assumes the null hypothesis is
true
30
Example: Social spider sex ratios
Social spiders live in groups
31
Example: Social spider sex ratios
• Groups are mostly females
• Hypothesis: Groups have just enough malesto allow reproduction
• Test: Whether distribution of number ofmales is as predicted by chance
• Problem: Groups are of many different sizes
• Binomial distribution therefore doesn’t apply
32
Simulation:
• For each group, the number of spiders is known.
The overall proportion of males, pm, is known.
• For each group, the computer draws the real
number of spiders, and each has pm probability of
being male.
• This is done for all groups, and the variance in
proportion of males is calculated.
• This is repeated a large number of times.
33
0.5 1. 1.5 2. 2.5 3. 3.5
200
400
600
800
1000
Variance in proportion of males(Pseudo-values)
Fre
quency
Actual Observed Value (0.44)
The observed value (0.44), or something more extreme,
is observed in only 4.9% of the simulations.
Therefore P = 0.049.34
Randomization
• Used for hypothesis testing
• Mixes the real data randomly
• Variable 1 from an individual is paired withvariable 2 data from a randomly chosenindividual. This is done for all individuals.
• The estimate is made on the randomized data.
• The whole process is repeated numerous times.The distribution of the randomized estimates is thenull distribution.
35
Without replacement
• Randomization is done without
replacement.
• In other words, all data points are used
exactly once in each randomized data set.
36
Randomization can be done for
any test of association between
two variables
37
Example: Sage crickets
Sage cricket males sometimes
offer their hind-wings to females
to eat during mating.
Do females who eat hind-wings
wait longer to re-mate?
38
Table 12.3A Waiting time to remating in sage cricket females after
initial mating with either a wingless or winged male (presented in
ln(days))
Male wingless Male winged
0 1.4
0.7 1.6
0.7 1.9
1.4 2.3
1.6 2.6
1.8 2.8
1.9 2.8
1.9 2.8
1.9 3.1
2.2 3.8
2.1 3.9
2.1 4.5
4.7
39
ln(Time to remating): First mate had no wings
ln(Time to remating): First mate had intact wings
Problems:
Unequal variance,
non-normal distributions
404.7
4.52.1
3.92.1
3.82.2
3.11.9
2.81.9
2.81.9
2.81.8
2.61.6
2.31.4
1.90.7
1.60.7
1.40
Malewinged
Malewingless
Real data: Randomized data:
!
Y 1"Y
2= "1.41
3.1
0.72.8
2.81.9
4.52.6
1.64.7
2.13.9
2.21.9
1.41.4
03.8
1.61.8
2.11.9
1.92.3
2.80.7
Malewinged
Malewingless
!
Y 1"Y
2= 0.41
41
Note that each data point was
only used once
42
1000 randomizations
P < 0.001
43
Randomization: Other questions
Q: Is this periodic?
(yes)
44
Bootstrap
• Method for estimation (and confidence
intervals)
• Often used for hypothesis testing too
• "Picking yourself up by your own
bootstraps"
45
Bootstrap
• For each group, randomly pick with
replacement an equal number of data points,
from the data of that group
• With this bootstrap dataset, calculate the
estimate -- bootstrap replicate estimate
464.7
4.52.1
3.92.1
3.82.2
3.11.9
2.81.9
2.81.9
2.81.8
2.61.6
2.31.4
1.90.7
1.60.7
1.40
Malewinged
Malewingless
Real data: Bootstrap data:
!
Y 1"Y
2= "1.41
4.7
4.72.1
4.72.1
4.72.1
4.51.9
3.91.9
3.11.8
3.11.8
2.81.8
2.81.4
2.81.4
1.40.7
1.40.7
Malewinged
Malewingless
!
Y 1"Y
2= "1.78
47 48
Bootstraps are often used in
evolutionary trees
49
Likelihood
!
L hypothesis A | data( ) = P data | hypothesis A[ ]
Likelihood considers many possible hypotheses,
not just one
50
Law of likelihood
A particular data set supports one hypothesis
better than another if the likelihood of that
hypothesis is higher than the likelihood of the
other hypothesis.
Therefore we try to find the hypothesis
with the maximum likelihood.
51
All estimates we have learned so
far are also maximum likelihood
estimates.
52
"Simple" example
• Using likelihood to estimate a proportion
• Data: 3 out of 8 individuals are male.
• Question: What is the maximum likelihood
estimate of the proportion of males?
53
Likelihood
!
L p = x( ) = P 3 males out of 8 | p = x[ ]
where x is a hypothesized value of the proportion of males.
e.g., L(p=0.5) is the likelihood of the hypothesis that
the proportion of males is 0.5.
54
For this example only...
The probability of getting 3 males out of 8
independent trials is given by the binomial
distribution.
!
L p = x( ) = Pr data | p = x[ ]
= Pr 3out of 8 | p = x[ ]
=8
3
"
# $ %
& ' x
31( x( )
8(3
55
How to find maximum likelihood
hypothesis
1. Calculus
or
2. Computer calculations
56
By calculus...
• Maximum value of L(p=x) is found when x
= 3/8.
• Note that this is the same value we would
have gotten by methods we already learned.
57
By computer calculation...
0.2 0.4 0.6 0.8 1
0.001
0.002
0.003
0.004
0.005
x
L(p=x)
x = 3/8
Input likelihood formula to computer, plot the
value of L for each value of x, and find the
largest L. 58
Finding genes for corn yield:
Corn Chromosome 5
59
Hypothesis testing by likelihood
• Compares the likelihood of maximum
likelihood estimate to a null hypothesis
Log-likelihood ratio =
!
lnLikelihood[Maximum likelihood hypothesis]
Likelihood[Null hypothesis]
"
# $
%
& '
60
Test statistic
!
" 2= 2 log likelihood ratio
With df equal to the number of variables
fixed to make null hypothesis
61
Example:3 males out of 8 individuals
• H0: 50% are male
• Maximum likelihood estimate
!
ˆ p =3
8
!
L p = 3/8[ ] =8
3
"
# $ %
& ' 3/8( )
3
1( 3/8( )5
= 0.2816
62
Likelihood of null hypothesis
!
L p = 0.5[ ] =8
3
"
# $ %
& ' 0.5( )
3
1( 0.5( )5
= 0.21875
63
Log likelihood ratio
!
lnL p = 3/8[ ]L p = 0.5[ ]
"
# $
%
& ' = ln
0.2816
0.21875
"
# $ %
& ' = 0.2526
( 2 = 2 0.2526( ) = 0.5051
We fixed one variable in the null hypothesis (p),
So the test has df = 1.
!
"0.05,1
2= 3.84, so we do not reject H0.