Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty ...
-
Upload
sharleen-day -
Category
Documents
-
view
214 -
download
0
description
Transcript of Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty ...
Hypothesis Testing
Statistical Inference – dealing with parameter and model uncertainty Confidence Intervals (credible intervals)
Hypothesis Tests
Goodness-of-fit
Model Selection (AIC)
Model averaging
Bayesian Model Updating
Statistical Testing of Hypotheses Objective of determining whether parameters
differ from hypothesized values.
Testing procedure framed in terms of comparison of null and alternative hypotheses. Null hypothesis Alternative hypothesis
Compound (1-sided) alternatives
00 = :H0a : H
0a : H
Procedure for Null Hypothesis Testing Specify
Null and alternate hypotheses Compute test statistic
Random variable that summarizes expected sample distribution given the null hypothesis is true (i.e., probability difference between sample means for 2 groups if the true mean is the same)
Compare to the sampled value Test is binary decision
Significance level of the test α Two types of incorrect decisions:
rejecting H0 when it is true (Type I error), Pr = α Not rejecting H0 when it is false (type II error), Pr = β
Power of test = 1- β
P- valuesProbability of obtaining a test statistic at
least as extreme as the observed one, given that null hypothesis is true Not Pr(Null hypothesis is true) Degree of consistency of data with null, not strength of
evidence for alternative
Dependent on null hypothesis (if null is that groups differ by 1 rather than 0 p-value will be different)
Dependent on sample size
Does not provide information on size or precision of estimated effect (i.e., not a measure of biological relevance or a confidence interval)
Reality Conclusion ↓
H0 True, Ha False H0 False, Ha True
We don’t reject H0(null hypothesis)
1-a (eg., 0.95)Odds of saying there is no difference when there really is one.95/100 times when there is no effect, we’ll correctly say there is no effect.
b (eg., 0.20) Type II ErrorOdds of saying there is no difference when there really is one.20/100 times when there is an effect, we’ll say there is no effect.
We reject H0, accept Ha (alternative hypothesis)
a (eg., 0.05) Type I ErrorOdds of saying there is a difference when there is no difference.5/100 times when there is no effect, we’ll say there is one.
1-b (eg., 0.80)POWEROdds of saying there is a difference when there is one.80/100 times when there is an effect, we’ll say there is oen.
Comments: Lower a , lower power; higher a , higher
power
Lower a , conservative in terms of rejecting the null when it’s true (i.e., saying there’s an effect when there really isn’t)
Higher a increases chances of Type I Error, decreases chances of making Type II Error and decreases rigor of test.
Sample Design: Choosing a sample size
Can choose based on target precision level (e.g. confidence intervals) or power (hypothesis testing)
Requires assumptions and tentative parameter (e.g., effect size) values Therefore it is an exercise in approximation Might identify cases where minimal sufficient
sample size would bust budget or is logistically impractical to achieve.
Likelihood Ratio Tests Comparing fit of hypothesized model to another model
(generally containing more parameters) – Null model to alternative model with additional parameters
Maximum likelihood estimation theory Evaluate MLE for restricted and more general parameterizations Calculate Likelihood ratio
Chi-square, with degrees of freedom of difference in number of parameters among models
)x|L()x|L(
2- = a
0e
2
ˆˆ
log
Goodness of fit (GOF)“Absolute” fit of model
Goal is to determine if data are reflective of the statistical model
Test statistic generated based on probability model using estimated parameters
Is there variation in the data that is out of the ordinary and not reflected in our statistical model?
Pearson’s 2 GOF Test Logic: If model is ‘correct’, expected and observed cell
frequencies for each multinomial cell should be similar.
Imagine we roll a die 1000 times and want to determine if the model P(x=1)=P(x=2)=…=P(x=6) is a good model
If sample size is adequate, (expect at least 2 per cell),
S(observedi – expectedi)2/expectedi
~ 2(df = # cells – 1)
General GOF if Large Sample Pearson’s 2
Direct use of Deviance
)x|L()x|L(
2- = saturated
0e
2
ˆˆ
log
Bootstrap GOF Test Compute ML estimates for parameters, Produce empirical distribution of estimates:
Simulate capture histories for each released animal: assume parameter = MLE, ‘flip coins’ to determine survival and capture for each
period, Repeat for {Ri } animals, estimate parameters, Compute deviance
Compare original deviance with empirical distribution (i.e., what percentile?)
What indicates lack of fit? With GOF test, the hope and purpose is
to accept the null hypothesis
This is counter to statistical hypothesis testing
What is a ‘significant’ P-value?
What might cause lack of fit? Inadequate model structure for
detection or survival, e.g., Age dependence, size dependence, etc. Trap dependence Those released earlier survive at different
rate Non-random temporary emigration
Lack of independence among animals
Solutions Inadequate model structure? Improve it.
Goal: Subdivide animals sufficiently that there is equal p and S within a group
Warning: Inadequate model structure doesn’t always result in lack of fit, e.g., Permanent emigration (confounded with S) Random temporary emigration (confounded with p) Random ring loss (confounded with S)
Lack of independence? Correct for Overdispersion Inflate variances using quasi-likelihood.
Adjusting Variances for Overdispersion Based on Quasi-likelihood theory
c-hat = deviance/df adj. variance = c-hat * (ML variance)
Bootstrap adjustment for overdispersion For each simulated sample:
compute deviance compute c-hat = deviance/df
Bootstrap c-hat = (observed deviance)/(mean deviance), or (observed c-hat) / (mean c-hat)
Note: could replace deviance with Pearson 2, or mean with median.