Chapter 4 Introduction to Hypothesis Testing Introduction to Hypothesis Testing.
Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing...
Transcript of Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing...
Statistical Hypothesis Testing
a dramatically incomplete primer
Are you just lucky?You live in one world in which the results came out the way they did.
If we tried it in one hundred parallel worlds, in how many would it have come out the same way?
1? 80? 100?All possible samples
You are here
Enter statisticsHypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds would the results have come out this way?
This is what we call a p-value.p<.05 intuitively means “a result like this is likely to have come up in at least 95% of parallel worlds”(parallel world = sample)
Enter statisticsP-values help us to make claims about populations:
“Students have better recall after a full night’s sleep!”
...when we only tested a small sample:“...because these students had better recall after a full night’s sleep!”
Science depends upon this capacity for statistical inference
Why does this work?
Population distribution
Population mean
Sample mean at n=1 Sample mean at n=100
Why does this work?Population mean
n=100
Central Limit Theorem: as sample size grows, the distribution of the means of samples approximates a
normal distribution centered about the population mean(when samples are independent and identically distributed)
Hypothesis Testing
Step 1• Before running any analysis, VIZUALIZE!
Mean(x) 9
Mean(y) 7.5
Variance(x) 11
Variance(y) 4.125
Correlation (r) .816
Regression line y = 3 + .5x
Visualize!
Anatomy of a statistical test• If your change had no effect, what would the world look
like?
• This is known as the null hypothesis
No difference in means No slope in relationship
Anatomy of a statistical test• Given the difference you observed, how likely is it to
have occurred by chance?
• In this case, we reject the null hypothesis
Probability of seeing a mean difference at least this large, by chance, is 0.012
Probability of seeing a slope at least this large, by chance, is 0.012
ErrorsDifference exists?
Differencedetected?
Yes No
Yes True positive False positive
No False negative True negative
Errors
13
p-valueThe probability of seeing the observed difference by chance
In other words, P(Type I error)
Typically accepted levels: 0.05, 0.01, 0.001
Comparing two populations:
counts
Count or occurrence data• “Fifteen people completed the trial with the control
interface, and twenty two completed it with the augmented interface.”
Control Augmented
Success 5 22
Failure 35 18
Pearson’s Chi-Squared Test
See: http://yatani.jp/HCIstats/ChiSquare
Degrees of freedom??• Sparing the details, DoF can often be considered the
number of independent scores (observations) going into a statistic, minus the number of parameters we estimated (typically 1)
• Higher degrees of freedom mean we had more observations relative to the estimate (less uncertainty!)
• What if two tests explain the same variance, but one has higher DoF?
Comparing two populations:
means
Normally distributed data
mean
std. dev.
t-test: do the means differ?
likely have different means(reject null hypothesis)
likely have the same mean(accept null hypothesis)
t-test: do the means differ?
Numbers that matter:• Difference in means
larger means more significant
• Variance in each grouplarger means less significant
• Number of sampleslarger means more significant
How many degrees of freedom?• If we know the mean of N numbers, then only N-1 of
those numbers can vary, while the last will always be constrained: (observations – estimations)
• We estimate two means, so a t-test has N-2 degrees of freedom.
Reporting the result• “Experts rated the designs of those in the augmented
condition (μ = 3.4, SD = 0.4) significantly higher than the designs of those in the control condition (μ = 2.0, SD = 0.5), according to an independent samples t-test (t(18) = -2.2, p < .05).”
?
Within-subjects study designs
• It can be easier to statistically detect a difference if the participants try both alternatives. Why?
• What are the potential issues?
Condition 1
Condition 2
Condition 1 Condition 2
Between-subjects Within-subjects
Paired-samples t-test• If we consider each data point to be
independent, then we find no significance (p = .491), because the between-group variance is small relative to the within-group variance
• A paired-samples t-test accounts for differences between individuals, revealing the effect of condition on each (p < .001)
Running a paired t-test in R
36
See: http://yatani.jp/teaching/doku.php?id=hcistats:ttest#a_paired_t_test
ANOVA
t-test: compare two means• “Do people fix more bugs with our IDE bug suggestion
callouts?”
ANOVA: compare N means• “Do people fix more bugs with our IDE bug suggestion
callouts, with warnings, or with nothing?”
total deviationfrom grand mean
deviation of factor mean from grand mean
deviation of response from factor mean
Rough intuition for ANOVA testHow much of the total variation can be accounted for by looking at the means of each condition?
Reporting an ANOVA• “A one-way ANOVA revealed a significant
difference in the effect of news feed source on number of likes (F(2, 21)=12.1, p<.001).”43
Repeated measures• Note: when your analysis includes any within-
subjects factors, use a repeated measures ANOVA.
Post-hoc tests
Omnibus Prime• ANOVA is an omnibus test. It compares all levels of all
factors.• When ANOVA is significant, it means “At least one of
the means is different.” Which one(s)? By how much?
Pairwise (post-hoc) tests
0.0
22.5
45.0
67.5
90.0
Friend feed Stranger feed Michael feed
Mea
n lik
es
The problem with many tests• implies a .95 probability of being correct in
rejecting the null hypothesis• If we do m tests, the actual probability of being correct
is now:
• This is called family-wise error rate
Bonferroni correction• Correct for family-wise error by adjusting to be more
conservative• Divide by the number of comparisons you make
• 4 tests at implies using • Conservative but safe method of compensating for
multiple tests• Note: you lose power when conducting lots of tests – so
be judicious and plan comparisons via hypotheses!
Bonferroni correction
51
Tukey test• Less conservative than Bonferroni• Compares all pairs of factor level means
52
Factorial ANOVA
Crossed study designs• Suppose you wanted to measure whether a drug works
for two types of headaches. You have two factors: • Treatment vs. placebo• Migraine vs. tension headache
• This is a 2 x 2 study: each factor has two levels• A factorial ANOVA can serve as an omnibus in this case
Interaction effects• The study reveals two main effects:
• Those in the treatment condition tend to have less pain than those in the placebo condition
• Those with tension headaches tend to have less pain than those with migraines
• There is also an interaction effect:• Those who had tension headaches had
a larger effect from the treatment than those with migraines
0123456789
10
Treatment Placebo
Pain
Treatment Condition
Migraine
Tension
Two-factor ANOVA test
58
Both main effects and interaction significant!
Nonparametric tests
Parametric assumptions• To use the tests we’ve seen thus far, three criteria must
be met:• Independence – each subject is sampled independently of
every other subject, and measures from one subject are independent of measures on any other subject
• Normality – data is normally distributed (technically, error terms are normally distributed)
• Homoskedasticity – the variance is similar across all levels of factors
Parametric assumptions• Non-parametric tests do not make these assumptions.
Use them for cases like:• Rankings or other ordinal data• Non-uniform variance
Equivalent nonparametric testsParametric Nonparametric
Unpaired t-test Mann-Whitney U
Paired t-test Wilcoxon matched pairs
ANOVA Kruskal-Wallis
Repeated measures ANOVA
Friedman test
Effect size• Significance tests inform us about the likelihood of a
meaningful difference between groups, but they don’t always tell us the magnitude of that difference.
• Because any difference will become “significant” with an arbitrarily large sample, it’s important to quantify the effect size that you observe
• We report either standardized or unstandardizedeffect sizes. When would you use each?
Effect size• Some common measures of effect size:
• Unstandardized:• The raw difference between means• The raw regression coefficient
• Standardized:• Pearson’s r (for correlations)• Cohen’s d (for differences between means)• 𝜂𝜂2 (to explain the variance explained by one factor,
controlling for all others)
A reference for analyses• DO NOT blindly apply these methods – know what your
analysis is considering, and when in doubt ask for assistance!
Samples (i.e. factors)
Response Categories Tests
1 2 One sample 𝜒𝜒2 test, binomial test
1 >2 One sample χ2 test, multinomial test
>1 ≥2 N-sample χ2 test, G-test, Fisher's exact test
Factors Levels (B)etween or (W)ithin
Parametric Tests Nonparametric Tests
1 2 B Independent-samples t-testWelch's t-test (if non-homoscedastic)
Mann-Whitney U Test
1 >2 B One-way ANOVA Kruskal-Wallis Test
1 2 W Paired-samples t-test Wilcoxon Signed-Rank Test
1 >2 W One-way repeated measures ANOVA Friedman Test
>1 ≥2 B Factorial ANOVALinear Models
Aligned Rank Transform (ART)Generalized Linear Models (GLM)
>1 ≥2 W Factorial repeated measures ANOVALinear Mixed Models (LMM)
Aligned Rank Transform (ART)Generalized Linear Mixed Models (GLMM)
Tests of Proportions (e.g. counts)
Analyses of Variance (comparing means)
Summary• Our goal is to make inferences about population
characteristics from a sample• Plan your study with your methods in mind• Always visualize data first• Check parametric assumptions• Run omnibus, and if called-for, post-hoc tests• Correct your family-wise error rate as necessary• Report test statistic, DoF, p-value, and effect size
Resources• Statistics drop-in office hours:
https://statistics.stanford.edu/resources/consulting• Our office hours• MOOCs:
https://www.coursera.org/learn/designexperiments