Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing...

Statistical Hypothesis Testing

a dramatically incomplete primer

Are you just lucky?You live in one world in which the results came out the way they did.

If we tried it in one hundred parallel worlds, in how many would it have come out the same way?

1? 80? 100?All possible samples

You are here

Enter statisticsHypothesis testing formalizes our intuition on this question. It quantifies: in what % of parallel worlds would the results have come out this way?

This is what we call a p-value.p<.05 intuitively means “a result like this is likely to have come up in at least 95% of parallel worlds”(parallel world = sample)

Enter statisticsP-values help us to make claims about populations:

“Students have better recall after a full night’s sleep!”

...when we only tested a small sample:“...because these students had better recall after a full night’s sleep!”

Science depends upon this capacity for statistical inference

Why does this work?

Population distribution

Population mean

Sample mean at n=1 Sample mean at n=100

Why does this work?Population mean

n=100

Central Limit Theorem: as sample size grows, the distribution of the means of samples approximates a

normal distribution centered about the population mean(when samples are independent and identically distributed)

Hypothesis Testing

Step 1• Before running any analysis, VIZUALIZE!

Mean(x) 9

Mean(y) 7.5

Variance(x) 11

Variance(y) 4.125

Correlation (r) .816

Regression line y = 3 + .5x

Visualize!

Anatomy of a statistical test• If your change had no effect, what would the world look

like?

• This is known as the null hypothesis

No difference in means No slope in relationship

Anatomy of a statistical test• Given the difference you observed, how likely is it to

have occurred by chance?

• In this case, we reject the null hypothesis

Probability of seeing a mean difference at least this large, by chance, is 0.012

Probability of seeing a slope at least this large, by chance, is 0.012

ErrorsDifference exists?

Differencedetected?

Yes No

Yes True positive False positive

No False negative True negative

Errors

13

p-valueThe probability of seeing the observed difference by chance

In other words, P(Type I error)

Typically accepted levels: 0.05, 0.01, 0.001

Comparing two populations:

counts

Count or occurrence data• “Fifteen people completed the trial with the control

interface, and twenty two completed it with the augmented interface.”

Control Augmented

Success 5 22

Failure 35 18

Pearson’s Chi-Squared Test

See: http://yatani.jp/HCIstats/ChiSquare

http://yatani.jp/HCIstats/ChiSquare

Degrees of freedom??• Sparing the details, DoF can often be considered the

number of independent scores (observations) going into a statistic, minus the number of parameters we estimated (typically 1)

• Higher degrees of freedom mean we had more observations relative to the estimate (less uncertainty!)

• What if two tests explain the same variance, but one has higher DoF?

Comparing two populations:

means

Normally distributed data

mean

std. dev.

t-test: do the means differ?

likely have different means(reject null hypothesis)

likely have the same mean(accept null hypothesis)

t-test: do the means differ?

Numbers that matter:• Difference in means

larger means more significant

• Variance in each grouplarger means less significant

• Number of sampleslarger means more significant

How many degrees of freedom?• If we know the mean of N numbers, then only N-1 of

those numbers can vary, while the last will always be constrained: (observations – estimations)

• We estimate two means, so a t-test has N-2 degrees of freedom.

Running the test in R• See: http://yatani.jp/HCIstats/TTest

29

http://yatani.jp/HCIstats/TTest

Reporting the result• “Experts rated the designs of those in the augmented

condition (μ = 3.4, SD = 0.4) significantly higher than the designs of those in the control condition (μ = 2.0, SD = 0.5), according to an independent samples t-test (t(18) = -2.2, p < .05).”

?

Within-subjects study designs

• It can be easier to statistically detect a difference if the participants try both alternatives. Why?

• What are the potential issues?

Condition 1

Condition 2

Condition 1 Condition 2

Between-subjects Within-subjects

Paired-samples t-test• If we consider each data point to be

independent, then we find no significance (p = .491), because the between-group variance is small relative to the within-group variance

• A paired-samples t-test accounts for differences between individuals, revealing the effect of condition on each (p < .001)

Running a paired t-test in R

36

See: http://yatani.jp/teaching/doku.php?id=hcistats:ttest#a_paired_t_test

http://yatani.jp/teaching/doku.php?id=hcistats:ttest#a_paired_t_test

t-test: compare two means• “Do people fix more bugs with our IDE bug suggestion

callouts?”

ANOVA: compare N means• “Do people fix more bugs with our IDE bug suggestion

callouts, with warnings, or with nothing?”

total deviationfrom grand mean

deviation of factor mean from grand mean

deviation of response from factor mean

Rough intuition for ANOVA testHow much of the total variation can be accounted for by looking at the means of each condition?

Reporting an ANOVA• “A one-way ANOVA revealed a significant

difference in the effect of news feed source on number of likes (F(2, 21)=12.1, p<.001).”43

Repeated measures• Note: when your analysis includes any within-

subjects factors, use a repeated measures ANOVA.

Post-hoc tests

Omnibus Prime• ANOVA is an omnibus test. It compares all levels of all

factors.• When ANOVA is significant, it means “At least one of

the means is different.” Which one(s)? By how much?

Pairwise (post-hoc) tests

0.0

22.5

45.0

67.5

90.0

Friend feed Stranger feed Michael feed

Mea

n lik

es

The problem with many tests• implies a .95 probability of being correct in

rejecting the null hypothesis• If we do m tests, the actual probability of being correct

is now:

• This is called family-wise error rate

Bonferroni correction• Correct for family-wise error by adjusting to be more

conservative• Divide by the number of comparisons you make

• 4 tests at implies using • Conservative but safe method of compensating for

multiple tests• Note: you lose power when conducting lots of tests – so

be judicious and plan comparisons via hypotheses!

Bonferroni correction

51

Tukey test• Less conservative than Bonferroni• Compares all pairs of factor level means

52

Factorial ANOVA

Crossed study designs• Suppose you wanted to measure whether a drug works

for two types of headaches. You have two factors: • Treatment vs. placebo• Migraine vs. tension headache

• This is a 2 x 2 study: each factor has two levels• A factorial ANOVA can serve as an omnibus in this case

Interaction effects• The study reveals two main effects:

• Those in the treatment condition tend to have less pain than those in the placebo condition

• Those with tension headaches tend to have less pain than those with migraines

• There is also an interaction effect:• Those who had tension headaches had

a larger effect from the treatment than those with migraines

0123456789

10

Treatment Placebo

Pain

Treatment Condition

Migraine

Tension

Two-factor ANOVA test

58

Both main effects and interaction significant!

Nonparametric tests

Parametric assumptions• To use the tests we’ve seen thus far, three criteria must

be met:• Independence – each subject is sampled independently of

every other subject, and measures from one subject are independent of measures on any other subject

• Normality – data is normally distributed (technically, error terms are normally distributed)

• Homoskedasticity – the variance is similar across all levels of factors

Parametric assumptions• Non-parametric tests do not make these assumptions.

Use them for cases like:• Rankings or other ordinal data• Non-uniform variance

Equivalent nonparametric testsParametric Nonparametric

Unpaired t-test Mann-Whitney U

Paired t-test Wilcoxon matched pairs

ANOVA Kruskal-Wallis

Repeated measures ANOVA

Friedman test

Effect size• Significance tests inform us about the likelihood of a

meaningful difference between groups, but they don’t always tell us the magnitude of that difference.

• Because any difference will become “significant” with an arbitrarily large sample, it’s important to quantify the effect size that you observe

• We report either standardized or unstandardizedeffect sizes. When would you use each?

Effect size• Some common measures of effect size:

• Unstandardized:• The raw difference between means• The raw regression coefficient

• Standardized:• Pearson’s r (for correlations)• Cohen’s d (for differences between means)• 𝜂𝜂2 (to explain the variance explained by one factor,

controlling for all others)

A reference for analyses• DO NOT blindly apply these methods – know what your

analysis is considering, and when in doubt ask for assistance!

Samples (i.e. factors)

Response Categories Tests

1 2 One sample 𝜒𝜒2 test, binomial test

1 >2 One sample χ2 test, multinomial test

>1 ≥2 N-sample χ2 test, G-test, Fisher's exact test

Factors Levels (B)etween or (W)ithin

Parametric Tests Nonparametric Tests

1 2 B Independent-samples t-testWelch's t-test (if non-homoscedastic)

Mann-Whitney U Test

1 >2 B One-way ANOVA Kruskal-Wallis Test

1 2 W Paired-samples t-test Wilcoxon Signed-Rank Test

1 >2 W One-way repeated measures ANOVA Friedman Test

>1 ≥2 B Factorial ANOVALinear Models

Aligned Rank Transform (ART)Generalized Linear Models (GLM)

>1 ≥2 W Factorial repeated measures ANOVALinear Mixed Models (LMM)

Aligned Rank Transform (ART)Generalized Linear Mixed Models (GLMM)

Tests of Proportions (e.g. counts)

Analyses of Variance (comparing means)

Summary• Our goal is to make inferences about population

characteristics from a sample• Plan your study with your methods in mind• Always visualize data first• Check parametric assumptions• Run omnibus, and if called-for, post-hoc tests• Correct your family-wise error rate as necessary• Report test statistic, DoF, p-value, and effect size

Resources• Statistics drop-in office hours:

https://statistics.stanford.edu/resources/consulting• Our office hours• MOOCs:

https://www.coursera.org/learn/designexperiments

https://statistics.stanford.edu/resources/consulting

https://www.coursera.org/learn/designexperiments

Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing...

Documents

Transcript of Statistical Hypothesis Testing - Stanford University · Enter statistics. Hypothesis testing...