Thomas Salzberger, PD Dr. [email protected] Quantitative...
Transcript of Thomas Salzberger, PD Dr. [email protected] Quantitative...
Quantitative Research Methods
<Analysis>
Hypothesis Testing
Conditional Probabilities
Thomas Salzberger, PD [email protected]
2
• Null hypothesis H0: no difference, no relationship in the total population
• could, in principle, also be any specific value
• but in the vast majority of cases no justifiable value other than 0
• Alternative hypothesis HA (H1): difference/relationship in the total population
• E.g., mean comparison, two groups (t-test)
• H0: µ1 = µ2 or µ1 - µ2 = 0
• HA: µ1 ≠ µ2 or µ1 - µ2 ≠ 0
• Special case one-tailed hypothesis testing
• H0: µ1 ≤ µ2 or µ1 - µ2 ≤ 0
• HA: µ1 > µ2 or µ1 - µ2 > 0
Hypothesis Testing
3
empirical t= 2.048 (example)
t under H0
0
0.2
0.4
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Distribution of t under H0
Two groups (male, female)Measurement of satisfaction
4
critical t value (e.g. -1.97; α=0.05) critical t value (e.g. +1.97)
0
0.2
0.4
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Retaining H0Accepting HA Accepting HA
t under H0
Decision: Null or Alternative Hypothesis (two tailed; αααα=5%)
If H0 is true, wrong decision
(type one error, αααα)
5
t under H0
Retaining H0
critical t value (e.g. +1.65)
Accepting HA
0
0.2
0.4
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Decision: Null or Alternative Hypothesis (one tailed; αααα=5%)
• Scenario: HA is true
6
Retaining H0Accepting HA Accepting HA
0
0.2
0.4
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4
t under HA*
Decision: Null or Alternative Hypothesis (two tailed; αααα=1%)
* for a specific difference (here half a scale point); The expected t value of 1.78 is based on
the difference of 0.5 divided by the standard error of the difference (0.281 in this case)
critical t value (e.g. -2.60; α=0.01) critical t value (e.g. +2.60)
Wrong decision
type two error
(ββββ)
0.73
Set #3 7
• Decision to reject H0 and accept HA depends on:
• ∆: True mean difference* (or true strength of relationship)
• σ: Standard deviation (incl. measurement error)
• n: Sample size
• α: type one error (α)
• →/↔: one-tailed hypothesis (versus two-tailed)
• β: type two error (β) depends on ∆, σ, α and n (and →/↔)
• Assume ∆, σ, α and β (and →/↔)
• optimal sample size n
• Practically meaningful difference? Relevance versus significance.
Effect size! (Cohen’s d)
bigger – more likely
Hypothesis Testing
smaller – more likely
bigger – more likely
bigger – more likely
one-tailed – more likely
n per group
·2
8
“Reality“
DecisionH0 is true HA is true
Retaining H0
Correct decision
P=1-ααααType 2 error
P=ββββ
Accepting HA
Type 1 error
P=αααα
Correct decision
P=1-ββββ
Conditional probabilities add
up to 1 column by columnP = 1-α + α = 100% P = 1-β + β = 100%
Hypothesis Testing
9
• p-value proposed by Sir Ronald Fisher as a measure of congruence of data with hypothesis of no difference/no relationship
• Originally, no decision based on some (arbitrary) threshold (implied by type 1 error rate) intended
• A p-value of, say, 0.05 (or smaller) means the data are relatively unlikely under the null hypothesis
• But does this mean that the null hypothesis is wrong and the alternative hypothesis is true?
• What is the probability of the alternative hypothesis being true?
• Arguably, there is no real difference between a p value of 0.04 and 0.06.
Worth noting ...
Sir Ronald Aylmer Fisher1890 London
– 1962 Adelaide
10
“Reality“
DecisionH0 is true HA is true
Retaining H0
Correct decision
P=1-ααααType 2 error
P=ββββ
Accepting HA
Type 1 error
P=αααα
Correct decision
P=1-ββββ
Conditional probabilities add
up to 1 column by columnP = 1-α + α = 100% P = 1-β + β = 100%
Hypothesis Testing
11
“Reality“
DecisionH0 is true HA is true
Retaining H0
Correct decision
P=1-ααααType 2 error
P=ββββ
Accepting HA
Type 1 error
P=αααα
Correct decision
P=1-ββββ
Conditional probabilities add
up to 1 column by columnP = 1-α + α = 100% P = 1-β + β = 100%
How Likely is the Hypothesis True?
• COVID-19 Antibody testing
• Sensitivity: true positive rate (how many actually positive cases are tested positive)
• Specificity: true negative rate (how many truly negative cases are tested negative)
• Say, a test has a sensitivity of 90% and a specificity of 95%.
• So, if you are tested positive, how likely are you truly positive (have antibodies)?
• If you are tested negative, how likely are you negative (have no antibodies)?
• Quick answer perhaps is 90% and 95%, respectively.
• In fact, it depends.
• Say, we have a town with 2000 people (for illustration only, does not really matter).
• Assume, 5% have had the disease, i.e. they have antibodies (or have the disease, if we
use a test for actual virus). Prevalence of 5%.
12
Epidemiology
• 2000 people
• Prevalence: 5%: 100 people
• Sensitivity of 90% and specificity of 95%
• I.e. 5% have had the disease, i.e. they have antibodies (or have the disease, if we use a
test for actual virus).
13
Epidemiology
Truly positive Truly negative Total
Positive test result 90 95
Negative test result 10 1805
100 1900 2000
Sensitivity 90/100=90% Prevalence: 5%
Specificity 1805/1900=95%
• If you are tested positive, how likely are you truly positive (have antibodies)?
• If you are tested negative, how likely are you negative (have no antibodies)?
14
Epidemiology
Truly positive Truly negative Total
Positive predicted
value/
negative predicted
value
Positive test
result90 95 185
90/185=49%(higher spec. needed)
Negative test
result10 1805 1815 1805/1815=99%
100 1900 2000
Sensitivity 90/100=90% Prevalence: 5%
Specificity 1805/1900=95% Rule out when prev.
low and spec. high
• What if a lot of people have antibodies, say 50%?
15
Epidemiology
Truly positive Truly negative Total
Positive predicted
value/
negative predicted
value
Positive test
result900 50 950 900/950=95%
Negative test
result100 950 1050 950/1050=90%
1000 1000 2000
Sensitivity 90/100=90% Prevalence: 50%
Specificity 950/1000=95% Rule in when prev.
high and sens. high
• What if most people have antibodies, say 80%?
16
Epidemiology
Truly positive Truly negative Total
Positive predicted
value/
negative predicted
value
Positive test
result1440 20 1460 1440/1460=99%
Negative test
result160 380 540 380/540=70%
1600 400 2000
Sensitivity 90/100=90% Prevalence: 50%
Specificity 950/1000=95%
The probability of antibodies given a positive test
and the probability of no antibodies given a negative test
depend on
• the sensitivity of the test
• the specificity of the test
• and, most importantly, on the prevalence!
• A low prevalence means we need very high specificity.
• A high prevalence means we need very high sensitivity (to decrease false negative).
• The same is true for hypothesis testing.
17
Conclusions
Null hypothesis = truly negative
Alternative hypothesis = truly positive
non significant = negative test result
significant = positive test result
The probability of an alternative hypothesis to be true given a significant test result
depends on
• the type-one error rate
• the type-two error rate
• and, most importantly, on the a priori probability of the hypothesis!
• Note that here the notion of probability is not frequentist. A theory (or its trueness) is
not a random variable that is sometimes true, sometimes not.
• Rather, it is a subjective probability. What we believe based on previous knowledge.
• Bayes?
18
Conclusions for hypothesis testing
• P (A | B) = P(HA is true | Accepting HA) this is what we want to know
• P (B | A) = P(Accepting HA | HA is true) that is 1-β (power)
• P (A) a priori probability, subjective (~ prevalence; before the test)
• P (B) = P(Accepting HA) can be computed based on P(A)
• P (B|A) * P(A) + P (B|¬ A) * P(¬ A )
• or: (1-β) ∗ P(A) + (α) * {1- P(A)} deciding HA for the right and for the wrong reason
19
Decision given Hypothesis is true
Hypothesis is true
Bayes‘ Theorem
Thomas Bayes (1701-1761)
Decision in favour of hypothesis
(regardless its being true or false)
Hypothesis is true (A)
given decision (B)
P (hyp true and we decide true)
P (we decide true)
• P (A | B) = P(HA is true | Accepting HA) ?
• P (B | A) = P(Accepting HA | HA is true) = 1-β = 0.80
• P (A) = 0.10
• P (B) = P(Accepting HA) can be computed based on P(A)
• = (1-β) ∗ P(A) + (α) * {1- P(A)}
• 0.80 ∗ 0.10 + 0.05 * 0.90 = 0.08 + 0.045 = 0.125
• P (A | B) = P(HA is true | Accepting HA) = 0.80 * 0.10 / 0.125 = 0.64
20
Decision given Hypothesis is true
Hypothesis is true
Bayes‘ Theorem: if a priori prob. is 0.10 (~ low prevalence)
Thomas Bayes (1701-1761)
Decision in favour of hypothesis
(regardless its being true or false)
Hypothesis is true
given decision
α = 5%
β = 20%
1-β= 80%
21
Reality
Decision
Theory wrong
[H0]
Theory correct
[HA]Σ
Retaining H0 95% 20% n.a.
Accepting HA 5% 80% n.a.
α = 5% β = 20%
A priori HA=10% Reality
Decision
Theory wrong
[H0]
Theory correct
[HA]Σ
Retaining H0 85.5% 2% 87.5%
Accepting HA 4.5% 8% 12.5%
90% 10% 100%
Bayes‘ Theorem
8 / 12.5 = 0.64
P (B) = P(Accepting HA)
P (A) = 0.10
P (B | A) * P (A) = P(Accepting HA | HA is true) * P (A)= 0.80*0.10
The p-value is just one part of the decision making.
It determines whether we retain the null hypothesis or reject it.
It is obvious that error rates are important.
The quality of the test is also important. For example, the t-test has the highest power
possible if its requirements are met.
That’s why we should use the best possible tests.
(Note: if prerequisites are not met, this no longer applies. Fancy statistics with poor
data! Structural equation modelling???)
It is (perhaps) less obvious that the a priori probability also matters.
With very unlikely theories, specificity should be high (very small type-one error alpha).
With very likely theories, sensitivity should be high (power, low type-two error beta).
(Consequences of wrong decisions also matter, of course.)
22
Conclusions for hypothesis testing
p-Value:
– Associated with the assumption that the H0 is true
– Probability that the result (relationship, difference, etc.) is
due to chance alone
– Idea: we need to do (much) better than chance ...
A small p-value implies that the data are very unlikely given the
H0 is true
– By implication, the smaller the p-value is, the less likely H0
is.
The p value
• Personal view: there is nothing wrong with hypothesis testing, but there is a lot wrong when it comes to (explicitly or implicitly) interpreting the results of hypothesis testing.
– Some problems should be addressed early (theory, design, data).
– The most sophisticated statistics are hardly conclusive, if the data are problematic (e.g., too small dataset, sampling).
– Also, the idea that a theory is either true or false is a simplification.
– Some parts of the theory might be true, others might be wrong.
Pitfalls of Hypothesis Testing
• Problem I: A very undetermined HA ...
– The H0 is an extreme case. Rarely is there no difference at all or no
relationship at all. The HA comprises all other cases!
– So, is the HA almost always true? Maybe.
– But is it interesting? Maybe not.
– What you can do: think about a more concrete HA.
• Direction (one-tailed testing)
• Specific HA: amount that is meaningful and relevant (in the context of the use of the
result)
• Type-two error beta (chance of retaining the H0 even though it is wrong) can be
defined a priori for the relevant amount (difference, relationship)
• What you can do: Optimal sample size can be determined
Problems with Hypothesis Testing
• Problem II: A very common misinterpretation
– Hypothesis testing invites a black-or-white decision: H0 or HA suggesting that either the H0 or HA is true with a very high probability (e.g. 1-alpha or 1-p or 1-beta)
– Refrain from this tempting but false interpretation. See also problem VII.
– Think of extreme cases:
– A: you check 100 alternative hypotheses that are not true (think of nonsense hypotheses that no one would believe they could be true).
• Type-1 error alpha is the false alarm rate (significant result given the H0 is true).
• For alpha=0.05, approximately 5% of your tests will show a p-value ≤ 0.05.
• What is the probability that any of these hypotheses with a significant result is actually true? 95%?
• Rather close to 0%. Actually, we know that none is correct.
• But maybe one of the hypotheses is not such a nonsense. The data analysis could very well point at something we just did not expect.
• What you can do: Cross-validate! If it is significant again, you could be on to something! But beware. The chance that the same thing can be significant at alpha=0.05 twice even though H0 is true is only 0.25%.
Problems with Hypothesis Testing
• Problem II: A very common misinterpretation (cont.)
– B: you check 100 alternative hypotheses that are almost certainly true,
but you have a small sample and therefore power is small.
• Type-2 error beta is rather high, say 0.60 for a meaningful difference.
• Since all alternative hypotheses are true, type 1 error alpha plays no role.
• 60 out of the 100 tests will be (wrongly) insignificant. Thus, you miss 60%
of the correct alternative hypotheses!
• What is the probability that in these cases the H0 is true? 40%?
• Rather close to 0%.
• But maybe one of the hypotheses we thought they are true is actually not
true.
• Could be. But which of the 60?
• What you can do: Run another study. One with more power.
Problems with Hypothesis Testing
• Problem III: Not really thinking (theorizing) about it before
– In practice, neither scenario (problem II A or B) is realistic.
– If we knew what’s true and what’s wrong, we would not need to
conduct statistical tests.
– There will be a mix of true and wrong hypotheses (HA).
– Avoid mass testing of hypotheses you have not really thought about.
• A lot of them will be wrong resulting in a lot of false positive findings.
• Theorize!
• (If you do run a lot of tests without theory, treat them as exploratory and
present the results accordingly. Never interpret them as theory-
testing/proving. There was no theory, was there?)
Problems with Hypothesis Testing
• Problem IV: Still not enough theorizing
– Okay, you have a theory.
– If it is obvious, interest will be limited.
– If it is daring, it is more interesting (but also more likely to be false).
– What can be done: Develop and argue the theory in a way that the theory appears at first surprising and interesting, but the empirical result should then be quite expected after grasping the theory (in other words, the subjective probability that the theory is right, is high).
– (Easier said then done. But at least, you should not be surprised when seeing a significant result and where you see it.)
– If the theory remains daring, assume a rather low probability (maybe 10%) and compute the probability that HA is actually true given the significant result using Bayes’ theorem.
Problems with Hypothesis Testing
• Problem V: Wrong sampling
– The hypothesis might only be true for a specific population.
– What can be done: Think carefully of the population of interest.
– Do the sampling accordingly (external validity).
Problems with Hypothesis Testing
• Problem VI: Too much noise
– Random error increases the variation within the sample.
– Variation within the sample increases the standard error.
– The standard error impacts on the power (1 – type-two error).
– What can be done: Ensure measures are reliable (sufficiently precise).
– Ensure measures are valid (avoiding systematic error).
– Choose data collection method accordingly.
������ = �� �
���=
���������������� �
� � �����������=
�̅���̅
!"(�̅���̅ )
%& =��'(
)(+��'+
)+
Problems with Hypothesis Testing
• Problem VII: It’s all black or white, no shades of grey*
– Under no circumstances, a significance test allows a judgement with certainty.
– A significant result implies that HA is more likely than before (but it could still be wrong).
– A non-significant result implies that H0 is more likely than before (but it could still be wrong).
– A yes-versus-no decision is a pretty crude way of summarising a probability – especially when based on the p-value rather than the actual [but subjective] probability of the hypothesis.
– What can be done: What about having, at least, three outcomes? Insufficient evidence (H0), weak evidence (maybe HA), and strong evidence (very likely HA)?
* Remember that we said science is also about admitting that one does not know
Problems with Hypothesis Testing
Rather daring hypotheses (Prob. of 10%)
• Reducing false negatives by 50% (power of 90% rather 80%) increases the 80 to 90 (sample size)
– Still only 90/135= 67% of all significant HA are true
• Should we look at the exact p value?
– Remember, its inventor, Sir Roland Fisher suggested p as a measure of how likely the result is based on chance alone.
Correct positives
False negatives: Can possibly be reduced by reducing the noise
False positives
80
45
only 80 out of 125 HA are true,
i.e. the prob. a HA is true is 64%
Rather daring hypotheses (Prob. of 10%)
α = 0.05, β = 0.20
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
-4 -2 0 2 4 6 8
H null
H alternative
Threshold (alpha=0.05,
one-tailed)
H0 peak
H A peak
α = 0.05, β = 0.20, H0 true
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
-4 -2 0 2 4 6 8
H null
H alternative
Threshold (alpha=0.05, one-tailed)
H0 peak
H A peak
true
neg.
α = 0.05, β = 0.20, HA true
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
-4 -2 0 2 4 6 8
H null
H alternative
Threshold (alpha=0.05, one-tailed)
H0 peak
H A peak
true
pos.
α = 0.05, β = 0.20, H0 true
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
-4 -2 0 2 4 6 8
H null
H alternative
Threshold (alpha=0.05, one-tailed)
H0 peak
H A peak
false
pos.
α = 0.05, β = 0.20, HA true
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
-4 -2 0 2 4 6 8
H null
H alternative
Threshold (alpha=0.05, one-tailed)
H0 peak
H A peak
false
neg.
sig.
α=0.05, β=0.20, “significant”
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
-4 -2 0 2 4 6 8
true
pos.
sig.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
-4 -2 0 2 4 6 8
false
pos. expected value of test statistic (peak of density
function) at 0 (no difference)
80% of these p values will be between 0.05 and
0.01 (4 fifth of all p values between 0 and 0.05)
expected value of test statistic (peak of density
function) at 2.5
57% of these p values will be smaller than 0.01
(23% will be between 0.05 and 0.01;
20% will be bigger than 0.05 = type 2 error)
α = 0.05, β = 0.20
Decision p> 0.05 → H0 and p≤ 0.05 → HA
* Looks impressive, but keep in mind the probability of the H0 being correct was already 90% before we tested.
Decision Reality Cases out of 1000
with 100 HA true
Probability
(subjective,
Bayes)
HA HA true 80
(80% of 100)
HA true
80/125=64%
HA HA not true 45
(5% of 900)
H0 H0 true 855
(95% of 900)
H0 true
855/875=97.7%*
H0 H0 not true 20
(20% of 100)
α = 0.05, β = 0.20
Decision p> 0.05 → H0 , p≤ 0.01 → HA , 0.05≥p> 0.01 → “maybe”
HA Decision Reality Cases out of 1000
with 100 HA true
Probability
(subjective,
Bayes)
HA HA true 57
(57% of 100)
HA true
57/66=86.4%
HA HA not true 9
(1% of 900)
HA ??? HA true 23
(23% of 100)
HA true
23/59=39%
HA ??? HA not true 36
(4% of 900)
H0 H0 true 855
(95% of 900)
H0 true
855/875=97.7%
H0 H0 not true 20
(20% of 100)
It depends ...
• Clear answers preferred
– Are we willing to accept a “maybe” as the outcome?
• Significant results are more exciting.
– Researchers “hack” the data to achieve significance at 0.05
(p-hacking, selective reporting)
• A “maybe” (0.01-0.05) is less exciting than a “very
likely” (≤ 0.01).
– Researchers might just hack at 0.01 ...
Problem solved?
Alternative
• Report the subjective probability of the HA
(H0) being true at a particular level, or at
various levels, of the HA (H0) being true before
the study was undertaken
Problem solved?
And yet another problem
(still at α = 0.05, β = 0.20)
In some cases, we are interested in retaining H0 (think of all the
tests of fit, e.g. in structural equation modelling)
Decision Reality Cases out of 1000
with 200 H0 true
Probability
(subjective,
Bayes)
H0
FIT
H0 true 190
(95% of 200)
H0 true
190/350=54.3%
H0
FIT
H0 not true 160
(20% of 800)
HA
MISFIT
HA true 640
(80% of 800)
HA true
640/650=98.5%
HA
MISFIT
HA not true 10
(5% of 200)
p-hacking (e.g. changing the model until it fits)
And yet another problem
(at α = 0.001, β = 0.05)
In practice, no one assesses model fit at α = 0.05, but rather at
α = 0.001 (or even lower) – with high power we get:
Decision Reality Cases out of 1000
with 200 H0 true
Probability
(subjective,
Bayes)
H0
FIT
H0 true 200
(99.9% of 200)
H0 true
200/240=83.3%
H0
FIT
H0 not true 40
(5% of 800)
HA
MISFIT
HA true 760
(95% of 800)
HA true
760/760=~100%
HA
MISFIT
HA not true 0
(0.1% of 200)
And yet another problem
(at α = 0.001, β = 0.73*)
But with alpha = 0.001, power will be much smaller (also think of sample size, which is often relatively small).
* Corresponds to α = 0.05, β = 0.20 in a simple t-test, when changing α to 0.001 and keeping all other things equal.
Decision Reality Cases out of 1000
with 200 H0 true
Probability
(subjective,
Bayes)
H0
FIT
H0 true 200
(99.9% of 200)
H0 true
200/784=25.5%
H0
FIT
H0 not true 584
(73% of 800)
HA
MISFIT
HA true 216
(27% of 800)
HA true
216/216=~100%
HA
MISFIT
HA not true 0
(0.1% of 200)