Thomas Salzberger, PD Dr. [email protected] Quantitative...

Quantitative Research Methods

<Analysis>

Hypothesis Testing

Conditional Probabilities

Thomas Salzberger, PD [email protected]

2

• Null hypothesis H0: no difference, no relationship in the total population

• could, in principle, also be any specific value

• but in the vast majority of cases no justifiable value other than 0

• Alternative hypothesis HA (H1): difference/relationship in the total population

• E.g., mean comparison, two groups (t-test)

• H0: µ1 = µ2 or µ1 - µ2 = 0

• HA: µ1 ≠ µ2 or µ1 - µ2 ≠ 0

• Special case one-tailed hypothesis testing

• H0: µ1 ≤ µ2 or µ1 - µ2 ≤ 0

• HA: µ1 > µ2 or µ1 - µ2 > 0

Hypothesis Testing

3

empirical t= 2.048 (example)

t under H0

0

0.2

0.4

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

Distribution of t under H0

Two groups (male, female)Measurement of satisfaction

4

critical t value (e.g. -1.97; α=0.05) critical t value (e.g. +1.97)

0

0.2

0.4

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

Retaining H0Accepting HA Accepting HA

t under H0

Decision: Null or Alternative Hypothesis (two tailed; αααα=5%)

If H0 is true, wrong decision

(type one error, αααα)

5

t under H0

Retaining H0

critical t value (e.g. +1.65)

Accepting HA

0

0.2

0.4

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

Decision: Null or Alternative Hypothesis (one tailed; αααα=5%)

• Scenario: HA is true

6

Retaining H0Accepting HA Accepting HA

0

0.2

0.4

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4

t under HA*

Decision: Null or Alternative Hypothesis (two tailed; αααα=1%)

* for a specific difference (here half a scale point); The expected t value of 1.78 is based on

the difference of 0.5 divided by the standard error of the difference (0.281 in this case)

critical t value (e.g. -2.60; α=0.01) critical t value (e.g. +2.60)

Wrong decision

type two error

(ββββ)

0.73

Set #3 7

• Decision to reject H0 and accept HA depends on:

• ∆: True mean difference* (or true strength of relationship)

• σ: Standard deviation (incl. measurement error)

• n: Sample size

• α: type one error (α)

• →/↔: one-tailed hypothesis (versus two-tailed)

• β: type two error (β) depends on ∆, σ, α and n (and →/↔)

• Assume ∆, σ, α and β (and →/↔)

• optimal sample size n

• Practically meaningful difference? Relevance versus significance.

Effect size! (Cohen’s d)

bigger – more likely

Hypothesis Testing

smaller – more likely



one-tailed – more likely

n per group

·2

8

“Reality“

DecisionH0 is true HA is true

Retaining H0

Correct decision

P=1-ααααType 2 error

P=ββββ

Accepting HA

Type 1 error

P=αααα

Correct decision

P=1-ββββ

Conditional probabilities add

up to 1 column by columnP = 1-α + α = 100% P = 1-β + β = 100%

Hypothesis Testing

9

• p-value proposed by Sir Ronald Fisher as a measure of congruence of data with hypothesis of no difference/no relationship

• Originally, no decision based on some (arbitrary) threshold (implied by type 1 error rate) intended

• A p-value of, say, 0.05 (or smaller) means the data are relatively unlikely under the null hypothesis

• But does this mean that the null hypothesis is wrong and the alternative hypothesis is true?

• What is the probability of the alternative hypothesis being true?

• Arguably, there is no real difference between a p value of 0.04 and 0.06.

Worth noting ...

Sir Ronald Aylmer Fisher1890 London

– 1962 Adelaide

10

“Reality“


Retaining H0

Correct decision


P=ββββ

Accepting HA

Type 1 error

P=αααα

Correct decision

P=1-ββββ



Hypothesis Testing

11

“Reality“


Retaining H0

Correct decision


P=ββββ

Accepting HA

Type 1 error

P=αααα

Correct decision

P=1-ββββ



How Likely is the Hypothesis True?

• COVID-19 Antibody testing

• Sensitivity: true positive rate (how many actually positive cases are tested positive)

• Specificity: true negative rate (how many truly negative cases are tested negative)

• Say, a test has a sensitivity of 90% and a specificity of 95%.

• So, if you are tested positive, how likely are you truly positive (have antibodies)?

• If you are tested negative, how likely are you negative (have no antibodies)?

• Quick answer perhaps is 90% and 95%, respectively.

• In fact, it depends.

• Say, we have a town with 2000 people (for illustration only, does not really matter).

• Assume, 5% have had the disease, i.e. they have antibodies (or have the disease, if we

use a test for actual virus). Prevalence of 5%.

12

Epidemiology

• 2000 people

• Prevalence: 5%: 100 people

• Sensitivity of 90% and specificity of 95%

• I.e. 5% have had the disease, i.e. they have antibodies (or have the disease, if we use a

test for actual virus).

13

Epidemiology

Truly positive Truly negative Total

Positive test result 90 95

Negative test result 10 1805

100 1900 2000

Sensitivity 90/100=90% Prevalence: 5%

Specificity 1805/1900=95%

• If you are tested positive, how likely are you truly positive (have antibodies)?

• If you are tested negative, how likely are you negative (have no antibodies)?

14

Epidemiology


Positive predicted

value/

negative predicted

value

Positive test

result90 95 185

90/185=49%(higher spec. needed)

Negative test

result10 1805 1815 1805/1815=99%

100 1900 2000


Specificity 1805/1900=95% Rule out when prev.

low and spec. high

• What if a lot of people have antibodies, say 50%?

15

Epidemiology


Positive predicted

value/

negative predicted

value

Positive test

result900 50 950 900/950=95%

Negative test

result100 950 1050 950/1050=90%

1000 1000 2000


Specificity 950/1000=95% Rule in when prev.

high and sens. high

• What if most people have antibodies, say 80%?

16

Epidemiology


Positive predicted

value/

negative predicted

value

Positive test

result1440 20 1460 1440/1460=99%

Negative test

result160 380 540 380/540=70%

1600 400 2000


Specificity 950/1000=95%

The probability of antibodies given a positive test

and the probability of no antibodies given a negative test

depend on

• the sensitivity of the test

• the specificity of the test

• and, most importantly, on the prevalence!

• A low prevalence means we need very high specificity.

• A high prevalence means we need very high sensitivity (to decrease false negative).

• The same is true for hypothesis testing.

17

Conclusions

Null hypothesis = truly negative

Alternative hypothesis = truly positive

non significant = negative test result

significant = positive test result

The probability of an alternative hypothesis to be true given a significant test result

depends on

• the type-one error rate

• the type-two error rate

• and, most importantly, on the a priori probability of the hypothesis!

• Note that here the notion of probability is not frequentist. A theory (or its trueness) is

not a random variable that is sometimes true, sometimes not.

• Rather, it is a subjective probability. What we believe based on previous knowledge.

• Bayes?

18

Conclusions for hypothesis testing

• P (A | B) = P(HA is true | Accepting HA) this is what we want to know

• P (B | A) = P(Accepting HA | HA is true) that is 1-β (power)

• P (A) a priori probability, subjective (~ prevalence; before the test)

• P (B) = P(Accepting HA) can be computed based on P(A)

• P (B|A) * P(A) + P (B|¬ A) * P(¬ A )

• or: (1-β) ∗ P(A) + (α) * {1- P(A)} deciding HA for the right and for the wrong reason

19

Decision given Hypothesis is true

Hypothesis is true

Bayes‘ Theorem

Thomas Bayes (1701-1761)

Decision in favour of hypothesis

(regardless its being true or false)

Hypothesis is true (A)

given decision (B)

P (hyp true and we decide true)

P (we decide true)

• P (A | B) = P(HA is true | Accepting HA) ?

• P (B | A) = P(Accepting HA | HA is true) = 1-β = 0.80

• P (A) = 0.10

• P (B) = P(Accepting HA) can be computed based on P(A)

• = (1-β) ∗ P(A) + (α) * {1- P(A)}

• 0.80 ∗ 0.10 + 0.05 * 0.90 = 0.08 + 0.045 = 0.125

• P (A | B) = P(HA is true | Accepting HA) = 0.80 * 0.10 / 0.125 = 0.64

20

Decision given Hypothesis is true

Hypothesis is true

Bayes‘ Theorem: if a priori prob. is 0.10 (~ low prevalence)

Thomas Bayes (1701-1761)

Decision in favour of hypothesis

(regardless its being true or false)

Hypothesis is true

given decision

α = 5%

β = 20%

1-β= 80%

21

Reality

Decision

Theory wrong

[H0]

Theory correct

[HA]Σ

Retaining H0 95% 20% n.a.

Accepting HA 5% 80% n.a.

α = 5% β = 20%

A priori HA=10% Reality

Decision

Theory wrong

[H0]

Theory correct

[HA]Σ

Retaining H0 85.5% 2% 87.5%

Accepting HA 4.5% 8% 12.5%

90% 10% 100%

Bayes‘ Theorem

8 / 12.5 = 0.64

P (B) = P(Accepting HA)

P (A) = 0.10

P (B | A) * P (A) = P(Accepting HA | HA is true) * P (A)= 0.80*0.10

The p-value is just one part of the decision making.

It determines whether we retain the null hypothesis or reject it.

It is obvious that error rates are important.

The quality of the test is also important. For example, the t-test has the highest power

possible if its requirements are met.

That’s why we should use the best possible tests.

(Note: if prerequisites are not met, this no longer applies. Fancy statistics with poor

data! Structural equation modelling???)

It is (perhaps) less obvious that the a priori probability also matters.

With very unlikely theories, specificity should be high (very small type-one error alpha).

With very likely theories, sensitivity should be high (power, low type-two error beta).

(Consequences of wrong decisions also matter, of course.)

22

Conclusions for hypothesis testing

p-Value:

– Associated with the assumption that the H0 is true

– Probability that the result (relationship, difference, etc.) is

due to chance alone

– Idea: we need to do (much) better than chance ...

A small p-value implies that the data are very unlikely given the

H0 is true

– By implication, the smaller the p-value is, the less likely H0

is.

The p value

• Personal view: there is nothing wrong with hypothesis testing, but there is a lot wrong when it comes to (explicitly or implicitly) interpreting the results of hypothesis testing.

– Some problems should be addressed early (theory, design, data).

– The most sophisticated statistics are hardly conclusive, if the data are problematic (e.g., too small dataset, sampling).

– Also, the idea that a theory is either true or false is a simplification.

– Some parts of the theory might be true, others might be wrong.

Pitfalls of Hypothesis Testing

• Problem I: A very undetermined HA ...

– The H0 is an extreme case. Rarely is there no difference at all or no

relationship at all. The HA comprises all other cases!

– So, is the HA almost always true? Maybe.

– But is it interesting? Maybe not.

– What you can do: think about a more concrete HA.

• Direction (one-tailed testing)

• Specific HA: amount that is meaningful and relevant (in the context of the use of the

result)

• Type-two error beta (chance of retaining the H0 even though it is wrong) can be

defined a priori for the relevant amount (difference, relationship)

• What you can do: Optimal sample size can be determined

Problems with Hypothesis Testing

• Problem II: A very common misinterpretation

– Hypothesis testing invites a black-or-white decision: H0 or HA suggesting that either the H0 or HA is true with a very high probability (e.g. 1-alpha or 1-p or 1-beta)

– Refrain from this tempting but false interpretation. See also problem VII.

– Think of extreme cases:

– A: you check 100 alternative hypotheses that are not true (think of nonsense hypotheses that no one would believe they could be true).

• Type-1 error alpha is the false alarm rate (significant result given the H0 is true).

• For alpha=0.05, approximately 5% of your tests will show a p-value ≤ 0.05.

• What is the probability that any of these hypotheses with a significant result is actually true? 95%?

• Rather close to 0%. Actually, we know that none is correct.

• But maybe one of the hypotheses is not such a nonsense. The data analysis could very well point at something we just did not expect.

• What you can do: Cross-validate! If it is significant again, you could be on to something! But beware. The chance that the same thing can be significant at alpha=0.05 twice even though H0 is true is only 0.25%.


• Problem II: A very common misinterpretation (cont.)

– B: you check 100 alternative hypotheses that are almost certainly true,

but you have a small sample and therefore power is small.

• Type-2 error beta is rather high, say 0.60 for a meaningful difference.

• Since all alternative hypotheses are true, type 1 error alpha plays no role.

• 60 out of the 100 tests will be (wrongly) insignificant. Thus, you miss 60%

of the correct alternative hypotheses!

• What is the probability that in these cases the H0 is true? 40%?

• Rather close to 0%.

• But maybe one of the hypotheses we thought they are true is actually not

true.

• Could be. But which of the 60?

• What you can do: Run another study. One with more power.


• Problem III: Not really thinking (theorizing) about it before

– In practice, neither scenario (problem II A or B) is realistic.

– If we knew what’s true and what’s wrong, we would not need to

conduct statistical tests.

– There will be a mix of true and wrong hypotheses (HA).

– Avoid mass testing of hypotheses you have not really thought about.

• A lot of them will be wrong resulting in a lot of false positive findings.

• Theorize!

• (If you do run a lot of tests without theory, treat them as exploratory and

present the results accordingly. Never interpret them as theory-

testing/proving. There was no theory, was there?)


• Problem IV: Still not enough theorizing

– Okay, you have a theory.

– If it is obvious, interest will be limited.

– If it is daring, it is more interesting (but also more likely to be false).

– What can be done: Develop and argue the theory in a way that the theory appears at first surprising and interesting, but the empirical result should then be quite expected after grasping the theory (in other words, the subjective probability that the theory is right, is high).

– (Easier said then done. But at least, you should not be surprised when seeing a significant result and where you see it.)

– If the theory remains daring, assume a rather low probability (maybe 10%) and compute the probability that HA is actually true given the significant result using Bayes’ theorem.


• Problem V: Wrong sampling

– The hypothesis might only be true for a specific population.

– What can be done: Think carefully of the population of interest.

– Do the sampling accordingly (external validity).


• Problem VI: Too much noise

– Random error increases the variation within the sample.

– Variation within the sample increases the standard error.

– The standard error impacts on the power (1 – type-two error).

– What can be done: Ensure measures are reliable (sufficiently precise).

– Ensure measures are valid (avoiding systematic error).

– Choose data collection method accordingly.

�� = ��

��=

��

� � ��=

�̅��̅

!"(�̅��̅ )

%& =��'(

)(+��'+

)+


• Problem VII: It’s all black or white, no shades of grey*

– Under no circumstances, a significance test allows a judgement with certainty.

– A significant result implies that HA is more likely than before (but it could still be wrong).

– A non-significant result implies that H0 is more likely than before (but it could still be wrong).

– A yes-versus-no decision is a pretty crude way of summarising a probability – especially when based on the p-value rather than the actual [but subjective] probability of the hypothesis.

– What can be done: What about having, at least, three outcomes? Insufficient evidence (H0), weak evidence (maybe HA), and strong evidence (very likely HA)?

* Remember that we said science is also about admitting that one does not know


Rather daring hypotheses (Prob. of 10%)

• Reducing false negatives by 50% (power of 90% rather 80%) increases the 80 to 90 (sample size)

– Still only 90/135= 67% of all significant HA are true

• Should we look at the exact p value?

– Remember, its inventor, Sir Roland Fisher suggested p as a measure of how likely the result is based on chance alone.

Correct positives

False negatives: Can possibly be reduced by reducing the noise

False positives

80

45

only 80 out of 125 HA are true,

i.e. the prob. a HA is true is 64%

Rather daring hypotheses (Prob. of 10%)

α = 0.05, β = 0.20

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-4 -2 0 2 4 6 8

H null

H alternative

Threshold (alpha=0.05,

one-tailed)

H0 peak

H A peak

α = 0.05, β = 0.20, H0 true

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-4 -2 0 2 4 6 8

H null

H alternative

Threshold (alpha=0.05, one-tailed)

H0 peak

H A peak

true

neg.

α = 0.05, β = 0.20, HA true

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-4 -2 0 2 4 6 8

H null

H alternative


H0 peak

H A peak

true

pos.

α = 0.05, β = 0.20, H0 true

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-4 -2 0 2 4 6 8

H null

H alternative


H0 peak

H A peak

false

pos.

α = 0.05, β = 0.20, HA true

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-4 -2 0 2 4 6 8

H null

H alternative


H0 peak

H A peak

false

neg.

sig.

α=0.05, β=0.20, “significant”

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-4 -2 0 2 4 6 8

true

pos.

sig.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-4 -2 0 2 4 6 8

false

pos. expected value of test statistic (peak of density

function) at 0 (no difference)

80% of these p values will be between 0.05 and

0.01 (4 fifth of all p values between 0 and 0.05)

expected value of test statistic (peak of density

function) at 2.5

57% of these p values will be smaller than 0.01

(23% will be between 0.05 and 0.01;

20% will be bigger than 0.05 = type 2 error)

α = 0.05, β = 0.20

Decision p> 0.05 → H0 and p≤ 0.05 → HA

* Looks impressive, but keep in mind the probability of the H0 being correct was already 90% before we tested.

Decision Reality Cases out of 1000

with 100 HA true

Probability

(subjective,

Bayes)

HA HA true 80

(80% of 100)

HA true

80/125=64%

HA HA not true 45

(5% of 900)

H0 H0 true 855

(95% of 900)

H0 true

855/875=97.7%*

H0 H0 not true 20

(20% of 100)

α = 0.05, β = 0.20

Decision p> 0.05 → H0 , p≤ 0.01 → HA , 0.05≥p> 0.01 → “maybe”

HA Decision Reality Cases out of 1000

with 100 HA true

Probability

(subjective,

Bayes)

HA HA true 57

(57% of 100)

HA true

57/66=86.4%

HA HA not true 9

(1% of 900)

HA ??? HA true 23

(23% of 100)

HA true

23/59=39%

HA ??? HA not true 36

(4% of 900)

H0 H0 true 855

(95% of 900)

H0 true

855/875=97.7%

H0 H0 not true 20

(20% of 100)

It depends ...

• Clear answers preferred

– Are we willing to accept a “maybe” as the outcome?

• Significant results are more exciting.

– Researchers “hack” the data to achieve significance at 0.05

(p-hacking, selective reporting)

• A “maybe” (0.01-0.05) is less exciting than a “very

likely” (≤ 0.01).

– Researchers might just hack at 0.01 ...

Problem solved?

Alternative

• Report the subjective probability of the HA

(H0) being true at a particular level, or at

various levels, of the HA (H0) being true before

the study was undertaken

Problem solved?

And yet another problem

(still at α = 0.05, β = 0.20)

In some cases, we are interested in retaining H0 (think of all the

tests of fit, e.g. in structural equation modelling)


with 200 H0 true

Probability

(subjective,

Bayes)

H0

FIT

H0 true 190

(95% of 200)

H0 true

190/350=54.3%

H0

FIT

H0 not true 160

(20% of 800)

HA

MISFIT

HA true 640

(80% of 800)

HA true

640/650=98.5%

HA

MISFIT

HA not true 10

(5% of 200)

p-hacking (e.g. changing the model until it fits)


(at α = 0.001, β = 0.05)

In practice, no one assesses model fit at α = 0.05, but rather at

α = 0.001 (or even lower) – with high power we get:


with 200 H0 true

Probability

(subjective,

Bayes)

H0

FIT

H0 true 200

(99.9% of 200)

H0 true

200/240=83.3%

H0

FIT

H0 not true 40

(5% of 800)

HA

MISFIT

HA true 760

(95% of 800)

HA true

760/760=~100%

HA

MISFIT

HA not true 0

(0.1% of 200)


(at α = 0.001, β = 0.73*)

But with alpha = 0.001, power will be much smaller (also think of sample size, which is often relatively small).

* Corresponds to α = 0.05, β = 0.20 in a simple t-test, when changing α to 0.001 and keeping all other things equal.


with 200 H0 true

Probability

(subjective,

Bayes)

H0

FIT

H0 true 200

(99.9% of 200)

H0 true

200/784=25.5%

H0

FIT

H0 not true 584

(73% of 800)

HA

MISFIT

HA true 216

(27% of 800)

HA true

216/216=~100%

HA

MISFIT

HA not true 0

(0.1% of 200)

Thomas Salzberger, PD Dr. [email protected] Quantitative...

Documents

Transcript of Thomas Salzberger, PD Dr. [email protected] Quantitative...