Issues concerning the interpretation of statistical significance tests.

31
Issues concerning the interpretation of statistical significance tests

Transcript of Issues concerning the interpretation of statistical significance tests.

Page 1: Issues concerning the interpretation of statistical significance tests.

Issues concerning the interpretation of statistical

significance tests

Page 2: Issues concerning the interpretation of statistical significance tests.

1RR

20 hypothetical studies, 18 showing an effect, none “statistically significant”

0

Page 3: Issues concerning the interpretation of statistical significance tests.

Failing to reject the null hypothesis

In the previous slide, none of the studies were statistically significant since all the confidence intervals included a RR of 1.0

Therefore, none of the studies would reject the null hypothesis.

However, failure to reject the null hypothesis does not prove that the null hypothesis is true! (In fact, most of the studies indicated an increased RR).

Page 4: Issues concerning the interpretation of statistical significance tests.

Failing to reject the null hypothesis

Sometimes a study will fail to reject the null hypothesis even though the null hypothesis is false (i.e., there really is an effect but the result is not statistically significant).

When a study fails to reject the null hypothesis when the null hypothesis is false, a “type II error” has occurred.

Page 5: Issues concerning the interpretation of statistical significance tests.

Four Components of statistical significance testing

• Sample size• Minimum meaningful effect size

– Minimum meaningful effect size is based on:• Public health considerations• Scientific considerations• Clinically meaningful• Results of previous studies

• Specification of Type I error (α error)• Specification of Type II error (β error)

Page 6: Issues concerning the interpretation of statistical significance tests.

Four Components of statistical significance testing

• As part of designing the study, values for three of the four components are chosen and are used to compute the value of the fourth component.

• Usually, values for the type I error, the type II error, and the minimum meaningful effect size are chosen, and the value for the sample size is computed.– The computed sample size is the number of study

subjects necessary to detect an effect equal to (or larger than) the chosen minimum meaningful effect size with the chosen type I and type II error rates.

Page 7: Issues concerning the interpretation of statistical significance tests.

Statistical Power

• What is “power”?

– Probability of not making a type II error

– Probability of correctly rejecting the null hypothesis

– Power = 1 - β error

= 1 - Type II error

Page 8: Issues concerning the interpretation of statistical significance tests.

Statistical Power

• The level of statistical power in a study depends on the choices for the three other components:– Type I error (α error)

• Related consideration: whether the alternative hypothesis is one-sided or two-sided

– Sample size– Minimum meaningful effect size

• Power is also affected by the presence of biases

Page 9: Issues concerning the interpretation of statistical significance tests.

Ways to increase statistical power of a study

• Choose a higher value for type I error (α error)– e.g., choose .10 rather than .05, and a one-sided rather

than two-sided alternative

• Increase the sample size– increases precision of the study by reducing the

variance– may not be possible if the number of subjects exposed

and/or willing to participate is not large– resource limitations may prevent increasing the sample

size

Page 10: Issues concerning the interpretation of statistical significance tests.

Ways to increase statistical power of a study

• Choose larger minimum meaningful effect size– Problem: important, although small, effect sizes

may be ignored

• Minimize bias– e.g., bias towards the null due to non-differential

exposure misclassification makes it harder for a study to achieve statistically significant results.

Page 11: Issues concerning the interpretation of statistical significance tests.

Relationship between Type I error and Type II error (and power)

Page 12: Issues concerning the interpretation of statistical significance tests.

Summary of Statistical Power Considerations

• Many studies are under-powered, especially given the likely presence of biases towards the null (e.g., exposure misclassification, healthy worker effect biases)

• Routine selection of .05 as the value for type I error (α error) can lead to under-powered studies – Choice of values for type I and type II errors should be

based on public health or clinical costs of each error

Page 13: Issues concerning the interpretation of statistical significance tests.

Summary of Statistical Power Considerations

• The level of statistical power is a function of the choices of values for the four components of statistical significance testing that is part of the study design process.

• However, once the study is conducted and a result is obtained, a post-study power calculation is relatively unimportant for interpreting the study findings.– Level of statistical power, like levels of type I and type II error and

the sample size, are considerations that go into the design of a study, not considerations useful in interpreting the results of a study

• To interpret a study finding, calculate a confidence interval to indicate the range of effect values (e.g., range of RRs) that are compatible with the effect estimate obtained in the study.

Page 14: Issues concerning the interpretation of statistical significance tests.

Interpretation of study findings:confidence intervals and p-values

Page 15: Issues concerning the interpretation of statistical significance tests.

Limitations of p-values and confidence intervals

• Both assume no systematic bias is present– A certainly false assumption!

• Confidence interval indicates precision of the effect estimate but may tell us nothing about the true value of the effect:– the confidence interval may not contain the true effect value

because of chance or because of systematic bias

• A 2-sided p-value does not provide a clear indication of the direction, magnitude, or precision of the association

Page 16: Issues concerning the interpretation of statistical significance tests.

Other limitations of p-values

• The P-value is mostly a function of sample size– if the sample size is very large, even trivial departures from the

null will be statistically significant– a tiny effect in a very large study can have the same p-value as a

huge effect in a small study

• Focuses attention on an arbitrary cutoff (e.g., .05)– Interpretation of a p-value of .049 is different than the

interpretation of a p-value of .051– Focuses attention on the lower boundary of a confidence interval

and ignores the rest of the interval– It is a qualitative assessment (yes/no statistically significant) when

a quantitative assessment would be more appropriate

Page 17: Issues concerning the interpretation of statistical significance tests.

Other limitations of p-values

• The value of the p-value depends on the statistical test:– exact vs asymptotic– trend test vs testing for any differences, disregarding the ordering

of exposure– One-tail vs two-tail tests: same data but two different interpretations

• The alternative hypothesis may be rejected when the data do not support the null hypothesis very much (e.g., p=.06)– data may be more likely under the alternative hypothesis

• The null hypothesis may be rejected even when the data are much less likely under alternative hypotheses.

Page 18: Issues concerning the interpretation of statistical significance tests.

Other limitations of p-values

P-value is simply a measure of rarity with respect to one hypothesis – the null hypothesis (“chance”)

But life is full of rare events which we often ignore. We take notice of a rare event if there is a plausible competing hypothesis under which the probability of the event is relatively higher.

A proper assessment of the plausibility of the null hypothesis requires the simultaneous consideration of the relative plausibility of alternative hypotheses.

Page 19: Issues concerning the interpretation of statistical significance tests.

Advantages of the confidence interval

• A good indicator of the precision of the effect estimate

– the p-value is affected by both the precision and the magnitude of the effect estimate, whereas the width of a specific (1 - α) confidence interval is affected solely by the precision of the effect estimate

– the width of the confidence interval is a much better indicator of the impact of random error than the p-value

Page 20: Issues concerning the interpretation of statistical significance tests.

Advantages of the confidence interval

• A good indicator of the likely magnitude of the effect

– provides a range of values for the association, under the assumption that the difference between the true value and the observed value for the association is due only to random variation

– provides a range of values for the effect that are compatible with the data obtained

– values located centrally in the interval, i.e., near the point estimate, are more compatible with the data than values near the boundary of the interval

• focus should therefore be on the entire interval, especially the values around the center of the interval, as well as the upper and lower boundaries.

Page 21: Issues concerning the interpretation of statistical significance tests.

Advantages of the confidence interval

• A confidence interval and the point estimate provide sufficient information to construct a graph of all possible p-values and confidence intervals

• This graph is called the P-value function

• The P-value function gives the p-values for the null hypothesis, as well as every alternative hypothesis, for the parameter of interest

Page 22: Issues concerning the interpretation of statistical significance tests.

P-value function for RR=1.5,95% CI: 0.5, 4.5

one-sided p = 0.23P-value function for Odds or Risk ratio

0

0.1

0.2

0.3

0.4

0.5

0.6

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00

Odds or Risk ratio

On

e-s

ide

d p

-va

lue

Page 23: Issues concerning the interpretation of statistical significance tests.

P-value function• Red line is the 95% confidence interval

• Green line marks off the null hypothesis– the one-sided p-value for the null hypothesis is where the green

line, vertical from RR=1.0, hits the graph

• The 95% confidence interval represents only one possible horizontal slice through the P-value function.– e.g., the 80% confidence interval can be read off the graph by

following where the horizontal line through the one-sided p of 0.1 hits the graph: 0.7, 3.1

• The p-value for every possible hypothesis for the true RR can be read from this graph.

Page 24: Issues concerning the interpretation of statistical significance tests.

P-value function• By comparing p-values for different

hypotheses (e.g., RR=1.0 vs RR=2.0 vs RR=3.0), one can get an indication of the relative strength of the evidence that the obtained data provides for each hypothesis

– In this example, the one-sided p-value for RR=2 is larger than for RR=1.0 (null hypothesis) and for RR=3.

– This indicates that the obtained data are more probable under the hypothesis that RR=2 than for the null (RR=1) hypothesis or the hypothesis that RR=3.

– In particular, this means that the hypothesis that RR=2 is more supported by the observed data than the null hypothesis.

– The RR most supported by the observed data is always the point estimate: RR=1.5, with a one-sided p-value of 0.50

Hypothesis 1-sided P-value

RR = 1 (null) 0.24

RR = 2 0.30

RR = 3 0.11

Page 25: Issues concerning the interpretation of statistical significance tests.

P-value function

• The P-value function provides all the information about the data– The precision of the study is indicated by the width of the graph,

and the magnitude of the association corresponds to the peak of the graph

• The P-value function:– avoids the arbitrariness of choosing a specific (1 – α) confidence

interval (e.g., a 95% confidence interval) – avoids the arbitrariness of choosing a specific p-value cutoff, (e.g.,

p=.05)– focuses attention on the weight of the evidence for other possible

hypotheses besides the null hypothesis

Page 26: Issues concerning the interpretation of statistical significance tests.

P-value function

• Since it is cumbersome to provide a P-value function for every estimate of RR, a confidence interval along with the point estimate can provide enough information so that an approximate P-value function graph can be drawn.

• More information on the P-value function curve can be provided if additional confidence intervals are presented– e.g., “nested” confidence intervals can be provided:

50% interval, 80% interval, 95% interval

Page 27: Issues concerning the interpretation of statistical significance tests.

Summary• Many studies are under-powered. The issue of statistical

power must be addressed in the design of the study.

• Statistical power is affected by sample size and the choice of values for Type I error and the minimum meaningful effect size. It is also affected by biases.

• To maximize power:– Increase the sample size– Increase the type I error– Increase the minimum meaningful effect size– Minimize biases

Page 28: Issues concerning the interpretation of statistical significance tests.

Summary

• The p-value, the confidence interval, and the P-value function all assume that only random variation is present. They do not address systematic biases (e.g., non-differential exposure misclassification, the healthy worker effect biases, selection bias, confounding).

• The confidence interval should not be interpreted as if it were simply just another way to determine statistical significance, (i.e., it should not be interpreted in the same way as the p-value for the null hypothesis)

Page 29: Issues concerning the interpretation of statistical significance tests.

Summary• Properly interpreting a confidence interval requires taking

into account the values in the center of the range as well as the values at both boundaries.

• A confidence interval and the point estimate together give an approximate indication of the P-value function curve.

• It is not true that a 95% confidence interval computed from a study contains the true parameter with 95% probability. The probability that it does contain the true parameter is undefined.

– the “95%” refers to the frequency that a very large number of intervals constructed in this manner contain the true parameter, assuming that only random variability is present

Page 30: Issues concerning the interpretation of statistical significance tests.

Summary

One should be aware that the choice of a particular confidence interval (e.g., 95% interval), and choice of a specific p-value cutoff for statistical significance, are arbitrary decisions, with no scientific or public health justification.

Page 31: Issues concerning the interpretation of statistical significance tests.

Berkson (1942)

“If an event has occurred, the definitive question is not, “is this an event which would be rare if the null is true?” but “Is there an alternative hypothesis under which the event would be relatively frequent?”. If there is no plausible alternative at all, the rarity is quite irrelevant to a decision….”