Making Inferences about Effects of Hypoxia/Altitude Will Hopkins ([email protected],...

Making Inferences about Effects of Hypoxia/Altitude Will Hopkins ([email protected], sportsci.org/will)Victoria University, Melbourne, Australia

· Making inferences (decisions or conclusions)· Sampling variation; true effects· Confidence limits; significance tests; magnitude-based inference· Individual differences

· Important magnitudes of effect statistics· Differences and changes in means· Slopes and correlations· Differences and ratios for proportions and counts

· Inferences when monitoring individual athletes

Adapted from a lecture for the IOC’s post-graduate diploma in sports medicine, available at Sportscience (Analysis and Interpretation: Introduction)

Making Inferences (Decisions or Conclusions)

· Every sample gives a different value for a statistic, owing to sampling variation.

· So, the value of a sample statistic is only an estimate of the true (right, real, actual, very large sample, or population) value.

· But people want to make an inference about the true value.· The best inferential statistic for this purpose is the confidence

interval: the range within which the true value is likely to fall.· "Likely" is usually 95%, so there is a 95% chance the true value is

included in the confidence interval (and a 5% chance it is not).· Confidence limits are the lower and upper ends of the interval.· The limits represent how small and how large the effect "could" be.· All effects should be shown with a confidence interval or limits.· Example: the altitude camp produced an average increase in peak

power of 2.2 % (95% confidence interval -0.6 to 5.0 %).· But confidence limits alone don't provide an inference.

· Statistical significance is the traditional way to make inferences.· It’s also known as the null-hypothesis significance test.· The inference is all about whether the effect could be zero or "null".· If the 95% confidence interval includes zero, the effect "could be

zero". The effect is "statistically non-significant (at the 5% level)":

· If the confidence interval does not include zero, the effect "couldn't be zero". The effect is "statistically significant (at the 5% level)".

· A p value is used to decide whether an effect is significant (p<0.05).

positivepositivenegativenegative95% confidence

intervalstatistically non-significant

statistically significantstatistically significant

(p=0.12)(p=0.02)

(p=0.003)

value of effect statistic (e.g., change in peak power)

zero or null Researchers using p values should show exact values.

· The exact definition of the p value is hard to understand.• Useful interpretation: if you observe a positive effect, and the p value

is 0.08, the probability is 0.04 that the effect is actually negative.· People usually interpret significant as "a real (important) effect" and

non-significant as "no real effect".· These interpretations apply only if the study was done with the right

sample size. · Even then they are misleading: they don't convey the uncertainty or

the magnitude.· And you hardly ever know if the sample size is right.· And I think the approach is too conservative.· Attempts to address this problem with post-hoc power calculations

are rare, generally wrong, and too hard to understand.· So the only safe interpretation is whether the effect could be zero

(p>0.05) or not (p<0.05).

· But the issue for the practitioner is not whether the effect could be zero, but whether the effect could be important.• Important has two meanings for clinical or practical effects: beneficial

and harmful.· One approach to this issue with significance testing is to assess the

magnitude of an effect only if the effect is significant.· So, if it’s significant, look at the value and decide whether it is

beneficial, harmful or even trivial.• Yes, a statistically significant trivial effect is trivial indeed!

· But if it’s non-significant, you can’t conclude anything. • But some non-significant effects are likely to be trivial.• Other non-significant effects have a good chance of being beneficial

or harmful.• And you often get non-significant effects, so you would like some

way to deal with them.· The confidence interval solves the problem, when important values

for benefit and harm are taken into account.

· Clinical inferences with the confidence interval· The smallest clinically or practically important values define ranges

of values of the effect that are beneficial, harmful and trivial.• Smallest effects for benefit and harm are equal and opposite.

· Infer (decide) the outcome from the confidence interval, as follows:

trivialtrivialharmfulharmful beneficialbeneficial

value of effect statistic (e.g., change in peak power)

Clear: use it.Clear: use it.

Clear: maybe use it.Clear: don't use it.Clear: don't use it.Clear: don't use it.

Clinicaldecision

Clear: use it.

smallest clinicallyharmful effect

smallest clinicallybeneficial effect

But p>0.05!

P values fail here.

But p<0.05!

Unclear: more data needed.

· This approach eliminates statistical significance.• And you should eliminate all mention of hypotheses, hypothesis

testing, p values and statistical significance from your manuscripts.• If you show whether the effect is significant or not, readers will

ignore what the confidence interval tells them about the effect.• Yes, but reviewers and editors sometimes insist on p values.• That’s changing. Some reviewers and editors are now insisting on

interpretation of confidence intervals, regardless of the p value.· The only issue is what level to make the confidence interval.· To be careful about avoiding harm, you can make a conservative

99% confidence interval on the harm side.· And to use effects only when there is a reasonable chance of

benefit, you can make a 50% interval on the benefit side.· But that's hard to understand. Consider this equivalent approach…

· Clinical inferences with chances of benefit and harm· The uncertainty in an effect can be expressed as chances that the

true effect is beneficial and the risk that it is harmful.· You would decide to use an effect with a reasonable chance of

benefit, provided it had a sufficiently low risk of harm.· I have opted for possibly beneficial (>25% chance of benefit) and

most unlikely harmful (<0.5% chance of harm).· An effect with >25% chance of benefit and >0.5% risk of harm is

therefore unclear. You'd like to use it, but you daren't. • Everything else is either clearly useful or clearly not worth using.

· However, if the chance of benefit is high (e.g., 80%), you could accept a higher risk of harm (e.g., 5%).• I have formalized this less conservative approach by comparing the

chance of benefit and risk of harm with an odds ratio (odds of benefit to odds of harm).

• If the odds ratio is >66, you may declare the effect beneficial.

· When an effect has no obvious benefit or harm (e.g., a comparison of males and females), the inference is only about whether the effect could be substantially positive or negative.• For such non-clinical inferences, use a symmetrical confidence

interval, usually 90%, to decide whether the effect is clear.• You can also use the chances that the effect is substantially positive

or negative. One of these has to be <5% for the effect to be clear.· Examples of ways to report magnitude-based inferences:

• possibly small benefit• likely moderately harmful• a trivial difference, clear at the 99% level• a trivial-moderate increase [the lower and upper confidence limits]• a small but unclear effect–more data are needed.

· Whatever, you should show confidence limits and interpret the uncertainty in a way readers (especially coaches and athletes) can understand.

· Magnitude-based inference (MBI) has been criticized recently in an article in Medicine and Science in Sports and Exercise.· The authors claimed that it is wrong to interpret the confidence

interval as the range of possible values of the true effect.· But Bayesian statisticians claim that this interpretation is correct.

• And other statisticians had already shown that magnitude-based inference is a legitimate form of Bayesian inference.

· More importantly, the authors claimed that magnitude-based inference has an unacceptable high rate of Type-I errors when the sample size is small.• That is, they said that when the true effect is zero, with MBI you

often claim incorrectly that the true effect is substantial.· But the authors misunderstood inference in MBI.

• If the true effect is zero, it is obviously not wrong to claim that the effect could be trivial or substantial.

• And if you are worried about that, with many statistically significant effects, the true effect could be trivial or substantial.

· Alan Batterham (my co-conspirator) and I have recently submitted an article to MSSE, in which we show that the Type-I error rate in non-clinical MBI is always ≤5%.· With significance testing, the Type-I rate is always ≥5%.· Significance testing is all “black and white”. MBI properly allows for

“shades of grey”, which are not wrong.· Furthermore, MBI has higher rates of publication-worthy

outcomes with small sample sizes.· With significance testing, if you get p>0.05, it’s hard to publish.

· And the resulting effects, if published, have trivial publication bias.· Publication bias is a major problem with significance testing.

· But if our article is rejected, I will have to retire!

· A caution about making an inference…· Whatever method you use, the inference is about the one and only

mean effect in the population.· The confidence interval represents the uncertainty in the true effect,

not a range of individual differences or individual responses.· With a large-enough sample size, a treatment could be clearly

beneficial (a mean beneficial effect with a narrow confidence interval, or p<0.0001, say), yet the treatment could be harmful for a substantial proportion of the population.

· Individual differences between groups and individual responses to a treatment are best summarized with a standard deviation to go with the mean effect.• The mean effect and this SD both need confidence limits.

· Individual differences and responses may be accounted for by including subject characteristics as modifying covariates in the analysis.

· Researchers generally neglect this important issue.

Important Magnitudes of Effect Statistics

· Researchers need the smallest important magnitude of an effect statistic to estimate sample size for a study.

· Practitioners need to know about important magnitudes to monitor their athletes or patients.

· Researchers and practitioners all need to know about important magnitudes to interpret research findings.

· So the remaining slides are all about values for various magnitudes of various effect statistics.

Differences or Changes in the Mean· The most common effect statistic, for numbers

with decimals (continuous variables).· Difference when comparing

different groups, e.g., patients vs healthy.· Change when tracking the same subjects.· Difference in the changes in controlled trials.· The between-subject standard deviation

provides default thresholds for importantdifferences and changes.· You think about the effect (mean) in terms of a

fraction or multiple of the SD (mean/SD).· The effect is said to be standardized.· The smallest and other important effects are…?

Trial

Strength

pre post1 post2

patients healthy

Strength

Data are means & SD.

Data are means & SD.

· Example: the effect of a treatment on strength

strength

postpost

prepre

Trivial effect (0.1x SD)

strength

postpost

prepre

Very large effect (3.0x SD)

· Interpretation of standardizeddifference orchange in means:

Cohen<0.2

Hopkins<0.2

0.2-0.5 0.2-0.60.5-0.8 0.6-1.2

>0.8 1.2-2.0? 2.0-4.0

trivialsmall

moderatelarge

very large? >4.0extremely large

0.2 0.6 1.2 2.0 4.0trivial small moderate large very large ext. largeComplete scale:

· Cautions with standardizing· Standardizing works only when the SD comes from a sample that

is representative of a well-defined population. • The resulting magnitude applies only to that population.

· In a controlled trial, use the baseline (pre) SD, never the SD of change scores.

· Beware of authors who show standard errors of the mean (SEM) rather than standard deviations (SD).• SEM = SD/(sample size), so SEMs on graphs make effects look a

lot bigger than they really are.· Standardization may not be best for effects on means of some

special variables: visual-analog scales, Likert scales, solo athletic performance…

· Visual-analog scales· The respondents indicate a perception on a line like this:

Rate your muscle pain by placing a mark on this scale:

· Score the response as percent of the length of the line.· Magnitude thresholds: ±10%, 30%, 50%, 70%, 90% for small,

moderate, large, very large, extremely large differences/changes.· Likert scales· Example: How has the training camp affected your performance?

much worse worse no effect better much better

· Most Likert-type questions have four to seven choices.· Code them as integers (1, 2, 3, 4, 5…) and analyze as numerics.· Magnitude thresholds are debatable.

• If you use the thresholds of the visual-analog scale as a guide, the threshold for a 6-pt scale would be ~0.5, 1.5, 2.5, 3.5 and 4.5.

· Both these scales could also be assessed by standardization.

none unbearable

· Solo athletic performance· For fitness tests and performance indicators of team-sport athletes,

use standardization.· But for top solo athletes, an enhancement that results in one extra

medal per 10 competitions is the smallest important effect.• The within-athlete variability that athletes show from one

competition to the next determines this effect. Here's why…• Because of this variability, each of the top athletes has a good

chance of winning at each competition:

Race 1Race 2

Race 3

· Your athlete needs an enhancement that overcomes this variability to give her or him a bigger chance of a medal.

· Simulations show an enhancementof 0.3 of an athlete's typical variability from competition to competition givesone extra win every 10 competitions.• Example: if the variability is an SD

(coefficient of variation) of 1%, the smallest important enhancement is 0.3%.

• In some early publications I have mistakenly referred to 0.5 of the variability as the smallest effect.

· Small, moderate, large, very large and extremely large effects result in an extra 1, 3, 5, 7 and 9 medals in every 10 competitions.

· The corresponding enhancements as factors of the variability are:

0.3 0.9 1.6 2.5 4.0trivial small moderate large very large ext. large

· Beware: smallest effect on athletic performance in performance tests depends on method of measurement, because…· A percent change in an athlete's ability to output power results in

different percent changes in performance in different tests.· These differences are due to the power-duration relationship for

performance and the power-speed relationship for different modes of exercise.

· Example: a 1% change in endurance power output produces the following changes…• 1% in running time-trial speed or time;• ~0.4% in road-cycling time-trial time;• 0.3% in rowing-ergometer time-trial time;• ~15% in time to exhaustion in a constant-power test.• A hard-to-interpret change in any test following a fatiguing pre-load.

(But such tests can be interpreted for cycling road races: see Bonetti and Hopkins, Sportscience 14, 63-70, 2010.)

Slope (or Gradient)· Used when the predictor and dependent are

both numeric and a straight line fits the trend.· The unit of the predictor is arbitrary.· Example: a 2% per year decline in activity

seems trivial… yet 20% per decade seems large.· So it's best to express a slope as the

difference in the dependent per two SDs of predictor.• It gives the difference in the dependent (physical activity) between

subjects with a typically low and high value of the predictor (age).• The SD for standardizing the resulting effect is the standard error of

the estimate (the scatter about the line).

Age

Physical activity

2 SD

Correlation Coefficient· This effect statistic is closely related to the slope. It represents

the overall linearity in a scatterplot. Examples:

· Negative values represent negative slopes.· The value is unaffected by the scaling of the two variables.· And it's much easier to calculate than a slope.· But a properly calculated slope is easier to interpret clinically.

· Smallest important correlation is ±0.1. Here’s the complete scale:

0.1 0.3 0.5 0.7 0.9trivial low moderate high very high ext. high

r = 0.00 r = 0.10 r = 0.30 r = 0.50 r = 0.70 r = 0.90 r = 1.00

Differences and Ratios of Proportions, Risks, Hazards, Odds· Example: percent of male and female players

injured in a season of touch rugby.· Risk difference or proportion difference· A common measure.

Example: a - b = 75% - 36% = 39%.· Problem: the sense of magnitude depends

on how big the proportions are.• Example: for the same 10% difference, 90% vs 80% doesn't seem

big, but 11% vs 1% can be interpreted as huge (11x the risk).· So there is no scale of magnitudes for a risk or proportion difference.· Exception: effects on winning a close match can be expressed as a

proportion difference: 55% vs 45% is a 10% difference or 1 extra match in every 10 matches; 65% vs 35% is 3 extra, and so on.

· Hence this scale for extra matches won or lost per 10 matches:

male female

Proportion injured (%)

Sex

0

100

a =75%

b =36%


· But the analyses don't work properly with proportions. • We have to use hazards or odds instead of proportions. I will

explain shortly.· Risk ratio (relative risk) or proportion ratio· Another common measure.

Example: a/b = 75/36 = 2.1, which meansmales are "2.1 times more likely" to be injured,or "a 110% increase in risk" of injury for males.

· Problem: if it's a time-dependent measure, the risk ratio changes. • If you wait long enough, everyone

gets injured, so the risk ratio = 100/100 = 1.00.· But it works for rare time-dependent risks and for time-independent

classifications (e.g., proportion playing a sport).· Hence we need values for the smallest and other important ratios

for such risks and proportions.

male female

Proportioninjured (%)

Sex

0

100

a =75%

b =36%

· The smallest ratio is when one event or case in every 10 is due to the effect.• Example: one in 10 injuries is due to being male.• That is, for every 10 injured males, there are 9 injured females. • If there are N males and N females (injured and uninjured), the injury

risks are 10/N and 9/N, and the risk ratio = (10/N)/(9/N) = 10/9.· For moderate, large, very large and extremely large ratios, for every

10 injured males, there are 7, 5, 3 and 1 injured females.• Corresponding risk ratios are 10/7, 10/5, 10/3 and 10/1.

· Hence this scale for proportion ratio and low-risk ratio:

• and the inverses for reductions in proportions:

· But there is still the problem of analyzing proportions properly.• Two solutions: hazards instead of risks; odds instead of proportions

1.11 1.43 2.0 3.3 10trivial small moderate large very large ext. large


· Hazard ratio or incidence-rate ratio· This ratio is used for risks or proportions that change with time,

especially for risks or proportions that get large. · The proportion is converted to an incidence rate, which is known as

a hazard. · Example:

The incidence rate for male injuries was 2.0% per week.The incidence rate for female injuries was 0.8% per week.Therefore the incidence-rate ratio or hazard ratio = 2.0/0.8 = 2.5.

· You can think of the hazard ratio as the right-now risk ratio: right now, males have 2.5 times greater risk of injury than females.

· The scale of magnitudes is the same as for ratio of proportions:

and the inverses 0.9, 0.7, 0.5, 0.3, 0.1.

1.11 1.43 2.0 3.3 10trivial small moderate large very large ext. large

· Odds ratio for time-independent classifications.· Odds are the awkward but only way to model classifications.· Example: proportions of boys and girls

playing a sport.Odds of a boy playing = a/c = 75/25.Odds of a girl playing = b/d = 36/64.Therefore the odds ratio = (75/25)/(36/64) = 5.3.

· Interpret the ratio as "…times more likely" only when the proportions in both groups are small (<10%).• The odds ratio is then approximately equal to the proportion ratio.

· To assess magnitude, authors should convert the odds ratio and its confidence limits to the proportion ratio and its confidence limits.• Unfortunately they often just leave effects as odds ratios.

boys girls

Proportionplaying (%)

Sex

0

100

a =75%

b =36%

c =25% d =

64%

Ratio of Counts· Example:

93 injuries per 1000 player-hours of match play in sport A69 injuries per 1000 player-hours of match play in sport B.· The effect is expressed as a count ratio: 93/69 = 1.35 times more

injuries in sport A than in sport B.· It can also be expressed as 35% more injuries.

· The scale of magnitudes for the count ratio is the same as for ratio of proportions or hazards:

and the inverses 0.9, 0.7, 0.5, 0.3, 0.1. 1.11 1.43 2.0 3.3 10trivial small moderate large very large ext. large

Inferences When Monitoring Individual Athletes

· Subjective assessments of the coach, athlete, and support personnel provide important evidence.· An experienced coach notices important changes.

· Objective assessments of change with an instrument or test are contaminated with error or "noise".· The noise is represented by the standard deviation of repeated

measurements: the standard (or typical) error of measurement.· Think of ± twice the error as the equivalent of confidence limits for

the athlete's true change. · Take into account clinically or practically important changes.

• Wow, you've made a moderate improvement!• No real change since last time. [High reliability needed here.]• It looks like you have made a small improvement, but we can’t be

sure, because the test is quite noisy. [This is the usual scenario.]

Summary· Inferential statistics are used to make conclusions about the true

value of a simple or effect statistic derived from a sample.· The inference from a null-hypothesis significance test is about

whether the true value of an effect statistic could be null (zero).· Magnitude-based inference addresses the issue of whether the true

value could be important (beneficial and harmful, or substantial).· Effect magnitudes have key roles in research and practice. · Small, moderate, large, very large and extremely large magnitude

thresholds for various effects…· Mean differences: 0.20, 0.60, 1.2, 2.0, 4.0 of between-athlete SD.· Solo performance: 0.3, 0.9, 1.6, 2.5, 4.0 of race-to-race CV.· Correlations: 0.10, 0.30, 0.50, 0.70, 0.90.· Team performance: 1, 3, 5,7, 9 games won/lost in every 10 games.· Ratios of proportions, hazards, counts: 1.11, 1.43, 2.0, 3.3, 10

and their inverses 0.9, 0.7, 0.5, 0.3, 0.1.· Take noise and thresholds into account when monitoring athletes.

Making Inferences about Effects of Hypoxia/Altitude Will Hopkins ([email protected],...

Documents

Transcript of Making Inferences about Effects of Hypoxia/Altitude Will Hopkins ([email protected],...