02.12.2014 - Sample Size Survival Analysis Common Issues in Data Analysis
description
Transcript of 02.12.2014 - Sample Size Survival Analysis Common Issues in Data Analysis
Research Methodology
Statistics Lecture 6
Sample Size, Survival Analysis and Common Issues in Data Analysis
Rifat Hamoudi
Senior Lecturer
Outline
• Importance of Sample Size calculations - Precision and Power
• Sample size calculation for the difference of 2 means
• Sample size calculation for the difference of 2 proportions • Survival Analysis
• Common Issues to be aware of in Analysis
Intensive Rehabilitation Post Knee Replacement
• Over 79,000 knee replacement operations were undertaken in the NHS in England in 2011-12 • Following the operation, patients currently receive a short course of inpatient and community rehabilitation and advice to conduct simple exercises at home
• 20% of patients are not satisfied with the outcome of their knee replacement surgery which may in part be a result of inadequate rehabilitation
• Can the activity, independence and quality of life of patients undergoing knee replacement at high risk of a poor outcome be improved using an intensive rehabilitation program?
• Standard care versus Intense rehabilitation intervention consisting of: - Multi-disciplinary team - Intensive physiotherapy - Technology-assisted information (e.g. tablet based app, DVD)
Intensive Rehabilitation Post Knee Replacement
• DESIGN: A multi-centre randomised controlled trial of intensive rehabilitation program versus usual rehabilitation following knee replacement
• PRIMARY OUTCOME : The change in the Western Ontario and McMaster Universities Arthritis Index (WOMAC) function domain score between pre and 2 years post-op (range 0-100)
• How many people do we need to recruit into this study? And why does this matter?
Intensive Rehabilitation Post Knee Replacement
Why is it Important to Consider Sample Size?
• To ensure your study will provide useful information, specifically:
1. Estimates which are precise
2. Hypothesis tests that can detect important effects
Why is it Important to Consider Sample Size?
1. Estimates which are precise
- Sample size calculation ensures an estimate has adequate precision
Prevalence of 10%, sample size of 20 → 95% CI: 1% to 31% NOT VERY PRECISE!!!! Prevalence of 10%, sample size of 400 → 95% CI: 7% to 13% FAIRLY PRECISE
Precise Imprecise
Precise Estimates • Recall: A 95% Confidence interval gives a range of values in which we
are fairly sure includes the true population parameter 95% of the time
• A 95% confidence interval for (mean) is calculated as follows:
± 1.96 x SE( ) Where SE =
• Larger sample size = Smaller standard error → Higher precision &
Narrower confidence intervals
• The more people in our sample, the more precise our estimate
nSDx x
x
• Example:
We want a precise estimate of the mean difference in the 2 year post-op change in WOMAC score between the intensive rehabilitation and control group
Intensive Rehabilitation Post Knee Replacement
Why is it Important to Consider Sample Size?
1. Estimates which are precise
2. Hypothesis tests that can detect important effects
Recap of Hypothesis Tests • Define null and alternative hypothesis under study (H0 and HA)
• Collect relevant data from a sample of individuals
• Calculate the appropriate test statistic specific to the null hypothesis
• Compute the probability of obtaining your observed results or something more extreme when the null hypothesis is true i.e P-value
• Interpret the P-value and present results. The smaller the P-value the stronger the evidence against the null hypothesis.
Hypothesis Tests
• Make one of two decisions:
- Reject the null hypothesis (usually if p<0.05) - Do not reject the null hypothesis
Errors in Hypothesis Tests • Two sources of error:
Type I, α : Incorrect rejection of the null hypothesis REJECT THE NULL HYPOTHESIS WHEN IT IS TRUE
Type II, β: Incorrect non-rejection of the null hypothesis
DO NOT REJECT THE NULL HYPOTHESIS WHEN IT IS FALSE
• Specifying values such as ‘p=0.05 is significant’ means you are willing to have a type I error rate of 5%. This is the significance level of the test.
Power of the Test • ‘Power’ is the probability of detecting a true result, given
that it exists
• The probability of rejecting the null hypothesis when it is false
• Power = 1 - type II error = 1 - β = 1 - Incorrect non-rejection of the null hypothesis
Power of the Test
Truth
Difference No difference
Test
Difference True
positive (power)
False positive (Type I error)
No difference
False negative (type II error)
True negative
• Helpful to think of this as a contingency table:
Power
• Ideally we would like power = 100%
• The probability of rejecting the null hypothesis when it is false
to be 100%
• But this is impossible! There is always chance of making a type II error → Not rejecting the null hypothesis when it is false.
Power and Sample Size • Power increases with increasing sample size
• A larger sample has greater ability than a small sample to
detect a clinically important effect if it exists
• When a sample size is very small, the test may have inadequate power to detect a particular effect = wasted resources!
• Sample size calculations ensure enough power to detect important effects as statistically significant
Example: We want a high probability of finding a meaningful difference
in the change in WOMAC score between the intensive rehabilitation and control group given it actually exists
Intensive Rehabilitation Post Knee Replacement
Sample Size
• On the other hand if the sample size is unduly large the study may be:
- Unnecessarily time consuming - Expensive - Unethical
• How do we ensure we have an appropriate sample size then?
• We use a sample size calculation to ensure we have enough, but not too many patients. It will give the appropriate amount of samples required.
What values are required to determine sample size?
Suppose you are planning a two group trial with a proposed hypothesis test. Sample size will depend on 4 things that are required for calculation:
1. Assuming there is a true underlying difference, how certain do
you want to be of detecting this? Power - E.g. 90%. 2. What significance level is difference criterion? The cut off below
which we will reject the null hypothesis E.g. p=0.05. 3. Clinically important effect size wish to detect in the test 4. Variability of the outcome of interest, i.e. the standard deviation
if we have a numerical value.
CANNOT DO A SAMPLE SIZE CALCULATION WITHOUT ALL 4!!!
Sample Size - 3. Clinically Important Effect Size
• Smallest effect which would be considered clinically or biologically important - The magnitude of the effect which we do not want to overlook.
• Most often clinically important difference in means we wish to detect (or difference in proportions we wish to detect)
• Consider: What would the effect need to be before you and your colleagues would adopt the new treatment?
E.g. Mean pain improvement (difference of 20 out of 100 on
VAS score between group 1 and group2 would be considered clinically important.)
Sample Size - 4. Variability
• How variable is the outcome you are investigating?
• Providing an estimate of variability (SD) before you have collected the data gives the greatest difficulty.
• Use information from published studies with similar outcomes
• Use data from Pilot Study
• Note: Sample size calculation is just an approximation. Use best information available to provide estimates.
Sample size for comparing two independent group means -
Methodology
• Use a general formulae for calculating sample size
• Using: Desired power, 1 - β Desired significance level, α Smallest clinically important difference you wish to detect Standard Deviation (sd) of outcome
Sample size for comparing two independent group means
d – smallest difference we wish to detect as significant between the standard treatment and new treatment response
sd – standard deviation of response α – significance level β – type II error (1-power) patients per group = f(α,β) x 2sd2 (d)2
Sample size for comparing two independent group means
• f(α,β) is a function of significance level (α) and power • f(α,β) = 7.85 or 10.5 for 80% or 90% power respectively
Significance (risk of type I error) set at 5%
• Other often used values of f(α,β)
β (Type II Error)
0.05 0.10 0.20 0.50
α (Type I Error)
0.10 10.82 8.56 6.18 2.71
0.05 12.99 10.5 7.85 3.84
0.02 15.77 13.02 10.04 5.41
0.01 17.81 14.88 11.68 6.64
Intensive Rehabilitation Post Knee Replacement
To determine how many patients required in the study we must specify: 1. Desired power 2. Desired significance level, α 3. Smallest clinically important difference in the 2 year change in WOMAC function score to detect 4. Standard Deviation of the 2 year change in WOMAC function score
Intensive Rehabilitation Post Knee Replacement
To determine how many patients required in the study we must specify: 1. Desired power = 90% 2. Desired significance level, α = 0.05 or 5% 3. Smallest clinically important difference in the 2 year change in WOMAC function score to detect 4. Standard Deviation of the 2 year change in WOMAC function score
• Angst F, Aeschlimann A, Michel BA, Stucki G. Minimal clinically important rehabilitation effects in patients with osteoarthritis of the lower extremities. JRheumatol. 2002 Jan;29(1):131-8.
• MCID = 8 • SD = 22
Intensive Rehabilitation Post Knee Replacement
- 90% power - Significance = 0.05 -Clinically meaningful difference we wish to detect is 8 - sd = 22
patients per group = f(α,β) x 2sd2 (d)2
= 10.5 x (2x222) (8)2
= 158.8 ~ 159 → 318 patients overall need to be recruited (comparing 2 groups)
f(α,β) = 10.5 for 90% power and type I error of 0.05
Intensive Rehabilitation Post Knee Replacement
Required number of patients increases as:
• Clinically important difference, d decreases
• Standard deviation, sd, increases
• Significance p-value decreases
• Power increases
Further Considerations • Inflation of sample size to account for possible losses to follow-up
• If drop out rate is believed to be r % then the adjusted sample size is obtained by multiplying the unadjusted sample size, N, by 100/(100-r)
• Example: Based on a similar study which had a drop out rate of r=15%, for our example ( previously estimated N=318) the adjusted sample size is obtained by:
318*100/(100-15) =374.1~376 (188 per group)
Note: Always round up
Studies Which Are Too Small • Unlikely to produce a conclusive result
• Will not detect realistic, moderate treatment effects which
would be clinically important
• Estimate treatment effect imprecisely
• Far too common - Misleading for medicine and further research
• More likely to lead to publication bias: A large trial should publish whatever the result, but a small one may only do so if the result is sensational
Power Statement
• You should include a power statement in a study protocol or in the methods section of a paper
• Shows careful thought has been given to the sample size at the design stage of the investigation and that the study has sufficient power
• Typical statement: To detect a difference of 8 in the two year change in WOMAC functional domain score (SD=22) with 90% power and 5% significance, taking into account a 15% drop out rate, 188 patients in each group are required.
Survival Analysis
• In medical research survival analysis is concerned with the analysis of data in the form of time until some end-point, or event • Historically, the end-point was often death, but now survival analysis more broadly encompasses more general events • Survival analysis vs. time to event analysis
Survival Analysis
Examples of survival or time to event data: • Time to death after entry to a clinical trial • Time to death following a Myocardial Infarction • Time to diagnosis of cancer following the acquisition of genetic mutation • Time to relief of pain after taking an analgesic
Survival Analysis
Special Features of Survival Data • Data often non-normal and highly skewed • Observations may be censored * End-point not observed for some individuals * Actual survival time is greater than censored survival time leading to right censored survival times * This is the most common form of censoring, and the lecture will focus on this
Survival Analysis
AML chemotherapy survival data
SPSS: Analyze -> Survival -> Kaplan Meyer
Survival Analysis
Survival Analysis Complications caused by right censoring, but the empirical survivor function is easily generalised via: Kaplan-Meier (or product limit) estimate
Dong et al, 2011
Survival Analysis
Dong et al, 2011
Survival Analysis
Poulogiannis et al, 2010
Comparing different grades of colorectal cancer (Duke stage) with genetic markers
Common Issues in Data Analysis
Missing Data
• Missing data is common to almost every dataset
• A common strategy to dealing with missing data is to ignore it, by only analysing observations with complete data (called a complete case analysis)
• Gives valid results if missing data is missing completely at random (MCAR) – does not depend on the missing values or any observed values
Missing Data • If this is not the case a complete case analysis can give biased
and incorrect results!
• Example: want to know whether a new drug reduces the odds of mortality compared to the standard treatment
• Randomised 200 patients to each arm (N = 400)
• 50 patients dropped out of the study (lost to follow-up) in the new treatment arm, 10 patients dropped out in the standard treatment arm
Missing Data
• Ignoring drop-out, our results are shown in the table (N = 340)
• OR comparing the standard treatment to the new treatment, in terms of mortality: 0.80
• Conclusion: the new treatment reduces mortality by 20% compared to the standard treatment
Standard New
Survived 97 85
Died 93 65
Missing Data
• What assumptions have we made by ignoring the 60 patients who were lost to follow-up?
• MCAR - We have assumed that the patients who dropped out randomly dropped out and the reason they dropped out was not related to their outcome (i.e. mortality)
• Is this likely to be true?
• No – patients often drop out of trials when they are not improving or begin to do worse
Missing Data
• What if there was 70% mortality among patients who dropped out?
• Standard treatment: 7/10 dropouts died
• New treatment: 35/50 dropouts died
• Result: OR = 1.00 No difference between
treatment groups in terms of mortality
Standard New
Survived 100 100
Died 100 100
Missing Data
• Ignoring missing data can lead to incorrect results!
• This is true for missing data in both outcome and explanatory variables
• If we have less than 5% missing data then a complete case analysis (where we ignore observations with missing data) will probably be ok and/or if data is missing completely at random
• It is therefore important to ensure as little missing data as possible → Investigate and chase up missing data entries
Clustered Data • Most statistical analyses assume independence between
observations
• In many situations data is clustered
• With clustered data, the assumption of independence is violated
Clustered Data
• Examples of clustered data: - Multiple measurements on the same people - Patients at the same centre in a multi-centre study - Children in the same classroom - Children in the same school
• These data cannot be analysed using the methods we have
discussed so far
Clustered Data • Simple analysis option:
• Aggregate Level Analysis - Base analysis on a suitable summary measure for each unit at the cluster level
• Typical summary measures: - Mean (e.g average left and right measures) - Maximum value - Minimum value • The choice of summary measure depends on the purpose of the
study
• Point to Remember: If conducting an analysis at the aggregate level can only conclude at the aggregate level!!
Clustered Data
• Example: A trial of exercises in people with Parkinson Disease (PD) was carried out. 70 patients were randomised to exercises and 72 to control. Quality of life was measured using the SF-36 (Short Form health survey) at 8 weeks, 16 weeks and 6 months.
Analysis cannot compare all the exercise SF-36 values to all the control
SF-36 values. Repeat measurements on the same participant (8 weeks, 16 weeks & 6 months) will be correlated.
Simple Aggregate Analysis Options: - Compare mean SF-36 for each person (SF-36 8 wk + SF-36 16 wk + SF-36 6mo)/3 - Compare minimum SF-36 for each person - Choose 6 months as the primary end point and compare SF-36 6 month measures for each person
Clustered Data • If we ignore clustering results will be incorrect, assumptions
violated: - Our analysis thinks we have more information than we
really do - Estimates will be too precise, confidence intervals will be
too narrow, and p-values will be too small
• You will find significant associations between variables where none exists!
• Carefully consider the structure of your data
Variables on the Causal Pathway
• In many analyses we want to adjust for potential confounders
• When adjusting for explanatory variables, we need to be careful to ensure they are not on the causal pathway
Variables on the Causal Pathway • Example: does a new drug reduce the risk of stroke in
patients with high blood pressure?
• The drug works by reducing blood pressure, which in turn reduces the risk of stroke
• In this example, blood pressure is on the causal pathway between the new drug (explanatory variable) and risk of stroke (outcome variable)
Variables on the Causal Pathway • What happens if we adjust for patients’ blood pressure in our
analysis?
• Y = constant + b1*treatment + b2*blood pressure
• b1 is the effect of treatment after adjustment for blood pressure
• The interpretation of b1 is the change in risk/odds for a stroke for patients with the same blood pressure
• BUT this drug works by reducing blood pressure
Variables on the Causal Pathway • When we fit this model, the question we are asking
becomes “does our new drug reduce the risk of stroke when it doesn’t reduce blood pressure?”
• This is not the question we are interested in!
• Results will show the new drug doesn’t work, even if it does
• Carefully consider variables before you adjust for them. Do not adjust for a variable if it is on the causal pathway.
Assumptions
• Every method of analysis discussed so far makes assumptions – E.g. The paired t-test assumes normally distributed
differences in the outcome – One-way ANOVA assumes normally distributed data and
constant variance
When Assumptions Are not Met
• It is important to check that assumptions hold
• Always check appropriate assumptions
• Minor departures from assumptions are not a big problem but major departures will invalidate results
• Consider an alternative non-parametric test (a test without assumptions)
• Sensitivity analysis (ROC, Bland-Altman)
Key Points • Sample Size is vital to the strength of conclusions
• Ensures precision and adequate power
• To determine sample size for a two group trial you need to know 4
things: 1. Required power 2. Required significance level 3. The minimum clinically important difference you wish to detect 4. The variability of your outcome
• Real datasets come with many problems
• It is important that you’re aware of these problems -> if you ignore
them you may end up with incorrect results!