R workshop xiv--Survival Analysis with R
-
Upload
vivian-s-zhang -
Category
Education
-
view
903 -
download
1
description
Transcript of R workshop xiv--Survival Analysis with R
SSuurrvviivvaall AAnnaallyyssiiss iinn RRYuan Huang, Project Manager Intern @ SupStat IncKai Xiao, Data Scientist@ Supstat Inc
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
1 of 43 6/13/14, 9:49 PM
OOuuttlliinneeIntroduction to survival analysis
Method and Implementation in R
Case study: ADDICTS data
·
Data types
Statistics of interest (Survival function, Hazard function, Relative risk)
-
-
·Create survival objects1.
Estimate survival functions2.
Test for equality of survival functions3.
Cox proportional hazards model4.
·
2/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
2 of 43 6/13/14, 9:49 PM
Introduction to survival analysis
3/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
3 of 43 6/13/14, 9:49 PM
WWhhaatt iiss ssuurrvviivvaall aannaallyyssiiss??Survival Analysis is a collection of statistical procedures that seeks to answer questions such as howlong a population can survive past a certain time or event and what variables can explain thisduration. Data often comes in the form of time until event of interest occurs.
Convention:
time: years/months/weeks/days from the beginning of follow-up of an individual until an eventoccurs
event: death, heart attack, disease incidence, or the event of interest.
·
·
time --> survival time
event --> failure
·
·
4/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
4 of 43 6/13/14, 9:49 PM
EExxaammpplleess::Clinical trial
Finance
Economics
Industry engineering
·
Test for the effect of medicine, study the time until a disease/ death. (event: disease/death)
Access the risk of organ transplant, study the living time after transplant. (event: death)
-
-
·
Credit model, study the time to default of a client. (event: default)-
·
Study the unemployment duration (event: employment)-
·
Study the lifetime of some product: light bulb fails, computer crashes-
5/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
5 of 43 6/13/14, 9:49 PM
DDaattaa TTyyppeessComplete data
Truncation
Censoring
True survival time = Observed survival time (follow-up time)·
Truncation may occurs when a subject enters the study: observation of subject depends onevent.
Subjects may not be observed. If the subjects are observed, the event time is precisely known.
e.g instruments with limits of detection
·
·
·
Censoring may occur when a subject leaves the study: time of event is not known precisely.
All subjects are observed, but the event time may not be precisely known.
Three types of censoring data: right censoring, left censoring, and interval censoring.
·
·
·
6/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
6 of 43 6/13/14, 9:49 PM
MMoorree oonn cceennssoorriinnggRight censoring : True survival time > Observed survival time (Most common)
Left censoring : True survival time < Observed survival time
Interval censoring :Observed survival time 1 < True survival time < Observed survival time 2.
e.g., Patients are alive at the end of the follow-up time.
Note: In this talk, we focus on events with right-censored data.
·
·
e.g. consider following persons until they become HIV positive. We may record a failure when asubject first tests positive for the virus at time t. In this case, we only know that the failure occursbefore t, instead of knowing exact failure time.
·
e.g. consider following persons until they become HIV positive. A subject may have had two HIVtests, where he/she was HIV negative at the time (say, t1) of the first test and HIV positive at thetime (t2) of the second test. In such a case, the subject’s true survival time occurred after time t1and before time t2, i.e., the subject is interval censored in the time interval (t1,t2).
·
7/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
7 of 43 6/13/14, 9:49 PM
DDaattaa llaayyoouuttEvent time data usually represented by pair (t,d), where
t: time
d: censoring indicator. d=1 if failure and d=0 if censored.
x: covariates of interests
·
·
·
8/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
8 of 43 6/13/14, 9:49 PM
DDaattaa llaayyoouutt:: eexxaammpplleeExmaple: Acute Myelogenous Leukemia (AML) data. It is included in the R package "survival".
Description: Survival in patients with AML. Experiment was designed to investigate whether thestandard course of chemotherapy should be extended (’maintenance’) for additional cycles.
library("survival")
head(aml)
time status x1 9 1 Maintained2 13 1 Maintained3 13 0 Maintained4 18 1 Maintained5 23 1 Maintained6 28 0 Maintained
9/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
9 of 43 6/13/14, 9:49 PM
Statistics of interest
10/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
10 of 43 6/13/14, 9:49 PM
SSuurrvviivvaall ffuunnccttiioonn:: Definition
Survival function gives proportion of population still without the event by time t.
Graph
is graphed as a decreasing smooth curve, which begins at S(t)=1 at t=0 and heads downwardtoward zero as t increases toward infinity.
S(t)
S(t) = Pr(T > t)
S(t)
11/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
11 of 43 6/13/14, 9:49 PM
EEssttiimmaatteedd//EEmmppiirriiccaall ssuurrvviivvaall ccuurrvveess Estimator
Survival curve is estimated by Kaplan-Meier (KM) estimator , also known as "product estimator".
Graph
(t)S
(t)S
is a step function, rather than smooth curve.
The estimated survival curve jumps only at observed failure times, and the information from thecensored observations contributes to the sizes of the steps.
· (t)S
·
12/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
12 of 43 6/13/14, 9:49 PM
HHaazzaarrdd ffuunnccttiioonn:: Alternative names
Hazard function, Incidence rate, Instantaneous risk, and Force of mortality
Definition
Hazard function gives the instantaneous potential per unit time for the event to occur, given that theindividual has survived up to time .
h(t)
t
h(t) = limΔt→0
P(t ≤ T < t + Δt|T ≥ t)Δt
The hazard is event rate at t for those at risk, rather than a probability. Thus, the values of thehazard function range between zero and infinity.
·
13/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
13 of 43 6/13/14, 9:49 PM
RReellaattiivvee rriisskkssAlternative names
Relative risk, and Risk ratio, and Hazard ratio (RR/HR)
Definition
RR is a measure of the strength of the effect on survival.
The risk ratio is defined by
Let denote the hazard rate from treatment group at the time t ,
Let denote the hazard rate from control group at the time t.
· (t)h1
· (t)h0
(t)h1
(t)h0
14/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
14 of 43 6/13/14, 9:49 PM
GGooaallss ooff ssuurrvviivvaall aannaallyyssiiss??Step 1Estimate and interpret survival and hazard functions from survival data. (Descriptive statistics)
Step 2:Compare survival and/or hazard functions. (Two-sample mean test)
Step 3:Assess the relationship of explanatory variables to survival time. (Regression analysis)
15/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
15 of 43 6/13/14, 9:49 PM
Methods and implementation in R
16/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
16 of 43 6/13/14, 9:49 PM
DDaattaa:: aammllstr(aml) # check variables defined in aml dataset.
'data.frame': 23 obs. of 3 variables: $ time : num 9 13 13 18 23 28 31 34 45 48 ... $ status: num 1 1 0 1 1 0 1 1 0 1 ... $ x : Factor w/ 2 levels "Maintained","Nonmaintained": 1 1 1 1 1 1 1 1 1 1 ...
17/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
17 of 43 6/13/14, 9:49 PM
Step 0: Create survival objects
18/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
18 of 43 6/13/14, 9:49 PM
Purpose : Create survival objects
Usage : Survival object is usually used as a response variable in a model formula.
Syntax
Exmaple
Surv(time, event, type=c('right', 'left', 'interval', 'counting', 'interval2', 'mstate'))
time: for right censored data, this is the follow up time.
event: The status indicator, normally 0=alive, 1=dead.
type: character string specifying the type of censoring. The default is "right".
·
·
·
surv.aml <- with(aml,Surv(time, status))
19/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
19 of 43 6/13/14, 9:49 PM
Step 1: Estimate the survival curves
20/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
20 of 43 6/13/14, 9:49 PM
EEssttiimmaattee tthhee ssuurrvviivvaall ccuurrvveessMethod: Kaplan-Meier estimator
Implimentation in R: survfit ( )
Visualization:Plot of survival curves
·
·
·
21/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
21 of 43 6/13/14, 9:49 PM
EExxaammppllee:: DDaattaaaml[with(aml,x=="Maintained"),]
time status x1 9 1 Maintained2 13 1 Maintained3 13 0 Maintained4 18 1 Maintained5 23 1 Maintained6 28 0 Maintained7 31 1 Maintained8 34 1 Maintained9 45 0 Maintained10 48 1 Maintained11 161 0 Maintained
22/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
22 of 43 6/13/14, 9:49 PM
PPrroocceedduurreessConstruct a table as follows:
Each row represents a time point that even happens.
for each row, calculate
·
·Number of people at risk at time t: ;1.
Number of people die at time t: ;2.
Surivival rate at time t3.
nrisk,t
ndeath,t
S(t) = S(t − 1) × (1 − )ndeath,t
nrisk,t
23/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
23 of 43 6/13/14, 9:49 PM
Each row represents a time point that even happens.
TIME # OF AT RISK # OF DEATH S(T)
9
13
18
23
31
34
48
24/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
24 of 43 6/13/14, 9:49 PM
For each row, calculate the four quantities.
TIME # OF AT RISK # OF DEATH S(T)
9 11 1
13 10 1
18 8 1
23
31
34
48
1 × (1 − ) = 0.909111
0.909 × (1 − ) = 0.818110
0.818 × (1 − ) = 0.71618
25/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
25 of 43 6/13/14, 9:49 PM
Finished table:
TIME # OF AT RISK # OF DEATH S(T)
9 11 1
13 10 1
18 8 1
23 7 1
31 5 1
34 4 1
48 2 1
Scatter plot with pairs of (t, S(t)) gives the esitmated survival curve.
1 × (1 − ) = 0.909111
0.909 × (1 − ) = 0.818110
0.818 × (1 − ) = 0.71618
0.716 × (1 − ) = 0.61417
0.614 × (1 − ) = 0.49115
0.491 × (1 − ) = 0.36814
0.368 × (1 − ) = 0.18412
26/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
26 of 43 6/13/14, 9:49 PM
IImmpplliimmeennttaattiioonn iinn RR :: ssuurrvvffiitt (( ))fit <- survfit(Surv(time, status) ~ x, data=aml)summary(fit)
Call: survfit(formula = Surv(time, status) ~ x, data = aml)
x=Maintained time n.risk n.event survival std.err lower 95% CI upper 95% CI 9 11 1 0.909 0.0867 0.7541 1.000 13 10 1 0.818 0.1163 0.6192 1.000 18 8 1 0.716 0.1397 0.4884 1.000 23 7 1 0.614 0.1526 0.3769 0.999 31 5 1 0.491 0.1642 0.2549 0.946 34 4 1 0.368 0.1627 0.1549 0.875 48 2 1 0.184 0.1535 0.0359 0.944
x=Nonmaintained time n.risk n.event survival std.err lower 95% CI upper 95% CI 5 12 2 0.8333 0.1076 0.6470 1.000 8 10 2 0.6667 0.1361 0.4468 0.995 12 8 1 0.5833 0.1423 0.3616 0.941 23 6 1 0.4861 0.1481 0.2675 0.883 27 5 1 0.3889 0.1470 0.1854 0.816
27/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
27 of 43 6/13/14, 9:49 PM
VViissuuaalliizzaattiioonn:: PPlloott ssuurrvviivvaall ffuunnccttiioonn..Better looking survival curves: 1. KM plot with at-risk-table 2. Good-looking KM curves·
plot(fit,lty = 1:2) # basic plot: plot( ) function.legend("topright",lty=1:2,legend= levels(aml$x))
28/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
28 of 43 6/13/14, 9:49 PM
Step 2: Test for equality of survivalfunctions
29/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
29 of 43 6/13/14, 9:49 PM
Question: Is there statistically significant difference between the two survival curves?
Method:
Implimentation in R: survdiff( )
1.log-rank test: test equality of two survival curves.
2.Stratified log-rank test: test equality of two survival curves in every stratum of the categoricalexplanatory variable.
·
·
30/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
30 of 43 6/13/14, 9:49 PM
lloogg--rraannkk tteessttLog-rank test is the most popular test for testing
(⋅) = (⋅)S1 S2
By testing the survival curves, we are testing at infinity many time points.
What Log-rank test tests does:
The test is based on statistic constructed over a series of tables.
·
·
If = c, for all t,(t)h1
(t)h2
test for : c = 1 versus : c ≠ 1H0 H1
·
31/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
31 of 43 6/13/14, 9:49 PM
CCaallccuullaattee lloogg--rraannkk ssttaattiittiiccStep 1: For each time with events , construct the table
GROUP 1 GROUP 2 ROW TOTAL
NO. of death at
NO. of survivors beyond
Column total
Step 2: Compute three quantities:
· , j = 1, … , Jtj
tj d1j d2j dj
tj s1j s2j sj
n1j n2j nj
·
= , = , =Oj d1j Ejn1jdj
njVj
n1jn2jdjsj
( − 1)n2j nj
Step 3: The log-rank statistics is·
=χ 2L
[ ( − )]∑Jj=1 Oj Ej
2
∑Jj=1 Vj
32/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
32 of 43 6/13/14, 9:49 PM
IImmpplliimmeennttaattiioonn iinn RR:: ssuurrvvddiiffff(( ))Syntax
Example
The p-value is 0.065, which is greater than 0.05. Therefore, there is no sufficient evidence toconclude the difference between the two survival curves.
survdiff(formula, data, subset, na.action, rho=0)
logrank <- survdiff(Surv(time, status) ~ x, data=aml)# Stratified log-rank test: survdiff(Surv(time,status)~x+strata(sex))logrank
Call:survdiff(formula = Surv(time, status) ~ x, data = aml)
N Observed Expected (O-E)^2/E (O-E)^2/Vx=Maintained 11 7 10.69 1.27 3.4x=Nonmaintained 12 11 7.31 1.86 3.4
Chisq= 3.4 on 1 degrees of freedom, p= 0.0653
33/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
33 of 43 6/13/14, 9:49 PM
Step 3: Cox proportional hazardsmodel
34/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
34 of 43 6/13/14, 9:49 PM
CCooxx pprrooppoorrttiioonnaall hhaazzaarrddss mmooddeellModel setup
Model assumption
Model interpretation
Implimentation in R: coxph( )
·
·
·
·
35/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
35 of 43 6/13/14, 9:49 PM
MMooddeell sseettuuppThe Cox PH model specifies the hazard for individual i as
The effects of covariates are additive and linear on the log-risk scale:
(t) = (t) exp( + … + )λi λ0 β1xi1 βpxip
is the value of variable for subject . and do not depend on the time .
is the baseline hazard.It depends on the time and is the same for all individuals.
Note: There is no intercept term in cox model.
· xij j i x β t
· (t)λ0 t
·
log( (t)) = log( (t) + + … +λi λ0 β1xi1 βpxip
is called linear predictor or risk score.· + … +β1xi1 βpxip
36/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
36 of 43 6/13/14, 9:49 PM
MMooddeell iinntteerrpprreettaattiioonnFor model
Intuition: Look at the model with only one treatment indicator as an example. In this case, the modelis , where
(t) = (t) exp( + … + )λi λ0 β1xi1 βpxip
is the log risk ratio associated with one-unit change in , given other 's are heldconstant.
· βk xk X
(t) = (t) exp(β )λi λ0 Xi
= {Xi01
, if i is treated, if i is control
For subject from control group,
For subject from treatment group,
Hence the hazard ratio between the treatment group and the control group is
· (t) = (t)λi λ0
· (t) = (t) exp(β)λi λ0
·
= exp(β)(t)λi
(t)λ0
37/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
37 of 43 6/13/14, 9:49 PM
MMooddeell aassssuummppttiioonn
Propostional hazards (PH) assumption
If the model set up is correct, then we can see from the formula that
indeed is a constant over time.
Let denote the hazard function for person .
Let denote the hazard function for person .
· (t)λi i
· (t)λj j
requires: is a constant over time.
means that the hazard for one individual is proportional to the hazard for any other individual,where the proportionality constant is independent of time.
· (t)λi
(t)λj
·
= exp( ( − ) + … + ( − ))(t)λi
(t)λjβ1 xi1 xj1 βp xip xjp
38/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
38 of 43 6/13/14, 9:49 PM
MMooddeelliinngg wwhheenn PPHH aassssuummppttiioonn iiss vviioollaatteedd1.Stratified Cox model
It is applied when PH assumption is not fulfilled across stratas, but is statisfied within each strata.
where stands for the baseline hazard for th group.
(further inquiry, email [email protected])
(t) = (t) exp( + … + )λi λ0k β1X1 βpXp
(t)λ0k k
Accelerated failure-time models (AFT).1.
39/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
39 of 43 6/13/14, 9:49 PM
CChheecckk aassssuummppttiioonnss1. Model with only one indicator variable
2. General cases
In general cases, we apply statistical test for asscessing the PH assumpation. We will skip the theoryhere. In R, it's implemented by cox.zph( ) function.
Graphical approach: Plotting log-log Kaplan Meier survival estimates against time and evaluatingwhether the curves are reasonably parallel.
·
This statistical test is a test of correlation between the Schoenfeld residuals and survival time (orranked survival time).
A correlation of zero supports the proportional hazards assumption (the null hypothesis).
·
·
40/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
40 of 43 6/13/14, 9:49 PM
Case study: ADDICTS data
41/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
41 of 43 6/13/14, 9:49 PM
FFuurrtthheerr ttooppiiccssParametric model - Accelerated failure time model(AFT)
Modeling with time dependent covaraites
Competing risk model
Model for interval censored data
Model for recurrent events
·
·
·
·
·
42/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
42 of 43 6/13/14, 9:49 PM
RReeffeerreenncceeTableman, M., & Kim, J. S. (2003). Survival analysis using S: analysis of time-to-event data. CRCPress.
Kleinbaum, D.G.& Klein, M. (2005). Survival Analysis: A Self-Learning Text.Springer-Verlag.
Ghosh, G. (2012). Lecture notes: survival analysis. Penn State University.
URL: http://staff.pubhealth.ku.dk/~sr/Aarhus08112010.pdf
URL: http://www.summitllc.us/applying-survival-analysis-to-the-hunger-games/
·
·
·
·
·
43/43
Survival Analysis in R http://nycdatascience.com/slides/survivial_analysis/index.html#1
43 of 43 6/13/14, 9:49 PM