STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 5: Censoring and Lifetables STT520-420...
-
Upload
darren-norris -
Category
Documents
-
view
217 -
download
0
Transcript of STT520-420: BIOSTATISTICS ANALYSIS Dr. Cuixian Chen Chapter 5: Censoring and Lifetables STT520-420...
STT520-420: BIOSTATISTICS ANALYSIS
Dr. Cuixian Chen
Chapter 5: Censoring and Lifetables
STT520-420 1
an observation of a survival r.v. is censored if we don’t know the survival time exactly. usually there are 3 possible reasons for censoring the study ends before the event occurs the subject is lost to follow-up during the study (e.g.,
they could have moved out of town) the subject withdraws from the study because of death
(assuming death is not the event of our interest, such as car incident) or because of some other reason.
these are all “right-censored” since the event occurs to the right (larger than) the time we last observe
STT520-420
2
Censoring
STT520-420
3
Right censored data
Examples of censored data
STT520-420
4
Right censored data
STT520-420
5
STT520-420
6
Right censored data representation
Representation #1: (6, 1), (6, 0), (8, 1), (10, 0), (14, 1);
Representation #2: 6, 6+, 8, 10+, 14.
STT520-420
7
Left censored data examples
Example 2: if you are studying menarche and you begin following girls at age 12, you may find that some of them have already begun menstruating. Unless you can obtain information about the start date for those girls, the age of menarche is left-censored at age 12.*from:Allison, Paul. Survival Analysis. SAS Institute. 1995.
Double censoring Def: If a dataset contains right-censored (RC) observations, left-censored
(LC) observations and exact/uncensored observations, but not strict interval censored observations. It is called double-censored data (DC data).
Case-I interval-censored data occur when each study subject is observed only once and the only observed info for the survival event of interest is whether the event has occurred no later than the observation time.
STT520-420
8
Interval censored data
STT520-420
9
Example 2: if you’re screening subjects for HIV infection yearly, you may not be able to determine the exact date of infection.**from:Allison, Paul. Survival Analysis. SAS Institute. 1995.
Type I/II censoring
In type I censoring, the # of uncensored/exact observations is a r.v.; Eg: Left-censored data; Right-censored data; Double-censoring data; Interval-censoring data.
On the other hand, in type II censoring, the # of uncensored/exact observations is fixed in advanced. Only the first r<n lifetimes are observed (in reliability
in engineering).
STT520-420
10
Definitions of Censoring and truncation
STT520-420
11
Left truncated data
STT520-420
12
Right truncated data
STT520-420
13
Truncated data example
With a given telescope, we can only detect a very distant stellar object which is brighter than some limiting flux:
– the object is left-truncated if it lies beyond detection by our telescope
– we can’t tell if the object is even there if we can’t see it.
STT520-420
14
Example: Types of censoring and truncation
STT520-420
15
About HIV and AIDS
HIV is a lot like other viruses, including those that cause the "flu" or the common cold. HIV can hide for long periods of time in the cells of your body.
Over time, HIV can destroy so many of your CD4 cells that your body can't fight infections and diseases anymore. When that happens, HIV infection can lead to AIDS.
AIDS is the final stage of HIV infection. People at this stage of HIV disease have badly damaged immune systems, which put them at risk for opportunistic infections (OIs).
STT520-420
16
STT520-420
17
Example: Types of censoring and truncation
STT520-420
18
Example: Types of censoring and truncation
STT520-420
19
Review--Right censored data representation
Representation #1: (6, 1), (6, 0), (8, 1), (10, 0), (14, 1);
Representation #2: 6, 6+, 8, 10+, 14.
We represent censored data as ordered pairs (def. 5.2): Y1,Y2,…Yn are right-censored by t1,t2,…tn
if the sample consists of (Zi, i), where Zi=min(Yi, ti), and
Note that ti is the value of Zi when the observation is censored, and Y is observed when uncensored.
assume Y’s are independent of the t’s
STT520-420
20
Right Censored Model
)( ,0
)/( ,1
censoredtYif
exactuncensoredtYif
ii
iii
Example 5.2 (Stanford Heart Transplant Data) - note the form of the dataset with the Days=Z, Cens=delta, other explanatory variables are Age, and T5.
Note in Example 5.3 the notation of using a “+” sign to represent a right-censored observation; the survival variable is astrocytoma's survival time until death resulting from tumors
STT520-420
21
Examples of censoring
Motivating example: leukemia
STT520-420
22
Go back to Exercise 4.5 on page 68. Note the censoring in the treatment group (with “+”) but not in the placebo group - what is the meaning of these censored observations?
1. Now let’s write the data in Exercise 4.5 in a form that it can be analyzed with various computer programs…
2. How many variables are there of interest? (We need a column for each variable…). How many observations are there? [ I’ll propose ID, remission time, censor indicator, treatment group]
3. Use Excel to organize the data and then we’ll read it into R (or SAS later) for analysis
4. Use read.csv(file=file.choose()) to get the data into R…
STT520-420
23
Example 4.5, page 68
Example 4.5, page 68
## Example 4.5, page 68
## To read data from a *.csv file from online:
data=read.csv("http://people.uncw.edu/chenc/STT520_420/dataset/EX4.5.csv", header = TRUE);
## Or use local directory:
data=read.csv("E:/EX4.5.csv",header = TRUE);
## To write NEW_data into a *.csv file
write.csv(NEW_data, “Z:/Mydata.csv",header = TRUE);
/*Your local timmy drive*/
STT520-420
24
Section 5.4: Lifetable estimates Divide the lifetime axis into fixed disjoint intervals Estimate the conditional probability of survival across each
interval Estimate S (the survival) at the endpoints of the intervals
The intervals of times are represented as
the choice of the endpoints is up to the data analyst In a lifetable, the number at risk in any interval is the number alive
and under consideration (not censored) at the start of the interval. For any interval Ij, we write Nj=number at risk in Ij, ;
Dj = number of deaths (or observed failures) in Ij, ;
Wj= number of observations censored in Ij .
I j [a j 1, a j ), j 1,2,...k 1; a0 0; ak1
STT520-420
25
Lifetable estimates: nonparametric estimate survival function with right censored data
Note: N1=n, the total sample size is initially at risk
Nj = Nj-1 - Dj-1 - Wj-1 ; this shows the propagation of those at risk in the j-1 interval to the j interval.
Write: pj=P(surviving thru Ij | alive at start of Ij)
= P(Y > aj | Y > aj-1) = S(aj)/S(aj-1)
Note that p1=S(a1) since S(a0)=S(0)=1
Then p2=S(a2)/S(a1)=S(a2)/p1 ; so S(a2)=p2*p1 ;
Continuing p3=S(a3)/S(a2)=S(a3)/(p2*p1); so S(a3)=p3*p2*p1;
and so forth till we get Theorem 5.1 (p. 82)
which states that for every j, S(aj)=pj*…*p3*p2*p1 , where pj = the conditional probability of surviving across Ij given alive at the start of Ij . Use this theorem to estimate the survival at the endpoints of the intervals in the lifetable.
STT520-420
26
Lifetable estimates
In order to get the S’s, we need to estimate the p’s… The usual estimate of a proportion works here (5.3).
Note that when estimating 1-pj, we’re estimating the conditional probability of dying in the interval, given they were alive at the start of the interval… So:
We define the effective number at risk as
which essentially assumes the censoring occurs uniformly across the interval. So we apply this to our estimator above and get the actuarial estimate
ˆ p j 1# dying in I j
number w / potential to die in I j
N j N j .5W j
˜ p j 1D j
N j, j 1,2,...,k 1
STT520-420
27
Lifetable estimates
)(*) (#
],( #)(ˆ
tttimeatsurvivingpatientsof
tttintimeunitperdyingpatientsofth
If for a given j, Nj’=0, then take the estimate to be 0.
So to estimate S(aj), use Two basic assumptions for the construction of lifetables
are: censor times are independent of lifetimes…this assures the p j is the
same for each individual failure times and censor times in a given interval are uniformly
distributed across the interval Think of a lifetable as a generalization of a frequency
histogram that accounts for right censoring. See Example 5.6 on page 83-84 of melanoma survival (defined as time from first treatment for melanoma to death - in years). Let’s go over this data carefully to understand the computations…try in Excel…and later in SAS!
jj pppaS ~...~~)(~
21
STT520-420
28
Lifetable estimates
Example 5.6 on page 83-84 of melanoma survival
The following data on 913 male and female patients with malignant melanoma, treated in the M.D. Anderson Tumor Clinic b/w 1944 and 1960. Here the survival time is defined as time from first treatment for melanoma to death - in years.
Use Lifetable method to find the survivor function.
STT520-420
29
N j N j .5W j
˜ p j 1D j
N j, j 1,2,...,k 1
Example 5.6 on page 83-84
## Third method to read in the dataset from online ##
data=read.csv("http://people.uncw.edu/chenc/STT520_420/dataset/Eg5_6.csv", header = TRUE, sep = ",", quote="\"", dec=".", fill = TRUE);
d=data[,1]; ## Deaths
w=data[,2]; ## Censored/Withdraws/Losses of followups
n=data[,3]; ## # of at risk
## Then we start to work on the lifetable estimates…
STT520-420
30
## Example 5.6, page 83-84, Recursive calculation in a lifetable
data=read.csv("http://people.uncw.edu/chenc/STT520_420/dataset/Eg5_6.csv", header = TRUE, sep = ",", quote="\"", dec=".", fill = TRUE);
(D=data[,1]); ## deaths
(W=data[,2]); ## censored/withdraws/loss of follow-ups
(N=data[,3]); ## risk
print(cbind(N,D,W));
## since W10=NA, we reassign its value as 0.
W[10]=0;
## Effective number at risk in j-th interval; use of () here is to print output directly.
(N.eff=N-0.5*W);
## Actuary estimate of P_j ##
(P=1-D/N.eff); (Q=1-P); n=length(N); S=rep(n, 0)
for (j in 1:n)
{
S[j]=prod(P[1:j]);
}
print(cbind(N, N.eff, D, W, Q, P, S));
STT520-420
31
Example 5.6 on page 83-84
## Plot the estimated survial function for each interval ##x=1:10;S=c(1, S); x=c(0, x); ## Add the starting point of S(0)=1.plot(x, S, type="s");
Greenwood’s formula gives an error bound around the lifetable estimates . I won’t go through the derivation, but if you’re interested, see pages 85-86.
Theorem 5.3 (Greenwood’s Formula). The standard error of the lifetable estimate is given by
This formula is usable as long as the effective number at risk is not too small in the intervals.
See Example 5.7 on page 87 for a use of this formula. Go over Example 5.8 - use SAS
˜ S (a j )
1,...2,1 , ~
~)(
~ ))(
~(
1
kjNp
qaSaSSE
j
iii
ijj
STT520-420
32
Greenwood’s formula for Lifetable method
Example 5.7: Find the standard error: 5-year survival prospects for melanoma patients by Greenwood formula.
STT520-420
33
Example 5.7, page 87
SE=(0.356)*sqrt(0.047/((0.953)*(149))+0.136/((213)*(0.864))+0.148/((304)*(0.852))+0.205/((468)*(0.795))+0.361/((865)*(0.639)))= 0.01899.
1,...2,1 , ~
~)(
~ ))(
~(
1
kjNp
qaSaSSE
j
iii
ijj
Example 5.8 - use SAS
STT520-420
34
Ij, Dj, Wj, Nj’, qj, SE(qj), est of S(aj), 1- est of S(aj), SE(est of S(aj)) … pdf, SE(pdf), hazard, SE(hazard)
Example 5.8 - use SAS
STT520-420
35
That means, by HAND, we take S^~(0)=1, [we add this in] S^~(1)=0.6393 [taking the right endpoint of the intervals] and
so on. Similar ideas to the Greenwood formula. But to the output from SAS, we need to be CAREFUL! It started from
1, rather than 0.6393. Therefore, for SAS output, you need make some adjustment to understand the output as:
S^~(0)=1 [taking the left endpoints of intervals.] S^~(1) = 0.6393, and so on...
Note: Let’s compare PPT #33 and #34: When we use hand to estimate survival function by hand (in #33), the survival function starts from 0.6393, instead from 1.
More about lifetable
Background: (1) # of observations are large; and (2) event times are measured crudely. Also called Actuarial method.
Advantage: (1) life times are grouped into intervals of time (can
be as long or as short as you like) (2) Can produce estimation and plots of the hazard
function in SAS/R. Disadvantage:
(1) choice of intervals is somewhat arbitrary (uncertainty about how to choose intervals)
(2) inevitable loss of informationSTT520-420
36
Review: How to decide bandwidth for histogram
STT520-420
37
Rule of thumb: start
with 5 to 10 bins.
Look at the distribution
and refine your bins
(There isn’t a unique or
“perfect” solution)