Survival Analysis of Cardiovascular Diseases
Transcript of Survival Analysis of Cardiovascular Diseases
Washington University in St. LouisWashington University Open Scholarship
All Theses and Dissertations (ETDs)
Winter 12-1-2013
Survival Analysis of Cardiovascular DiseasesYuanxin HuWashington University in St. Louis
Follow this and additional works at: https://openscholarship.wustl.edu/etd
Part of the Mathematics Commons, and the Statistics and Probability Commons
This Thesis is brought to you for free and open access by Washington University Open Scholarship. It has been accepted for inclusion in All Theses andDissertations (ETDs) by an authorized administrator of Washington University Open Scholarship. For more information, please [email protected].
Recommended CitationHu, Yuanxin, "Survival Analysis of Cardiovascular Diseases" (2013). All Theses and Dissertations (ETDs). 1209.https://openscholarship.wustl.edu/etd/1209
WASHINGTON UNIVERSITY IN ST. LOUIS
Department of Mathematics
Program in Statistics
Survival Analysis of Cardiovascular Diseases
by
Yuanxin Hu
A thesis presented to the
Graduate School of Arts & Sciences
of Washington University in
partial fulfillment of the
requirements for the degree
of Master of Arts
December 2013
Saint Louis, Missouri
ii
Table of Contents
Acknowledgement iii
Abstract iv
Introduction
Cardiovascular Diseases 1
Risk Factors 1
Risk Scoring Systems and Models 3
Survival Data and Survival Analysis 4
Data 5
Survival Analysis 8
1) Kaplan-Meier Method 11
2) Cox Proportional Hazards Model 13
3) Accelerated Failure Time Model 18
Tests for Significance 19
Methods and Procedures
SAS Procedures 20
Results
Data exploration 21
Model Selection 22
Validation of Assumption of Proportionality 33
Model Optimization 36
Model Diagnostics 39
Comparison of Survival Functions 40
Parametric model 41
Discussion 47
References 50
iii
Acknowledgements
I am very thankful for my supervisor, Prof. Ed. Spitznagel. He paid lots of efforts and ideas on the project.
And also, he gave many valuable suggestions and comments on the draft of my thesis. Certainly, without
his supervision, the thesis cannot be accomplished.
I also appreciate the suggestions and ideas kindly offered by Prof. Jimin Ding and Prof. Laura Dumitrescu
during the revision of my thesis. The accomplishment is also counted lots on them.
In addition, I want to thank my other teachers and classmates during the period of studying in the
university. They helped me successfully completing the transition from one field to another.
In the end, I want to thank my wife. Without her support, it is impossible for me going back to school.
Therefore, the thesis is not only for me, but also for my supervisor, my teachers, my classmates and my
family.
iv
ABSTRACT OF THE THESIS
Survival Analysis of Cardiovascular Diseases
by
Yuanxin Hu
Master of Arts in Statistics
Washington University in St. Louis, 2013
Professor Ed Spitznagel, Chair
Cardiovascular disease (CVD) is a class of diseases related to the heart or blood vessels. It is the leading
cause of death worldwide. Many risk factors are found to be associated with it, such as family history,
gender, age, etc. Much effort has been paid to identify them and to evaluate their effects on the onset or
progression of the diseases, so that prediction of onset/development of the diseases can be made, and
preventive solutions can be applied.
Thus far, much survival analysis of cardiovascular diseases comes from relatively collected populations
followed over short time, and the findings may not easily be generalized to the patients. The current data
set is one subset of data collected by The Framingham Heart Study since 1948, which recorded 5209
subjects originally. Current data includes 4434 randomly selected subjects living in the community of
Framingham.
The main goal for our current study is to build up a Cox proportional hazards model by survival analysis
using the SAS statistical package. To process the analysis, the proportional assumption or time
dependence for individual factors is tested; variables are selected; and their interactions are considered to
optimize the model. Due to strikingly impact of gender on the prediction, it is stratified. Therefore different
baseline hazards are applied for the set of variables within each group.
In the model, the parameters are estimated by maximum likelihood Newton-Raphson algorithm. The
results show that gender, status of diabetes, age, body mass index, cholesterol and blood pressure are
v
found impacting the diseases onset/development. Interestingly, the education level has its influence on it
as well.
1
Introduction
Cardiovascular Diseases
Heart is the organ pumping blood through the blood vessels and arteries to all parts of the body. In the
process, it delivers oxygen and nutrients through the whole body, and also collects waste products and
gets rid of them from it [1]. The process is vital for life. The cardiovascular system includes the heart and
all the network of blood vessels. Diseases or disorders related to the system are termed cardiovascular
diseases (CVD), such as Coronary heart disease, congenital heart disease, stroke or cerebrovascular
accident (CVA), congestive heart failure, peripheral arterial disease or peripheral vascular diseases, deep
venous thrombosis (DVT) and pulmonary embolism, rheumatic heart disease, and other cardiovascular
diseases caused by tumors of the heart [2, 3, 4, 5]. But the major CVD are heart attacks and strokes.
Cardiovascular diseases are the leading causes of death worldwide, which killed over 17 million people in
2008. They have been prevailing in developed countries, but they are becoming severe issues for the
people living in developing countries as well since decades ago [6, 7].
Risk Factors
CVD is very common in population, especially for adults [7]. Some risks are found contributing lots to the
onset or progression of the disease, such as age, gender, genetics, lifestyles, and some biochemistry
parameters of the body [8, 9]. They are categorized as uncontrollable factors, such as age, sex and family
history [10, 11, 12], and modifiable factors, such as lifestyles and other biochemistry parameters. For
example, people can delay the onset or development of the diseases by changing their lifestyles, such as
excises, no smoking, healthy diets [13, 14, 15, 16]. Consequently, the biochemistry parameters can be
improved as well with healthy lifestyle, which will decrease the risk of suffering from the diseases [17, 18].
Blood pressure is a well-known risk factor for cardiovascular diseases. Once it reaches and keeps at high
level of blood pressure, sooner or later, the people will develop cardiovascular diseases. And it is
irreversible after a period. In theory, higher blood pressure makes heart work harder to pump blood out of
2
it. As a result, the wall of heart will be thickening at the beginning, and then be thinner and thinner
gradually later. Eventually, the wall of heart will be too thin to contract powerfully, so the pressure in the
heart becomes higher and higher until it is out of control [20, 21, 22].
Cholesterol is a fatty wax-like substance found in the blood. Reasonable concentration of cholesterol in
the blood is crucial for keeping good health. There are two common lipoproteins called high-density
lipoproteins (HDL) and low-density lipoproteins (LDL). The HDL carries extra fat away from arteries, so
that extra fat will not accumulate in them. However, the LDL assists fat to build up on the artery wall. The
deposited fat will shrink space of artery. Eventually, it leads to high blood pressure. The normal level of
total cholesterol for adults is below 200. The risk of heart attack increases dramatically with its
concentration goes high[18, 23].
Smoking is another well-known risk factor. In principle, nicotine narrows the blood vessels and causes
high blood pressure and high heart rate. In addition, the byproduct of smoking carbon monoxide
competes with oxygen in the red blood cells so less oxygen can be delivered to the heart. As a
consequence, it damages artery wall. In addition, it involves in more damages in heart than that, such as,
depositing cholesterol in artery and reducing blood HDL level. As reported, smoking can increase twice
risk of early death from heart attack as nonsmokers [24, 25].
Diabetes is due to dysfunction of response to insulin, which results in high blood sugar level. It also brings
up the accumulation of lipoprotein and increases atherosclerosis as side effect. Usually, Diabetes
combines with high blood pressure, high cholesterol, low level of HDL, and overweight (obesity). Any
individual of those factors contributes the progression or onset of heart failure, and the combination has
cumulative effect on the diseases [21, 22].
Stress is a part of our lives. It is another controllable risk factor. During dealing with tough situation, the
body releases chemicals, such as adrenalin, which can narrow blood vessels, and cause high blood
pressure [26]. People experiencing long-period pressure are prone to coronary artery disease.
3
Age, sex and family history (genetic heredity) are uncontrollable risk factors. Aging is associated with
progressive decline in numerous physiological processes and fundamental morphological and structural
changes of heart. Specifically, aging results in dysfunction of endothelium, which leads death and decline
of heart endothelial cells [11, 12]. In clinic, the dysfunction results in high blood pressure and increases
risk for development of atherosclerosis, hypertension, stroke and arterial fibrillation.
Certain genetic mutations can raise the likelihood of having heart and vascular diseases. And its risk can
be told by family history. As report, subjects most likely bear genetic mutations of risks, if their close
relative has had heart diseases at certain age, for example, 55 years for male, 65 years old for female
[27,28]. They inherit higher tendency toward heart disease than people who do not have family history of
CVD.
Interestingly, sex is found to be a critical risk factor of CVD, since men are more likely to have heart and
blood vessel problems at an earlier than women, especially before menopause. As report, estrogen may
play critical role in the prevention. As known, estrogen can up-regulate HDL, which is good for health of
heart [11].
The above generally gives an introduction of the risk factors individual by individual, but they are working
separately. The impacts can be cumulative, or enhancive on development of the diseases, so all the risk
factors should be considered interactively or in an integrative way. Therefore, it would be meaningful to
build up a system or model with consideration of all potential risks. And then, the effective prediction or
evaluation of risks can be made, and suggestions and treatments can be applied for prevention or delay
the disease onset or progression.
Risk Scoring Systems and Models
Due to high incidence and cost of the CVD, to predict the risk of cardiovascular events is important. It will
assist clinical practice guidelines target primary prevention and recommend that providers evaluate
patients for cardiac risk factors that warrant medical treatment.
4
Thus far, several scoring systems to evaluate the risk of suffering from CVD have been developed, such
as the Framingham Scoring System (FSS)[30], Joint British Societies (JBS) [31] and ASSIGN [32, 33]. In
general, the systems assign certain scores for individual risk factors. The risk of suffering from
cardiovascular disease is evaluated with the calculated sum of the scores. The higher the score means
greater risk. Among them, the FSS is the most popular one. Others followed similar ideas and scales for
individual risks, but introduced specific criteria or scores for specific populations. The Framingham system
can be applied for evaluation of risks of several CVD outcomes with period range 4 to 12 years, such as
stroke, coronary disease, myocardial infarction, and death from either coronary or cardiovascular disease
[30, 31, 32, 33, 34, 35].
The JBS has developed its own guideline called British National Formulary (BNF) for evaluation of
cardiovascular risk, but mainly for risk of coronary disease stroke. It inherits scoring rules and scales used
by FSS [34, 35]. Either FSS or JBS has limitation of generalization for all populations, The ASSIGN
scoring system was developed in conjunction with the Scottish Intercollegiate Guidelines Network by
introducing more risk factors, such as family history and social deprivation [32, 33]. It serves well for
Scottish population.
In general, all the systems treat age as a dominated factor during evaluation [34]. Certainly, it can be a
limitation or problem, because the risk for healthy older people might be higher than that of younger CVD
patients evaluated by the systems. Since all preventive treatments or benefits are designed for high-risk
population, the actual high-risk young patients may be ignored. Therefore, the existing scoring system
can be improved [37, 38, 39, 40].
In addition, the accuracy of evaluation of the risk is not always satisfactory, because the impacts of risk
factors on different populations and different individuals within same population vary greatly. They can be
different also among races, types of diseases, and so on. Thus far, the systems can only evaluate risks
for particular heart diseases. For example, the FSS can only predict the risk event for coronary disease.
5
Others are beyond its capability [41, 42, 43, 44, 45, 46]. And also, all the prediction can only be applied
for subjects at certain ages.
Until now, over hundred other risk models for prediction of cardiovascular risk have been developed, such
as Framingham (FRS) model for CVD, the Prospective Cardiovascular Munster (PROCAM) model for
CHD, and the Systematic Coronary Risk Evaluation (SCORE) model for CVD mortality. Unfortunately,
only a few of them have been validated for various times and different cohorts, such as Scottish Heart
Health Extended Cohort (SHHEC), Diabetes Audit and Research in Tayside, Scotland (DARTS),
FINRISK, Framingham Study (FRS), Framingham Offspring Study (FRS-O), Prospective Cardiovascular
Munster Study (PROCAM), QRESEARCH Database, Systematic Coronary Risk Evaluation (SCORE) and
United Kingdom Prospective Diabetes Study (UKPDS) [46, 47, 48, 49]. These systems are improved at
some points, but the above limitations still exist among them.
Survival Data and Survival Analysis
Survival data is collected for investigation of time to event. The event can be death, occurrence of
disease, machine failure or end of marriage, and so on [50]. One common characteristic of survival data
is censoring, truncation, or combination of censoring and truncation, due to end of investigation, drop out
of subjects, or even the experiment design with threshold. Therefore, the information for the censored or
truncated subjects is not complete known. Certainly, the presence of censoring or truncation of data
requires special treat during data analysis, for example, with consideration of both time to event and
indication of censoring together, since it may not properly by treating the data by classical way. For
example, simply evaluation of mean time to event among groups using t-test or linear regression ignores
censoring issues, whose information is not completely known. Or, comparison of proportion of events
among groups using risk/odds ratio test or logistic regression does not take account of impact of time.
Therefore, the analysis should consider both of them. For example, to count the likelihood function, the
terms corresponding to both observed data and censored data should be considered together. For
observed data, it contributes to its density function at time t, while the censored observation accounts for
the probability function over time t. Fortunately, the occurrence of censoring is not related to the future
6
time, for example, the probability distribution of residual lifetime is not different between censored
observations and uncensored cases, since they bear same set of explanatory variables [53, 54].
Data
Current investigation includes a subset of data for the Framingham Heart Study. It is a long term
prospective study of etiology of cardiovascular disease among population living in the community of
Framingham, Massachusetts. The data set was collected with 5209 subjects enrolled initially since 1948.
The participants have been examined biennially and continuously followed through regular surveillance
for cardiovascular outcomes. Clinic examinations has included with cardiovascular disease risk factors
and markers of disease, such as blood pressure, blood chemistry, lung function, smoking history, health
behaviors, ECG tracing, Echocardiography, and use of medications. Through regular surveillance, the
study reviews and adjudicates events for occurrence of Angina Pectoris, Myocardial Infarction, Heart
failure, and cerebrovascular disease.
The current dataset includes 4434 participants with three examination periods, approximately 6 years
apart from 1956 to 1968. The data set includes information of laboratory results, clinic diagnostics,
healthy questionnaire, and adjudicated events. Each participant was followed for a total 24 years for
outcome of the events, such as Angina Pectoris, Myocardial Infarction, Atherothrombotic Infarction or
Cerebral hemorrhage (Stroke), or death.
The data set is provided in longitudinal form. Each participant has 1 to 3 observations depending on the
exams the subject attended. Overall, there are 11,627 observations on the 4434 participants. For
example, the participant with RANDID 95148 had three exams (period 1, 2, 3). The first visit was 52 years
old, and then 58 and 64 years old for the following two visits. The participant was a female (SEX =2). The
baseline time was set as 0 (time = 0), and the following two exams happened on 2128 and 4192 days
(time = 2128, 4192). The column timemifc is the time to event (MI occurred) on 3607 days counted from
the baseline visiting time. Therefore, participant 95148 entered the study (time = 0, or period = 1) free of
prevalent coronary heart disease (prevchd = 0). However, an MI event occurred at day 3607 following the
7
baseline examination. It means the time to event of MI for participant 95148 is 3607 days. Since she had
the second exam at 2128 days (time = 2128, period = 2), she was free of disease (prevchd = 0, period =
2), but before her third visit (prevchd = 1, period =3). The mi_fchd shows the participant had the event
during the three exams (mi_fchd = 1), but it did not tell when it started. Some other participants had no
events occurred, so they were censoring cases (prevchd = 0, mi_fchd = 0) for all three exams.
Example
RANDID age SEX time period prevchd Mi_fchd Timemifc
95148 52 2 0 1 0 1 3607
95148 58 2 2128 2 0 1 3607
95148 64 2 4192 3 1 1 3607
The real data has details as shown Table 1 & Table 2) respectively.
8
Table 1 Description of data (part I)
Variable Description Units Range or count
RANDID Unique identification ID 2448-9999312
SEX sex 1=male
2=female
5022
6605
PERIOD cycle 1=period 1
2=period 2
3=period 3
4434
3930
3263
TIME Number of days since baseline exam 0-4854
AGE Age at exam (years) 32-81
SYSBP Systolic blood pressure (mmHg) 83.5-295
DIABP Diastolic blood pressure (mmHg) 30-150
BPMEDS Use of anti-hypertensive medicine 0=not used
1=using
10090
944
CURSMOKE Current smoking 0=nonsmoker
1=smoker
6598
5029
CIGPDAY Number of cigarettes smoked per day 0=nonsmoker
1-90 cigarettes per day
EDUC Attained education 1= 0-11 years
2=high school diploma
3=vocational school
4=college (BA, BS) above
TOTCHOL Total cholesterol (mg/dL) 107-696
HDLC High density lipoprotein cholesterol(mg/dL) Period 3 only 10-189
LDLC Low density lipoprotein cholesterol(mg/dL) Period 3 only 20-565
BMI Body mass index 14.4-56.8
GLUCOSE Serum glucose (mg/dL) 39-478
DIABETES Diabetic (casual glucose >= 200 mg/dL) 0=not diabetic
1=diabetic
11097
530
HEARTRTE Heart rate (beats/min) 37-220
PREVAP Prevalent Angina Pectoris at exam 0=free disease
1=prevalent disease
11000
627
PREVCHD Prevalent Angina Pectoris at exam 0=free disease
1=prevalent disease
10785
842
PREVMI Prevalent Myocardial Infarction 0=free disease
1=prevalent disease
11253
374
PREVSTRK Prevalent stroke 0=free disease
1=prevalent disease
11475
152
PREVHYP Prevalent hypertension 0=free disease
1=prevalent disease
6283
5344
9
Table 2 Description of data (part II)
Variable name Description
ANGINA Angina pectoris
HOSPMI Hospitalized myocardial infarction
MI_FCHD Hospitalized myocardial infarction or Fatal coronary heart disease
ANYCHD Angina pectoris, Myocardial infarction, or fatal Coronary heart disease
STROKE Atherothrombotic, Cerebral Embolism, Intracerebral hemorrhage, or subarachnoid hemorrhage
or Fatal cerebrovascular disease
CVD Myocardial infarction, Atherothrombotic, Cerebral Embolism, Intracerebral hemorrhage, or
Subarachnoid hemorrhage or Fatal cerebrovascular disease
HYPERTEN Hypertension: either Systolic >= 140 mmHg or Diastolic >= 90 mmHg at the first or second exam
DEATH Death from any cause
TIMEAP Number of days from baseline exam to the first Angina,
Or number of days from baseline to censor date.
TIMEMI Number of days from baseline exam to the first HopsMI event,
Or number of days from baseline to censor date.
TIMEMIFC Number of days from baseline exam to the first MI_FCHD event,
Or number of days from baseline to censor date.
TIMECHD Number of days from baseline exam to first ANYCHD event,
Or number of days from baseline to censor date.
TIMESTRK Number of days from baseline exam to first STROKE event,
Or number of days from baseline to censor date.
TIMECVD Number of days from baseline exam to first CVD event,
Or number of days from baseline to censor date.
TIMEHYP Number of days from baseline exam to first Hyperten event,
Or number of days from baseline to censor date.
TIMEDTH Number of days from baseline exam to first death event,
Or number of days from baseline to censor date.
Survival Analysis
The purpose of current investigation is to build up a predictive model to estimate time to event for an
individual. Due to the presence of censoring observation, the response variable will be a combination of
time to event and indicator of censorship (time*censor).
The censorship can be defined with indicator (as shown below formula)
10
To predict the time to event with survival data, the best and most convenient way is to identify the
probability density function f(t) of time to event, and then its survival cumulative function S(t) and hazard
function h(t) can be easily gotten. For example, for continuous variable time to event (t), its density
function,
And for survival function of time to event can be derived by,
For its hazard function, by definition,
For various survival data, different probability distributions can be selected, such as exponential
distribution, Weibull distribution or lognormal distribution, and so on.
For example, the probability density function of Weibull distribution,
And then its survival function and hazard function,
11
Actually, once one of the three functions is known, it will be easily to get others, since there are more
below relations among them,
However, the distribution might not be perfectly pinned down mathematically. Currently, there are three
popular methods available for the survival data analysis. They are Kaplan-Meier nonparametric method,
Cox proportional hazard semiparametric method, and parametric method, which is also called
Accelerated Failure Time (AFT) model.
1) Kaplan-Meier Method
It is also called one-sample nonparametric method. Its survival function can be estimated with Kaplan-
Meier plot, Life-table, or Cumulative hazard estimator [53, 54, 56]. It is a very popular method for its
advantages, such as, it does not require any mathematical assumption for its hazard function or
proportional hazard. It simply takes into account with the empirical probability of surviving over certain
time, but it assumes the value of survival function between successive distinct observations is constant. It
is widely applied for studying survival function of population. And also, it is very commonly used to
compare the survival functions among groups. If the sample size is large enough, its estimated survival
function represents that of population.
The method does not take into account of covariates, so it is mainly descriptive. For better presentation,
the survival function usually shows as stepwise reduction plot. Therefore, it mainly deals with categorical
predictors or grouped continuous variables. It cannot take care of continuous variables directly, so it will
not be able to accommodate time-dependent variables. Even with these limitations, it is still very popular,
because it treats censored data well, particularly right-censoring data.
12
There are two ways of estimation of survival function. One was proposed by Kaplan and Meier, the
function is called Product-Limit estimator. It offers an efficient way of estimation its survival function for all
the values of time (t) in the range with right censoring observations. It means before the time (t1), there is
no event occurred, so the survival function is 1. Once the event happens, its survival function will
decrease. The time t1 is kind of checkpoint or cutting point for event occurring. Where di is the number of
event occurred in that time interval or at the point, and yi is the number of subject at risk in that interval or
at the time point. It includes all subjects who have died, dropped out, or not reached the event time yet at
that time point or interval. They are applied for all the following equations.
Its variance estimator follows the below formula,
Since the survival function can be estimated, its cumulative hazard function can be calculated following
formula: .
However, for data with small-sample-size, the cumulative hazard estimator can be better performed by
Nelson-Aalen estimator. It follows the below formula:
13
Therefore, the Nelson-Aalen survival function estimator can be calculated following formula:
. The estimated variance for its cumulative hazard function is counted as below:
The Nelson-Aalen estimator has two primary uses. The first is for selection between parametric models
for the time to event. For example, the plot of s time(t) is linear for exponential distribution with
hazard rate λ. The second is in providing raw estimation of hazard rate h(t).
Usually, the time points are assigned as small as possible. For example, there is only one event occurred
at that time point or in the interval. Therefore, the number of event at a particular time point or interval is
very small. As a result, the above methods calculated survival probability at any given intervals is not
perfectly accurate due to small number of events, but the overall probability of surviving to each time point
is.
To make the plot of survival function versus time, it can be achieved by , or
.
Since the estimate of survival function and its variance can be easily counted, the confidence interval is
also not hard to get. Basically, there are three different confidence interval can be calculated, such as
linear confidence interval, log-transformed confidence interval, and arcsine-square root transformed
confidence interval. Among them, the linear confidence interval is the most popular one. They are
following the below formula respectively.
Linear confidence interval , where .
14
For log-transformed confidence interval
For arcsine-square root transformed confidence interval:
2) Cox Proportional Hazards Model
The Cox proportional hazard model also is called semiparametric regression model. As discussed before,
one of the interests of survival analysis is to compare the survival function between groups. The
nonparametric model can do the job well if the groups are identical except one characteristic is different,
which is what is to compare, such as treatment, or drug use, and so on. However, for most cases, the
individuals in the data set not only vary only one, they are identified with lots of other characteristics. For
example, the patients in current data set, subjects have demographic variables recorded, such as, age,
sex, education, and so on. These characteristics can impact the survival function. For example, patients
with high blood pressure will be prone to develop heart failure than the subjects with normal blood
pressure. Therefore, it might cause severe bias if the comparison is made without consideration of these
impacts. In addition, another goal for survival analysis is to predict the potential survival time for a
particular subjects, ignorance of those information will leads inaccurate prediction.
The Cox proportional hazards model considers both the underlying distribution of the time to event and
the impacts of other characteristics. Therefore, for any observation, it consists of the information of triple
, where is the time on the study, is the indicator of event, is the set of covariates for the
ith subject. And can be time-dependent. And
15
In general, the hazard function at time t is defined as
,
Where . And the is time-fixed variable.
The model is called proportional hazard model, because the hazard is proportional for two subjects with
covariates Z and Z*. It can get from the formula:
Therefore, with consideration of a single covariate, for example, educational level, it has three levels. The
relative risk for the three levels of education can be calculated by above formula. For example, level 1 as
reference, and the relative risk of level 2 to level 1 is , level 3 to level 1 is , and level 3 to
level 2 is .
The below are their hazard functions.
To test the significance of covariates on the impact of hazard in the regression, there are three tests are
popular. They are Wald test, Likelihood Ratio test, and Score test.
The Wald test is based on asymptotic normality of partial maximum likelihood estimate, it follows formula:
16
where is the value under null hypothesis,
is the information matrix,
P is the dimension of covariates.
Likelihood Ratio test follows formula:
where
And Score test follows formula:
where
Since the semiparametric model is established with assumption of proportional hazard rate. It can be
tested by checking variable time-dependency, or plots. To check the time-dependence of a covariate, the
variable is modified with interaction with time (usually logarithmic transformed time), and then it is fitted
with Cox proportional regression for its significance.
In addition, the assumption can be tested graphically as well. The first method is to plot the cumulative
hazard rate or transformed cumulative hazard rate against time with individual stratified covariate. For
example, is vector of covariate, and
is the remaining set ( of covariates. To test
17
whether covariate is satisfied with proportional hazard assumption, it can be stratified into strata. If
the assumption holds, the baseline cumulative hazard function should be constant for each stratum.
Except the assumption of proportional hazard, other aspects, such as testing of appropriate form of
covariate in the regression, the accuracy of prediction with fitted model, and overall fitting of the final
model.
To test the best form of a covariate in the model means the functional form of a covariate can be best
explained its effect on survival through the proportional hazard model, for example, log, quadratic,
square, and so on. To find out the best form of a covariate, the martingale residual plot is applied. It has
the following form,
where
To build the plot, the Cox proportional hazard regression is fitted with the covariates ( ) except the
one being tested ( . Its function is to be determined
18
Then the martingale residuals are computed, and plotted against the tested covariate. The can be
determined by it shape, for example, linear means no need for transformation.
To test the overall fitting of the final regression model, the Cox-Snell residual plot is applied. In principle,
suppose the proportional hazards model: is correctly fitted, if the probability
integral transformation on the true death time , then the resulting random variable has an
exponential distribution with hazard rate 1. Therefore, if the model is correct, and will be close to true
, then the residual will look like a censored sample from a unit exponential distribution. The formula
of is defined as:
where
To check whether behaviors as from a unit exponential distribution, the Nelson-Aalen estimator of
cumulative hazard rate is calculated by following the below formula, and then a plot of versus . If
the model is fitted well, it should be a straight line through the origin.
3) Accelerated Failure Time Model
Accelerated failure time model (AFT model) is a parametric model for estimating univariate survival and
for censored-data regression problem. For nonparamatric model, it estimates the survival function based
on the observation only. It does not require any distribution of the data. However, the question may
become much simpler if the data fits a distribution well, since there are fewer parameters should be dealt
19
with. The AFT model means that the survival function of an individual with covariate Z at time t is the
same survival function of an individual with a baseline survival function at a time , where θt is a
vector of regression coefficients. It means,
where is called an acceleration factor. It means how a change in covariate alters the time scale
from the baseline time scale. It implies that the median time to event with covariate Z is the baseline
median time to event divided by its acceleration factor. Similarly, the hazard function is also able to be
calcualted. It is also related to the baseline hazard rate.
Furthermore, from the above implications and presentations, the logrithmatic transformed time has linear
relationship with covariate values (as shown below).
where is a vector of regression coefficients, and is the distribution for error. To identify the survival
function of T, it can be followed by:
Since follows a paticular distribution, its survival function is known with time equaling to
.
Therefore, the mean time or mean residual life time can be estimated by the model. The maximum
likelihood estimation is applied for estimating the parameters for thee models. While their vaiance-
covaraince matrix may be obtained by delta method.
20
To select a model, the transformed Nelson-Aalen estimated cumulative hazard function is plotted against
some function of time. For example, Weibull distribution, its cumulative hazard rate can be expressed as:
, so there is a linear relationship for a plot with . While for exponential distribution, its
cumulative function is , so the plot will be a straight line through th origin. Its slope will be the
parameter . Therefore, if the linearality does not hold, it means the parametric model does not fit the
data. When comparing two groups, the q-q plot is usually applied for checking whether the model fits data
well. For example, and are survival function of group 1 and group 2, then
where is accerelation factor.
And the following will hold as well, since
,
where , are the p-th quantiles of groups 0 and 1 respectively.
Tests for Significance
For hypothesis test, the threshold for significance is of P-value at 0.05. The test for significance for others
is listed in their respective chapters.
21
Methods and Procedures
SAS Procedures
SAS 9.1.3 statistical package is applied for the current data analysis. In SAS, Proc LIFETEST, LIFEREG
and PHREG are three major procedures for survival analysis. In general, Proc LIFETEST is used to
compute nonparametric estimates of the survival function by Kaplan-Meier method. Since we are
interested in estimating the survival distribution function (SDF), the Kaplan-Meier plot can represent its
survival function if the sample size is big enough. In addition, the Proc LIFETEST is also applied for
making plots to compare survival curves among groups. The procedure offers Log rank test to evaluate
significance of difference.
The Proc PHREG is applied for performing regression analysis of survival data based on the Cox
proportional hazards model. To build up the predictive model for hazard function, the Proc PHREG is
applied for model selection using option of stepwise. Based on the same strategy, the model is improved
by taking into account of interaction terms. And then, the model is finalized by backward method by
eliminating the predictors with the least significance or the highest p-value step by step. The procedure is
also applied to test time dependence of variables, generate residual plots for model fitting diagnostics,
survival functions plots for comparison between groups.
The Proc LIFEREG is applied to fit parametric models for the failure time to events. It is also called the
accelerated failure time model. In our current study, it is used for assessment of effectiveness of variables
selected by Cox proportional hazards model. To fit the model, the failure time is logarithm transformed,
and Weibull distribution is specified for disturbance term.
22
Results
Data exploration
First of all, information of current data set is shown in Table 3. It has 11627 observations. There are
continuous variables, such as totchol, age, sysbp, diabp, cigpday, bmi, heartrte, glucose, and variable for
time to event or censoring (timeap, timemi, timemifc, timchd, timestrk, time cvd, timedth, timehyp). The
discrete variables are sex, cursmoke, diabetes, bpmeds, educ, and indicators for events, such as death,
angina, hospmi, mi_fchd, anychd, stroke, cvd, hyperten. In addition, there are numbers of missing values
for variables totchol, cigpday, bmi, bpmeds, glucose, and educ. In the current study, the missing values
are removed for complete case analysis.
Table 3 Data summary
23
Before the analysis, some definitions need to be clarified. First, the events recorded in current data
include death, angina, hospmi, mi_fchd, anychd, stroke, cvd and hyperten. The death is cardiovascular
diseases related or unrelated. The disease unrelated death is treated as censoring. For disease related
death and other events are indications for onset of cardiovascular diseases, since the diseases will be
realized sooner or later with onset these intermediate symptoms. Therefore, if any of the above events
happens for an observation, the event for the observation is recorded as realized observation. Otherwise,
they are treated as censored observations. Second, the time to event is recorded the period starting from
baseline to the time of end of investigation, time of dropout, death, or minimum time to any of events.
Third, to better investigate the impacts of variables on disease onset, the subjects with preconditions are
removed before conducting analysis. Finally, since it is a longitudinal data set, to simplify or decrease the
impact of time-serials, the final data set only includes observations at period 1. In another word, the study
shows how the survival function is influenced by the conditions, such as age, biochemical measurements
and so on at that time.
Overall, there are 949 censored observations (censor=0), and 1937 observations with events (censor=1).
Model Selection
The goal of current investigation is to identify risk factors of cardiovascular diseases, and their effects on
the contribution to the onset and progression of the diseases. To determine the variables included in the
final model, the univariate analysis is applied first to identify the impact of individual variable on time to
event before proceeding more complicated model selection. For the categorical variables, such as sex,
smoking status (cursmoke), education level (educ) and status of diabetes (diabetes), the non-parametric
model Kaplan-Meier curves are generated (as shown Figure 1). And the significance of impact is tested
by log-rank test individually, and summarized in Table 4. The log-rank test is better than Wilcoxon test,
24
since it is more sensitive for difference happens at the late stage. In addition, the log-rank test will not
generate bias when the pattern of censoring is different between groups. The data indicates that the
variable sex, diabetes and educ could be potential risk factors for the model. Since their p-values are less
than 0.0001, which is far below the threshold of significance at 0.05. However, the p-value for variable
cursmoke is over 0.6245, so it will not be included for final model. In addition, the survival curves for all for
variables, such as sex, diabetes and cursmoke, do not cross over groups. The educ level 1 behaviors
differently from other three levels, so the three can be treated as one single group for fitting the model
with satisfactory to the assumption of proportional hazard rate. Therefore, they can be potential to fit
proportional hazard model.
For the continuous variables, the univariate analysis is performed by Cox proportional hazard regression.
To include the potential predictors in the final model, the threshold to keep the variable for model
selection with p-value equal to or less than 0.2 – 0.25, just in case that they may contribute to the impact
on disease onset or progression with other predictors.
25
In our current study, variables, such as totchol, age, sysbp diabp, bmi and glucose have significant impact
on disease onset with p-value less than 0.0001 individually. Although the variable cigpday has p-value at
0.106, it will be kept for model selection since its p-value less than 0.25. The p-value for variable heartrte
is 0.2876, which is over the threshold for our model selection, but it is a very important predictor, so it will
be included for model selection as well. As reported, coronary mortality rates increased progressively in
relation to antedent heart rates, especially for male patients of 35 to 64 years of age [76]. The outcomes
for the Cox proportional hazard analysis for the individual variables are summarized as shown (Top,
Table 4). Therefore, variables sex, educ, diabetes, totchol, age, sysbp, diabp, bmi, glucose and heartrte
are kept for model building.
Table 4 Univariate analysis
26
For the following process of model building, the above variables with p-value less than 0.25, except
heartrte are included. Since variable educ has four levels, it is included for model selection using the sub-
group educ = 1 as the reference group. To further confirm its significant impact on disease onset or
progression, the procedure proportional hazards regression is applied for the test. The data shows its p-
value is 0.0116, which is less than the threshold of significance with p-value at 0.05 (Table 5). It indicates
that the variable should be included in the predictive model. As shown above, the education level is
categorized into two levels: level 1, and combined levels above 1, and then the assumption of
proportional hazard rate can be satisfied.
To select the best subgroup of variables in our model, the approach of stepwise was applied in SAS
procedure PHREG. The stepwise selection process consists of a series of alternating forward selection
Table 5 Significance test for variable: educ
27
and backward elimination steps. The former step adds variables into the model, and the latter step
removes variables from the model. The threshold for variable selection into the model is setting with p-
value at 0.25 (SLENTRY = 0.25), while the threshold for variable removing from the model is setting with
p-value at 0.15 (SLSTAY = 0.15). It means only variables with p-value less than 0.25 will be tested in the
model, and to keep it in the model, its p-value should be less than 0.15. The results from the stepwise
proportional hazard regression are displayed as below (Table 6). According to the settings, variables such
as sex, age, totchol, sysbp, diabp, cigpday, bmi, diabetes, glucose and educ are included in the raw
model. Their significances on the impact in the model can be found in column named Pr > ChiSq in the
following table. The value is estimated by analysis of maximum likelihood estimation. And the effects for
individual predictors in the model are shown in column named Parameter Estimate. The data shown
predictors such as sex and educ have negative impact on hazards rate. It indicates that females have
lower hazard than males. And higher educated people have lower hazard than lower educated patients
do. Since there are variables with non-significant p-values, the model should be able to be improved by
elimination of the variables with the largest p-value one by one.
Table 6 Raw variables selected by Stepwise
28
To further optimize the Cox model, the variable with the highest p-value and over threshold of significance
are removed from the predictive model one by one until all the rest variables are shown significant impact
on the prediction of hazard rate. From Table 3, the variable called glucose is the one with highest p-value,
so it is removed according to our strategy in the following regression test. The result is shown as below
(Table 7).
And then, the same strategy is applied for the following analysis. The variable called diabp is removed in
the following step although the p-value of totchol is higher than it, since there is another parameter called
sysbp is included. The data shows that most of the predictors are significant in the model with their p-
value less than 0.05 or close to 0.05. The data is shown as below (Table 8).
Table 7 Step 2 elimination variable with high p- value by Stepwise
29
Since the predictors might interact with each other for prediction in the model. To test the interaction
among variables, the list of all raw variables and all possible combinations of interactions are included for
proportional hazard regression analysis. To select the interaction terms, same strategy is applied. The
result is shown as below (Table 9). From the table, interactions such as sex_age, totchol_cigpday,
totchol_bmi, age_educ, sysbp_diabp and diabetes_glucose might impact on the prediction according to
their p-values, which are shown in columns called parameter estimate and Pr > ChiSq respectively.
Table 8 Step 3 elimination variable with high p- value by Stepwise
30
To test how good fit including all originally selected variables and the potential interaction terms, all of
them are included for the proportional hazards regression analysis. In general, total variables in the list
are sex, totchol, age, sysbp, cigpday, bmi, diabetes, educ, diabp, glucose, sex_age, totchol_cigpday,
totchol_bmi, age_educ, sysbp_diabp and diabetes_glucose. Since the significant impact of some
interaction terms involve some variables, which are not significant and removed in the initial process of
model optimization, such as diabp and glucose, they are reintroduced in the process of model selection.
The data shows that some terms are not significant when all possible predictors are included, such as
educ, diabetes_glucose, whose p-value are 0.8571 and 0.8205 respectivly. And the effect of variable sex
is shown negative impact. It makes sense, since the female has less risk than male patients suffering
from the cardiovascular diseases. However, the model needs to be further optimized, because some
variables in the raw model are not significant enough. The detail can be found as below (Table 10).
Table 9 Interaction terms selected by stepwise
31
As before, to further optimize the predictive model, the variable with the highest p-value is removed from
the regression analysis. In current case, the interaction term called diabetes_glucose is removed first,
though it has lower p-value than that of educ, because the glucose is not found significant impact on the
prediction at the first place. The result of regression is displayed as below (Table 11).
Table 10 Raw Final Model with Interaction Terms
32
Based on the same strategy, the interaction term called educ_age is removed from the following
regression analysis, since it has the highest p-value at 0.9147 (Table 12).
Table 11 Optimize Final Model with Interaction Terms_1
33
And then the predictor called diabp and sex_diabp are removed from the following regression as shown
(Table 13).
Table 13 Optimize Final Model with Interaction Terms_3
Table 12 Optimize Final Model with Interaction Terms_2
34
Similarly, the variable sex_sysbp is removed from the following regression (Table 14). Eventually, the final
model is generated including variable sex, totchol, age, sysbp, cigpday, bmi, diabetes, educ, sex_age.
However, the effect of sex has positive impact on the cardiovascular diseases. It is opposite to the well-
known fact that female risks less than male on cardiovascular diseases. The effect of gender might be
influenced by introducing the interaction terms such as sex_age, sex_sysbp.
Validation of Assumption of Proportionality
The final model is built based on a major assumption that the hazards between groups are proportional.
To test the assumption of proportionality, two common methods are widely applied. It can be achieved
either by generation of Kaplan-Meier Curve ) using Proc LIFETEST, or testing
the time dependence for time dependent variables. If the assumption of proportionality is established, the
Kaplan-Meier curves between groups are parallel, or the effects of time dependent variables on prediction
are not varied with time. The Kaplan-Meier curves are usually applied for categorical variables, or
continuous variables that can be easily grouped. They can be produced in the Proc LIFETEST model
Table 14 Optimize Final Model with Interaction Terms_4
35
statement using survival density function Vs time , or . In our current
study, the logarithm transformed time and survival function is applied. For continuous variables, the
model can include time dependent variables instead, which are interactions between logarithm
transformed time and variables. It can be tested in Proc PHREG directly.
Our data show categorical variables such as sex, diabetes and education level are proportional in the
model using plot of . For variable:educ, the proportional hazard function is
more obvious when the comparison between level 1 and combination of level 2,3 &4 (As shown in Figure
2).
36
However, the overall proportionality of the final model for the continuous variables is not observed, since
the overall p-value is less than 0.0001 (Table 15). Particularly, the time dependence is from variables
age, sex_age and cigpday, which have p-value < 0.0001, 0.0083, 0.0005 respectively (Table 12).
To improve the performance of the model, these three variables are transformed into time dependent
form (x*log(t), x: variable, t: time to event) and applied for fitting the model instead of the original ones.
And then, the new model is verified again by proportional regression to test its proportionality. The results
Table 15 Test for proportional assumption
37
show that the assumption of proportionality for continuous variables sysbp and bmi holds, since the p-
value for overall test is 0.4068, and for sysbp and bmi are 0.7787 and 0.254 respectively (Table 16).
Model Optimization
To further optimize the final model by fitting the three transformed time dependent variables instead of
original variables, the model including sex, totchol, sysbp, bmi, diabetes, educ, ageT, sex_ageT and
cigpdayT is fitted with proportional hazards regression analysis. The result shows that the variable
sex_ageT is not a significant variable anymore with p-value at 0.1546 (Table 17).
Table 16 Test for proportional assumption-2
38
To solve the problem, the variable sex is stratified to fit the model, and the regression is redone, and it
shows all the variables fit the model significantly (As shown Table 18) eventually. The results indicate that
sex should be considered as a dominated predictor in the model. Therefore, it can be analyzed by its
stratification. It means that the observations from different genders should be analyzed separately with
their own baseline hazards rate. Overall, the model fits the observations in the data.
Table 18 Final model with time-dependent variable and stratified variable: sex
Table 17 Raw final model with time-dependent variable
39
In summary, the final proportional hazard function model is finalized with formula:
Where female and male have their own baseline hazard functions. The model satisfies the assumption of
proportionality for all categorical variables, such as diabetes, sex and educ, and partial continuous
Table 19 Stratification with variabel: Sex
40
variables such as bmi and sysbp. However age, cigpday and sex_age behavior in a time-dependent
manner. However, the age changes with the time, so it should be treated as a fixed covariate. Certainly,
the sex_age should be treated as a fixed covariate as well, since sex never changes over time. In another
word, the time-dependent variable is cigpday only. Therefore, the model is suitable to predict the time to
event by using subject with fixed certain measurements or characteristics, but more considerations
should be taken to make more precise prediction if time-dependent covariates are presented.
Model Diagnostics
To test how good fit the model with these predictors overall, the diagnostics is performed by measuring
the deviance residual plots. In SAS, the option Resdev is applied for the measurements. The plots for the
all linear predictors or individual predictor show that the majority of the residuals are converged around
the baseline, and distributed almost evenly over their respective ranges. Certainly, there are outliers
observed in the plots, but overall, the fitting is good (Figure 3).
41
Figure 3
Comparison of Survival Functions
Since the model fits well with the set of covariates, their survival function is also analyzed individually by
fixing values for all the rest variables. The data is shown below (Figure 4). From the figure, the survival
function for male subjects with older age, higher body mass index, more smoking, status of diabetes,
lower education level, higher value for the combination between sex and age, higher level of total
cholesterol, or and higher blood pressure will decrease quicker than those of subjects with opposite
characteristics and measurements. However, the effects observed from the survival density function
match the parameter estimated from proportional hazards function model by regression analysis.
Therefore, the model is good.
42
Parametric model
As discussed, a better parametric model will make the estimation of survival function much simpler, since
it deals with much less parameters. To test this, the same above selected set of covariates are applied to
fit the regression. To fit the model, the logarithmic transformed time (t) has the equation with error term
(w) following certain distribution, and determined by set of covariates. Its formula is
. Therefore, its survival function is
The output is shown in Table 20. Therefore, the
estimated logarithmic transformed time to event is influenced significantly as the formula shown:
43
To test the overall model fitting, the Cox-snell residual plot is generated. If the model is correct. It means
, the Cox-snell residual should look like a censored sample from an exponential distribution. And
then, the plot of should be an approximate straight line through the origin with slope of 1. As
expected, it should be a straight line going through the origin. In general, the line is close to straight one,
but it is not perfect. It indicates that the model should be able to further be improved. The martingale
residual plot is produced with residual versus time. Overall, it is smooth over time. The Schoenfeld
residual is computed for each covariate and each individual, if the assumption of proportional hazard
holds, the chance of failure at the individual at time given there is one death at is:
,
where And the Schoenfeld residual is defined as:
Therefore, for
each individual I, there are a set of values, one for each covariate j included in the PH model. If the PH
Table 20 AFT: Analysis of Parameter Estimates
44
assumption is correct, thee Schoenfeld residuals are uncorrelated with mean 0. The Schoenfeld residual
plot is very smooth, and is converged around the baseline. The deviance residual plot is used to check
the model fitting of individual observations. And it is good for finding outliers. The deviance residual is
defined as:
There are considerable observations are seen as outliers in the model. Therefore, the model needs
further improvement.
To improve the model fitting, two ways can be tried. First is to add more data, and/or include more
covariates, but it may be too costly. And sometimes, it is not feasible. Second is to optimize the model by
fitting with different transformed forms of covariates. To test whether the forms of covariates in current
model is the best or not, the martingale residual plots against respective covariates are generated (as
Figure 4
45
shown in Figure 5). The lines seem to be smooth, but the forms applied in the model, majority of the
covariates, such as totchol, sysbp, cigpday, bmi and sex_age are not perfect.
The model fitting is further checked by Q-Q plot (as shown in Figure 6). As expected, the true fitting
means the plot is a straight line going through the origin with slope of 1. From the data, the sex fits the
model perfectly. The covariate diabetes does not fit the model well at all. The variable educ only fits the
model well before survival time of 6000 days. Therefore, the diabetes covariate should be optimized by
Figure 5
46
some way. The sysbp is a continuous variable. In the figure, it is categorized with breaking point at
140mmHg, to show high blood pressure and normal blood pressure. The form is not perfect as well.
To test which model fits the data the best, the log likelihood ratio of four popular distributions for
estimating survival curve is calculated from SAS. And their respective AIC values are calculated based on
the values of log likelihood ratio (as shown in Table 21). The result indicates that Weibull distribution is
the best model fitting the data, since its AIC value is the least.
Figure 6
47
Discussion
The cardiovascular diseases are the leading causes of death world-wide. To figure out how the diseases
are occurred, many efforts have been paid. Thus far, many factors have been found association with
onset or progression of the diseases [11]. They are called risk factors, such as gender, sex, smoking
status, blood pressure, blood total cholesterol and diabetes, and so on [11, 12]. To better understand their
impacts of individual risk factors on the onset or progression of the diseases, many scoring systems or
models have been built for measuring their effects, such as Framingham Scoring System, European
Systematic Coronary Risk Evaluation algorithm (SCORE), and the Prospective Cardiovascular Munster
(PROCAM) model. In general, the systems assign values for individual risk factors, and then add them up
for individual subjects [30, 31, 32, 34]. The higher score a subject has, greater risk for suffering from
cardiovascular diseases will be. These scores are assigned arbitrarily. And also, the scoring systems are
built based on specific populations. Therefore, great variations are usually observed across populations
by applying same system or model. Even within same population, much variation is also observed from
different cohorts [31, 36, 39]. Thus far, most models have not been cross validated with different cohorts.
Therefore, it is still far away from applicable for any system or model to a general population.
Our current investigation is trying to build up a predictive model based on subset of Framingham Heart
Study dataset by survival analysis. It has been widely used various fields, such as, biomedicine,
engineering, social science, and so on. The survival analysis is specially designed for investigating
survival data. Its features of censoring and non-normality generate huge difficulty in analyzing the data by
traditional statistical models or methods. The non-normality aspect of the data violates the normality
Table 21 Comparisons among models with different distributions
48
assumption of most commonly used statistical models such as regression or ANOVA, and so forth. And
censoring is very popular in survival data. It means the information of such observation is incomplete. In
general, there are five different censorings observed in survival data: right truncation, left truncation, left
censoring, right censoring and interval censoring. In the current dataset of Framingham Heart Study, only
right censoring is observed. It can be several reasons, i.e. drop out of the investigation, or termination of
the investigation, or death because of event unrelated reasons [52, 53, 54]. Therefore, any subject either
experiences an event at some time point, or survives over the time point (censoring).
Another feature for survival analysis is the hazard rate. For discrete time the hazard rate is the probability
that an individual will experience an event at time point t. Therefore, the hazard rate is just the estimated,
unobserved rate. It can be constant or vary with time [54].
To build up the model, the univariate analysis is conducted to test the association between variable and
the time to even [61]. The Kaplan-Meier plots are generated for each categorical variable for the purpose.
However, the univariate analysis is tested by Cox proportional hazards regression. The threshold to keep
variable for further model selection is p-value below is 0.25. Interestingly, the education seems to be
involved in impacting on onset or development of cardiovascular diseases. It makes sense somehow,
since people have different lifestyles with different educations, such as, physical activity, friends cycles,
food, and so on [13, 16].
Sex is a significant covariate as well. It is a commonly sense, since females usually have much less
chance suffering from cardiovascular diseases than male. When Cox proportional hazard model is built,
its performance is strikingly different from others, so it is treated as stratification.
Other factors, such as high blood pressure, total blood cholesterol, body mass index, and age are well
known disease-related risk factors. It is not surprising to see their positive influences on the disease onset
and development. In another words, the higher values of them are, the survival time will be shorter, while
49
the hazard rate will be higher. The results can be seen in the estimates of parameter in semiparametric
model and parametric model.
Current model includes two time-dependent variables, such as age, number of cigarettes smoked
(cigpday). It indicates that the risk of suffering from cardiovascular diseases will increase with aging. It is
well known that the heart undergoes subtle physiologic changes as person gets older, even without the
diseases. It means impacts of diseases of these two variables can be additive to the influence of aging.
Interestingly, in our current hazard model, interaction between sex and age is observed with positive
impact on life time. It is understandable for sex, but age does not make too much sense. Therefore, the
impact may mainly contribute from the gender for the interaction, since one unit change of gender (from
male to female: 0 to 1) means lots of age change. Another interesting explanation is that if subjects do
not have cardiovascular diseases when they are old, they must have good genetics [14, 19].
Overall, the model fits the dataset well although it is not perfect. It can be visualized from the deviance
residual plots. But it is still able to be improved by adding more data, or optimized by change the form of
the covariates.
50
Reference
1. "cardiovascular system" at Dorland's Medical Dictionary
2. Maton A, Hopkins J, McLaughlin CW, Johnson S, Warner MQ, LaHart D, Wright JD. (1993).
Human Biology and Health. Englewood Cliffs, New Jersey: Prentice Hall.
3. Bridget B. Kelly; Institute of Medicine; Fuster, Valentin (2010). Promoting Cardiovascular
Health in the Developing World: A Critical Challenge to Achieve Global Health. Washington,
D.C: National Academies Press.
4. Dantas AP, Jimenez-Altayo F, Vila E (2012). "Vascular aging: facts and factors". Frontiers in
Vascular Physiology 3 (325): 1–2.
5. Countries, Committee on Preventing the Global Epidemic of Cardiovascular Disease:
Meeting the Challenges in Developing; Fuster, Board on Global Health ; Valentin;
Academies, Bridget B. Kelly, editors ; Institute of Medicine of the National (2010). Promoting
cardiovascular health in the developing world : a critical challenge to achieve global health.
Washington, D.C.: National Academies Press.
6. Mendis S, Puska, P, Norrving B. (editors) (2011), Global Atlas on cardiovascular disease
prevention and control.
7. McGill HC, McMahan CA, Gidding SS (2008). "Preventing heart disease in the 21st century:
implications of the Pathobiological Determinants of Atherosclerosis in Youth (PDAY) study".
Circulation 117 (9): 1216–27.
8. Howard BV; Wylie-Rosett J (2002). "Sugar and cardiovascular disease: A statement for
healthcare professionals from the Committee on Nutrition of the Council on Nutrition, Physical
Activity, and Metabolism of the American Heart Association.". Circulation 106 (4): 523–7.
9. Finks SW; Airee A; Chow SL; Macaulay TE; Moranville MP; Rogers KC; Trujillo TC (2012).
"Key articles of dietary interventions that influence cardiovascular mortality.".
Pharmacotherapy 32 (4): e54–87.
10. Mackay, Mensah, Mendis, et al. The Atlas of Heart Disease and Stroke. World Health
Organization. January 2004.
51
11. Vartiainen J, Puska T (1999). "Sex, Age,Cardiovascular Risk Factors, and coronary heart
disease". Circulation 99: 1165–1172.
12. Jani B, Rajkumar C (2006). "Ageing and vascular ageing". Postgrad Med J 82: 357–362.
13. Ignarro, LJ; Balestrieri, ML, Napoli, C (2007). "Nutrition, physical activity, and cardiovascular
disease: an update.". Cardiovascular research 73 (2): 326–40.
14. World Heart Federation (2011). "World Heart Federation: Cardiovascular disease risk
factors".
15. The National Heart, Lung, and Blood Institute (NHLBI) (2011). "How To Prevent and Control
Coronary Heart Disease Risk Factors - NHLBI, NIH".
16. Baska T, Straka S. (1997) Epidemiology of modifiable risk factors in cardiovascular diseases.
Review of present knowledge Epidemiol Mikrobiol Imunol. 46(3):108-14.
17. Munroe PB, Barnes MR, Caulfield MJ. (2013) Advances in blood pressure genomics. Circ
Res. 112(10):1365-79.
18. Thurston RC, Rewak M, Kubzansky LD. (2013) An anxious heart: anxiety and the onset of
cardiovascular diseases. Prog Cardiovasc Dis. 55(6):524-37.
19. Shah RU, Winkleby MA, Van Horn L, Phillips LS, Eaton CB, Martin LW, Rosal MC,
Manson JE, Ning H, Lloyd-Jones DM, Klein L. (2011) Education, income, and incident heart
failure in post-menopausal women: the Women's Health Initiative Hormone Therapy Trials. J
Am Coll Cardiol. 2011 Sep 27;58(14):1457-64.
20. Jernigan VB, Duran B, Ahn D, Winkleby M. (2010) Changing patterns in health behaviors and
risk factors related to cardiovascular disease among American Indians and Alaska Natives.
Am J Public Health. 100(4):677-83.
21. Sowers JR. (2013) Diabetes mellitus and vascular disease. Hypertension. 61(5):943-7.
22. Kassab E, McFarlane SI, Sower JR. (2001) Vascular complications in diabetes and their
prevention. Vasc Med. 6(4):249-55.
23. Rubenfire M, Brook RD. (2013) HDL cholesterol and cardiovascular outcomes: what is the
evidence? Curr Cardiol Rep. 15(4):349.
52
24. Sarwar MS, Islam MS, Al Baker SM, Hasnat A. (2013) Resistant hypertension: underlying
causes and treatment. Drug Res 63(5):217-23.
25. Khan Z, Almeida DR, Rahim K, Belliveau MJ, Bona M, Gale J. (2013) 10-Year Framingham
risk in patients with retinal vein occlusion: a systematic review and meta-analysis. Can J
Ophthalmol. 48(1):40-45.
26. Burg MM, Edmondson D, Shimbo D, Shaffer J, Kronish IM, Whang W, Alcántara C,
Schwartz JE, Muntner P, Davidson KW. (2013) The 'perfect storm' and acute coronary
syndrome onset: do psychosocial factors play a role? Prog Cardiovasc Dis. 55(6):601-10.
27. Johnson JA, Cavallari LH. (2013) Pharmacogenetics and cardiovascular disease--
implications for personalized medicine. Pharmacol Rev. 65(3):987-1009.
28. Munroe PB, Barnes MR, Caulfield MJ. (2013) Advances in blood pressure genomics. Circ
Res. 112(10):1365-79.
29. Munroe PB, Barnes MR, Caulfield MJ. (2013) Advances in blood pressure genomics. Circ
Res. 112(10):1365-79.
30. Nyborn JA, Himali JJ, Beiser AS, Devine SA, Du Y, Kaplan E, O'Connor MK, Rinn WE,
Denison HS, Seshadri S, Wolf PA, Au R.(2013) The Framingham Heart Study clock drawing
performance: normative data from the offspring cohort. Exp Aging Res. 39(1):80-108.
31. Artac M, Dalton AR, Majeed A, Car J, Millett C. (2013) Effectiveness of a national
cardiovascular disease risk assessment program (NHS Health Check): Results after one
year. Prev Med. 57(2):129-34.
32. Zhang G, Parikh PB, Zabihi S, Brown DL. (2013) Rating the preferences for potential harms
of treatments for cardiovascular disease: a survey of community-dwelling adults. Med Decis
Making. 33(4):502-9.
33. Fu BB, Fei XL, Sekar BD, Dong MC. (2013) Research and application of heart sound
alignment and descriptor. Comput Biol Med. 43(3):211-8.
34. Gensini GG. (1983) A more meaningful scoring system for determining the severity of
coronary heart disease. Am J Cardiol. 51(3):606.
53
35. http://hp2010.nhlbihin.net/atpiii/calculator.asp
36. Basow DS (Ed). (2010) Estimation of cardiovascular risk in an individual patient without
known cardiovascular disease. Massachusetts Medical Society, and Wolters Kluwer
publishers, The Netherlands. 2010.
37. D'Agostino RB Sr, Vasan RS, Pencina MJ, Wolf PA, Cobain M, Massaro JM, Kannel WB.
(2008) General cardiovascular risk profile for use in primary care: the Framingham Heart
Study. Circulation. 117(6):743-53.
38. Brindle P, Emberson J, Lampe F, Walker M, Whincup P, Fahey T, Ebrahim S. (2003)
Predictive accuracy of the Framingham coronary risk score in British men: prospective cohort
study. BMJ. 327(7426):1267.
39. Liu J, Hong Y, D'Agostino RB Sr, Wu Z, Wang W, Sun J, Wilson PW, Kannel WB, Zhao D.
(2004) Predictive value for the Chinese population of the Framingham CHD risk assessment
tool compared with the Chinese Multi-Provincial Cohort Study. JAMA. 291(21):2591-9.
40. Sacco RL, Khatri M, Rundek T, Xu Q, Gardener H, Boden-Albala B, Di Tullio MR, Homma S,
Elkind MS, Paik MC. (2009) Improving global vascular risk prediction with behavioral and
anthropometric factors. The multiethnic NOMAS (Northern Manhattan Cohort Study). J Am
Coll Cardiol. 54(24):2303-11.
41. Conroy RM, Pyorala K, Fitzgerald AP, Sans S, Menotti A, De Backer G, De Bacquer D,
Ducimetiere P, Jousilahti P, Keil U, Njolstad I, Oganov RG, Thomsen T, Tunstall-Pedoe H,
Tverdal A, Wedel H, Whincup P, Wilhelmsen L, Graham IM. (2003) Estimation of ten-year
risk of fatal cardiovascular disease in Europe: the SCORE project. Eur Heart J. 24(11):987-
1003.
42. 2002 Third Report of the National Cholesterol Education Program (NCEP) Expert Panel on
Detection, Evaluation, and Treatment of High Blood Cholesterol in Adults (Adult Treatment
Panel III) final report. Circulation. 106(25):3143-421.
43. Ford ES, Giles WH, Mokdad AH. (2004) The distribution of 10-Year risk for coronary heart
disease among US adults: findings from the National Health and Nutrition Examination
Survey III. J Am Coll Cardiol. 43(10):1791-6.
54
44. Schnabel RB, Sullivan LM, Levy D, Pencina MJ, Massaro JM, D'Agostino RB Sr, Newton-
Cheh C, Yamamoto JF, Magnani JW, Tadros TM, Kannel WB, Wang TJ, Ellinor PT, Wolf PA,
Vasan RS, Benjamin EJ. (2009) Development of a risk score for atrial fibrillation
(Framingham Heart Study): a community-based cohort study. Lancet. 373(9665):739-45.
45. Lloyd-Jones DM, Wang TJ, Leip EP, Larson MG, Levy D, Vasan RS, D'Agostino RB,
Massaro JM, Beiser A, Wolf PA, Benjamin EJ. (2004) Lifetime risk for development of atrial
fibrillation: the Framingham Heart Study. Circulation. 110(9):1042-6.
46. http://cvrisk.mvm.ed.ac.uk/help.htm
47. http://en.wikipedia.org/wiki/Framingham_Heart_Study
48. Siontis GC, Tzoulaki I, Siontis KC, Ioannidis JP. (2012) Comparisons of established risk
prediction models for cardiovascular disease: systematic review. BMJ. 344
49. See R, de Lemos JA. (2006) Current status of risk stratification methods in acute coronary
syndromes. Curr Cardiol Rep. 8(4):282-8.
50. Viv Bewick1, Liz Cheek1 and Jonathan Ball2 (2004) Statistics review 12: Survival analysis,
Critical Care , 8:389-394
51. Abbring JH. and van den Berg GJ.(2003), .The identifiability of the mixed
proportional hazards competing risks model., Journal of the Royal Statistical
Society (A), 65, 701.710.
52. Allison P. (1995), Survival Analysis Using the SAS System: A Practical
Guide, SAS Institute, Gary NC.
53. Cox D. and Oakes D. (1984), Analysis of Survival Data, Chapman Hall,
London.
54. Hosmer DW and Lemeshow S (1999), Applied Survival Analysis, Wiley,
New York.
55. Kalb.eisch JD. and Prentice RL. (1980), The Statistical Analysis of Failure Time Data, John
Wiley, New York.
56. Narandranathan W. and Stewart M. (1991), .Simple methods for testing the proportionality of
cause-speci.c hazards in competing risks models., Oxford
55
Bulletin of Economics and Statistics 53:331.340.
57. Prentice RL and Gloeckler LA. (1978), .Regression analysis of grouped
survival data with application to breast cancer data., Biometrics 34, 57.67.
58. Singer JD and Willett JB (1993), .It.s about time: using discrete-time
survival analysis to study duration and the timing of events., Journal of Edu-
cational Statistics 18, 155.195.
59. Singer JD and Willett JB (2003), Applied Longitudinal Data Analysis.
Modeling Change and Event Occurence, Oxford University Press, Oxford.
60. Therneau TM and Grambsch PM. (2000), Modeling Survival Data: Extending the Cox Model,
Springer, New York.
61. http://www.ats.ucla.edu/stat/sas/seminars/sas_survival/#Background
62. Klein JP, Moeschberger ML, (1997), Survival analysis: techniques for censored and truncated
data, Springer-Verlag, New York;
63. Nelson W, et al, (1972) Theory and Applications of hazard plotting for censored failure data,
Technometrics;
64. Klein JP (1991) small sample moments of some estimators of the variance of the Kaplan-
Meier and Nelson-Aalen estimators. Scandinavian Journal of Statistics 18:333-340;
65. Kaplan EL, Meier P, (1958) nonparametric estimation from incomplete observations, Journal
of the American Statistics Association 53:457-481;
66. Breslow NE, (1975) analysis of survival data under the proportional hazard model,
International Statistics Review 43:45-58;
67. Aalen, OO, (1978) nonparametric estimation of partial transition probabilities in multiple
decrement models, Annals of Statistics 6:534-545;
68. Aalen, OO, (1980) A model for non-parametric regression analysis of counting processes, In
lecture notes on mathematical statistics and probability, Spring-Verlag, 1-25;
69. Akaike H, Information theory and an extension of the maximum likelihood principle, in 2nd
international symposium of information theory and control, 267-281;
56
70. Cox DR, (1972) regression models and life tables (with discussion), Journal of the Royal
Statistical Society 187-220;
71. Cox DR, Snell EJA (1968) A general definition of residuals (with discussion), Journal of the
Royal Statistical Society B30:248-275;
72. Efron B, (1967) the two sample problem with censored data, in Processing of the fifth Berkeley
Symposium on Mathematical statistics and probability, 4:831-853;
73. Fleming TR, Harrington DP, O’Sullivan M, (1987) supremum versions of the log-rank and
generalized Wilcoxon statistics, Journal of the American Statistical and Association, 82:312-
320;
74. Fleming TR et al (1980) Modified Kolmogorov-Smirnov test procedures with application to
arbitrarily right censored data, Biometrics 36:607-626;
75. Gehan EAA, (1965) A generalized Wilcoxon test for comparing arbitrarily singly censored
samples, Biometrika 203-223;
76. Kannel WB, et al, (1987) Heart rate and cardiovascular mortality: The Framingham study,
American Heart Journal, 114(6):1489-1494.