Survival Analysis of Cardiovascular Diseases

62
Washington University in St. Louis Washington University Open Scholarship All eses and Dissertations (ETDs) Winter 12-1-2013 Survival Analysis of Cardiovascular Diseases Yuanxin Hu Washington University in St. Louis Follow this and additional works at: hps://openscholarship.wustl.edu/etd Part of the Mathematics Commons , and the Statistics and Probability Commons is esis is brought to you for free and open access by Washington University Open Scholarship. It has been accepted for inclusion in All eses and Dissertations (ETDs) by an authorized administrator of Washington University Open Scholarship. For more information, please contact [email protected]. Recommended Citation Hu, Yuanxin, "Survival Analysis of Cardiovascular Diseases" (2013). All eses and Dissertations (ETDs). 1209. hps://openscholarship.wustl.edu/etd/1209

Transcript of Survival Analysis of Cardiovascular Diseases

Page 1: Survival Analysis of Cardiovascular Diseases

Washington University in St. LouisWashington University Open Scholarship

All Theses and Dissertations (ETDs)

Winter 12-1-2013

Survival Analysis of Cardiovascular DiseasesYuanxin HuWashington University in St. Louis

Follow this and additional works at: https://openscholarship.wustl.edu/etd

Part of the Mathematics Commons, and the Statistics and Probability Commons

This Thesis is brought to you for free and open access by Washington University Open Scholarship. It has been accepted for inclusion in All Theses andDissertations (ETDs) by an authorized administrator of Washington University Open Scholarship. For more information, please [email protected].

Recommended CitationHu, Yuanxin, "Survival Analysis of Cardiovascular Diseases" (2013). All Theses and Dissertations (ETDs). 1209.https://openscholarship.wustl.edu/etd/1209

Page 2: Survival Analysis of Cardiovascular Diseases

WASHINGTON UNIVERSITY IN ST. LOUIS

Department of Mathematics

Program in Statistics

Survival Analysis of Cardiovascular Diseases

by

Yuanxin Hu

A thesis presented to the

Graduate School of Arts & Sciences

of Washington University in

partial fulfillment of the

requirements for the degree

of Master of Arts

December 2013

Saint Louis, Missouri

Page 3: Survival Analysis of Cardiovascular Diseases

ii

Table of Contents

Acknowledgement iii

Abstract iv

Introduction

Cardiovascular Diseases 1

Risk Factors 1

Risk Scoring Systems and Models 3

Survival Data and Survival Analysis 4

Data 5

Survival Analysis 8

1) Kaplan-Meier Method 11

2) Cox Proportional Hazards Model 13

3) Accelerated Failure Time Model 18

Tests for Significance 19

Methods and Procedures

SAS Procedures 20

Results

Data exploration 21

Model Selection 22

Validation of Assumption of Proportionality 33

Model Optimization 36

Model Diagnostics 39

Comparison of Survival Functions 40

Parametric model 41

Discussion 47

References 50

Page 4: Survival Analysis of Cardiovascular Diseases

iii

Acknowledgements

I am very thankful for my supervisor, Prof. Ed. Spitznagel. He paid lots of efforts and ideas on the project.

And also, he gave many valuable suggestions and comments on the draft of my thesis. Certainly, without

his supervision, the thesis cannot be accomplished.

I also appreciate the suggestions and ideas kindly offered by Prof. Jimin Ding and Prof. Laura Dumitrescu

during the revision of my thesis. The accomplishment is also counted lots on them.

In addition, I want to thank my other teachers and classmates during the period of studying in the

university. They helped me successfully completing the transition from one field to another.

In the end, I want to thank my wife. Without her support, it is impossible for me going back to school.

Therefore, the thesis is not only for me, but also for my supervisor, my teachers, my classmates and my

family.

Page 5: Survival Analysis of Cardiovascular Diseases

iv

ABSTRACT OF THE THESIS

Survival Analysis of Cardiovascular Diseases

by

Yuanxin Hu

Master of Arts in Statistics

Washington University in St. Louis, 2013

Professor Ed Spitznagel, Chair

Cardiovascular disease (CVD) is a class of diseases related to the heart or blood vessels. It is the leading

cause of death worldwide. Many risk factors are found to be associated with it, such as family history,

gender, age, etc. Much effort has been paid to identify them and to evaluate their effects on the onset or

progression of the diseases, so that prediction of onset/development of the diseases can be made, and

preventive solutions can be applied.

Thus far, much survival analysis of cardiovascular diseases comes from relatively collected populations

followed over short time, and the findings may not easily be generalized to the patients. The current data

set is one subset of data collected by The Framingham Heart Study since 1948, which recorded 5209

subjects originally. Current data includes 4434 randomly selected subjects living in the community of

Framingham.

The main goal for our current study is to build up a Cox proportional hazards model by survival analysis

using the SAS statistical package. To process the analysis, the proportional assumption or time

dependence for individual factors is tested; variables are selected; and their interactions are considered to

optimize the model. Due to strikingly impact of gender on the prediction, it is stratified. Therefore different

baseline hazards are applied for the set of variables within each group.

In the model, the parameters are estimated by maximum likelihood Newton-Raphson algorithm. The

results show that gender, status of diabetes, age, body mass index, cholesterol and blood pressure are

Page 6: Survival Analysis of Cardiovascular Diseases

v

found impacting the diseases onset/development. Interestingly, the education level has its influence on it

as well.

Page 7: Survival Analysis of Cardiovascular Diseases

1

Introduction

Cardiovascular Diseases

Heart is the organ pumping blood through the blood vessels and arteries to all parts of the body. In the

process, it delivers oxygen and nutrients through the whole body, and also collects waste products and

gets rid of them from it [1]. The process is vital for life. The cardiovascular system includes the heart and

all the network of blood vessels. Diseases or disorders related to the system are termed cardiovascular

diseases (CVD), such as Coronary heart disease, congenital heart disease, stroke or cerebrovascular

accident (CVA), congestive heart failure, peripheral arterial disease or peripheral vascular diseases, deep

venous thrombosis (DVT) and pulmonary embolism, rheumatic heart disease, and other cardiovascular

diseases caused by tumors of the heart [2, 3, 4, 5]. But the major CVD are heart attacks and strokes.

Cardiovascular diseases are the leading causes of death worldwide, which killed over 17 million people in

2008. They have been prevailing in developed countries, but they are becoming severe issues for the

people living in developing countries as well since decades ago [6, 7].

Risk Factors

CVD is very common in population, especially for adults [7]. Some risks are found contributing lots to the

onset or progression of the disease, such as age, gender, genetics, lifestyles, and some biochemistry

parameters of the body [8, 9]. They are categorized as uncontrollable factors, such as age, sex and family

history [10, 11, 12], and modifiable factors, such as lifestyles and other biochemistry parameters. For

example, people can delay the onset or development of the diseases by changing their lifestyles, such as

excises, no smoking, healthy diets [13, 14, 15, 16]. Consequently, the biochemistry parameters can be

improved as well with healthy lifestyle, which will decrease the risk of suffering from the diseases [17, 18].

Blood pressure is a well-known risk factor for cardiovascular diseases. Once it reaches and keeps at high

level of blood pressure, sooner or later, the people will develop cardiovascular diseases. And it is

irreversible after a period. In theory, higher blood pressure makes heart work harder to pump blood out of

Page 8: Survival Analysis of Cardiovascular Diseases

2

it. As a result, the wall of heart will be thickening at the beginning, and then be thinner and thinner

gradually later. Eventually, the wall of heart will be too thin to contract powerfully, so the pressure in the

heart becomes higher and higher until it is out of control [20, 21, 22].

Cholesterol is a fatty wax-like substance found in the blood. Reasonable concentration of cholesterol in

the blood is crucial for keeping good health. There are two common lipoproteins called high-density

lipoproteins (HDL) and low-density lipoproteins (LDL). The HDL carries extra fat away from arteries, so

that extra fat will not accumulate in them. However, the LDL assists fat to build up on the artery wall. The

deposited fat will shrink space of artery. Eventually, it leads to high blood pressure. The normal level of

total cholesterol for adults is below 200. The risk of heart attack increases dramatically with its

concentration goes high[18, 23].

Smoking is another well-known risk factor. In principle, nicotine narrows the blood vessels and causes

high blood pressure and high heart rate. In addition, the byproduct of smoking carbon monoxide

competes with oxygen in the red blood cells so less oxygen can be delivered to the heart. As a

consequence, it damages artery wall. In addition, it involves in more damages in heart than that, such as,

depositing cholesterol in artery and reducing blood HDL level. As reported, smoking can increase twice

risk of early death from heart attack as nonsmokers [24, 25].

Diabetes is due to dysfunction of response to insulin, which results in high blood sugar level. It also brings

up the accumulation of lipoprotein and increases atherosclerosis as side effect. Usually, Diabetes

combines with high blood pressure, high cholesterol, low level of HDL, and overweight (obesity). Any

individual of those factors contributes the progression or onset of heart failure, and the combination has

cumulative effect on the diseases [21, 22].

Stress is a part of our lives. It is another controllable risk factor. During dealing with tough situation, the

body releases chemicals, such as adrenalin, which can narrow blood vessels, and cause high blood

pressure [26]. People experiencing long-period pressure are prone to coronary artery disease.

Page 9: Survival Analysis of Cardiovascular Diseases

3

Age, sex and family history (genetic heredity) are uncontrollable risk factors. Aging is associated with

progressive decline in numerous physiological processes and fundamental morphological and structural

changes of heart. Specifically, aging results in dysfunction of endothelium, which leads death and decline

of heart endothelial cells [11, 12]. In clinic, the dysfunction results in high blood pressure and increases

risk for development of atherosclerosis, hypertension, stroke and arterial fibrillation.

Certain genetic mutations can raise the likelihood of having heart and vascular diseases. And its risk can

be told by family history. As report, subjects most likely bear genetic mutations of risks, if their close

relative has had heart diseases at certain age, for example, 55 years for male, 65 years old for female

[27,28]. They inherit higher tendency toward heart disease than people who do not have family history of

CVD.

Interestingly, sex is found to be a critical risk factor of CVD, since men are more likely to have heart and

blood vessel problems at an earlier than women, especially before menopause. As report, estrogen may

play critical role in the prevention. As known, estrogen can up-regulate HDL, which is good for health of

heart [11].

The above generally gives an introduction of the risk factors individual by individual, but they are working

separately. The impacts can be cumulative, or enhancive on development of the diseases, so all the risk

factors should be considered interactively or in an integrative way. Therefore, it would be meaningful to

build up a system or model with consideration of all potential risks. And then, the effective prediction or

evaluation of risks can be made, and suggestions and treatments can be applied for prevention or delay

the disease onset or progression.

Risk Scoring Systems and Models

Due to high incidence and cost of the CVD, to predict the risk of cardiovascular events is important. It will

assist clinical practice guidelines target primary prevention and recommend that providers evaluate

patients for cardiac risk factors that warrant medical treatment.

Page 10: Survival Analysis of Cardiovascular Diseases

4

Thus far, several scoring systems to evaluate the risk of suffering from CVD have been developed, such

as the Framingham Scoring System (FSS)[30], Joint British Societies (JBS) [31] and ASSIGN [32, 33]. In

general, the systems assign certain scores for individual risk factors. The risk of suffering from

cardiovascular disease is evaluated with the calculated sum of the scores. The higher the score means

greater risk. Among them, the FSS is the most popular one. Others followed similar ideas and scales for

individual risks, but introduced specific criteria or scores for specific populations. The Framingham system

can be applied for evaluation of risks of several CVD outcomes with period range 4 to 12 years, such as

stroke, coronary disease, myocardial infarction, and death from either coronary or cardiovascular disease

[30, 31, 32, 33, 34, 35].

The JBS has developed its own guideline called British National Formulary (BNF) for evaluation of

cardiovascular risk, but mainly for risk of coronary disease stroke. It inherits scoring rules and scales used

by FSS [34, 35]. Either FSS or JBS has limitation of generalization for all populations, The ASSIGN

scoring system was developed in conjunction with the Scottish Intercollegiate Guidelines Network by

introducing more risk factors, such as family history and social deprivation [32, 33]. It serves well for

Scottish population.

In general, all the systems treat age as a dominated factor during evaluation [34]. Certainly, it can be a

limitation or problem, because the risk for healthy older people might be higher than that of younger CVD

patients evaluated by the systems. Since all preventive treatments or benefits are designed for high-risk

population, the actual high-risk young patients may be ignored. Therefore, the existing scoring system

can be improved [37, 38, 39, 40].

In addition, the accuracy of evaluation of the risk is not always satisfactory, because the impacts of risk

factors on different populations and different individuals within same population vary greatly. They can be

different also among races, types of diseases, and so on. Thus far, the systems can only evaluate risks

for particular heart diseases. For example, the FSS can only predict the risk event for coronary disease.

Page 11: Survival Analysis of Cardiovascular Diseases

5

Others are beyond its capability [41, 42, 43, 44, 45, 46]. And also, all the prediction can only be applied

for subjects at certain ages.

Until now, over hundred other risk models for prediction of cardiovascular risk have been developed, such

as Framingham (FRS) model for CVD, the Prospective Cardiovascular Munster (PROCAM) model for

CHD, and the Systematic Coronary Risk Evaluation (SCORE) model for CVD mortality. Unfortunately,

only a few of them have been validated for various times and different cohorts, such as Scottish Heart

Health Extended Cohort (SHHEC), Diabetes Audit and Research in Tayside, Scotland (DARTS),

FINRISK, Framingham Study (FRS), Framingham Offspring Study (FRS-O), Prospective Cardiovascular

Munster Study (PROCAM), QRESEARCH Database, Systematic Coronary Risk Evaluation (SCORE) and

United Kingdom Prospective Diabetes Study (UKPDS) [46, 47, 48, 49]. These systems are improved at

some points, but the above limitations still exist among them.

Survival Data and Survival Analysis

Survival data is collected for investigation of time to event. The event can be death, occurrence of

disease, machine failure or end of marriage, and so on [50]. One common characteristic of survival data

is censoring, truncation, or combination of censoring and truncation, due to end of investigation, drop out

of subjects, or even the experiment design with threshold. Therefore, the information for the censored or

truncated subjects is not complete known. Certainly, the presence of censoring or truncation of data

requires special treat during data analysis, for example, with consideration of both time to event and

indication of censoring together, since it may not properly by treating the data by classical way. For

example, simply evaluation of mean time to event among groups using t-test or linear regression ignores

censoring issues, whose information is not completely known. Or, comparison of proportion of events

among groups using risk/odds ratio test or logistic regression does not take account of impact of time.

Therefore, the analysis should consider both of them. For example, to count the likelihood function, the

terms corresponding to both observed data and censored data should be considered together. For

observed data, it contributes to its density function at time t, while the censored observation accounts for

the probability function over time t. Fortunately, the occurrence of censoring is not related to the future

Page 12: Survival Analysis of Cardiovascular Diseases

6

time, for example, the probability distribution of residual lifetime is not different between censored

observations and uncensored cases, since they bear same set of explanatory variables [53, 54].

Data

Current investigation includes a subset of data for the Framingham Heart Study. It is a long term

prospective study of etiology of cardiovascular disease among population living in the community of

Framingham, Massachusetts. The data set was collected with 5209 subjects enrolled initially since 1948.

The participants have been examined biennially and continuously followed through regular surveillance

for cardiovascular outcomes. Clinic examinations has included with cardiovascular disease risk factors

and markers of disease, such as blood pressure, blood chemistry, lung function, smoking history, health

behaviors, ECG tracing, Echocardiography, and use of medications. Through regular surveillance, the

study reviews and adjudicates events for occurrence of Angina Pectoris, Myocardial Infarction, Heart

failure, and cerebrovascular disease.

The current dataset includes 4434 participants with three examination periods, approximately 6 years

apart from 1956 to 1968. The data set includes information of laboratory results, clinic diagnostics,

healthy questionnaire, and adjudicated events. Each participant was followed for a total 24 years for

outcome of the events, such as Angina Pectoris, Myocardial Infarction, Atherothrombotic Infarction or

Cerebral hemorrhage (Stroke), or death.

The data set is provided in longitudinal form. Each participant has 1 to 3 observations depending on the

exams the subject attended. Overall, there are 11,627 observations on the 4434 participants. For

example, the participant with RANDID 95148 had three exams (period 1, 2, 3). The first visit was 52 years

old, and then 58 and 64 years old for the following two visits. The participant was a female (SEX =2). The

baseline time was set as 0 (time = 0), and the following two exams happened on 2128 and 4192 days

(time = 2128, 4192). The column timemifc is the time to event (MI occurred) on 3607 days counted from

the baseline visiting time. Therefore, participant 95148 entered the study (time = 0, or period = 1) free of

prevalent coronary heart disease (prevchd = 0). However, an MI event occurred at day 3607 following the

Page 13: Survival Analysis of Cardiovascular Diseases

7

baseline examination. It means the time to event of MI for participant 95148 is 3607 days. Since she had

the second exam at 2128 days (time = 2128, period = 2), she was free of disease (prevchd = 0, period =

2), but before her third visit (prevchd = 1, period =3). The mi_fchd shows the participant had the event

during the three exams (mi_fchd = 1), but it did not tell when it started. Some other participants had no

events occurred, so they were censoring cases (prevchd = 0, mi_fchd = 0) for all three exams.

Example

RANDID age SEX time period prevchd Mi_fchd Timemifc

95148 52 2 0 1 0 1 3607

95148 58 2 2128 2 0 1 3607

95148 64 2 4192 3 1 1 3607

The real data has details as shown Table 1 & Table 2) respectively.

Page 14: Survival Analysis of Cardiovascular Diseases

8

Table 1 Description of data (part I)

Variable Description Units Range or count

RANDID Unique identification ID 2448-9999312

SEX sex 1=male

2=female

5022

6605

PERIOD cycle 1=period 1

2=period 2

3=period 3

4434

3930

3263

TIME Number of days since baseline exam 0-4854

AGE Age at exam (years) 32-81

SYSBP Systolic blood pressure (mmHg) 83.5-295

DIABP Diastolic blood pressure (mmHg) 30-150

BPMEDS Use of anti-hypertensive medicine 0=not used

1=using

10090

944

CURSMOKE Current smoking 0=nonsmoker

1=smoker

6598

5029

CIGPDAY Number of cigarettes smoked per day 0=nonsmoker

1-90 cigarettes per day

EDUC Attained education 1= 0-11 years

2=high school diploma

3=vocational school

4=college (BA, BS) above

TOTCHOL Total cholesterol (mg/dL) 107-696

HDLC High density lipoprotein cholesterol(mg/dL) Period 3 only 10-189

LDLC Low density lipoprotein cholesterol(mg/dL) Period 3 only 20-565

BMI Body mass index 14.4-56.8

GLUCOSE Serum glucose (mg/dL) 39-478

DIABETES Diabetic (casual glucose >= 200 mg/dL) 0=not diabetic

1=diabetic

11097

530

HEARTRTE Heart rate (beats/min) 37-220

PREVAP Prevalent Angina Pectoris at exam 0=free disease

1=prevalent disease

11000

627

PREVCHD Prevalent Angina Pectoris at exam 0=free disease

1=prevalent disease

10785

842

PREVMI Prevalent Myocardial Infarction 0=free disease

1=prevalent disease

11253

374

PREVSTRK Prevalent stroke 0=free disease

1=prevalent disease

11475

152

PREVHYP Prevalent hypertension 0=free disease

1=prevalent disease

6283

5344

Page 15: Survival Analysis of Cardiovascular Diseases

9

Table 2 Description of data (part II)

Variable name Description

ANGINA Angina pectoris

HOSPMI Hospitalized myocardial infarction

MI_FCHD Hospitalized myocardial infarction or Fatal coronary heart disease

ANYCHD Angina pectoris, Myocardial infarction, or fatal Coronary heart disease

STROKE Atherothrombotic, Cerebral Embolism, Intracerebral hemorrhage, or subarachnoid hemorrhage

or Fatal cerebrovascular disease

CVD Myocardial infarction, Atherothrombotic, Cerebral Embolism, Intracerebral hemorrhage, or

Subarachnoid hemorrhage or Fatal cerebrovascular disease

HYPERTEN Hypertension: either Systolic >= 140 mmHg or Diastolic >= 90 mmHg at the first or second exam

DEATH Death from any cause

TIMEAP Number of days from baseline exam to the first Angina,

Or number of days from baseline to censor date.

TIMEMI Number of days from baseline exam to the first HopsMI event,

Or number of days from baseline to censor date.

TIMEMIFC Number of days from baseline exam to the first MI_FCHD event,

Or number of days from baseline to censor date.

TIMECHD Number of days from baseline exam to first ANYCHD event,

Or number of days from baseline to censor date.

TIMESTRK Number of days from baseline exam to first STROKE event,

Or number of days from baseline to censor date.

TIMECVD Number of days from baseline exam to first CVD event,

Or number of days from baseline to censor date.

TIMEHYP Number of days from baseline exam to first Hyperten event,

Or number of days from baseline to censor date.

TIMEDTH Number of days from baseline exam to first death event,

Or number of days from baseline to censor date.

Survival Analysis

The purpose of current investigation is to build up a predictive model to estimate time to event for an

individual. Due to the presence of censoring observation, the response variable will be a combination of

time to event and indicator of censorship (time*censor).

The censorship can be defined with indicator (as shown below formula)

Page 16: Survival Analysis of Cardiovascular Diseases

10

To predict the time to event with survival data, the best and most convenient way is to identify the

probability density function f(t) of time to event, and then its survival cumulative function S(t) and hazard

function h(t) can be easily gotten. For example, for continuous variable time to event (t), its density

function,

And for survival function of time to event can be derived by,

For its hazard function, by definition,

For various survival data, different probability distributions can be selected, such as exponential

distribution, Weibull distribution or lognormal distribution, and so on.

For example, the probability density function of Weibull distribution,

And then its survival function and hazard function,

Page 17: Survival Analysis of Cardiovascular Diseases

11

Actually, once one of the three functions is known, it will be easily to get others, since there are more

below relations among them,

However, the distribution might not be perfectly pinned down mathematically. Currently, there are three

popular methods available for the survival data analysis. They are Kaplan-Meier nonparametric method,

Cox proportional hazard semiparametric method, and parametric method, which is also called

Accelerated Failure Time (AFT) model.

1) Kaplan-Meier Method

It is also called one-sample nonparametric method. Its survival function can be estimated with Kaplan-

Meier plot, Life-table, or Cumulative hazard estimator [53, 54, 56]. It is a very popular method for its

advantages, such as, it does not require any mathematical assumption for its hazard function or

proportional hazard. It simply takes into account with the empirical probability of surviving over certain

time, but it assumes the value of survival function between successive distinct observations is constant. It

is widely applied for studying survival function of population. And also, it is very commonly used to

compare the survival functions among groups. If the sample size is large enough, its estimated survival

function represents that of population.

The method does not take into account of covariates, so it is mainly descriptive. For better presentation,

the survival function usually shows as stepwise reduction plot. Therefore, it mainly deals with categorical

predictors or grouped continuous variables. It cannot take care of continuous variables directly, so it will

not be able to accommodate time-dependent variables. Even with these limitations, it is still very popular,

because it treats censored data well, particularly right-censoring data.

Page 18: Survival Analysis of Cardiovascular Diseases

12

There are two ways of estimation of survival function. One was proposed by Kaplan and Meier, the

function is called Product-Limit estimator. It offers an efficient way of estimation its survival function for all

the values of time (t) in the range with right censoring observations. It means before the time (t1), there is

no event occurred, so the survival function is 1. Once the event happens, its survival function will

decrease. The time t1 is kind of checkpoint or cutting point for event occurring. Where di is the number of

event occurred in that time interval or at the point, and yi is the number of subject at risk in that interval or

at the time point. It includes all subjects who have died, dropped out, or not reached the event time yet at

that time point or interval. They are applied for all the following equations.

Its variance estimator follows the below formula,

Since the survival function can be estimated, its cumulative hazard function can be calculated following

formula: .

However, for data with small-sample-size, the cumulative hazard estimator can be better performed by

Nelson-Aalen estimator. It follows the below formula:

Page 19: Survival Analysis of Cardiovascular Diseases

13

Therefore, the Nelson-Aalen survival function estimator can be calculated following formula:

. The estimated variance for its cumulative hazard function is counted as below:

The Nelson-Aalen estimator has two primary uses. The first is for selection between parametric models

for the time to event. For example, the plot of s time(t) is linear for exponential distribution with

hazard rate λ. The second is in providing raw estimation of hazard rate h(t).

Usually, the time points are assigned as small as possible. For example, there is only one event occurred

at that time point or in the interval. Therefore, the number of event at a particular time point or interval is

very small. As a result, the above methods calculated survival probability at any given intervals is not

perfectly accurate due to small number of events, but the overall probability of surviving to each time point

is.

To make the plot of survival function versus time, it can be achieved by , or

.

Since the estimate of survival function and its variance can be easily counted, the confidence interval is

also not hard to get. Basically, there are three different confidence interval can be calculated, such as

linear confidence interval, log-transformed confidence interval, and arcsine-square root transformed

confidence interval. Among them, the linear confidence interval is the most popular one. They are

following the below formula respectively.

Linear confidence interval , where .

Page 20: Survival Analysis of Cardiovascular Diseases

14

For log-transformed confidence interval

For arcsine-square root transformed confidence interval:

2) Cox Proportional Hazards Model

The Cox proportional hazard model also is called semiparametric regression model. As discussed before,

one of the interests of survival analysis is to compare the survival function between groups. The

nonparametric model can do the job well if the groups are identical except one characteristic is different,

which is what is to compare, such as treatment, or drug use, and so on. However, for most cases, the

individuals in the data set not only vary only one, they are identified with lots of other characteristics. For

example, the patients in current data set, subjects have demographic variables recorded, such as, age,

sex, education, and so on. These characteristics can impact the survival function. For example, patients

with high blood pressure will be prone to develop heart failure than the subjects with normal blood

pressure. Therefore, it might cause severe bias if the comparison is made without consideration of these

impacts. In addition, another goal for survival analysis is to predict the potential survival time for a

particular subjects, ignorance of those information will leads inaccurate prediction.

The Cox proportional hazards model considers both the underlying distribution of the time to event and

the impacts of other characteristics. Therefore, for any observation, it consists of the information of triple

, where is the time on the study, is the indicator of event, is the set of covariates for the

ith subject. And can be time-dependent. And

Page 21: Survival Analysis of Cardiovascular Diseases

15

In general, the hazard function at time t is defined as

,

Where . And the is time-fixed variable.

The model is called proportional hazard model, because the hazard is proportional for two subjects with

covariates Z and Z*. It can get from the formula:

Therefore, with consideration of a single covariate, for example, educational level, it has three levels. The

relative risk for the three levels of education can be calculated by above formula. For example, level 1 as

reference, and the relative risk of level 2 to level 1 is , level 3 to level 1 is , and level 3 to

level 2 is .

The below are their hazard functions.

To test the significance of covariates on the impact of hazard in the regression, there are three tests are

popular. They are Wald test, Likelihood Ratio test, and Score test.

The Wald test is based on asymptotic normality of partial maximum likelihood estimate, it follows formula:

Page 22: Survival Analysis of Cardiovascular Diseases

16

where is the value under null hypothesis,

is the information matrix,

P is the dimension of covariates.

Likelihood Ratio test follows formula:

where

And Score test follows formula:

where

Since the semiparametric model is established with assumption of proportional hazard rate. It can be

tested by checking variable time-dependency, or plots. To check the time-dependence of a covariate, the

variable is modified with interaction with time (usually logarithmic transformed time), and then it is fitted

with Cox proportional regression for its significance.

In addition, the assumption can be tested graphically as well. The first method is to plot the cumulative

hazard rate or transformed cumulative hazard rate against time with individual stratified covariate. For

example, is vector of covariate, and

is the remaining set ( of covariates. To test

Page 23: Survival Analysis of Cardiovascular Diseases

17

whether covariate is satisfied with proportional hazard assumption, it can be stratified into strata. If

the assumption holds, the baseline cumulative hazard function should be constant for each stratum.

Except the assumption of proportional hazard, other aspects, such as testing of appropriate form of

covariate in the regression, the accuracy of prediction with fitted model, and overall fitting of the final

model.

To test the best form of a covariate in the model means the functional form of a covariate can be best

explained its effect on survival through the proportional hazard model, for example, log, quadratic,

square, and so on. To find out the best form of a covariate, the martingale residual plot is applied. It has

the following form,

where

To build the plot, the Cox proportional hazard regression is fitted with the covariates ( ) except the

one being tested ( . Its function is to be determined

Page 24: Survival Analysis of Cardiovascular Diseases

18

Then the martingale residuals are computed, and plotted against the tested covariate. The can be

determined by it shape, for example, linear means no need for transformation.

To test the overall fitting of the final regression model, the Cox-Snell residual plot is applied. In principle,

suppose the proportional hazards model: is correctly fitted, if the probability

integral transformation on the true death time , then the resulting random variable has an

exponential distribution with hazard rate 1. Therefore, if the model is correct, and will be close to true

, then the residual will look like a censored sample from a unit exponential distribution. The formula

of is defined as:

where

To check whether behaviors as from a unit exponential distribution, the Nelson-Aalen estimator of

cumulative hazard rate is calculated by following the below formula, and then a plot of versus . If

the model is fitted well, it should be a straight line through the origin.

3) Accelerated Failure Time Model

Accelerated failure time model (AFT model) is a parametric model for estimating univariate survival and

for censored-data regression problem. For nonparamatric model, it estimates the survival function based

on the observation only. It does not require any distribution of the data. However, the question may

become much simpler if the data fits a distribution well, since there are fewer parameters should be dealt

Page 25: Survival Analysis of Cardiovascular Diseases

19

with. The AFT model means that the survival function of an individual with covariate Z at time t is the

same survival function of an individual with a baseline survival function at a time , where θt is a

vector of regression coefficients. It means,

where is called an acceleration factor. It means how a change in covariate alters the time scale

from the baseline time scale. It implies that the median time to event with covariate Z is the baseline

median time to event divided by its acceleration factor. Similarly, the hazard function is also able to be

calcualted. It is also related to the baseline hazard rate.

Furthermore, from the above implications and presentations, the logrithmatic transformed time has linear

relationship with covariate values (as shown below).

where is a vector of regression coefficients, and is the distribution for error. To identify the survival

function of T, it can be followed by:

Since follows a paticular distribution, its survival function is known with time equaling to

.

Therefore, the mean time or mean residual life time can be estimated by the model. The maximum

likelihood estimation is applied for estimating the parameters for thee models. While their vaiance-

covaraince matrix may be obtained by delta method.

Page 26: Survival Analysis of Cardiovascular Diseases

20

To select a model, the transformed Nelson-Aalen estimated cumulative hazard function is plotted against

some function of time. For example, Weibull distribution, its cumulative hazard rate can be expressed as:

, so there is a linear relationship for a plot with . While for exponential distribution, its

cumulative function is , so the plot will be a straight line through th origin. Its slope will be the

parameter . Therefore, if the linearality does not hold, it means the parametric model does not fit the

data. When comparing two groups, the q-q plot is usually applied for checking whether the model fits data

well. For example, and are survival function of group 1 and group 2, then

where is accerelation factor.

And the following will hold as well, since

,

where , are the p-th quantiles of groups 0 and 1 respectively.

Tests for Significance

For hypothesis test, the threshold for significance is of P-value at 0.05. The test for significance for others

is listed in their respective chapters.

Page 27: Survival Analysis of Cardiovascular Diseases

21

Methods and Procedures

SAS Procedures

SAS 9.1.3 statistical package is applied for the current data analysis. In SAS, Proc LIFETEST, LIFEREG

and PHREG are three major procedures for survival analysis. In general, Proc LIFETEST is used to

compute nonparametric estimates of the survival function by Kaplan-Meier method. Since we are

interested in estimating the survival distribution function (SDF), the Kaplan-Meier plot can represent its

survival function if the sample size is big enough. In addition, the Proc LIFETEST is also applied for

making plots to compare survival curves among groups. The procedure offers Log rank test to evaluate

significance of difference.

The Proc PHREG is applied for performing regression analysis of survival data based on the Cox

proportional hazards model. To build up the predictive model for hazard function, the Proc PHREG is

applied for model selection using option of stepwise. Based on the same strategy, the model is improved

by taking into account of interaction terms. And then, the model is finalized by backward method by

eliminating the predictors with the least significance or the highest p-value step by step. The procedure is

also applied to test time dependence of variables, generate residual plots for model fitting diagnostics,

survival functions plots for comparison between groups.

The Proc LIFEREG is applied to fit parametric models for the failure time to events. It is also called the

accelerated failure time model. In our current study, it is used for assessment of effectiveness of variables

selected by Cox proportional hazards model. To fit the model, the failure time is logarithm transformed,

and Weibull distribution is specified for disturbance term.

Page 28: Survival Analysis of Cardiovascular Diseases

22

Results

Data exploration

First of all, information of current data set is shown in Table 3. It has 11627 observations. There are

continuous variables, such as totchol, age, sysbp, diabp, cigpday, bmi, heartrte, glucose, and variable for

time to event or censoring (timeap, timemi, timemifc, timchd, timestrk, time cvd, timedth, timehyp). The

discrete variables are sex, cursmoke, diabetes, bpmeds, educ, and indicators for events, such as death,

angina, hospmi, mi_fchd, anychd, stroke, cvd, hyperten. In addition, there are numbers of missing values

for variables totchol, cigpday, bmi, bpmeds, glucose, and educ. In the current study, the missing values

are removed for complete case analysis.

Table 3 Data summary

Page 29: Survival Analysis of Cardiovascular Diseases

23

Before the analysis, some definitions need to be clarified. First, the events recorded in current data

include death, angina, hospmi, mi_fchd, anychd, stroke, cvd and hyperten. The death is cardiovascular

diseases related or unrelated. The disease unrelated death is treated as censoring. For disease related

death and other events are indications for onset of cardiovascular diseases, since the diseases will be

realized sooner or later with onset these intermediate symptoms. Therefore, if any of the above events

happens for an observation, the event for the observation is recorded as realized observation. Otherwise,

they are treated as censored observations. Second, the time to event is recorded the period starting from

baseline to the time of end of investigation, time of dropout, death, or minimum time to any of events.

Third, to better investigate the impacts of variables on disease onset, the subjects with preconditions are

removed before conducting analysis. Finally, since it is a longitudinal data set, to simplify or decrease the

impact of time-serials, the final data set only includes observations at period 1. In another word, the study

shows how the survival function is influenced by the conditions, such as age, biochemical measurements

and so on at that time.

Overall, there are 949 censored observations (censor=0), and 1937 observations with events (censor=1).

Model Selection

The goal of current investigation is to identify risk factors of cardiovascular diseases, and their effects on

the contribution to the onset and progression of the diseases. To determine the variables included in the

final model, the univariate analysis is applied first to identify the impact of individual variable on time to

event before proceeding more complicated model selection. For the categorical variables, such as sex,

smoking status (cursmoke), education level (educ) and status of diabetes (diabetes), the non-parametric

model Kaplan-Meier curves are generated (as shown Figure 1). And the significance of impact is tested

by log-rank test individually, and summarized in Table 4. The log-rank test is better than Wilcoxon test,

Page 30: Survival Analysis of Cardiovascular Diseases

24

since it is more sensitive for difference happens at the late stage. In addition, the log-rank test will not

generate bias when the pattern of censoring is different between groups. The data indicates that the

variable sex, diabetes and educ could be potential risk factors for the model. Since their p-values are less

than 0.0001, which is far below the threshold of significance at 0.05. However, the p-value for variable

cursmoke is over 0.6245, so it will not be included for final model. In addition, the survival curves for all for

variables, such as sex, diabetes and cursmoke, do not cross over groups. The educ level 1 behaviors

differently from other three levels, so the three can be treated as one single group for fitting the model

with satisfactory to the assumption of proportional hazard rate. Therefore, they can be potential to fit

proportional hazard model.

For the continuous variables, the univariate analysis is performed by Cox proportional hazard regression.

To include the potential predictors in the final model, the threshold to keep the variable for model

selection with p-value equal to or less than 0.2 – 0.25, just in case that they may contribute to the impact

on disease onset or progression with other predictors.

Page 31: Survival Analysis of Cardiovascular Diseases

25

In our current study, variables, such as totchol, age, sysbp diabp, bmi and glucose have significant impact

on disease onset with p-value less than 0.0001 individually. Although the variable cigpday has p-value at

0.106, it will be kept for model selection since its p-value less than 0.25. The p-value for variable heartrte

is 0.2876, which is over the threshold for our model selection, but it is a very important predictor, so it will

be included for model selection as well. As reported, coronary mortality rates increased progressively in

relation to antedent heart rates, especially for male patients of 35 to 64 years of age [76]. The outcomes

for the Cox proportional hazard analysis for the individual variables are summarized as shown (Top,

Table 4). Therefore, variables sex, educ, diabetes, totchol, age, sysbp, diabp, bmi, glucose and heartrte

are kept for model building.

Table 4 Univariate analysis

Page 32: Survival Analysis of Cardiovascular Diseases

26

For the following process of model building, the above variables with p-value less than 0.25, except

heartrte are included. Since variable educ has four levels, it is included for model selection using the sub-

group educ = 1 as the reference group. To further confirm its significant impact on disease onset or

progression, the procedure proportional hazards regression is applied for the test. The data shows its p-

value is 0.0116, which is less than the threshold of significance with p-value at 0.05 (Table 5). It indicates

that the variable should be included in the predictive model. As shown above, the education level is

categorized into two levels: level 1, and combined levels above 1, and then the assumption of

proportional hazard rate can be satisfied.

To select the best subgroup of variables in our model, the approach of stepwise was applied in SAS

procedure PHREG. The stepwise selection process consists of a series of alternating forward selection

Table 5 Significance test for variable: educ

Page 33: Survival Analysis of Cardiovascular Diseases

27

and backward elimination steps. The former step adds variables into the model, and the latter step

removes variables from the model. The threshold for variable selection into the model is setting with p-

value at 0.25 (SLENTRY = 0.25), while the threshold for variable removing from the model is setting with

p-value at 0.15 (SLSTAY = 0.15). It means only variables with p-value less than 0.25 will be tested in the

model, and to keep it in the model, its p-value should be less than 0.15. The results from the stepwise

proportional hazard regression are displayed as below (Table 6). According to the settings, variables such

as sex, age, totchol, sysbp, diabp, cigpday, bmi, diabetes, glucose and educ are included in the raw

model. Their significances on the impact in the model can be found in column named Pr > ChiSq in the

following table. The value is estimated by analysis of maximum likelihood estimation. And the effects for

individual predictors in the model are shown in column named Parameter Estimate. The data shown

predictors such as sex and educ have negative impact on hazards rate. It indicates that females have

lower hazard than males. And higher educated people have lower hazard than lower educated patients

do. Since there are variables with non-significant p-values, the model should be able to be improved by

elimination of the variables with the largest p-value one by one.

Table 6 Raw variables selected by Stepwise

Page 34: Survival Analysis of Cardiovascular Diseases

28

To further optimize the Cox model, the variable with the highest p-value and over threshold of significance

are removed from the predictive model one by one until all the rest variables are shown significant impact

on the prediction of hazard rate. From Table 3, the variable called glucose is the one with highest p-value,

so it is removed according to our strategy in the following regression test. The result is shown as below

(Table 7).

And then, the same strategy is applied for the following analysis. The variable called diabp is removed in

the following step although the p-value of totchol is higher than it, since there is another parameter called

sysbp is included. The data shows that most of the predictors are significant in the model with their p-

value less than 0.05 or close to 0.05. The data is shown as below (Table 8).

Table 7 Step 2 elimination variable with high p- value by Stepwise

Page 35: Survival Analysis of Cardiovascular Diseases

29

Since the predictors might interact with each other for prediction in the model. To test the interaction

among variables, the list of all raw variables and all possible combinations of interactions are included for

proportional hazard regression analysis. To select the interaction terms, same strategy is applied. The

result is shown as below (Table 9). From the table, interactions such as sex_age, totchol_cigpday,

totchol_bmi, age_educ, sysbp_diabp and diabetes_glucose might impact on the prediction according to

their p-values, which are shown in columns called parameter estimate and Pr > ChiSq respectively.

Table 8 Step 3 elimination variable with high p- value by Stepwise

Page 36: Survival Analysis of Cardiovascular Diseases

30

To test how good fit including all originally selected variables and the potential interaction terms, all of

them are included for the proportional hazards regression analysis. In general, total variables in the list

are sex, totchol, age, sysbp, cigpday, bmi, diabetes, educ, diabp, glucose, sex_age, totchol_cigpday,

totchol_bmi, age_educ, sysbp_diabp and diabetes_glucose. Since the significant impact of some

interaction terms involve some variables, which are not significant and removed in the initial process of

model optimization, such as diabp and glucose, they are reintroduced in the process of model selection.

The data shows that some terms are not significant when all possible predictors are included, such as

educ, diabetes_glucose, whose p-value are 0.8571 and 0.8205 respectivly. And the effect of variable sex

is shown negative impact. It makes sense, since the female has less risk than male patients suffering

from the cardiovascular diseases. However, the model needs to be further optimized, because some

variables in the raw model are not significant enough. The detail can be found as below (Table 10).

Table 9 Interaction terms selected by stepwise

Page 37: Survival Analysis of Cardiovascular Diseases

31

As before, to further optimize the predictive model, the variable with the highest p-value is removed from

the regression analysis. In current case, the interaction term called diabetes_glucose is removed first,

though it has lower p-value than that of educ, because the glucose is not found significant impact on the

prediction at the first place. The result of regression is displayed as below (Table 11).

Table 10 Raw Final Model with Interaction Terms

Page 38: Survival Analysis of Cardiovascular Diseases

32

Based on the same strategy, the interaction term called educ_age is removed from the following

regression analysis, since it has the highest p-value at 0.9147 (Table 12).

Table 11 Optimize Final Model with Interaction Terms_1

Page 39: Survival Analysis of Cardiovascular Diseases

33

And then the predictor called diabp and sex_diabp are removed from the following regression as shown

(Table 13).

Table 13 Optimize Final Model with Interaction Terms_3

Table 12 Optimize Final Model with Interaction Terms_2

Page 40: Survival Analysis of Cardiovascular Diseases

34

Similarly, the variable sex_sysbp is removed from the following regression (Table 14). Eventually, the final

model is generated including variable sex, totchol, age, sysbp, cigpday, bmi, diabetes, educ, sex_age.

However, the effect of sex has positive impact on the cardiovascular diseases. It is opposite to the well-

known fact that female risks less than male on cardiovascular diseases. The effect of gender might be

influenced by introducing the interaction terms such as sex_age, sex_sysbp.

Validation of Assumption of Proportionality

The final model is built based on a major assumption that the hazards between groups are proportional.

To test the assumption of proportionality, two common methods are widely applied. It can be achieved

either by generation of Kaplan-Meier Curve ) using Proc LIFETEST, or testing

the time dependence for time dependent variables. If the assumption of proportionality is established, the

Kaplan-Meier curves between groups are parallel, or the effects of time dependent variables on prediction

are not varied with time. The Kaplan-Meier curves are usually applied for categorical variables, or

continuous variables that can be easily grouped. They can be produced in the Proc LIFETEST model

Table 14 Optimize Final Model with Interaction Terms_4

Page 41: Survival Analysis of Cardiovascular Diseases

35

statement using survival density function Vs time , or . In our current

study, the logarithm transformed time and survival function is applied. For continuous variables, the

model can include time dependent variables instead, which are interactions between logarithm

transformed time and variables. It can be tested in Proc PHREG directly.

Our data show categorical variables such as sex, diabetes and education level are proportional in the

model using plot of . For variable:educ, the proportional hazard function is

more obvious when the comparison between level 1 and combination of level 2,3 &4 (As shown in Figure

2).

Page 42: Survival Analysis of Cardiovascular Diseases

36

However, the overall proportionality of the final model for the continuous variables is not observed, since

the overall p-value is less than 0.0001 (Table 15). Particularly, the time dependence is from variables

age, sex_age and cigpday, which have p-value < 0.0001, 0.0083, 0.0005 respectively (Table 12).

To improve the performance of the model, these three variables are transformed into time dependent

form (x*log(t), x: variable, t: time to event) and applied for fitting the model instead of the original ones.

And then, the new model is verified again by proportional regression to test its proportionality. The results

Table 15 Test for proportional assumption

Page 43: Survival Analysis of Cardiovascular Diseases

37

show that the assumption of proportionality for continuous variables sysbp and bmi holds, since the p-

value for overall test is 0.4068, and for sysbp and bmi are 0.7787 and 0.254 respectively (Table 16).

Model Optimization

To further optimize the final model by fitting the three transformed time dependent variables instead of

original variables, the model including sex, totchol, sysbp, bmi, diabetes, educ, ageT, sex_ageT and

cigpdayT is fitted with proportional hazards regression analysis. The result shows that the variable

sex_ageT is not a significant variable anymore with p-value at 0.1546 (Table 17).

Table 16 Test for proportional assumption-2

Page 44: Survival Analysis of Cardiovascular Diseases

38

To solve the problem, the variable sex is stratified to fit the model, and the regression is redone, and it

shows all the variables fit the model significantly (As shown Table 18) eventually. The results indicate that

sex should be considered as a dominated predictor in the model. Therefore, it can be analyzed by its

stratification. It means that the observations from different genders should be analyzed separately with

their own baseline hazards rate. Overall, the model fits the observations in the data.

Table 18 Final model with time-dependent variable and stratified variable: sex

Table 17 Raw final model with time-dependent variable

Page 45: Survival Analysis of Cardiovascular Diseases

39

In summary, the final proportional hazard function model is finalized with formula:

Where female and male have their own baseline hazard functions. The model satisfies the assumption of

proportionality for all categorical variables, such as diabetes, sex and educ, and partial continuous

Table 19 Stratification with variabel: Sex

Page 46: Survival Analysis of Cardiovascular Diseases

40

variables such as bmi and sysbp. However age, cigpday and sex_age behavior in a time-dependent

manner. However, the age changes with the time, so it should be treated as a fixed covariate. Certainly,

the sex_age should be treated as a fixed covariate as well, since sex never changes over time. In another

word, the time-dependent variable is cigpday only. Therefore, the model is suitable to predict the time to

event by using subject with fixed certain measurements or characteristics, but more considerations

should be taken to make more precise prediction if time-dependent covariates are presented.

Model Diagnostics

To test how good fit the model with these predictors overall, the diagnostics is performed by measuring

the deviance residual plots. In SAS, the option Resdev is applied for the measurements. The plots for the

all linear predictors or individual predictor show that the majority of the residuals are converged around

the baseline, and distributed almost evenly over their respective ranges. Certainly, there are outliers

observed in the plots, but overall, the fitting is good (Figure 3).

Page 47: Survival Analysis of Cardiovascular Diseases

41

Figure 3

Comparison of Survival Functions

Since the model fits well with the set of covariates, their survival function is also analyzed individually by

fixing values for all the rest variables. The data is shown below (Figure 4). From the figure, the survival

function for male subjects with older age, higher body mass index, more smoking, status of diabetes,

lower education level, higher value for the combination between sex and age, higher level of total

cholesterol, or and higher blood pressure will decrease quicker than those of subjects with opposite

characteristics and measurements. However, the effects observed from the survival density function

match the parameter estimated from proportional hazards function model by regression analysis.

Therefore, the model is good.

Page 48: Survival Analysis of Cardiovascular Diseases

42

Parametric model

As discussed, a better parametric model will make the estimation of survival function much simpler, since

it deals with much less parameters. To test this, the same above selected set of covariates are applied to

fit the regression. To fit the model, the logarithmic transformed time (t) has the equation with error term

(w) following certain distribution, and determined by set of covariates. Its formula is

. Therefore, its survival function is

The output is shown in Table 20. Therefore, the

estimated logarithmic transformed time to event is influenced significantly as the formula shown:

Page 49: Survival Analysis of Cardiovascular Diseases

43

To test the overall model fitting, the Cox-snell residual plot is generated. If the model is correct. It means

, the Cox-snell residual should look like a censored sample from an exponential distribution. And

then, the plot of should be an approximate straight line through the origin with slope of 1. As

expected, it should be a straight line going through the origin. In general, the line is close to straight one,

but it is not perfect. It indicates that the model should be able to further be improved. The martingale

residual plot is produced with residual versus time. Overall, it is smooth over time. The Schoenfeld

residual is computed for each covariate and each individual, if the assumption of proportional hazard

holds, the chance of failure at the individual at time given there is one death at is:

,

where And the Schoenfeld residual is defined as:

Therefore, for

each individual I, there are a set of values, one for each covariate j included in the PH model. If the PH

Table 20 AFT: Analysis of Parameter Estimates

Page 50: Survival Analysis of Cardiovascular Diseases

44

assumption is correct, thee Schoenfeld residuals are uncorrelated with mean 0. The Schoenfeld residual

plot is very smooth, and is converged around the baseline. The deviance residual plot is used to check

the model fitting of individual observations. And it is good for finding outliers. The deviance residual is

defined as:

There are considerable observations are seen as outliers in the model. Therefore, the model needs

further improvement.

To improve the model fitting, two ways can be tried. First is to add more data, and/or include more

covariates, but it may be too costly. And sometimes, it is not feasible. Second is to optimize the model by

fitting with different transformed forms of covariates. To test whether the forms of covariates in current

model is the best or not, the martingale residual plots against respective covariates are generated (as

Figure 4

Page 51: Survival Analysis of Cardiovascular Diseases

45

shown in Figure 5). The lines seem to be smooth, but the forms applied in the model, majority of the

covariates, such as totchol, sysbp, cigpday, bmi and sex_age are not perfect.

The model fitting is further checked by Q-Q plot (as shown in Figure 6). As expected, the true fitting

means the plot is a straight line going through the origin with slope of 1. From the data, the sex fits the

model perfectly. The covariate diabetes does not fit the model well at all. The variable educ only fits the

model well before survival time of 6000 days. Therefore, the diabetes covariate should be optimized by

Figure 5

Page 52: Survival Analysis of Cardiovascular Diseases

46

some way. The sysbp is a continuous variable. In the figure, it is categorized with breaking point at

140mmHg, to show high blood pressure and normal blood pressure. The form is not perfect as well.

To test which model fits the data the best, the log likelihood ratio of four popular distributions for

estimating survival curve is calculated from SAS. And their respective AIC values are calculated based on

the values of log likelihood ratio (as shown in Table 21). The result indicates that Weibull distribution is

the best model fitting the data, since its AIC value is the least.

Figure 6

Page 53: Survival Analysis of Cardiovascular Diseases

47

Discussion

The cardiovascular diseases are the leading causes of death world-wide. To figure out how the diseases

are occurred, many efforts have been paid. Thus far, many factors have been found association with

onset or progression of the diseases [11]. They are called risk factors, such as gender, sex, smoking

status, blood pressure, blood total cholesterol and diabetes, and so on [11, 12]. To better understand their

impacts of individual risk factors on the onset or progression of the diseases, many scoring systems or

models have been built for measuring their effects, such as Framingham Scoring System, European

Systematic Coronary Risk Evaluation algorithm (SCORE), and the Prospective Cardiovascular Munster

(PROCAM) model. In general, the systems assign values for individual risk factors, and then add them up

for individual subjects [30, 31, 32, 34]. The higher score a subject has, greater risk for suffering from

cardiovascular diseases will be. These scores are assigned arbitrarily. And also, the scoring systems are

built based on specific populations. Therefore, great variations are usually observed across populations

by applying same system or model. Even within same population, much variation is also observed from

different cohorts [31, 36, 39]. Thus far, most models have not been cross validated with different cohorts.

Therefore, it is still far away from applicable for any system or model to a general population.

Our current investigation is trying to build up a predictive model based on subset of Framingham Heart

Study dataset by survival analysis. It has been widely used various fields, such as, biomedicine,

engineering, social science, and so on. The survival analysis is specially designed for investigating

survival data. Its features of censoring and non-normality generate huge difficulty in analyzing the data by

traditional statistical models or methods. The non-normality aspect of the data violates the normality

Table 21 Comparisons among models with different distributions

Page 54: Survival Analysis of Cardiovascular Diseases

48

assumption of most commonly used statistical models such as regression or ANOVA, and so forth. And

censoring is very popular in survival data. It means the information of such observation is incomplete. In

general, there are five different censorings observed in survival data: right truncation, left truncation, left

censoring, right censoring and interval censoring. In the current dataset of Framingham Heart Study, only

right censoring is observed. It can be several reasons, i.e. drop out of the investigation, or termination of

the investigation, or death because of event unrelated reasons [52, 53, 54]. Therefore, any subject either

experiences an event at some time point, or survives over the time point (censoring).

Another feature for survival analysis is the hazard rate. For discrete time the hazard rate is the probability

that an individual will experience an event at time point t. Therefore, the hazard rate is just the estimated,

unobserved rate. It can be constant or vary with time [54].

To build up the model, the univariate analysis is conducted to test the association between variable and

the time to even [61]. The Kaplan-Meier plots are generated for each categorical variable for the purpose.

However, the univariate analysis is tested by Cox proportional hazards regression. The threshold to keep

variable for further model selection is p-value below is 0.25. Interestingly, the education seems to be

involved in impacting on onset or development of cardiovascular diseases. It makes sense somehow,

since people have different lifestyles with different educations, such as, physical activity, friends cycles,

food, and so on [13, 16].

Sex is a significant covariate as well. It is a commonly sense, since females usually have much less

chance suffering from cardiovascular diseases than male. When Cox proportional hazard model is built,

its performance is strikingly different from others, so it is treated as stratification.

Other factors, such as high blood pressure, total blood cholesterol, body mass index, and age are well

known disease-related risk factors. It is not surprising to see their positive influences on the disease onset

and development. In another words, the higher values of them are, the survival time will be shorter, while

Page 55: Survival Analysis of Cardiovascular Diseases

49

the hazard rate will be higher. The results can be seen in the estimates of parameter in semiparametric

model and parametric model.

Current model includes two time-dependent variables, such as age, number of cigarettes smoked

(cigpday). It indicates that the risk of suffering from cardiovascular diseases will increase with aging. It is

well known that the heart undergoes subtle physiologic changes as person gets older, even without the

diseases. It means impacts of diseases of these two variables can be additive to the influence of aging.

Interestingly, in our current hazard model, interaction between sex and age is observed with positive

impact on life time. It is understandable for sex, but age does not make too much sense. Therefore, the

impact may mainly contribute from the gender for the interaction, since one unit change of gender (from

male to female: 0 to 1) means lots of age change. Another interesting explanation is that if subjects do

not have cardiovascular diseases when they are old, they must have good genetics [14, 19].

Overall, the model fits the dataset well although it is not perfect. It can be visualized from the deviance

residual plots. But it is still able to be improved by adding more data, or optimized by change the form of

the covariates.

Page 56: Survival Analysis of Cardiovascular Diseases

50

Reference

1. "cardiovascular system" at Dorland's Medical Dictionary

2. Maton A, Hopkins J, McLaughlin CW, Johnson S, Warner MQ, LaHart D, Wright JD. (1993).

Human Biology and Health. Englewood Cliffs, New Jersey: Prentice Hall.

3. Bridget B. Kelly; Institute of Medicine; Fuster, Valentin (2010). Promoting Cardiovascular

Health in the Developing World: A Critical Challenge to Achieve Global Health. Washington,

D.C: National Academies Press.

4. Dantas AP, Jimenez-Altayo F, Vila E (2012). "Vascular aging: facts and factors". Frontiers in

Vascular Physiology 3 (325): 1–2.

5. Countries, Committee on Preventing the Global Epidemic of Cardiovascular Disease:

Meeting the Challenges in Developing; Fuster, Board on Global Health ; Valentin;

Academies, Bridget B. Kelly, editors ; Institute of Medicine of the National (2010). Promoting

cardiovascular health in the developing world : a critical challenge to achieve global health.

Washington, D.C.: National Academies Press.

6. Mendis S, Puska, P, Norrving B. (editors) (2011), Global Atlas on cardiovascular disease

prevention and control.

7. McGill HC, McMahan CA, Gidding SS (2008). "Preventing heart disease in the 21st century:

implications of the Pathobiological Determinants of Atherosclerosis in Youth (PDAY) study".

Circulation 117 (9): 1216–27.

8. Howard BV; Wylie-Rosett J (2002). "Sugar and cardiovascular disease: A statement for

healthcare professionals from the Committee on Nutrition of the Council on Nutrition, Physical

Activity, and Metabolism of the American Heart Association.". Circulation 106 (4): 523–7.

9. Finks SW; Airee A; Chow SL; Macaulay TE; Moranville MP; Rogers KC; Trujillo TC (2012).

"Key articles of dietary interventions that influence cardiovascular mortality.".

Pharmacotherapy 32 (4): e54–87.

10. Mackay, Mensah, Mendis, et al. The Atlas of Heart Disease and Stroke. World Health

Organization. January 2004.

Page 57: Survival Analysis of Cardiovascular Diseases

51

11. Vartiainen J, Puska T (1999). "Sex, Age,Cardiovascular Risk Factors, and coronary heart

disease". Circulation 99: 1165–1172.

12. Jani B, Rajkumar C (2006). "Ageing and vascular ageing". Postgrad Med J 82: 357–362.

13. Ignarro, LJ; Balestrieri, ML, Napoli, C (2007). "Nutrition, physical activity, and cardiovascular

disease: an update.". Cardiovascular research 73 (2): 326–40.

14. World Heart Federation (2011). "World Heart Federation: Cardiovascular disease risk

factors".

15. The National Heart, Lung, and Blood Institute (NHLBI) (2011). "How To Prevent and Control

Coronary Heart Disease Risk Factors - NHLBI, NIH".

16. Baska T, Straka S. (1997) Epidemiology of modifiable risk factors in cardiovascular diseases.

Review of present knowledge Epidemiol Mikrobiol Imunol. 46(3):108-14.

17. Munroe PB, Barnes MR, Caulfield MJ. (2013) Advances in blood pressure genomics. Circ

Res. 112(10):1365-79.

18. Thurston RC, Rewak M, Kubzansky LD. (2013) An anxious heart: anxiety and the onset of

cardiovascular diseases. Prog Cardiovasc Dis. 55(6):524-37.

19. Shah RU, Winkleby MA, Van Horn L, Phillips LS, Eaton CB, Martin LW, Rosal MC,

Manson JE, Ning H, Lloyd-Jones DM, Klein L. (2011) Education, income, and incident heart

failure in post-menopausal women: the Women's Health Initiative Hormone Therapy Trials. J

Am Coll Cardiol. 2011 Sep 27;58(14):1457-64.

20. Jernigan VB, Duran B, Ahn D, Winkleby M. (2010) Changing patterns in health behaviors and

risk factors related to cardiovascular disease among American Indians and Alaska Natives.

Am J Public Health. 100(4):677-83.

21. Sowers JR. (2013) Diabetes mellitus and vascular disease. Hypertension. 61(5):943-7.

22. Kassab E, McFarlane SI, Sower JR. (2001) Vascular complications in diabetes and their

prevention. Vasc Med. 6(4):249-55.

23. Rubenfire M, Brook RD. (2013) HDL cholesterol and cardiovascular outcomes: what is the

evidence? Curr Cardiol Rep. 15(4):349.

Page 58: Survival Analysis of Cardiovascular Diseases

52

24. Sarwar MS, Islam MS, Al Baker SM, Hasnat A. (2013) Resistant hypertension: underlying

causes and treatment. Drug Res 63(5):217-23.

25. Khan Z, Almeida DR, Rahim K, Belliveau MJ, Bona M, Gale J. (2013) 10-Year Framingham

risk in patients with retinal vein occlusion: a systematic review and meta-analysis. Can J

Ophthalmol. 48(1):40-45.

26. Burg MM, Edmondson D, Shimbo D, Shaffer J, Kronish IM, Whang W, Alcántara C,

Schwartz JE, Muntner P, Davidson KW. (2013) The 'perfect storm' and acute coronary

syndrome onset: do psychosocial factors play a role? Prog Cardiovasc Dis. 55(6):601-10.

27. Johnson JA, Cavallari LH. (2013) Pharmacogenetics and cardiovascular disease--

implications for personalized medicine. Pharmacol Rev. 65(3):987-1009.

28. Munroe PB, Barnes MR, Caulfield MJ. (2013) Advances in blood pressure genomics. Circ

Res. 112(10):1365-79.

29. Munroe PB, Barnes MR, Caulfield MJ. (2013) Advances in blood pressure genomics. Circ

Res. 112(10):1365-79.

30. Nyborn JA, Himali JJ, Beiser AS, Devine SA, Du Y, Kaplan E, O'Connor MK, Rinn WE,

Denison HS, Seshadri S, Wolf PA, Au R.(2013) The Framingham Heart Study clock drawing

performance: normative data from the offspring cohort. Exp Aging Res. 39(1):80-108.

31. Artac M, Dalton AR, Majeed A, Car J, Millett C. (2013) Effectiveness of a national

cardiovascular disease risk assessment program (NHS Health Check): Results after one

year. Prev Med. 57(2):129-34.

32. Zhang G, Parikh PB, Zabihi S, Brown DL. (2013) Rating the preferences for potential harms

of treatments for cardiovascular disease: a survey of community-dwelling adults. Med Decis

Making. 33(4):502-9.

33. Fu BB, Fei XL, Sekar BD, Dong MC. (2013) Research and application of heart sound

alignment and descriptor. Comput Biol Med. 43(3):211-8.

34. Gensini GG. (1983) A more meaningful scoring system for determining the severity of

coronary heart disease. Am J Cardiol. 51(3):606.

Page 59: Survival Analysis of Cardiovascular Diseases

53

35. http://hp2010.nhlbihin.net/atpiii/calculator.asp

36. Basow DS (Ed). (2010) Estimation of cardiovascular risk in an individual patient without

known cardiovascular disease. Massachusetts Medical Society, and Wolters Kluwer

publishers, The Netherlands. 2010.

37. D'Agostino RB Sr, Vasan RS, Pencina MJ, Wolf PA, Cobain M, Massaro JM, Kannel WB.

(2008) General cardiovascular risk profile for use in primary care: the Framingham Heart

Study. Circulation. 117(6):743-53.

38. Brindle P, Emberson J, Lampe F, Walker M, Whincup P, Fahey T, Ebrahim S. (2003)

Predictive accuracy of the Framingham coronary risk score in British men: prospective cohort

study. BMJ. 327(7426):1267.

39. Liu J, Hong Y, D'Agostino RB Sr, Wu Z, Wang W, Sun J, Wilson PW, Kannel WB, Zhao D.

(2004) Predictive value for the Chinese population of the Framingham CHD risk assessment

tool compared with the Chinese Multi-Provincial Cohort Study. JAMA. 291(21):2591-9.

40. Sacco RL, Khatri M, Rundek T, Xu Q, Gardener H, Boden-Albala B, Di Tullio MR, Homma S,

Elkind MS, Paik MC. (2009) Improving global vascular risk prediction with behavioral and

anthropometric factors. The multiethnic NOMAS (Northern Manhattan Cohort Study). J Am

Coll Cardiol. 54(24):2303-11.

41. Conroy RM, Pyorala K, Fitzgerald AP, Sans S, Menotti A, De Backer G, De Bacquer D,

Ducimetiere P, Jousilahti P, Keil U, Njolstad I, Oganov RG, Thomsen T, Tunstall-Pedoe H,

Tverdal A, Wedel H, Whincup P, Wilhelmsen L, Graham IM. (2003) Estimation of ten-year

risk of fatal cardiovascular disease in Europe: the SCORE project. Eur Heart J. 24(11):987-

1003.

42. 2002 Third Report of the National Cholesterol Education Program (NCEP) Expert Panel on

Detection, Evaluation, and Treatment of High Blood Cholesterol in Adults (Adult Treatment

Panel III) final report. Circulation. 106(25):3143-421.

43. Ford ES, Giles WH, Mokdad AH. (2004) The distribution of 10-Year risk for coronary heart

disease among US adults: findings from the National Health and Nutrition Examination

Survey III. J Am Coll Cardiol. 43(10):1791-6.

Page 60: Survival Analysis of Cardiovascular Diseases

54

44. Schnabel RB, Sullivan LM, Levy D, Pencina MJ, Massaro JM, D'Agostino RB Sr, Newton-

Cheh C, Yamamoto JF, Magnani JW, Tadros TM, Kannel WB, Wang TJ, Ellinor PT, Wolf PA,

Vasan RS, Benjamin EJ. (2009) Development of a risk score for atrial fibrillation

(Framingham Heart Study): a community-based cohort study. Lancet. 373(9665):739-45.

45. Lloyd-Jones DM, Wang TJ, Leip EP, Larson MG, Levy D, Vasan RS, D'Agostino RB,

Massaro JM, Beiser A, Wolf PA, Benjamin EJ. (2004) Lifetime risk for development of atrial

fibrillation: the Framingham Heart Study. Circulation. 110(9):1042-6.

46. http://cvrisk.mvm.ed.ac.uk/help.htm

47. http://en.wikipedia.org/wiki/Framingham_Heart_Study

48. Siontis GC, Tzoulaki I, Siontis KC, Ioannidis JP. (2012) Comparisons of established risk

prediction models for cardiovascular disease: systematic review. BMJ. 344

49. See R, de Lemos JA. (2006) Current status of risk stratification methods in acute coronary

syndromes. Curr Cardiol Rep. 8(4):282-8.

50. Viv Bewick1, Liz Cheek1 and Jonathan Ball2 (2004) Statistics review 12: Survival analysis,

Critical Care , 8:389-394

51. Abbring JH. and van den Berg GJ.(2003), .The identifiability of the mixed

proportional hazards competing risks model., Journal of the Royal Statistical

Society (A), 65, 701.710.

52. Allison P. (1995), Survival Analysis Using the SAS System: A Practical

Guide, SAS Institute, Gary NC.

53. Cox D. and Oakes D. (1984), Analysis of Survival Data, Chapman Hall,

London.

54. Hosmer DW and Lemeshow S (1999), Applied Survival Analysis, Wiley,

New York.

55. Kalb.eisch JD. and Prentice RL. (1980), The Statistical Analysis of Failure Time Data, John

Wiley, New York.

56. Narandranathan W. and Stewart M. (1991), .Simple methods for testing the proportionality of

cause-speci.c hazards in competing risks models., Oxford

Page 61: Survival Analysis of Cardiovascular Diseases

55

Bulletin of Economics and Statistics 53:331.340.

57. Prentice RL and Gloeckler LA. (1978), .Regression analysis of grouped

survival data with application to breast cancer data., Biometrics 34, 57.67.

58. Singer JD and Willett JB (1993), .It.s about time: using discrete-time

survival analysis to study duration and the timing of events., Journal of Edu-

cational Statistics 18, 155.195.

59. Singer JD and Willett JB (2003), Applied Longitudinal Data Analysis.

Modeling Change and Event Occurence, Oxford University Press, Oxford.

60. Therneau TM and Grambsch PM. (2000), Modeling Survival Data: Extending the Cox Model,

Springer, New York.

61. http://www.ats.ucla.edu/stat/sas/seminars/sas_survival/#Background

62. Klein JP, Moeschberger ML, (1997), Survival analysis: techniques for censored and truncated

data, Springer-Verlag, New York;

63. Nelson W, et al, (1972) Theory and Applications of hazard plotting for censored failure data,

Technometrics;

64. Klein JP (1991) small sample moments of some estimators of the variance of the Kaplan-

Meier and Nelson-Aalen estimators. Scandinavian Journal of Statistics 18:333-340;

65. Kaplan EL, Meier P, (1958) nonparametric estimation from incomplete observations, Journal

of the American Statistics Association 53:457-481;

66. Breslow NE, (1975) analysis of survival data under the proportional hazard model,

International Statistics Review 43:45-58;

67. Aalen, OO, (1978) nonparametric estimation of partial transition probabilities in multiple

decrement models, Annals of Statistics 6:534-545;

68. Aalen, OO, (1980) A model for non-parametric regression analysis of counting processes, In

lecture notes on mathematical statistics and probability, Spring-Verlag, 1-25;

69. Akaike H, Information theory and an extension of the maximum likelihood principle, in 2nd

international symposium of information theory and control, 267-281;

Page 62: Survival Analysis of Cardiovascular Diseases

56

70. Cox DR, (1972) regression models and life tables (with discussion), Journal of the Royal

Statistical Society 187-220;

71. Cox DR, Snell EJA (1968) A general definition of residuals (with discussion), Journal of the

Royal Statistical Society B30:248-275;

72. Efron B, (1967) the two sample problem with censored data, in Processing of the fifth Berkeley

Symposium on Mathematical statistics and probability, 4:831-853;

73. Fleming TR, Harrington DP, O’Sullivan M, (1987) supremum versions of the log-rank and

generalized Wilcoxon statistics, Journal of the American Statistical and Association, 82:312-

320;

74. Fleming TR et al (1980) Modified Kolmogorov-Smirnov test procedures with application to

arbitrarily right censored data, Biometrics 36:607-626;

75. Gehan EAA, (1965) A generalized Wilcoxon test for comparing arbitrarily singly censored

samples, Biometrika 203-223;

76. Kannel WB, et al, (1987) Heart rate and cardiovascular mortality: The Framingham study,

American Heart Journal, 114(6):1489-1494.