Modeling the Progression Through Time of Longitudinal ...bertozzi/WORKFORCE/REU...Modeling the...

Modeling the Progression Through Time ofLongitudinal Medical Data

John Wu, Paul Chavy-Waddy, Mindy Case

August 9, 2013

Abstract

Predicting future medical conditions and also death is certainly an intriguing task.The UCLA REU Social Networks and Large Data Sets group proposes various meth-ods to make such predictions. Each method produces new information based off ofpreviously recorded observations. These methods and strategies are discussed as wellas the results obtained throughout the project.

1 Introduction

As part of the UCLA REU social networks and large data sets group, we have been workingwith medical data and implementing various methods on this data to predict medical condi-tions of present and possible future patients. The data given to us is a logitudinal data setwhich records the status of numerous patients over time until the study is completed or thepatient dies and drops out of the study. Throughout the beginning stages of our research, wenoticed a few complications with our data set but did what we could to work around them.The purpose of our research includes clustering patients, and computing the probability ofdisease development. To do this we used various methods including the following: TermFrequency-Inverse Document Frequency, Matrix Completion, Spectral Clustering, PrincipalComponent Analysis, Classification and Regression Tree, Least Squares, and Frailty Indexand Biological Age.

1.1 Data

The data given to us is a longitudinal data set (data collected over time) from a Yale Geri-atrical Study, in the form of a matrix. Each row of the matrix represents one of the 754patients at a different time step, and the columns give information about the patient such astheir study ID number, gender, age, visit return number, and 34 different symptoms rangingfrom diseases such as cancer and diabetes to individual symtoms such as feeling lonely andgrip strength.The patients range in age from 70 to 106 years old, and there is about eighteenmonths in between each visit to the doctor for a total of approximately 13 to 14 years.

In the matrix columns which correlate to the different symptoms and diseases, if the pa-tient has that disease during that visit, they would recieve a 1 in that entry, otherwise, theywould recieve a 0. Thus the data that is most useful to our research becomes a binary set,

1

which can add some complications.

There are a few other difficulties with this data set which we wish to further improve inthe future. Medical data by nature is very noisy, meaning it is subject to an extreme degreeof randomness due to a majority of variables not recorded. This makes it very difficult topredict individual outcomes. Due to death and possibly other unknown factors, not all pa-tients finish the study at the same time; thus, the data is an unbalanaced longitudunal set.There were also many missing data entries, and this leaves the data set incomplete. Deletingthe patients with missing entries may lead to statistical bias.

1.2 Purpose

The purpose of our research can be summarized by the following:

1. We want to cluster patients such that the clusters explain the largest variation possiblebetween the mean outcomes.

2. Given the prognosis at a given time, we want to compute the probability of developingvarious diseases.

3. Determine the time at which the patient will develop those diseases and/or death.

4. Mathematically measure the degree of accuracy of our predictions.

Outline The remainder of this article is structured in the following way. Section 2 givesa break down of all the methods used during our research along with the results obtainedfrom each method. Possibilities for future work and improvement are presented in Section3. Finally, Section 4 gives the conclusion.

2 Data Analysis

2.1 Matrix Completion

To address the issue of missing data entries, we used a technique called matrix completion.According to Emmanuel J. Candes and Benjamin Recht [4], given a matrix with missing dataentries, we can use the information from the known entries to estimate the values of thosemissing data. What we obtain through matrix completion is a new matrix which consistsof only the estimated values from the missing data of the original matrix. To optimize theaccuracy of our estimated data, we wish to minimize the rank of this matrix. Equivalently,we may minimize the nuclear norm of the matrix,

||M ||∗ =n∑

i=1

αi(M).

The nuclear norm is the sum of all singular values of the matrix [4]. In the equation above,αi(M) is the ith largest singular value of M. Minimizing this norm allows for reducing therank of our new matrix, and this implies that our estimated data was obtained strictlythrough analysis of our known data. We hope that doing this will produce sufficient entries.

2

Figure 1: Matrix Completion for 25 test runs

The above figure shows how we determined the accuracy of Matrix Completion on ourdataset. We did this by running this algorithm on the binary entries of the 80-year old pa-tients that contain no missing entries. Given a certain number of missing entries, we wantedto determine the misclassification rate, which is defined to be the number entries that werelabeled incorrectly due to Matrix completion divided by the of total missing entries .Werandomly selected n percent of the entries to be missing and ran the code on these entries.Running this algorithm does not necessarily give a stable misclassification rate. Therefore,we ran the algorithm 25 times for each n percent and average the misclassification rate. Itturns out that the misclassification rate for each having the number of missing entries to bebetween 0-50 percent is about 20 percent. However, this percentage increases as the numberof missing entries increases.

Other than our use of this technique, it can be used for various applications. Anotherapplication for example is the Netflix problem. Netflix surveyed their customers to find outhow people ranked the movies which they borrowed. Since not all movies were ranked byeach customer, matrix completion can be used to determine which movies are more highlyrecommended by viewers [4].

2.2 Spectral Clustering

After obtaining our completed matrix, the first method we used is Spectral Clustering[7].To implement spectral clustering, we used a metric called the Pearson Product [8] in orderto obtain our clusters.

The Pearson Product is:

SPhi =ad− bc√

(a+ b)(a+ c)(b+ d)(c+ d).

3

Here the variables are represented at the following:

a = proportion of 1’s that the vaiables share in the same positionb = proportion of 1’s in the first variable and 0’s in second variable in he same postionsc = propotion of 0’s in the first variable and 1’s in secod variable in the same positionsd = proportion of 0’s that both variables share in the same positions

Information about this metric can be found in Similarity Coefficients for Binary Datawritten by Matthijs Joost Warrens [8].

Using spectral clustering, we seperated patients into three different overlapping groups.One group contained patients who needed help with various things such as eating, walking,etc., the next group contained patients with mostly emotional deffects, and the last groupconsisted of mostly pateints with cardio defects.

2.3 Topic Modeling

After using spectral clustering to help us get an idea of the appropriate number of clustersneeded, we wanted to gain a gain a better understanding of what type of clusters we haveand explore the characteristics that some patients have. To do this, we used topic modelingon our patients. Information about topic modeling can found in [1]. With topic modeling,we represented each of the patients as a document that lists the diseases that a patienthas. Then, we ran Term Frequency Inverse Document Frequency (TFIDF) on our data toobtain a weighted matrix. Additionally, we used a technique called Non-negative MatrixFactorization (NMF) to obtain our groupings. Below are examples of disease groupings thatTopic Modeling helped us achieve.

Figure 2: Trial 1, ranked list of terms for three topics.

We noticed that our three topics resembled our three clusters from spectral clustering.This verified that our clusters were in fact appropriat clusters. We then considered runningtopic modeling to create 5 topics. The figure below shows our results:

4

Figure 3: Trial 2, ranked list of terms for five topics.

We observed that three of the topics still closely resembled our original topics, but theother two didn’t give much meaning to our research and picked up the random variation ofour patients. Because of this, we concluded that three topics would be sufficient.

In addition to NMF, we wanted to see what type of clusters we can obtain by usingSpectral Clustering on the Topics-by-Documents matrix, which is the matrix that showsthe weights of different topics on each patient. With NMF, we wanted to use 13 topics andcreated the Topics-by-Documents topic matrix. With this matrix, we ran symmetric spectralclustering and the following diagram shows a plot of the first 3 eigenvectors:

Figure 4: Trial 2, five topics

From this figure, it clearly shows patient data points that are nicely grouped together.The main reason why we did this because we are able to obtain the best set of groupings ofdata with 13 topics. From these groups, we ran k-means clustering to create two clusters.We used two clusters because it nicely divides the groupings into two clusters. We wanted toexplore the qualitative properties of these clusters. Below is a table lists how patients havea certain disease for a cluster

5

Figure 5: Number of Patients With Disease in Each Group

As the table points out, all the patients who have cancer are contained into one cluster.A diagram of this is plotted below,

6

Figure 6: Number of Patients With Cancer

2.4 Generalized Linear Mixed Effect Models

One of the main problems that we faced during this project was dealing with random varia-tion between different individuals. Factors that can affect a patient, which is not explicitlyexpressed as a combination of covariates of a patient, can create statistical biases for a stan-dard regression model and this can be addressed with mixed-effect models. With mixedeffect models, we consider using logistic linear mixed models to compute the probability of apatient obtaining a heart attack. Suppose that i represent the ith patient and j represents thejth visit time. According to Bruin[3], generalized linear mixed effect model can be writtenin the form

xkijrepresents the kth covariate that is a fixed effect variable, which is a variable thatis non-random and that there is systematic relationship between g(yij) and xij. z1ij isconsidered to be called a random-effect variable. For instance, since all individuals aredifferent, this can be interpreted as the intercept αi, which represents a random effect.This represents an intercept that is different from every other intercept that is the randomvariation between patients and time.

When we were considering mixed effect model, we just assumed that the intercept ofan individual to be the only random-effect variable. With the R command, glmer, in thepackage, lme4 [2], we created a binomial linear mixed effect model that uses all the covariatesexcept for HeartAttack to predict the probability of a patient obtaining a heart attack. Thefollowing is information about the model:

7

Random effects:Groups Name Variance Std.Dev.StudyID (Intercept) 0.80687 0.89826

Number of obs: 4253, groups: StudyID, 681

Fixed effects:Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.23624 1.53575 -0.154 0.877747Gender -0.72868 0.20735 -3.514 0.000441 ***age -0.03793 0.01946 -1.949 0.051253 .HELPeating 0.99738 0.43623 2.286 0.022234 *HELPgrooming -0.94745 0.51897 -1.826 0.067905 .HELPtoilet 0.40041 0.55850 0.717 0.473414HELPstairs -0.12154 0.30915 -0.393 0.694212HELPlifting 0.83989 0.25450 3.300 0.000966 ***HELPshopping -0.34800 0.28762 -1.210 0.226309HELPhousework -0.15937 0.23548 -0.677 0.498536HELPmeal -0.92280 0.29659 -3.111 0.001863 **HELPmed 0.06574 0.32387 0.203 0.839144HELPfinancing -0.10312 0.23947 -0.431 0.666731weightloss 0.43877 0.18938 2.317 0.020509 *healthSELFrate 0.26944 0.37166 0.725 0.468486HEALTHchangeLASTyear 0.07748 0.20073 0.386 0.699499WLKoutside -0.33251 0.20384 -1.631 0.102839lotsOFeffort 0.12034 0.20300 0.593 0.553326FEELlonely 0.01242 0.19839 0.063 0.950086FEELdepressed -0.14773 0.21104 -0.700 0.483925FEELhappy 0.04748 0.19539 0.243 0.807994TROUBLEgoing -0.10195 0.19907 -0.512 0.608555HIGHbloodPRESSURE 1.23412 0.19427 6.353 2.12e-10 ***CHF 1.98422 0.23160 8.567 < 2e-16 ***stroke 0.66436 0.27782 2.391 0.016787 *CANCER -0.32259 0.29747 -1.084 0.278164diabetes 0.70476 0.22317 3.158 0.001589 **arthritis 0.09068 0.19704 0.460 0.645367CLD 0.33883 0.27764 1.220 0.222316usualWLK 0.19979 0.26694 0.748 0.454182rapidWLK 0.34079 0.23436 1.454 0.145912PFLOWindex -0.08298 0.20842 -0.398 0.690535BMIindex -0.26793 0.21473 -1.248 0.212108MMSEIndex 0.38172 0.25064 1.523 0.127766SHLDstrngINDEX -0.35344 0.20707 -1.707 0.087845 .GRIPstrength 0.13032 0.20443 0.637 0.523830

BIC = 1727

Since we want to reduce the dimensionality of the dataset, we used principal componentanalysis on the 80-year old patients. With the first 3 loadings that we obtained from principalcomponent analysis, we reduced the dimensionality of the 35 covariates to 3 variables. We

8

used 3 loadings, because 3-4 loadings are usually used to reduce the dimensionality of thedataset. With these reduced-dimension variables along with biological age, we obtained thefollowing model:

Random effects:Groups Name Variance Std.Dev.StudyID (Intercept) 0.4278 0.65406

Number of obs: 4245, groups: StudyID, 681

Fixed effects:Estimate Std. Error z value Pr(>|z|)

(Intercept) 4.1302129 0.9813827 4.209 2.57e-05 ***Dim1 13.4175412 0.8625661 15.555 < 2e-16 ***Dim2 -7.4283381 0.7086750 -10.482 < 2e-16 ***Dim3 -3.7221067 0.7532619 -4.941 7.76e-07 ***Age -0.3936393 0.0369501 -10.653 < 2e-16 ***CHF 1.2539549 0.2381669 5.265 1.40e-07 ***I(Age^2) 0.0056627 0.0004181 13.544 < 2e-16 ***BIC = 1393

From these statistics, we can see that that the second model is much more superior thanthe first one. The second model contains variables that are all significant compared to thefirst one and the BIC is much smaller for the second model than the first one. Even withthese statistics, we are uncertain about the effectiveness and the accuracy of these modelsand we would like to explore these techniques in the future. For more information aboutGeneralized Linear Mixed Models, please refer to Bruin [3].

2.5 Classification and Regression Tree

We wanted to create a nonparametric method that determines when a patient will stopreturning to a hospital based on the first time that they visited the hospital. We thoughtthat this technique is appropriate because it was difficult to compute when a patient willdie with a linear regression. We did this by creating a regression tree, which asks a patientquestions based on a patients covariates and determines when a patient will die. To create aregression trees, we used the matlab command, tree = RegressionTree.fit(X,Y). Here is theresult that we produced from Classification and Regression Trees:

9

Figure 7: Example of Regression Tree

From these statistics, we believe that the tree model has considerably high MSE value.This is mainly because µ is about the same value as MSE. In most nicely fitted models,the MSE level is significantly smaller than µ. We are considering to use a technique calledRE-EM trees, which is a regression tree that is used for longitudinal data, to compute whena patient will die. We suggest to read De’ath et al. [5] to read about Regression Trees/

2.6 Least Squares

Our next approach was to construct a transition matrix by applying the least squares methodto information that we extracted from our original data matrix. Given data about ourpatients at visit x and visit x + 1, we can set up the following equation to obtain ourtransition matrix: [

T1(TransitionMatrix)

]∗[V isitx

]=

[V isitx+ 1

]Performing least squares on this equation for 3 different time steps gave us three different

transition matrices T1, T2, and T3. Applying these three trasition matrices to a matrixthat consists of information about all the 80 year olds in our study allows us to predict theprobability of their medical conditions at roughly age 84 to 85. The equation looks like thefollowing: [

T3Matrix

]∗[

T2Matrix

]∗[

T1Matrix

]∗[Age 80

(V isit x)

]=

[Age 84− 85(V isit x+ 3)

]After testing this approach on half of our 80 year old patients, we created a probalistic

model shown in [Figure 8].

10

Figure 8: Ground Truth vs. Predicted Probalistic Outcomes for the Training Set

The above graph compares the ground truth to the expected outcomes for the trainingset (compiled from a random half the individuals with complete records aged 80), at theirfourth visit thereafter; each node represents the population proportion presenting the givendisorder, with the 35th node representing the proportion which is deceased. The red curverepresents the ground truth, ie the true proportions of symptom occurence at the 4th visitafter being observed at age 80. The blue curve represents anticipated outcome proportionsat the fourth visit based on the distribution of disorders at age 80 according to the leastsquares method. Average percentage-wise error was 15 percent.

Figure 9: Ground Truth Versus Expected for Test Set

11

We perform cross-validation by using our fitted model to produce predicted results, andcontrasting these to real outcomes. The above graph compares the ground truth to theexpected outcomes, cross-validated on the second half of the individuals aged 80 at somepoint in the study; the blue curve represents the ground truth, versus the red predictedpopulation proportions. We observe a significantly larger average percentage-wise error of102 percent.

Limitations There are many obvious oversights in the least squares model; notably, simplelinear regression returns unbounded values on the real number line. This means that individ-uals or groups could return values for certain disorders which are less then 0 or greater than1; these are difficult to interpret in the context of population proportions. Logistic binomialregression would probably be better suited for this purpose. Moreover, the large discrepancybetween the mean predictive error between the test set and cross-validation suggests thatthe model has been subject to over-fitting; this should not come as a surprise given the largenumber of variables on which we have regressede. Low symptom counts at any given timestep make it impossible to find the best fitting coefficients for all 34 predictive variables,with respect to that particular outcome.

The results of the least squares transition matrix method suggests that a necessary pre-requisite to establishing any kind of predictive relationship is to find ways to reduce thedimensionality of the problem. Arnold Mitnitskis does so in an extremely simple way, in anumber of medical papers, by defining the concept of the frailty index. [6]

2.7 Frailty Index and Biological Age

Considering some set of disorders and assigning a vector V with each entry either 1 if thesymptom is present or 0 if it is not, Mitnitski [6] defines an individuals frailty index by thefollowing relationship:

FrailtyIndex = (sumV )/length(V )

If an individuals frailty index exceeds its expected given their age, then he is said tobe frail; likewise, if his Frailty Index is less than its expected value given his age then heis said to be fit. [6] While this may seem like an overly simplistic metric that ignores alarge amount of information, Mitnitski has shown time and again that it appears to hold areasonable amount of predictive power. One such way of establishing this, used by Mitnitskiin 2002, is to transform the frailty index in such a way that it facilitates its comparison tochronological age, ie defining the concept of the biological age based on some frailty index.[6] Mitnitski used a set of 20 disorders not necessarily present in our data set, but corre-lated with chronological age. Here we will consider only that subset of disorders which arepositively correlated with age, over all patients at every visit. There are 25 such symptoms,which we have listed below (correlation coefficients in ascending order):

• Feel Lonely

• Arthritis

• Feel Happy

12

• CHF

• Lots Of Effort

• Trouble Going

• Weightloss

• Help Grooming

• Help Eating

• Health Change Last Year

• Help Toilet

• MMSE Index

• Help Financing

• Help Stairs

• Help Lifting

• Help Med

• Usual Walk

• P-Flow Index

• Help Shopping

• Help Meal

• Rapid Walk

• Grip Strength

• Walk Outside

• Help House Work

• Shoulder Strength Index

Mitnitski was able to show that frailty index was log-linearly correlated with age.[6] Weare able to establish a similar relationship based on our own index. We sample our dataset by randomly choosing three visits per patient with replacement, from the set of visitswith no missing entries; we proceed by eliminating all duplicates in the sample. We thencompute the natural logarithm of the mean frailty index at every age present in our study(70-106). The following is a scatter plot of each point representing a group of against thenatural logarithm of its mean frailty index, distributed along the best fit line as determinedby simple linear regression:

13

Figure 10: Frailty Index and Age

log(MFI) = −4.5287 + 0.0410× AgeR2 = 0.9632

F − statistic = 784.5138

p− value < 0.0001

Upon defining this relationship, Arnold Mitnitski suggests that we can define an individ-uals biological age by rearranging the regression equation to obtain age as a function of thelog of the mean frailty index:

BiologicalAge = (log(MFI) + 4.5287)/0.0410

Intuitively, an individuals biological age is simply the age at which the expected value ofthe frailty index corresponds to the individuals frailty index. The relevance of biologicalage is most relevant in situations of accelerated decrepitude, when the positive relationshipbetween time to death and chronological age breaks down (As a matter of fact chronologicalage never seems to be a very good predictor of time to death among geriatrics, as will be es-tablished later in this paper). For example, Arnold Mitnitski used biological age to regress onmean time to death for various groups of dementia and alzeihmers patients, [6] successfullyshowing that while there did not appear to be any clear correlation between chronologicalage and mean time to death, most of the variation between group means could be explainedvia differences in mean biological age. The data that we have considered contains little infor-mation with regards to alzeihmers patients, but we sought to show that our own biologicalage could hold some predictive power in other situations of accelerated decrepitude.

Consider the following subset of inter-correlated symptoms in our data set: high bloodpressure, congestive heart failure, diabetes, CLD, BMI. We randomly select one visit perpatient. We cluster the sample into 6 groups: those with zero symptoms from the list,through to all 5 symptoms. Dropping the last cluster as an outlier due to low counts, we

14

can run a linear regression of mean biological age as a function of mean time to death.

Figure 11: Biological Age Regression

The blue crosses, distributed along the regression line, represent mean biological age pergroup versus mean number of visits until death. Likewise, the corresponding green bubblesindicate mean chronological age. If anything, the relationship between mean chronologicalage and time to death appears to be negative. This could potentially be related to the factthat accumulation of symptoms in this set can largely be attributed to lifestyle; we expectto observe those individuals with higher degrees of cardiovascular damage to die earlier, onaverage, and have higher hazard functions. The result of this phenomenon would be forindividuals from the unhealthiest groups to be both younger on average, as well as having ahigher risk of imminent death.

The main action of the frailty index, and the similar concept of the biological age, is togreatly lower the dimensionality of hazard models; as such, it ignores much of the inherentnoise we would expect to observe in such a data set. Presumably, there is a natural thresholdwith regards to the degree of information that is useful to include in a model attempting topredict death, beyond which we begin to over fit the model. Whatever metric utilizes theoptimal amount of information should perform the best under cross-validation.

We will consider two methods in order to reduce the dimensionality of the data set: avariation on spectral clustering, as well as principal component analysis. A common pitfall ofregression is to use too many correlated explanatory variables. Considering the correlationcoefficient matrix for all symptoms over all entries in complete rows, using the k meansalgorithm over the rows of its eigenvectors would give us the k groups of symptoms withinwhich correlation is the strongest. We suggest making k indices based on those clusters,defining each index from the binary vector k of clustered symptoms by the following equation:

Index k = (sum vector k)/length(vector k)

This method can be viewed as a sort of generalization of the frailty index: note the

15

k = 1 would correspond to Arnold Mitnitskis frailty index computed from the set of all 34symptoms.

Given that we our attempting to reduce the dimensionality data matrix, a natural matlabcommand to use is this case is [A,B,C] = svd(data); i.e. principal component analysis. Thecommand returns the diagonal matrix B and unitary matrices A and C such that multiplyingA B and the transpose of C yields the original matrix.

The values of B are given in decreasing order, and setting the last n− k columns to zeroand plugging B back into the previous equation should give us the best rank k approximationof the matrix data. The following is a plot of magnitudes of the diagonal entries of B, showingdiminishing returns in information by including more than a small number of vectors.

Figure 12: Plot of Singular Values

In order to create k vectors of information, we will multiply A by the first k columns ofB. This gives us k variables per individual on which we can regress.

Our comparisons concentrate on the ability of each of these methods in anticipating timeto death. To do so we must first address some of the issues in modeling time to death in ourdata set. To facilitate sampling, we will eliminate all patients who have incomplete records.The information on time to death is interval censored; we only know the number of theirlast visit, giving us an 18 month window during which they must have died. However, if apatient survives all 10 visits, then his information is right-censored. Likewise, we have noinformation for all patients at the seventh visit; thus, if a patient drops out at the sixth visit,then we have a 3-year window during which they could have died.

In order to deal with these challenges, we relax the problem by classifying the patientsinto two groups: those who die after the median number of visits (3), versus all others. Weconstruct a binary vector, which we will call the Death Classification vector, defining itsindividual entries as 1 if time-to-death is less than 4, and 0 otherwise. Seeing as we aredealing with binary classification, a presumably appropriate regression method is binomial

16

regression with the logit link function. Whether or not this is the optimal binary classifierthat we could use, we hope that in standardizing our method we can draw reasonable com-parisons between the different predictor variables discussed above.

Samples are drawn as follows: for each patient with complete records, we sample a singlevisit at random and compute the number of visits until he drops out, with respect to thesampled visit. We will define this vector as the time-to-death vector. If time-to-death isless than 4 but the patient survives the study, then we must cast him out. Likewise, if thetime-to-death is 3 but the patient survives 6 visits in the full data, then we cast him out.

One may argue that the redundancy of the sampling method will lead to statistical bias.Given that we have already selected one visit per candidate, there are some 125 choose 250ways to partition the sample into two equal parts for model fitting and cross-validation, andwe cannot hope to obtain the true average purity scores for each method. We do not claimto be drawing any conclusions with regards to which model would perform best over anypopulation outside our data set, but only hope to evaluate which model would perform beston average within the context of our data set.

We then select half of the patients in the sample, at random; after fitting the binomialregression model, we check how each model compares to the others by cross validating onthe other half of the patients. An assigned value greater than 0.5 corresponds to that patientmost likely dying prior to the 4th visit; therefore we assign to him a 1 in that case and a 0otherwise, effectively constructing an estimated death classification vector. We can thereafterassign purity scores to each model, which we define to be the total proportion of patients tobe sorted into the correct group. This random sampling method will be repeated 500 timesper model, such that average purity scores can be used to draw comparisons between thedifferent models. On average, fifty seven percent of the sampled individuals die before thefourth visit. The expected mean purity score for a randomly generated binary vector, underthe constraint that it preserves correct population proportions, is:

0.572 + (1− 0.57)2 = 0.51

Results For the sake of comparison, we first calculate average purity for the model utiliz-ing chronological age and sex as sole predictor variables. This model yielded average purityunder cross-validation of 62.15 percent.

Conversely, Arnold Mitnitski’s frailty index, computed over all 34 disorders, yields anaverage purity of 66.57 percent.

Matrix singular value decomposition provides us with an increase in model sophisticationover the frailty index. Given that we our using the model exactly as discussed above, weface the choice of exactly how many vectors of information we wish to use as predictors. Inorder to check what amount of information could provide optimal purity scores for PCA, wecompute average purity ((n = 500) cross-validations) for model dimensionality of 1 through34. As described above, one visit per individual has been selected at random after which abinomial regression is fit to a random half of the data points. The model is cross-validatedon the other half of patients and the process is repeated. Singular value decomposition isperformed over a matrix of all sampled patients for all 34 symptoms and includes informationwith regards to sex and age. We observe a sharp increase in mean purity over low rank,

17

followed by a diminishing returns and negative returns for high rank. It should be notedthat there is no rank at which average purity exceeds that given by Arnold Mitnitski’s frailtyindex; we have no evidence suggesting that a model based on singular value decompositionwould perform any better than the much simpler model based on overall frailty index.

Figure 13: Purity for PCA as a Function of Rank

Similarly, it is unclear what is the optimal number of clusters to be used in a modelregressing on multiple un-weighted indices. Conventionally, spectral clustering algorithmsoften run the k-means algorithm with k clusters over the rows of k eigenvectors of infor-mation, such as the model suggest by Ng, Jordan and Weiss [insert footnote]; in order tofind which number of clusters could be optimal, we will first adhere to that convention andcompute average purity for number of indices 1 through 10. Information on chronologicalage and sex are included as separate predictor variables in our regression. Clearly, we ob-serve a steep drop-off in average purity as we increase the number of clusters past 1. Thisappears to suggest that no number of clustered indices will provide us with any more rele-vant information than is already provided by Mitnitski’s frailty index computed over all 34symptoms.

18

Figure 14: Average purity scores for k clusters

Consider only the two eigenvectors corresponding to the two largest eigenvalues of thecorrelation coefficient matrix. These should contain the most relevant information withrespect to linear correlation between symptoms. By visual inspection of the following scat-terplot representing the rows of these vectors, it would appear that the k-means algorithmshould yield relatively stable results for number of clusters k = 4. Running the k-meansalgorithm with 4 clusters on the rows of the first two eigenVectors (each corresponding to asingle symptom) consistently produces the same clusters.

Figure 15: Scatterplot of First Two Eigen Vectors of Correlation Coefficient Matrix

After having obtained four indices by performing k-means with k = 4 over the first 2eigenvectors, we regress and include age and sex as separate predictor vectors. Averaging

19

over 500 cross-validations yields a 0.6749 purity score. This corresponds to a less then onepercent increase in purity over Arnold Mitnitski’s frailty index. However, the difference inmean purity is statistically significant at a 0.05 significance level (students t-test), suggestinguse of multiple frailty indices compiled from inter-correlated groups of disorders could be animprovement over the overall frailty index. Excluding information on age and sex yields anaverage purity score of 0.6682; this difference is statistically significant at a 0.05 significancelevel according to students t-test.

The above model merits deeper analysis of its explanatory variables. After we computea single regression model over all individuals with complete histories at a random visit, weobserve that all explanatory variables are significant at a 0.05 significance level in presenceof all other variables with the exception of index 4 (corresponding to emotional defects) andchronological age (p− values = [0.4116; 0.2314] respectively).

We can use an analysis of deviance to check whether the full model is significantly betterthan the nested model that excludes those terms: referring to the chi-squared distributionwith 2 degrees of freedom we obtain a p-value of 0.3506 and conclude that we have no suchevidence.

3 Other Future Work and Improvements

Through our work, we believe that we can still improve upon it. One improvement that wesuggest is that we can create more graphics with our dataset. Traditionally, when analyzinglongitudinal data, graphical plot are useful to make certain inferences. When continuing onthis project, we suggest making plots such as trellis plots, to see visual relationships betweencertain variables such as heart attack and return visits. However, the difficulty in makinggraphical plots for our dataset is that the variables are binary and thus it maybe difficult todetect any meaningful relationships between different variables.

Additionally, we did not use a ground truth when using different clustering techniques.We believe that it will be interesting to cluster patients by the amount of the number ofvisits that they make to the hospital. To start this, we think we can do this by using theHausdorff distance to cluster patients based on their past records.

Also, we would like to explore more techniques to deal with missing data. Longitudinaldata with missing entries is a problem that is widely discussed in the literature of statis-tics. A technique that is used specifically to fill in missing data entries from a longitudinaldataset is called multiple partial imputation et al Li, which creates multiple copies of a datamatrix with Marko Chain Monte Carlo techniques, that contains filled-in entries. Currently,there has been no work on comparing the effectiveness between matrix completion and mul-tiple imputation and we believe that there maybe a possible publication can be made whencomparing the two techniques or variants of them.

We also want to use different models to compute the probability of death. In this project,we assume that when a patient drops out of the study, they die and thus we are determiningthe probability that a subject will drop out of a study. In the dataset, we can describemissing data as the following:

Whereyij represents the response variables. According to Li, we assume that the r followsa Random-Effects Markov Transition Model, which is a Markov Chain with Random-Effects.The probability of r is the following:

20

rij =

Pr(rij = k|ri,j−1 = 0,xij, ξi) = 1

1+∑2l=1 e

xijηl+ξiγlifk = 0,

P r(rij = k|ri,j−1 = 0,xij, ξi) = exijηk+ξiγl

1+∑2l=1 e

xijηl+ξiγlifk = 1 or 2

Pr(rij = k|ri,j−1 = 1,xij, ξi)

{= 1

1+exijηl+ξiγlifk = 0,

exijηk+ξiγl

1+∑2l=1 e

xijηl+ξiγlifk = 1

To find the parameters of this multinomial equation, we assume that yij also follows amarkov chain. Here is the set of equation that represent yij

logit(Pr(yij = 0|yi,j−1 = 1,xij, ξi)) = xijβ01 + ξilogit(Pr(yij = 1|yi,j−1 = 0,xij, ξi)) = xijβ10 + νξiWith Bayes rule, We find that the likelihood function for the parameters are θ =

(β01, β10, ν, θ) and φ = (η1, η2, γ1, γ2) is expressed as:

L(θ, φ) ∝∫ ∏n

i=1(∏Ti

j=1 Pr(yij|yi,j−1 = 1,xij, ξi, θ)∏J

j=1 Pr(rij|ri,j−1,xi,j, ξi, θ)Pr(ξi))dξiWhere Pr(ξi) is the pdf ofN(0, σ2)With this Likelihood function, we try to maximize it, which will help us find appropiate

values for parameters. Ultimately, this will help us find the right parameters for our MarkovTransition Model for death.

4 Conclusion

Although we appear to have made little concrete progress, we hope that future expansion onour ideas could lead to favorable results. Notably, there is possibility that one can improveupon the concept of the frailty index. Here we considered one method dividing it into smallersubsets, but finding a systematic and productive way of weighting its components may alsobe of interest. The frailty index assumes equal relevance of all of its components; this is astrong assumption, and dependent on what the author seeks to predict. Unfortunately thisdata set is too small and incomplete for us to be able to make truly relevant conclusionsabout our methodology.

5 Acknowledgements

• Dr. Blake Hunter, and Dr. Theodore Kolokolnikov for useful advice and guidance

• Arnold Mitinkski

• UCLA Department of Mathematics

• BUGS and PIC Lab

• Dr. Bertozzi for organizing a research program

References

[1] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models–going beyond svd.In Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on,pages 1–10. IEEE, 2012.

21

[2] Bates, Douglas, Maechler, and Bolker. lme4: Linear mixed-effects models using s4 classes.2012.

[3] J. Bruin. newtest: command to compute new test @ONLINE, February 2011.

[4] Emmanuel J. Candes and Benjamin Recht. Exact matrix completion via convex opti-mization, May 2008.

[5] Glenn De’ath and Katharina E Fabricius. Classification and regression trees: a powerfulyet simple technique for ecological data analysis. Ecology, 81(11):3178–3192, 2000.

[6] Arnold B Mitnitski, Jane E Graham, Alexander J Mogilner, and Kenneth Rockwood.Frailty, fitness and late-life mortality in relation to chronological and biological age,February 2002.

[7] Andrew Y Ng, Michael I Jordan, Yair Weiss, et al. On spectral clustering: Analysis andan algorithm. Advances in neural information processing systems, 2:849–856, 2002.

[8] Matthijs Warrens. Similarity Coefficients for Binary Data. PhD thesis, UniversiteitLeiden, 1978.

22

Modeling the Progression Through Time of Longitudinal ...bertozzi/WORKFORCE/REU...Modeling the...

Documents

Transcript of Modeling the Progression Through Time of Longitudinal ...bertozzi/WORKFORCE/REU...Modeling the...