Prediction and Informative Risk Factor Selection of Bone Diseases · 2014. 6. 18. · 1 Prediction...

1

Prediction and Informative Risk FactorSelection of Bone Diseases

Hui Li, Xiaoyi Li, Murali Ramanathan, and Aidong Zhang, Fellow, IEEE

Abstract—With the booming of healthcare industry and the overwhelming amount of electronic health records (EHRs) sharedby healthcare institutions and practitioners, we take advantage of EHR data to develop an effective disease risk managementmodel that not only models the progression of the disease, but also predicts the risk of the disease for early disease control orprevention. Existing models for answering these questions usually fall into two categories: the expert knowledge based model orthe handcrafted feature set based model. To fully utilize the whole EHR data, we will build a framework to construct an integratedrepresentation of features from all available risk factors in the EHR data and use these integrated features to effectively predictosteoporosis and bone fractures. We will also develop a framework for informative risk factor selection of bone diseases. A pairof models for two contrast cohorts (e.g., diseased patients vs. non-diseased patients) will be established to discriminate theircharacteristics and find the most informative risk factors. Several empirical results on a real bone disease data set show that theproposed framework can successfully predict bone diseases and select informative risk factors that are beneficial and useful toguide clinical decisions.

Index Terms—Electronic health records (EHRs); risk factor analysis; integrated feature extraction; risk factor selection; diseasememory; osteoporosis; bone fracture.

F

1 INTRODUCTION

Risk factor (RF) analysis based on patients’ electronichealth records (EHRs) is a crucial task of epidemiol-ogy and public health. Usually, people treat variablesin EHR data as numerous potential risk factors (RFs)that need to be considered simultaneously for assess-ing disease determinants and predicting the progres-sion of the disease, for the purpose of disease controlor prevention. More importantly, some common dis-eases may be clinically silent but can cause significantmortality and morbidity after onset. Unless early pre-vented or treated, these diseases will affect the qualityof life, and increase the burden of healthcare costs.With the success of RF analysis and disease predictionbased on an intelligent computational model, unnec-essary tests can be avoided. The information can assistin evaluating the risk of the occurrence of disease,monitor the disease progression, and facilitate earlyprevention measures. In this paper, we focus on thestudy of osteoporosis and bone fracture prediction.

Over the past few decades, osteoporosis has beenrecognized as an established and well-defined diseasethat affects more than 75 million people in the UnitedStates, Europe and Japan, and it causes more than 8.9million fractures annually worldwide [?]. It’s reported

• Hui Li, Xiaoyi Li and Aidong Zhang are with the Department ofComputer Science and Engineering, State University of New York atBuffalo, Buffalo, NY, 14260.E-mail: hli24, xiaoyili, azhang @buffalo.edu

• Murali Ramanathan is with Department of Pharmaceutical Sciences,State University of New York at Buffalo, Buffalo, NY, 14260.E-mail: [email protected]

that 20-25% of people with a hip fracture are unableto return to independent living and 12-20% die withinone year. In 2003, the World Health Organization(WHO) embarked on a project to integrate informa-tion on RFs and bone mineral density (BMD) to betterpredict the fracture risk in men and women world-wide [?]. Osteoporosis in the vertebrae can causeserious problems for women such as bone fracture.The diagnosis of osteoporosis is usually based on theassessment of BMD measured by dual energy X-rayabsorptiometry (DXA). Different from osteoporosismeasured by BMD, bone fracture risk is determinedby the bone loss rate and various factors such asdemographic attributes, family history and life style.Some studies have stratified their analysis of fracturerisk into those who are fast or slow bone losers. Witha faster rate of bone loss, people have a higher risk offracture [?].

Osteoporosis and bone fracture are complicateddiseases. As shown in Fig. ??, they are associated withvarious potential RFs such as demographic attributes,patients’ clinical records regarding disease diagnosesand treatments, family history, diet, and life style.Different representations might entangle the differentexplanatory reasons of variation behind various RFsand diseases. Some of the fundamental questions havebeen attracting researchers’ interest in this area, forexample, how to perform feature extraction and selectthe integrated significant features? Also, what are ap-propriate approaches for manifold feature extractionand maintaining the real and intricate relationshipsbetween a disease and its potential RFs? A goodrepresentation has an advantage for capturing un-

2

Demographics

Diagnosis

Diet

Vertebrate fracture

Hip fracture

Wrist fracture

Lifestyle

Fig. 1: Risk factors for osteoporosis

derlying factors with shared statistical strength forpredicting bone diseases. A representation-learningmodel discovers explanatory factors behind sharedintermediate representation with the combination ofknowledge from both input data and output specifictasks. The rich interactions among numerous potentialRFs or between RFs and a disease can complicateour final prediction tasks. Besides, the other typesof questions we aim to address in this paper are:what are the informative RFs from the whole list ofRFs? Whether patients can change some modifiableRFs for delaying the onset and progression of bonediseases. The proposed approach will show somegood properties for answering these questions.

Traditionally, the assessment of the relationship be-tween a disease and a potential RF is achieved byfinding statistically significant associations using theregression model such as linear regression, logisticregression, Poisson regression, and Cox regression [?],[?], [?], [?]. Although these regression models are theo-retically acceptable for analyzing the risk dependenceof several variables, they pay little attention to thenature of the RFs and the disease itself. Sometimes,less than ten significant RFs are fed into those models,which are not intelligent enough for predicting acomplicated disease such as osteoporosis. Other datamining studies under this objective are associationrules [?], decision tree [?] and Artificial Neural Net-work (ANN) [?]. For these methods, it’s ineffective tobuild a comprehensive model that can guide medicaldecision-making if there are a number of potentialRFs to be studied simultaneously. Usually limited RFsare selected based on the knowledge of physicianssince handling a lot of features is computationallyexpensive. Feature selection techniques are well ap-plied to select limited number of RFs before feedingto a classifier. However, the feature selection problemis known to be NP-hard [?] and more importantly,those abandoned RFs might still contain valuableinformation. Furthermore, the performance of ANN

RF Extraction

Prediction RF Selection Input

Expert Knowledge

Risk Factor Learning

Integrated RF

Informative RF

Expert RF

Fig. 2: A Generalized Risk Factor Learning Framework

depends on a good setting of meta parameters andso parameter-tuning is an inevitable issue. Underthese scenarios, most of these traditional data miningapproaches may not be effective.

Mining the causality relationship between RFs anda specific disease has attracted considerable researchattention in recent years. In [?], [?], [?], limited RFsare used to construct a Bayesian network and thoseRFs are assumed conditionally independent of oneanother. It is noteworthy to mention that the randomforest decision tree has been investigated for identi-fying osteoporosis cases [?]. The data in this work isprocessed using FRAX [?]. Although this is a popularfracture risk assessment tool developed by WHO, itmay not be appropriate to directly adopt the resultsfrom this prediction tool for evaluating the validityof an algorithm since FRAX sometimes overestimatesor underestimates the fracture risk [?]. The predictionresults from FRAX need to be further interpreted withcaution and properly re-evaluated. Some hybrid datamining approaches might also be used to combineclassical classification methods with feature selectiontechniques for the purpose of improving the perfor-mance or minimizing the computational expense for alarge data set [?], but they are limited by the challengeof explaining selected features.

The existing methods for predicting osteoporosisand bone fracture are all based on expert knowledgeor handcrafted features. Besides, both approachesare time-consuming, brittle, and incomplete. To solvethese problems, we propose a RF learning modelincluding learning an abstract representation for pre-dicting the bone related diseases and selecting themost influential RFs that cause the disease progressionas shown in Fig. ??. In this generalized RF learningpipeline, we apply all variables of EHR data as RFsinto the Risk Factor Learning module which includesthree tasks: (1) RF extraction is used to produce inte-grated features which show combinations of multiplenonlinear RF transformations with the goal of yieldingmore abstract and salient RF representations, (2) RFselection aims at choosing a subset of RFs from a poolof candidates showing that informativeness enablesstatistically significant improvements in disease pre-diction and (3) Expert RF is extracted based on thedomain expert knowledge to validate the performance

3

of both RF extraction and RF selection. For such aframework, we are facing three main challenges:

• The performance of the follow-up analysis will behighly dependent on how well the integrated fea-tures capture the underlying causes of the diseaseand the predictive power using those integratedfeatures. To obtain the latent variables from largeamounts of entangled RFs, we are actually facingthe problem of learning an extraction and repre-sentation which can best disentangle the salientintegrated features from original complex data.

• It is difficult to discriminate the different roles ofseemingly independent features for both healthyindividuals and diseased individuals. Selectingthe informative RFs are beneficial and useful toguide clinical decision. Besides, these informativeRFs can save budget and time for physicians topredict health conditions. Therefore, our modelshould handle with some problems such as whatare the most important RFs that contribute to thedevelopment of diseases? How many RFs do weneed to achieve a good predictive performance?

• EHR data are diverse, multi-dimensional, large insize, with missing and noisy values, and withoutground truth in nature. These properties makeexisting methods inapplicable because of the lackof extensive learning power and overall modelexpressiveness. A state-of-the-art model shouldbe carefully designed for handling those ques-tions simultaneously.

In this paper, we propose a novel approach forthe study of bone diseases in two aspects: bone dis-ease prediction and disease RF selection accordingto the significance. For clear understanding, we de-fine Disease Memory (DM) as a model trained bya specific group of samples aiming to memorize theunderlying characteristics for this group. In addition,we apply all samples in our model and train ageneral model which captures the characteristics forboth diseased patients and non-diseased patients topredict an unknown sample, denoted by the compre-hensive disease memory (CDM) model. Our model isseparately trained using diseased samples and non-diseased samples to distinguish their different prop-erties. Bone disease memory (BDM) is a type of DMmodel which is trained by diseased samples and so itonly memorizes the characteristics of those patientswho suffer from bone diseases. Similarly, the non-disease memory (NDM) is a model which is trainedby the non-diseased samples and memorizes theirattributes. We individually train them because wewant to find informative RFs which can be used to dis-tinguish the diseased individuals from non-diseasedones. In other words, different DM models increasethe flexibility for exploring different tasks. DM servesas an important embedded module in our frameworkthat has the following nice properties. First, diseased

11 RFs

CDM

Original Dataset

Integrated Risk

Features

Phase1 Phase2

Task 1: Bone disease predic8on

672 RFs

Task 2: Informa8ve RF selec8on

BDM

NDM Disease Samples

Non-Disease Samples

Training Process

Candidate Informative

RFs

Medical Knowledge

Validate

Fig. 3: Overview of our framework for bone health

patients and healthy patients are modeled together toestablish a CDM which captures the salience of allRFs by a limited number of integrated features forpredicting bone diseases. Second, diseased patientsand healthy patients are modeled separately based ontheir unique characteristics to find the RFs that causethe disease. Third, our model is robust in the presenceof missing and noisy data. Last but not least, themodel doesn’t require all samples are labeled, instead,it can be trained in a semi-supervised fashion. Thesenice properties are achieved by our proposed model -a deep graphical model focusing on the bone disease.Recently, many efforts have been devoted to developlearning algorithms for the deep graphical model withimpressive results obtained in various areas such ascomputer vision and natural language processing [?].The intuition and extensive learning power of thesemodels are suitable for our task. To the best of ourknowledge, our method is the first work on risk factoranalysis using a deep learning method which canhandle high-dimensional and imbalanced data andinterpret hidden reasons behind bone diseases.

2 OVERVIEW OF OUR SYSTEM

In this section, we define our problem by show-ing a pipeline for the whole framework. Generallyspeaking, our proposed system contains a two-taskframework, as shown in Fig. ??. The upper componentof Fig. ?? shows the roadmap for the first task: thebone disease prediction based on integrated features.The bottom component of Fig. ?? shows the roadmapfor the second task: informative RF selection. Givenpatients’ information, our system can not only predictthe risk of osteoporosis and bone fractures, but alsorank the informative RFs and explain the semantics ofeach RF. The description of each component is givenas follows.

Task 1 – The Bone Disease Prediction Component.In this component, we feed the original data set tothe comprehensive disease memory (CDM) which isa trained model of the intermediate representationof the original RFs. The training procedure of CDMincludes two steps: pre-training and fine-tuning. In

4

the pre-training step, we train CDM in an unsuper-vised fashion. This pre-training procedure aims atcapturing the characteristics among all RFs. In thefine-tuning step, we focus on the training using twotypes of labeled information (osteoporosis and boneloss rate). We use a greedy layered learning algorithmto train a two-layer deep belief network (DBN) whichis the underlying structure of CDM. All RFs in theoriginal data are projected onto a new space withthe lower dimensionality by restricting the number ofunits in the output layer of DBN. Therefore, the CDMmodule extracts the integrated risk features from theoriginal data set. These lower-dimensional integratedrisk features are a new representation of the originalhigher-dimensional RFs, which will be evaluated by atwo-phase prediction module. In Phase 1, we predictthe risk of osteoporosis for all test samples. Theosteoporotic bones are labeled as the positive outputand the normal bone as the negative output. Becausethe osteoporotic patients tend to have more severebone fractures, in Phase 2, we further predict the riskof bone loss rate for all positive samples from Phase 1.The high bone loss rate, as the positive output, revealsthe higher possibility to have bone fractures, and thelow bone loss rate is defined as the negative outputof Phase 2.

Task 2 – The Informative RF selection Component.Although the integrated features generated in the firstcomponent can be used to effectively predict bonediseases, it is difficult to directly relate the semanticsof the integrated features to individual patients. Thus,in this component, we propose to select the mostmeaningful and significant RFs. Instead of using allsamples in the training procedure, we first split theoriginal data set into two parts: diseased samplesand non-diseased samples. In the training proce-dure, we separately train the bone disease memory(BDM) model using the diseased samples and thenon-disease memory (NDM) model using the non-diseased samples, shown in dashed arrows at the bot-tom component of Fig. ??. Once the training session iscomplete, both memories are used to reconstruct datarespectively based on the contrast groups of samples.A two-layer DBN, as the structure of NDM and BDM,has the properties to reconstruct the samples. Butit yields large reconstruction errors if we use BDMto reconstruct non-diseased samples because of themismatch between the input data and the memorymodule. The contrasts will provide valuable informa-tion to explain why some individuals are apt to getdisease. Similarly, the differences are obvious whenreconstructing diseased samples using NDM. All RFscumulatively lead to the reconstruction errors. Ourultimate goal is to find the top-N individual RFs thatcontribute most greatly to the reconstruction errors.The top-N selected RFs form a candidate informa-tive RF list that will be validated using the medicalknowledge in reports from the WHO and the National

Osteoporosis Foundation (NOF), as well as in thebiomedical literature from PubMed.

3 METHODOLOGY

In this section, we first briefly describe the evolutionof the energy models as the preliminaries to ourproposed method. Then we introduce single-layerand multi-layer learning approaches to construct ourdifferent disease memories. Finally, we propose ourmodel focusing on the prediction and informative RFselection for bone diseases.

3.1 Preliminaries3.1.1 Hopfield NetA Hopfield network is a form of recurrent artificialneural network invented by John Hopfield [?]. Itserves as the content-addressable memory systemswith the binary threshold nodes where each unit(node in the graph simulating the artificial neuron)can be updated using the following rule:

Si =

{1 if

∑j Wi,jSj > θi,

-1 otherwise(1)

where Wi,j is the strength of the connection weightfrom unit j to unit i. Sj is the state of unit j. θi isthe threshold of unit i. Based on Eq.(??) the energy ofHopfield Net is defined as,

E = −1

2

∑i,j

Wi,jSiSj +∑i

θiSi. (2)

The difference in the global energy that results froma single unit i being 0 (off) versus 1 (on), denoted as∆Ei, is given as follows:

∆Ei =∑j

wijsj + θi. (3)

Eq.(??) ensures that when units are randomly cho-sen to update, the energy E will either lower in valueor stay the same. Furthermore, repeatedly updatingthe network will eventually converge to a state whichis a local minima in the energy function (which isconsidered to be a Lyapunov function [?]). Thus, ifa state is a local minimum in the energy function,it is a stable state for the network. Note that thisenergy function belongs to a general class of modelsin physics, under the name of Ising models. This inturn is a special case of Markov networks, since theassociated probability measure, the Gibbs measure,has the Markov property.

3.1.2 Boltzmann MachinesBoltzmann machines (BM) can be seen as the stochas-tic, generative counterpart of Hopfield nets [?]. Theyare one of the first examples of a neural networkcapable of learning internal representations, and areable to represent and (given sufficient time) solvedifficult combinatoric problems. The global energy in

5

a Boltzmann machine is identical in form to that of aHopfield network, with the difference that the partialderivative with respect to each unit (Eq.(??)) can beexpressed as the difference of energies of two states:

∆Ei = Ei=off − Ei=on. (4)

If we want to train the network so that it willconverge to a global state according to a data dis-tribution that we have over these states, we needto set the weights making the global states with thehighest probabilities which will get the lowest ener-gies. The units in the BM are divided into “visible”units, V , and “hidden” units, h. The visible units arethose which receive information from the data. Thedistribution over the data set is denoted as P+(V ).After the distribution over global states convergesand marginalizes over the hidden units, we get theestimated distribution P−(V ) that is the distributionof our model. Then the difference can be measuredusing KL-divergence [?], and partial gradient of thisdifference will be used to update the network. But thecomputation time grows exponentially with the ma-chine’s size, and with the magnitude of the connectionstrengths.

3.2 Single-Layer Learning for the Latent ReasonsUnderlying Observed RFsTo have a good RF representation of latent reasonsfor the data, we propose to use Restricted BoltzmannMachine (RBM) [?]. A RBM is a generative stochasticgraphical model that can learn a probability distribu-tion over its set of inputs, with the restriction thattheir visible units and hidden units must form a fullyconnected bipartite graph. Specifically, it has a singlelayer of hidden units that are not connected to eachother and have undirected, symmetrical connectionsto a layer of visible units. We show a shallow RBMin Fig. ??(a). The model defines the following energyfunction: E : {0, 1}D+F → R :

E(v,h; θ) = −D∑i=1

F∑j=1

viWijhj−D∑i=1

bivi−F∑

j=1

ajhj , (5)

where θ = {a, b,W} are the model parameters. D andF are the number of visible units and hidden units.The joint distribution over the visible and hiddenunits is defined by:

P (v,h; θ) =1

Z(θ)exp(−E(v,h; θ)), (6)

where Z(θ) is the partition function that plays the roleof a normalizing constant for the energy function.

Exact maximum likelihood learning is intractablein RBM. In practice, efficient learning is performedusing Contrastive Divergence (CD) [?]. In particular,each hidden unit activation is penalized in the form:∑F

j=1KL(ρ|hj), where F is the total number of hid-den units, hj is the activation of unit j and ρ is a

…

……

W RBM

V

h

(a) RBM

…

……

MLP

RBM

…

h1

h2

W1

W2

V

(b) DBN

Fig. 4: (a) Shallow Restricted Boltzmann Machine, whichcontains a layer of visible units v that represent the dataand a layer of hidden units h that learn to represent featuresthat capture higher-order correlations in the data. The twolayers are connected by a matrix of symmetrically weightedconnections, W , and there are no connections within a layer.(b) A 2-Layer DBN in which the top two layers form aRBM and the bottom layer forms a multi-layer perceptron.It contains a layer of visible units v and two layers of hiddenunits h1 and h2.

predefined sparsity parameter, typically a small valueclose to zero (we use 0.05 in our model). So the overallcost of a sparse RBM used in our model is:

E(v,h; θ) = −∑D

i=1

∑Fj=1 viWijhj −

∑Di=1 bivi−∑F

j=1 ajhj + β∑F

j=1KL(ρ|hj) + λ ‖W‖ ,(7)

where ‖W‖ is the regularizer, β and λ are hyper-parameters.1

The advantage of RBM is that it investigates an ex-pressive representation of the input RFs. Each hiddenunit in RBM is able to encode at least one high-orderinteraction among the input variables. Given a specificnumber of latent reasons in the input, RBM requiresless hidden units to represent the problem complexity.Under this scenario, RFs can be analyzed by a RBMmodel with an efficient CD learning algorithm. In thispaper, we use RBM for an unsupervised greedy layer-wise pre-training. Specifically, each sample describes astate of visible units in the model. The goal of learningis to minimize the overall energy so that the datadistribution can be better captured by the single-layermodel.

3.3 Multi-Layer Learning for Mining AbstractiveReasonsThe new representations learned by a shallow RBM(one layer RBM) can model some directed hiddencausalities behind the RFs. But there are more ab-stractive reasons behind them (i.e. the reasons of thereasons). To sufficiently model reasons in differentabstractive levels, we can stack more layers into theshallow RBM to form a deep graphical model, namely,a DBN [?]. DBN is a probabilistic generative modelthat is composed of multiple layers of stochastic,latent variables. The latent variables typically havebinary values and are often called hidden units. The

1. We tried different settings for both β and λ and found ourmodel is not very sensitive to the input parameters. We fixed β to0.1 and λ to 0.0001.

6

top two layers form a RBM which can be viewed asan associative memory. The lower layer forms a multi-layer perceptron (MLP) [?] which receives top-down,directed connections from the layers above. The statesof the units in the lowest layer represent data vector.

There is an efficient, layer-by-layer procedure forlearning the top-down, generative weights that deter-mine how the variables in one layer depend on thevariables in the layers above. The bottom-up inferencefrom the observed variables V and the hidden layershk (k = 1, ..., l when l > 2) is following a chain rule:

p(hl, hl−1, ..., h1|v) = p(hl|hl−1)p(hl−1|hl−2)...p(h1|v),(8)

where if we denote bias for the layer k as bk and σ isa logistic sigmoid function, for m units in layer k andn units in layer k − 1,

p(hk|hk−1) = σ(bkj +∑m

j=1Wkjih

k−1i ). (9)

The top-down inference is a symmetric version ofthe bottom-up inference, which can be written as

p(hk−1|hk) = σ(ak−1i +

∑ni=1W

k−1ij hkj ), (10)

where we denote bias for the layer k − 1 as ak−1.We show a two-layer DBN in Fig. ??(b), in which

the pre-training follows a greedy layer-wise trainingprocedure. Specifically, one layer is added on top ofthe network at each step, and only that top layer istrained as an RBM using CD strategy [?]. After eachRBM has been trained, the weights are clamped anda new layer is added and then repeats the aboveprocedure. After pre-training, the values of the latentvariables in every layer can be inferred by a single,bottom-up pass that starts with an observed datavector in the bottom layer and uses the generativeweights in the reverse direction. The top layer of DBNforms a compressed manifold of input data, in whicheach unit in this layer has distinct weighted non-linearrelationship with all of the input factors.

3.4 Integrated Risk Features for Bone DiseasePredictionOur goal is to disentangle the salient integratedfeatures from the complex EHR data for the bonedisease prediction. We propose to define a learningmodel based on the given data set for two types ofbone disease prediction, osteoporosis and bone lossrate. Our general idea is shown in Fig. ??, where agood RF representation for predicting osteoporosisand bone loss rate is achieved by learning a set ofintermediate representation using a DBN structure atbottom and a classifier is appended on it. This multi-learning model can capture the characteristics fromboth observed input (bottom-up learning) and labeledinformation (top-down learning). The internal model,which memorizes the trained parameters using thewhole training data and preserves the informationfor both normal and abnormal patients, is termed

…

z z

……

…

Whole data set

Comprehensive disease memory

(CDM)

Classifiers

Osteoporosis Prediction

Bone Loss Rate Prediction

z z

……

. . .

Fig. 5: Osteoporosis and bone loss rate prediction using atwo-layer DBN model

as the comprehensive disease memory (CDM). Thatis, the learned representation model CDM discoversgood intermediate representations that can be sharedacross two prediction tasks with the combination ofknowledge from both input layer with the originaltraining data and output layer with two types of classlabels.

As shown in Algorithm ??, the training procedurefor CDM concentrates on two specific prediction tasks(osteoporosis and bone loss rate) with all RFs as theinput and model parameters as the output. It includesa pre-training stage and a fine-tuning stage. In the firststage, the unsupervised pre-training stage, we applythe layer-wise CD learning procedure for putting theparameter values in the appropriate range for furthersupervised training. It guides the learning towardsbasins of attraction of minima that support a goodRF generalization from the training data set [?]. Sothe result of the pre-training procedure establishes aninitialization point of the fine-tuning procedure insidea region of parameter space in which the parametersare henceforth restricted. In the second stage, the fine-tuning (FT) stage, we take advantage of label informa-tion to train our model in a supervised fashion. In thisway, the prediction errors for both prediction taskswill be minimized. Specifically, we use parametersfrom the pre-training stage to calculate the predictionresults for each sample and then back propagate theerrors between the predicted result and the groundtruth about osteoporosis from top to bottom to updatemodel parameters to a better state. Since we haveanother type of labeled information, we then repeatthe fine-tuning stage by calculating errors betweenthe predicted result and another ground truth aboutbone loss rate. After the two-stage training procedure,our CDM is well trained and can be used to predictosteoporosis and bone loss rate simultaneously.

In Algorithm 1, lines 2 to 4 reflect a layer-wise Con-trastive Divergence (CD) learning procedure where zis a predetermined hyper-parameter that controls how

7

Algorithm 1 DBN Training Algorithm with 2-stageFine-tuning for Bone Disease Prediction

Input: All risk factors, learning rate ε, Gibbs round z;Output: Model parameters M (W, a, b);

Pre-training Stage:1: Randomly initialize all W, a, b;2: for t from layer V to hl−1 do3: clamp t and run CDz to update Mt and t+1

4: end forFine-tuning Stage:

5: randomly dropout 30% hidden units for each layer6: loop7: for each predicted result (r) do8: calculate cost (c) between r and ground truth

g19: calculate partial gradient of c with respect to

M10: update M11: calculate cost (c′) on holdout set12: if c′ is larger than c′−1 for 5 rounds then13: break14: end if15: end for16: end loop17: do the fine-tuning stage again with ground truth

g2

many Gibbs rounds for each sampling are completedand t+1 is the state of upper layer. In our experiments,we choose z to be 1. The pre-training phase stopswhen all layers are exhausted. Lines 5 to 15 showa standard gradient update procedure (fine-tuning).We update model parameters from top to bottom bya simple two-step procedure. First, we update modelparameters M by the gradient descent on the cost cfor the training set. Second, we use early stopping asa type of regularization in our method to avoid over-fitting. We compare the cost of the current step c′ withthe previous step c′−1 for the validation set (holdoutset) and halt the procedure when the validation errorstops decreasing or starts to increase within 5 mini-batches. Since we have ground truth g1 and g2 rep-resenting osteoporosis and bone loss rate, we imple-ment the second fine-tuning procedure using g2 afterthe stage using g1. Moreover, we randomly dropout30% hidden units for each layer for the purpose ofalleviating the counter effect between different labelinformation during the fine-tuning stage.

The main advantage of the DBN in the abovetraining procedure is that it tends to display moreexpressive and invariant results than the single layernetwork and also reduces the size of the representa-tion. This approach obtains a filter-like representationif we treat unit weights as the filters [?]. We want tofilter the insignificant RFs and thus find out the robustand integrated features which are the fusion of both

Non-diseased RFs

!!!

!!!

!!!

!!!

!!!

!!!

!!!

!!!

!!!

!!!

!!!

!!!!!…!

…!

Reconstructed RFs

Bone disease memory (BDM)

!!z!

!!!

!!!

!!!

!!!

!!!

!……! !!!

!!!

!!z!

!!!

!!!

!!!

!!!

!!!

!……! !!!

!!!

!!!!!z!

!!!

!!!

!!!

!!!

!!!

!……! !!!

!!!

Diseased RFs Reconstructed RFs

!!!!!z!

!!!

!!!

!!!

!!!

!!!

!……! !!!

!!!

(a) !

dtrain

dtest

Fig. 6: Informative RF selection model with DBN

observed RFs and hidden reasons for predicting bonediseases.

The disease risk prediction requires building a pre-dictive model such as a classifier for a specific diseasecondition using the integrated features in CDM. Thenew representation CDM of RFs extracted by a two-layer DBN can be served as the input of several tra-ditional classifiers, as shown in Fig. ??. To incorporatelabeled samples, we propose to add a regression layeron top of DBN to get classification results, whichcan be used to update the overall model using backpropagation. Based on the proposed model in Fig. ??,physicians and researchers can assess the risk of apatient in developing osteoporosis or bone fracture.Then the proper intervention and care plan can bedesigned accordingly for the purpose of preventionor disease control.

3.5 Informative Risk Factor SelectionIn the previous section, we have proposed CDM tomodel both diseased patients and healthy patientstogether and established a comprehensive diseasememory which captures the salience of all RFs by alimited number of integrated features for predictingosteoporosis and bone loss rate. However, the infor-mative RF selection aims to capture the differencesbetween the diseased patients and non-diseased pa-tients. Therefore, our CDM model cannot be appliedto this task since it models over all patients. In thissection, we propose to model the diseased patientsand healthy patients separately based on their uniquecharacteristics and identify the RFs that cause thedisease (osteoporosis). Two variants of disease mem-ory will be introduced to conduct the informative RFselection for bone diseases.

Bone Disease Memory (BDM). We term the bonedisease memory (BDM) model as a variant of DM thatis totally different from CDM model. The differencemainly lies in the input data during the training andtesting stage. The ultimate goal of BDM is to monitorthose RFs which cause people to get osteoporosis.Therefore, we have a crucial step for splitting dataset as shown at the bottom block in Fig. ??. Duringthe training stage, the top block of Fig. ?? shows ahierarchical latent DBN structure that is well trainedby applying diseased RFs, as shown by dashed arrows

8

in this figure. An interesting property of DBNs isthe capability of reconstructing RFs [?]. Therefore,RFs reconstructed using BDM are reflections of thediseased individuals. We try to minimize the errorsbetween both sides for a well-trained BDM. Basedon the property of DBNs, if there is a large errorbetween the original RF and the reconstructed one,this RF is likely to be a noisy RF and should not befurther considered. After such a noisy RF selectionprocess, we can measure the reconstructed error foreach RF to find a possible informative RF in thetesting stage. However, in the testing stage, we willuse non-diseased RFs as the input, as shown by solidarrows in Fig. ??. This effort monitors the differencesbetween the original RFs and the reconstructed RFs.Therefore, we intend to track a large error betweenboth sides during the testing stage. The larger theerror is, the more possible it is the informative RF.Under this scenario, we measure the reconstructederror for each RF for filtering the noisy RF in thetraining stage and finding the possible informative RFin the testing stage. We rank the total reconstructederror to select the top-N informative RFs by followingdistance metrics:

Reconstructed Error in Training Stage: d(k)train =√∑n

i=1(RRF(k)i −ORF

(k)i )

2

n , where we use Root MeanSquare Error (RMSE) to calculate the kth RF distancebetween the reconstructed RF RRF (k)

i and the originalRF ORF

(k)i for n training samples, and n is the total

number of training samples.Reconstructed Error in Testing Stage: d

(k)test =√∑m

j=1(RRF(k)j −ORF

(k)j )

2

m , where we use RMSE to cal-culate the kth RF distance between the reconstructedRF RRF

(k)j and the original RF ORF

(k)j for m test

samples, and m is the total number of testing sam-ples. For a new incoming sample, we still use aboveformula but change m to 1.

Total Reconstructed Error: d(k)total = |d(k)test - d

(k)train|,

where d(k)total represents the total error of both stages

for the kth RF by calculating its absolute distance.Note that only the RFs with large reconstructed errorin the testing stage and small error in the trainingstage (i.e. not the noisy RF) are regarded as theinformative RFs. We rank d

(k)total by decreasing order

and yield a top-N informative RF list by selecting thefirst N th terms.

Non-Disease Memory (NDM). Similarly, we termthe non-disease memory (NDM) model as a modelwhich is trained by the non-diseased individuals soas to focus on the characteristics of those patients whohave healthy bone. The structure of NDM is similarto BDM. However, the input data for training andtesting the NDM model are swapped. The trainingprocedure for NDM is achieved by using all non-diseased RFs as input data, instead of diseased RFsin Fig. ??. During the testing stage, we replace non-

diseased RFs in Fig. ?? with diseased RFs and aim atobserving if an osteoporotic diseased individual willget back to normal. This procedure can be severedas a cross-validation to evaluate the informative RFsprovided by BDM. Since only the informative RFsproduce a large total reconstructed error if we suc-cessfully remove the unreliable data, the informativeRFs predicted by either BDM or NDM should beconsistent. We apply distance metrics in accordancewith BDM when calculating the total reconstructederror.

4 EXPERIMENTS

4.1 Data SetThe Study of Osteoporotic Fractures (SOF) is thelargest and most comprehensive study of RFs for bonediseases which includes 9704 Caucasian women aged65 years and older. It contains 20 years of prospectivedata about osteoporosis, bone fractures, breast cancer,and so on. Potential RFs and confounders were classi-fied into 20 categories such as demographics, familyhistory, lifestyle, and medical history [?]. As shownin Fig. ??, there are missing values for both RF spaceand label space, denoted as empty shapes.

A number of potential RFs are grouped and orga-nized at the first and second visits which include 672variables scattered into 20 categories as the input ofour model. The rest of the visits contain time-seriesdual-energy x-ray absorptiometry (DXA) scan resultson bone mineral density (BMD) variation, which willbe extracted and processed as the label for our dataset. Based on WHO standard, T-score of less than -12

indicates the osteopenia condition that is the precur-sor to osteoporosis, which is used as the first type oflabel. The second type of label is the annual rate ofBMD variation. We use at least two BMD values in thedata set to calculate the bone loss rate and define thehigh bone loss rate with greater than 0.84% bone lossin each year [?]. Notice that this is a partially labeleddata set since some patients just come during the firstand second visit and never take a DXA scan in thefollowing visits like example Patient3 shown in Fig.??.

4.2 Evaluation MetricsThe error rate on a test dataset is commonly usedas the evaluation method of the classification perfor-mance. Nevertheless, for most skewed medical datasets, the error rate could be still low when misclas-sifying entire minority sample to the class of ma-jority. Thus, two alternative measurements are usedin this paper. First, Receiver Operator Characteristic(ROC) curves are plotted to generally capture how thenumber of correctly classified abnormal cases varieswith the number of incorrectly classifying normal

2. T-score of -1 corresponds to BMD of 0.82, if the reference BMDis 0.942 and the reference standard deviation is 0.122.

9

Patient1

Patient2

Patient3

Patient4

Patient5

Patient9704

RF1 RF2 … RF671 RF672 L1 L2

Risk Factors Labels

. . . Fig. 7: Illustration of missing values for the SOF dataset

shown in non-shaded shapes for both RF space and labelspace. Two types of label information L1 and L2 with binaryvalues are shown.

TABLE 1: Confusion matrix.

Actual ClassPositive Negative

Predicted Class Positive TP FPNegative FN TN

cases as abnormal cases. Since in most medical prob-lems, we usually care about the fraction of examplesclassified as abnormal cases that are truly abnormal,the measurements, Precision-Recall (PR) curves, arealso plotted to show this property. We present theconfusion matrix in Table ?? and several derivativequality measures in Table ??.

TABLE 2: Metrics definition.

True Positive Rate = TPTP+FN

False Positive Rate = FPFP+TN

Precision = TPTP+FP

Recall = TPTP+FN

Error Rate = FP+FNTP+TN+FP+FN

4.3 Experiments and Results for Integrated RiskFeatures Extraction

4.3.1 Experiment SetupTo show the excellent predictive power using inte-grated features extracted by our CDM model, wemanually choose RFs based on the expert opinion [?],[?], [?] as the baseline approach shown in Table ??. Fora fair comparison, we fix the number of the outputdimensions to be equal to the expert selected RFs.Specifically, we fix the number of units in the outputlayer to be 11, where each unit in this layer representsa new integrated feature describing complex relation-ships among all 672 input factors, rather than a set oftypical RFs selected by experts shown in Table ??.

TABLE 3: Typical risk factors from the expert opinion

Variables Type DescriptionAge Numeric Between 65 - 84Weight NumericHeight NumericBMI Numeric BMI = weight/height2

Parent fall Boolean Hip fracture in the pa-tient’s mother or father

Smoke BooleanExcess alcohol Boolean 3 or more units of alcohol

dailyRheumatoid arthritis BooleanPhysical activity Boolean Use of arms to stand up

from chairPhysical exercise Boolean Take walk for exercisesBMD Numeric Normal: T-score >-1;

Abormal: T-score <= -1

TABLE 4: AUC of ROC and PR curves of expert knowledgemodel and our model with four different structures

Risk Factors From: LR-ROC SVM-ROC LR-PR SVM-PRExpert knowledge 0.729 0.601 0.458 0.343

Shallow RBM without FT 0.638 0.591 0.379 0.358Shallow RBM with FT 0.795 0.785 0.594 0.581

DBN without FT 0.662 0.631 0.393 0.386DBN with FT 0.878 0.879 0.718 0.720

To test the predictive power using either our inte-grated risk features or risk features given by the ex-pert opinion, we put a regression layer (i.e. classifier)on top of DBN to get classification results as shownin Fig. ??. Since no classifier is considered to performthe best classification, we use two classical classifiersto validate results generated by CDM compared withthe expert opinion. Logistic Regression (LR) is widelyused among experts to assess clinical RFs and predictthe fracture risk. Support Vector Machine (SVM) hasalso been applied to various real-world problems.

We conduct cross-validation throughout our exper-iments. It is noteworthy that holding out portions ofthe dataset is a manner similar to cross-validation. InAlgorithm 1, unsupervised pre-training of the CDMmodel employs both labeled and unlabeled trainingsamples while the supervised fine-tuning phase isconducted by a 5-fold cross-validation on the labeledtraining examples. Specifically, we divide the wholedata set into 5 parts, in which 3 parts are used to trainthe model, and the fourth part is applied as the hold-out set for mitigating the impact of over-fitting, andthe fifth part is used to run a classification test. In thenext run, the parts used for training, holding out, andtesting are changed. Thus, each run on testing sampleoutputs a vector in the range [0,1] indicating the beliefscore to a class, yielding 5 independent vectors intotal after a 5-fold cross-validation. When plottingROC and PR curves, 5 vectors are concatenated into avector with their equal-sized label vectors. In this way,AUC score is indeed the averaged score over 5-foldcross-validation runs.

4.3.2 Performance study for osteoporosis predictionThe overall results for the SOF data after Phase1 areshown in Table ??. The area under curve (AUC) of

10

ROC curve for each classifier (denoted as “LR-ROC”,“SVM-ROC”) and the AUC of PR curve (denoted as“LR-PR”, “SVM-PR”) are shown in Table ??. AUCindicates the performance of a classifier: the larger thebetter (an AUC of 1.0 indicates a perfect performance).The classification results using expert knowledge arealso shown for the performance comparison as thebaseline. From Table ??, we observe that a “shallowRBM without FT” method gets a sense of how thedata is distributed which represents the basic charac-teristics of the data itself. Although the performancesare not always higher than the expert model, this is acompletely unsupervised process without borrowingknowledge from any types of labeled information.Achieving such a comparable performance is not easysince the expert model is trained in a supervisedway. Further improvements may be possible by morethorough experiments under the help of label forfinishing a two-stage fine-tuning that is used to bettersatisfy our prediction tasks. Next we transform froman unsupervised task to a semi-supervised task. Table?? also shows the classification results which boostthe performance of all classifiers because of the two-stage fine-tuning shown as “shallow RBM with FT”.Especially, the AUC of PR of our model significantlyoutperforms the expert system. Since the capacity forthe RBM model with one hidden layer is usuallysmall, it indicates a need for a more expressive modelover the complex data. To satisfy this need, we adda new layer of non-linear perceptron at the bottom ofRBM, which forms a DBN as shown in Fig. ?? (b). Thisnew added layer greatly enlarges the overall modelexpressiveness. More importantly, the deeper struc-ture is able to extract more abstractive reasons. Aswe expected, unsupervised pre-training of a deeperstructure yields a better performance than the shallowRBM model (denoted as “DBN without FT”), and themodel further improves its behavior after the two-stage fine-tuning shown as “DBN with FT” in Table??.

4.3.3 Performance study for bone loss rate predictionIn this section, we show the bone loss rate predictionusing the abnormal cases after Phase1. High boneloss rate is an important predictor of higher fracturerisk. Moreover, it’s reported that RFs that accountfor high and low bone loss rates are different [?].Our integrated risk features are good at detectingthis property since they integrate the characteristicsof data itself and nicely tuned under the help oftwo kinds of labels. We compare the results betweenexpert knowledge based model and our DBN withfine-tuning model that yields the best performance forPhase1. The classification error rate is defined in Table??.

Since our result is also fine-tuned by the bone lossrate, we can directly feed new integrated features intoPhase2. Table ?? shows that our model outperforms

TABLE 5: Classification error rates of expert knowledgemodel and our model

LR-Error SVM-ErrorExpert 0.383 0.326DBN with FT 0.107 0.094

the expert model when predicting bone loss rate. Inthis case, the expert model fails because the limitedfeatures are not sufficient to forecast the bone lossrate which may interact with other different RFs. Thishighlights the need for a more complex model toextract the precise attributes from amounts of po-tential RFs. Moreover, our CDM module takes intoaccount the whole data set, not only keeping the 672risk factors but also utilizing two types of label. Theintegrated risk features reserve the characteristics ofbone loss rate after the second round fine-tuning,which assist in bone loss rate prediction.

4.4 Experiments and Results for Informative RiskFactor Selection

In this section, we will show experiments and resultson informative RF selection. Based on the proposedmethod shown in Fig. ??, we show a case study whichlists the top 20 informative RFs selected using BDMand NDM in Table ??. Description for each variablecan be found from the data provider website [?].

In this study osteoporosis appears to be associatedwith several known RFs that are well described inthe literature. Based on the universal rule used byFRAX [?] that is a popular fracture risk assessmenttool developed by WHO, some of the selected RFshave already been used to evaluate fracture risk ofpatients such as age, fracture history, family his-tory, BMD and extra alcohol intake. Besides, mostinformative RFs we reported in Table ?? have beenreviewed and endorsed by bone health institutionsand medical researchers. For example, some physicaland lifestyle risk factors such as dizziness(DIZZY),vital status(CSHAVG), inability to rise from a chair(STDARM), daily exercises(50TMWT), are examinedas important risk factors for osteoporotic fractures[?], [?], [?]. Blood pressure is a secondary risk fac-tor in that blood pressure pills may increase therisk [?]. Breast cancer has been examined as a riskfactor by National Institutes of Health (NIH) whichsays that women with breast cancer are at increasedrisk for developing osteoporosis due to a drop inestrogen levels, chemotherapy and the productionof osteoclasts [?]. Of greatest interest is that somephysical performances such as steadiness/step in turn(TURNUM, STEADY, STEPUP), aid used for pacetests(GAID) are perhaps the identifiable informativerisk factors and can be easily incorporated into routineclinical practice. Based on these results, some envi-ronmental/behavioral risk factors are modifiable andpreventions and therapeutic interventions are neededto reduce osteoporosis and fracture risks.

11

TABLE 6: Informative risk factors generated by BDM and NDM

Category Variable DescriptionDemographics AGE The patient’s age at this visit

Fracture history

IFX14 Vertebral fracturesINTX Intertrochanteric fracturesFACEF Face fractureANYF Follow-up time to 1st any fracture since current visit

Family history MHIP80 Mom hip fracture after age 80

Exam DSTBMC Distal radius bone mass content(gm/cm)PRXBMD Proximal radius bone mass density(gm/cm2)

Physicalperformance

TURNUM Number of steps in turnSTEADY Steadiness of turnSTEPUP Ability to step up one stepSTDARM Does participant use arms to stand up?GAID Aid used for pace tests(i.e.crutch,cane,walker)

Exercise 50TMWT Total number of times of activity/year at age 50Life style DR30 During the past 30 days, how often did you have 5 or

more drinks during one dayBreast cancer BRSTCA Breast cancer status such as tumor behavior, staging of

tumor and so on

Blood pressure LISYS Systolic blood pressure lying down (mmHg)DIZZY Dizziness upon standing up

Vision CSHAVG Average contrast sensitivity

On the other hand, learner may need to purchasedata during the training stage and recruit people toanswer hundreds and thousands questions. With afixed budget and limited time, it might be impossibleto acquire as many as possible features for all par-ticipants. So what are the most important questionsthe physicians need to know? How many featuresthey need to achieve a good predictive performance.Using the proposed approach we select at most top 50informative RFs, instead of using all of them, and feedthem directly to the logistic regression classifier for theosteoporosis prediction. Fig. ?? shows the osteoporosisprediction AUC result for both ROC and PR curve asthe number of informative RFs increases. As we cansee, the proposed informative RF selection methodexhibits great power of predicting osteoporosis in thatthe selected RFs are more significant than the rest ofthe RFs. Moreover, the best prediction performance isachieved using the proposed method when selectingthe top 20 to top 25 informative RFs. And the AUC iseven better than the expert knowledge model (AUC ofROC: 0.729; AUC of PR:0.458). The performance of theprediction result of top-N RFs selected by BDM andNDM is inferior to that of integrated RFs extractedby CDM (AUC of ROC: 0.878; AUC of PR:0.718) inthat some information are discarded and those infor-mation might still make contribution to enhancing thepredictive behaviors.

5 SENSITIVITY ANALYSIS AND PARAMETERSELECTION

5.1 Sensitivity to Skewed ClassWe provide the data set statistics in Fig. ??. Wecollected patients’ BMD in baseline visit and next 10-years visit. The whole dataset can be split to fiveparts: normal BMD to normal BMD, normal BMDto osteoporotic BMD, osteoporotic BMD to normalBMD, osteoporotic BMD to osteoporotic BMD and

0 10 20 30 40 500.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

Number of informative RFs

AU

C o

f RO

C

Informative RFIntegrated RFExpert RF

(a) AUC of ROC curve

0 10 20 30 40 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Number of informative RFs

AU

C o

f PR

Informative RFIntegrated RFExpert RF

(b) AUC of PR curve

Fig. 8: Osteoporosis prediction based on informative RFs

NormalToNormalNormalToOsteo OsteoToNormal OsteoToOsteo Missing Label0

500

1000

1500

2000

2500

3000

3500

↓ 1696

↓ 689

↓ 2653

↓ 3036

↓ 1630

Num

ber

of s

ampl

es

Fig. 9: SOF data set statistics

missing BMD. We used 8074 patients with label forosteoporosis prediction in Section ??, in which 4349patients (NormalToNormal and OsteoToNormal) havenormal BMD and 3725 patients (NormalToOsteo andOsteoToOsteo) have osteoporotic BMD in 10-yearslater. Although class distribution is not far from theuniformly distribution, it is still necessary to discusshow well our model overcomes the skewed classproblem. We conduct experiments on the partial dataset that is highly imbalanced to examine our modelsince the imbalanced class problem may seriouslydegrade a classifier’s performance on the minorityclass. We manually remove OsteoToOsteo data, shownas the fourth bar in Fig. ??. As a result, the data sethas a ratio of roughly 6:1 for two classes (4349 normalpatients and 689 osteoporotic patients). Algorithm1 includes two steps: pre-training and fine-tuning.During the pre-training phrase, we do not rely on any

12

DBN Classifier 1

DBN Classifier 2

DBN Classifier 3

DBN Classifier 4

DBN Classifier 5

DBN Classifier 6

VoDng

Osteoporosis PredicDon

Fig. 10: Training each DBN with the balanced data andvoting for the final prediction

label information, that is, an un-supervised learningfashion. If the training set is imbalanced, this pre-training will likely initialize the structure of modelclose to the majority class. The common solutions forbalancing the data are under-sampling (ignoring datafrom the majority class), over-sampling (replicatingdata from the minority class) and informed under-sampling (selecting data according to some set ofprinciples) [?], [?], [?]. One straightforward way isto independently sample several subsets from themajority class, with each subset having approximatelythe same number of examples as the minority class. Inthis way, we can create six roughly balanced datasetsby replicating the minority class and partitioning themajority class into six subsets. Then, we can indepen-dently train six classifiers and count votes for the finaldecision shown in Fig. ??. To test the effect of the classdistributions on our proposed model, we compare theperformances based on our proposed model shownin Fig. ?? with a DBN structure appended a logisticregression model on it. We still use AUC of ROCand PR curves as performance evaluation measuresshown in Fig. ??. Experimental results show that classimbalance is harmful for osteoporosis prediction, anddealing with this problem improves the AUC scoresof both ROC and PR curve. We also observed that onmodels which add fine-tuning, it is especially prob-lematic since imbalanced label information leads ourmodel to a local minimum that ignores the minoritycases.

5.2 Sensitivity to Noisy DataFig. ?? shows that there are missing values or noisyvalues in the risk factor space of the data set. This is acommon problem in most clinical datasets. To handlewith missing/noisy data, we follow up two steps: (1)manually removing those risk factor columns withmore than 70% missing values during the data pre-processing procedure (249 of 672 risk factor columnswith more than 70% missing values are deleted)and (2) using a column-wise mean to fill out theblank for the surviving columns. Then we rely onthe good de-noising properties coming from the basicstructure “RBM” of our model. The de-noising ca-pability of RBM model has been widely examinedby some computer vision tasks [?], [?]. ContrastiveDivergence training is actually a stochastic sampling

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

False Positive Rate

Tru

e P

ositi

ve R

ate

DBN with imbalanced data: 0.61DBN with balanced data: 0.655DBN−FT with imbalanced data: 0.742DBN−FT with balanced data: 0.872

0 0.2 0.4 0.6 0.8 10.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cisi

on

DBN with imbalanced data: 0.342DBN with balanced data: 0.379DBN−FT with imbalanced data: 0.517DBN−FT with balanced data: 0.705

Fig. 11: Both ROC and PR curves show the effect of the classdistributions on CDM model indicated with AUC of ROCand PR curves.

process, which randomly turns on the hidden unitbased on the activation probability. This randomnesscancels out the data noise in a certain level. Moreover,the data distribution is more consistent across alltraining samples, but the noise distribution differs foreach sample. When we feed the model with enoughsamples, the sampling process will drive the modeltoward the data distribution because this distributionoccurs more frequently. We may need to apply a moresophisticated method such as matrix factorizationbased collaborative filtering so as to maximize the useof the original data in future studies.

5.3 Parameter SelectionThe number of hidden units is closely related to therepresentational power of our model. Ideally, we canrepresent any discrete distribution exactly when thenumber of hidden units is very large [?]. In ourexperiment, we examine the power of our modelwhen increasing the number of binary units on thefirst hidden layer. Fig. ?? shows the performance ofour CDM model under different number of hiddenunits. When the number of hidden units is small, themodel is lack of capacity to capture data complexitywhich results in lower performance on both AUC ofROC and PR curves. As we increase the number ofhidden units, our model shows a strictly improvedmodeling power. However, when the number is toolarge, we don’t have sufficient samples to train thenetwork which results in a lower performance andstability. In our experiment, we choose 400 as thehidden unit number.

13

10 50 100 400 600 8000

0.2

0.4

0.6

0.8

1

Number of hidden units

AUC of ROCAUC of PR

Fig. 12: The performance on models with different hiddenunits

Despite the model parameters changing betweenupdates, these changes should be small enough thatonly a few steps of Gibbs (in practice, often one stepis used) are required to maintain samples from theequilibrium distribution of the Gibbs chain, i.e., themodel distribution. The learning rate used to updateweights is fixed to the value of 0.05 that is chosen fromthe validation set. And the number of iterations is setto 10 for efficiency since we observed that the modelcost can reach into a relatively stable state within 5 to10 iterations. We use mini-batch gradient for updatingthe parameters and the batch size is set to 20. Afterthe model is trained, we simply feed it with the wholedata and get the new integrated RFs and then run thesame classification module to get results.

6 CONCLUSIONS

We developed a multi-tasking framework for osteo-porosis that not only extracts the integrated featuresfor progressive bone loss and bone fracture predictionbut also selects the individual informative RFs thatare valuable for both patients and medical researchers.Our framework finds a representation of RFs so thatthe salient integrated features can be disentangledfrom ill-organized EHR data. These integrated fea-tures constructed from original RFs will become themost effective features for bone disease prediction. Wedeveloped disease memory (DM), which categorizesand stores the underlying characteristics for a specificcohort. In essence, we trained an independent modelbased on a specific group of patients. For example,comprehensive disease memory (CDM) captures thecharacteristics for all patients to predict the disease.Bone disease memory (BDM) memorizes the charac-teristics of those individuals who suffer from bonediseases. Similarly, the non-disease memory (NDM)memorizes attributes for non-diseased individuals. Avariety of DM models increase the flexibility for moni-toring the disease for different groups of patients. Ourextensive experimental results showed that the pro-posed method improves the prediction performanceand has great potential to select the informative RFsfor bone diseases. As a long-term impact, a bonedisease analytic system will ultimately be deployed

in bone disease monitoring and preventing settingswhich will offer much greater flexibility in tailoringthe scheduling, intensity, duration and cost of therehabilitation regimen.Hui Li Hui Li received her Master degree

in the Department of Computer Science andEngineering at State University of New Yorkat Buffalo, New York, US in 2011. She iscurrently pursuing a PhD degree at RobinLi Data Mining & Machine Learning Labora-tory with supervision by Prof. Aidong Zhang.Her research interests are in the area ofdata mining, machine learning and healthinformatics. Her current work focuses on 3Dbone microstructure modeling and risk factor

analysis for bone disease prediction.Xiaoyi Li Xiaoyi Li received the Master de-gree in computer science from the StateUniversity of New York at Buffalo, New York,US in 2011. He is currently pursuing a PhDdegree at Robin Li Data Mining & MachineLearning Laboratory lead by Prof. AidongZhang. His primary interests lie in data min-ing, machine learning and deep learning. Hiscurrent work focuses on learning represen-tation from data with multiple views – howto maximize the fusion of different views to

better represent an object.Murali Ramanathan Dr. Murali Ramanathanis Professor of Pharmaceutical Sciences andNeurology at the State University of NewYork at Buffalo, NY. Dr. Ramanathan receivedhis Ph.D. in Bioengineering from the Univer-sity of California, San Francisco, CA in 1994.He received a B.Tech. (Honors) in Chemi-cal Engineering from the Indian Institute ofTechnology, Kharagpur, India in 1983 andan M.S. in Chemical Engineering from IowaState University, Ames, IA in 1989. Dr. Ra-

manathans research interests are in the area of treatment of multiplesclerosis (MS), an inflammatory-demyelinating disease of the centralnervous system that affects over 1 million patients worldwide. MSis a complex, variable disease that causes physical and cognitivedisability and nearly 50% if patients diagnosed with MS are unableto walk after 15 years. The etiology and pathogenesis of MS re-mains poorly understood. The focus of the research is to identifythe molecular mechanisms by which the autoimmunity of MS istranslated into neurological damage in the CNS. A second area ofresearch emphasis in the laboratory is pharmacogenomic modeling.The large-scale genomewide association studies in MS have hadlimited success and have explained only a small proportion of therisk of developing MS. These data have lent further support to thepossible importance of environmental factors, gene-gene interac-tions and gene-environment interactions in MS. Dr. Ramanathanspharmacogenomic modeling research has focused on identifying keyinteractions between genetic and environmental factors in diseaseprogression in MS. His group has developed novel informationtheory-based algorithms for gene-environment interaction analysis.

Aidong Zhang Dr. Aidong Zhang is UB Dis-tinguished Professor and Chair in the Depart-ment of Computer Science and Engineeringat State University of New York at Buffalo.Her research interests include bioinformat-ics, data mining, multimedia and databasesystems, and content-based image retrieval.She is an author of over 250 research pub-lications in these areas. She has chairedor served on over 100 program committeesof international conferences and workshops,

and currently serves several journal editorial boards. She has pub-lished two books Protein Interaction Networks: Computational Anal-ysis (Cambridge University Press, 2009) and Advanced Analysis ofGene Expression Microarray Data (World Scientific Publishing Co.,Inc. 2006). Dr. Zhang is a recipient of the National Science Foun-dation CAREER award and State University of New York (SUNY)Chancellor’s Research Recognition award. Dr. Zhang is an IEEEFellow.

Prediction and Informative Risk Factor Selection of Bone Diseases · 2014. 6. 18. · 1 Prediction...

Documents

Transcript of Prediction and Informative Risk Factor Selection of Bone Diseases · 2014. 6. 18. · 1 Prediction...