Analysis of sampling techniques for imbalanced data: An n...

1

2

3Q1

4

567

8

910111213141516171819202122

43

44

45

46

47

48

49

50

51

52

53

NeuroImage xxx (2013) xxx–xxx

YNIMG-10890; No. of pages: 22; 4C: 5, 6, 7, 8, 15, 17, 18, 20, 21

Contents lists available at ScienceDirect

NeuroImage

j ourna l homepage: www.e lsev ie r .com/ locate /yn img

Analysis of sampling techniques for imbalanced data: An n = 648ADNI study

OFRashmi Dubey a,b, Jiayu Zhou a,b, Yalin Wang a, Paul M. Thompson c, Jieping Ye a,b,⁎,

For the Alzheimer's Disease Neuroimaging Initiative 1

a School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, USAb Center for Evolutionary Medicine and Informatics, The Biodesign Institute, Arizona State University, Tempe, AZ, USAc Imaging Genetics Center, Laboratory of Neuro Imaging, UCLA School of Medicine, Los Angeles, CA, USA

U

O

⁎ Corresponding author at: Department of Computer SciEvolutionary Medicine and Informatics, The Biodesign Ins699 S. Mill Ave, Tempe, AZ 85287, USA.

E-mail address: [email protected] (J. Ye).1 The data used in the preparation of this article were

Disease Neuroimaging Initiative (ADNI) database (adni.lotigators within the ADNI contributed to the design and iprovided data but did not participate in the analysis or wlisting of ADNI investigators can be found at: http:/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf

1053-8119/$ – see front matter © 2013 Elsevier Inc. All rihttp://dx.doi.org/10.1016/j.neuroimage.2013.10.005

Please cite this article as: Dubey, R., et al., Ahttp://dx.doi.org/10.1016/j.neuroimage.2013

Ra b s t r a c t
a r t i c l e i n f o
23

24

25

26

27

28

29

30

31

32

Article history:Accepted 7 October 2013Available online xxxx

Keywords:Alzheimer's diseaseClassificationImbalanced dataUndersamplingOversamplingFeature selection

33

34

35

36

37

38

39

40

ECTED P

Many neuroimaging applications deal with imbalanced imaging data. For example, in Alzheimer's DiseaseNeuroimaging Initiative (ADNI) dataset, the mild cognitive impairment (MCI) cases eligible for the study arenearly two times the Alzheimer's disease (AD) patients for structural magnetic resonance imaging (MRI)modalityand six times the control cases for proteomicsmodality. Constructing an accurate classifier from imbalanced data isa challenging task. Traditional classifiers that aim to maximize the overall prediction accuracy tend to classify alldata into the majority class. In this paper, we study an ensemble system of feature selection and data samplingfor the class imbalance problem.We systematically analyze various sampling techniques by examining the efficacyof different rates and types of undersampling, oversampling, and a combination of over and undersamplingapproaches. We thoroughly examine six widely used feature selection algorithms to identify significant bio-markers and thereby reduce the complexity of the data. The efficacy of the ensemble techniques is evaluatedusing two different classifiers including Random Forest and Support VectorMachines based on classification accu-racy, area under the receiver operating characteristic curve (AUC), sensitivity, and specificity measures. Our exten-sive experimental results show that for various problem settings in ADNI, (1) a balanced training set obtainedwithK-Medoids technique basedundersampling gives the best overall performance amongdifferent data sampling tech-niques and no sampling approach; and (2) sparse logistic regression with stability selection achieves competitiveperformance among various feature selection algorithms. Comprehensive experiments with various settingsshow that our proposed ensemble model of multiple undersampled datasets yields stable and promising results.

© 2013 Elsevier Inc. All rights reserved.

4142
R
54

55

56

57

58

59

60

61

62

63

NCO

R

Introduction

Alzheimer's disease (AD) is the most frequent form of dementiain elderly patients; it is a neurodegenerative disease which causesirreversible damage to motor neurons and their connectivity, resultingin cognitive failure and several other behavioral disorders whichseverely impact day-to-day functioning of the patients (Alzheimer'sAssociation, 2012). As the population is aging, by the year 2050, it isprojected that therewill be 13.5 million clinical AD individuals account-ing for a total care cost of $1.1 trillion (Alzheimer's Association, 2012). It

64

65

66

67

68

69

70

71

72

73

ence and Engineering, Center fortitute, Arizona State University,

obtained from the Alzheimer'sni.ucla.edu). As such, the inves-mplementation of ADNI and/orriting of this report. A complete/adni.loni.ucla.edu/wp-content/.

ghts reserved.

nalysis of sampling technique.10.005

is estimated that by the time the typical patient is diagnosed with AD,the disease has been progressing for nearly a decade. Preclinical ADpatients may not show debilitating AD symptoms but the toxic changesin the brain and blood proteins have been developing since the incep-tion of the disease (Bartzokis, 2004; Vlkolinskỳ et al., 2001). Early diag-nosis of AD is critical to prevent or delay the progression of the disease.Future treatments could then target the disease in its earliest stages,before irreversible brain damage or mental decline has occurred.

There aremany studieswhich aim to capture the elusive biomarkersof AD for preclinical AD research (Sperling et al., 2011). Several genetic,imaging and biochemical markers are being studied to monitor progres-sion of AD and explore treatment and detection options (Frisoni et al.,2010; Jack et al., 2008; Mueller et al., 2005; Reiman and Jagust, 2011;Shaw et al., 2009). For example, a genetic risk factor, Apolipoprotein E(APOE) gene, has been shown to be associated with the late onsetof AD. The APOE gene comes in different forms or alleles; peoplewith an APOE ε-4 allele have a 20% to 90% higher risk of developingAlzheimer's disease than those who do not have an APOE ε-4 (Corderet al., 1993; Mayeux et al., 1998). Magnetic resonance imaging (MRI)and fluorodeoxyglucose positron emission tomography (FDG-PET)

s for imbalanced data: An n= 648 ADNI study, NeuroImage (2013),

http://dx.doi.org/10.1016/j.neuroimage.2013.10.005

mailto:[email protected]

http://adni.loni.ucla.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf

http://adni.loni.ucla.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf


http://www.sciencedirect.com/science/journal/10538119


T

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

2 R. Dubey et al. / NeuroImage xxx (2013) xxx–xxx

UNCO

RREC

scans are powerful neuroimaging modalities which have been shownby various cross-sectional and longitudinal studies to have the highestdiagnostic and prognostic power in identifying preclinical and clinicalAD patients from control cases (Devanand et al., 2007; Dickersonet al., 2001). MRI is a medical imaging technique utilizing magneticfield to produce very clear 3-dimensional images enabling detailedstudy of structural and functional changes in the body. MRI has becomean essential tool in AD research due to its non-invasive nature, wide-spread availability, and great potential in predicting disease progression.Since the brain controls most functions of the body, it is hypothesizedthat any changes in the brain are reflected in the proteins produced.Proteomics, the study of proteins found in blood, is gainingmomentumas an AD modality due to its cost effectiveness, ease of availability, andability to detect probable/positive AD cases in simplistic initial screen-ings which could be followed up by other advanced clinical modalities(O'Bryant et al., 2011; Ray et al., 2007).

The Alzheimer's Disease Neuroimaging Initiative (ADNI), a multi-pronged, longitudinal study started as a 5 year project, is a collaborativeeffort by multiple research groups from both the public and privatesectors, including the National Institute on Aging (NIA), the NationalInstitute of Biomedical Imaging and Bioengineering (NIBIB), the Foodand Drug Administration (FDA), 13 pharmaceutical companies, and 2foundations that provided support through the Foundation for theNational Institutes of Health (NIH). It was launched in 2003 as a$60 million, 5-year public–private partnership to help identify thecombination of biomarkers with the highest diagnostic and prognos-tic power. The primary goal of ADNI has been to test whether serialmagnetic resonance imaging (MRI), positron emission tomography(PET), other biological markers, and clinical and neuropsychologicalassessment can be combined to measure the progression of mild cogni-tive impairment (MCI) and early Alzheimer's disease (AD). Determina-tion of sensitive and specific markers of very early AD progression isintended to aid researchers and clinicians to develop new treatmentsand monitor their effectiveness, as well as lessen the time and costof clinical trials. This initiative has helped develop optimized methodsand uniform standards for acquiring biomarker data which includesMRI, PET, proteomics and genetics data on patients with AD, mildcognitive impairment (MCI) and healthy controls (NC), and creatingan accessible data repository for the scientific community (Muelleret al., 2005). The Principal Investigator of this initiative is MichaelW. Weiner, MD, VA Medical Center and University of California –

San Francisco.One of the key challenges in designing good predictionmodels on

ADNI data lies in the class imbalance problem. A dataset is said to beimbalanced if there are significantlymore data points of one class andfewer occurrences of the other class. For example, the number of controlcases in the ADNI dataset is half of the number of AD cases for proteo-micsmeasurement, whereas for MRImodality, there are 40%more con-trol cases thanAD cases. Data imbalance is also ubiquitous inworldwideADNI type initiatives from Europe, Japan and Australia, etc. (Weineret al., 2012). In addition, lots of medical research involves dealingwith rare, but important medical conditions/events or subject dropoutsin the longitudinal study (Bernal-Rusiel et al., 2012; Duchesnay et al.,2011; Fitzmaurice et al., 2011; Jiang et al., 2011; Johnstone et al.,2012). It is commonly agreed that imbalanced datasets adverselyimpact the performance of the classifiers as the learned model is biasedtowards themajority class tominimize the overall error rate (Estabrooks,2000; Japkowicz, 2000a). For example, in Cuingnet et al. (2011), due tothe imbalance in the number of subjects in NC andMCIc (MCI Converter)groups, they achieved amuch lower sensitivity than specificity. Similarly,in our prior work (Yuan et al., 2012), due to the imbalance in thenumber of subjects in NC, MCI and AD groups, we obtained imbalancedsensitivity and specificity on AD/MCI and MCI/NC classificationexperiments. Recently, Johnstone et al. (2012) studied pre-clinicalAD prediction using proteomics features in the ADNI dataset. Theyexperimented with imbalanced and balanced datasets and observed

Please cite this article as: Dubey, R., et al., Analysis of sampling techniquehttp://dx.doi.org/10.1016/j.neuroimage.2013.10.005

ED P

RO

OF

that the sensitivity and specificity gap significantly reduces when thetraining set is balanced.

In machine learning field, many approaches have been developed inthe past to deal with the imbalanced data (Chan and Stolfo, 1998;Chawla et al., 2003, 2004; Ertekin et al., 2007; He and Garcia, 2009;Japkowicz and Stephen, 2002; Jo and Japkowicz, 2004; Kołcz et al.,2003; Lee et al., 2004; Liu et al., 2009c; Maloof, 2003; Provost, 2000;Van Hulse et al., 2007; Visa and Ralescu, 2005; Yang and Wu, 2006).They can be broadly classified as internal or algorithmic level and externalor data level. The algorithmic level approaches involve either designingnew classification algorithms or modifying the existing ones to handlethe bias introduced due to the class imbalance. Many researchers studiedthe class imbalance problem in relation to the cost-sensitive learningproblem, wherein the penalty of misclassification is different for differentclass instances, and proposed solutions to the class imbalance problembyincreasing the misclassification cost of the minority class and/or byadjusting the estimate at leaf nodes in case of decision trees such asRandom Forest (RF) (Bradford et al., 1998; Chen et al., 2004; Elkan,2001; Knoll et al., 1994; Pazzani et al., 1994). Akbani et al. proposedan algorithm for learning from imbalanced data in case of SupportVector Machines (SVM) by updating the kernel function (Akbani et al.,2004). Recognition based (one-class) learningwas identified as a bettersolution for certain imbalanced datasets instead of two-class learningapproaches (Japkowicz, 2001). The external or data level solutions includedifferent types of data resampling techniques such as undersampling andoversampling. Random resampling techniques randomly select datapoints to be replicated (oversampling with or without replacement) orremoved (undersampling). These approaches incur the cost of over-fitting or losing the important information respectively. Directed orfocused sampling techniques select specific data points to replicate orremove. Japkowicz proposed to resample minority class instances lyingclose to the class boundary (Japkowicz, 2000b) whereas Kubat andMatwin (1997) proposed resampling majority class such that borderlineand noisy data points are eliminated from the selection. Yen and Lee(2006) proposed cluster-based undersampling approaches for selectingthe representative data as training data to improve the classificationaccuracy. Liu et al. (2009c) developed two ensemble learning systemsto overcome the deficiency of information loss introduced in the tradi-tional random undersampling method. Chawla et al. (2002) designed asophisticated algorithmbased on nearest neighbors to generate syntheticdata for oversampling (SMOTE) and combined it with undersamplingapproaches and achieved significant improvements over random sam-pling techniques. Padmaja et al. (2008) proposed an algorithm, calledMajority filter-based minority prediction (MFMP), and achieved betterperformance than random resampling approaches. Estabrooks et al.(2004) dealt with the rate of resampling required and proposed a combi-nation scheme heavily biased towards under-represented class to miti-gate the classifier's bias towards the majority class. Joshi et al. (2001)combined results from several weak classifiers and concluded thatboosting algorithms such as RareBoost and AdaBoost effectively handlerare cases. Zheng and Srihari (2003) proposed a novel feature levelsolution based on selecting and optimally combining positive andnegative features. This approach was specifically devised to solve theimbalanced data problem in text categorization.

Apart from the internal and external solutions, evaluation ofthe classifier for imbalanced datasets has always remained a bigchallenge (Elkan, 2003). Provost and Fawcett (2001) proposed theROC convex hull method for estimating classifier performance. Lingand Li (1998) used lift analysis as the performancemeasure, formarket-ing analysis problem,which is a customized version of ROC curve. Kubatand Matwin (1997) used the geometric mean to assess the classifierperformance. The internal approaches are quite effective; for example,Zadrozny et al. (2003) proposed a cost-sensitive ensemble classifierCosting which yielded better results than random sampling methods.However, the greatest disadvantage of internal level solutions is thatthey are very specific to the classification algorithm. On the other



T

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

Q2Table 1 t1:1

t1:2Summary of ADNI data used in the study.

t1:3ADNI baseline data details

t1:4Proteomics MRI

t1:5Feature count 147 305t1:6Control cases (NC) 58 191

3R. Dubey et al. / NeuroImage xxx (2013) xxx–xxx

UNCO

RREC

hand, the external or data level solutions are classifier independent,portable, and thereforemore adaptable. In thiswork,we focus on devel-oping and evaluating ensemble models based on data level methods.

While ubiquitous and important, imbalanced data analysis hasnot received enough attention in the neuroimaging field, at leastfor the ADNI dataset. This paper aims to fill this gap by studying anensemble technique to tackle the class imbalance problem in theADNI dataset. The resampling approaches that we studied includerandom undersampling and oversampling (He and Garcia, 2009; Jo andJapkowicz, 2004; Liu et al., 2009c; Van Hulse et al., 2007; Yen andLee, 2006), SMOTE oversampling (Chawla et al., 2002), and K-Medoidsbased undersampling. We extended our study by varying rates of under-sampling and oversampling independently, and a combination of differ-ent rates of oversampling and undersampling to generate balancedtraining sets. In AD research, it is crucial to determine a few significantbio-markers that can help develop therapeutic treatment. In this paper,we examine six state-of-the-art feature selection algorithms includingStudent's t-test, Relief-F, Gini Index, Information Gain, Chi-Square, andSparse Logistic Regression with stability selection. The classifiers studiedare decision tree basedRandomForest (RF) classifier anddecision bound-ary based Support Vector Machine (SVM) classifier. The classificationevaluation criterion is a combination of test accuracy, AUC, sensitivity,and specificity. As an illustration, we study clinical group (diagnostic)classification problems using the ADNI baseline MR imaging and proteo-mics data. The multitude of experiments conducted corroborated theefficacy of the ensemble system which includes an ensemble of multiplecompletely undersampled datasets (majority class is reduced to matchminority class count) using K-Medoids together with feature selectionbased on sparse logistic regression and stability selection.

Subjects and methods

Subjects

Data used in the preparation of this article were obtained fromthe Alzheimer's Disease Neuroimaging Initiative (ADNI) database(adni.loni.ucla.edu). ADNI is the result of efforts of many co-investigators from a broad range of academic institutions and pri-vate corporations, and subjects have been recruited from over 50sites across the U.S. and Canada. The initial goal of ADNI was torecruit 800 adults, ages 55 to 90, to participate in the research,including approximately 200 cognitively normal older individuals,400 people with MCI, and 200 people with early AD. For up-to-date information, see www.adni-info.org.

In our experiments, we used the baseline MRI and proteomics dataas the input features because of their wide availability. The MRI imagefeatures in this studywere based on the imaging data from theADNI da-tabase processed by the UCSF team,who performed cortical reconstruc-tion and volumetric segmentations with the FreeSurfer image analysissuite (http://surfer.nmr.mgh.harvard.edu/). The processedMRI featurescome from a total of 648 subjects (138 AD, 319 MCI and 191 NC), andcan be grouped into 5 categories: average cortical thickness, standarddeviation in cortical thickness, the volumes of cortical parcellations(based on regions of interest automatically segmented in the cortex),the volumes of specific white matter parcellations, and the total surfacearea of the cortex. There were 305 MRI features in total. Details of theanalysis procedure are available at http://adni.loni.ucla.edu/research/mri-post-processing/. More details on ADNI MRI imaging instrumenta-tion and procedures (Jack et al., 2008) may be found at the ADNI website (http://adni.loni.ucla.edu). The proteomics data set (112 AD, 396MCI, and 58 NC) was produced by the Biomarkers Consortium Project“Use of Targeted Multiplex Proteomic Strategies to Identify Plasma-Based Biomarkers in Alzheimer's Disease”2 (see URL in the footnote).

2 http://adni.loni.ucla.edu/wp-content/uploads/2010/11/BC_Plasma_Proteomics_Data_Primer.pdf


ED P

RO

OF

We use 147 measures from the proteomic data downloaded from theADNI web site.

The subjects of interest in AD research are divided into three broadcategories: Control or normal cases (NC), mild cognitive impairment(MCI) cases and AD cases. The MCI cases, based on their status whenfollowed-up over the course of a 4 year period, are further divided intoMCI stable or non-converter cases (MCI NC), i.e., those MCI individualswho remain at MCI status and MCI converter cases (MCI C), i.e., thoseMCI patients who subsequently progress to AD. The summary of thenumber of samples for MRI and proteomics modalities, which passedthe quality control and were available for the current study, and theirbaseline features together with disease status, are listed in Table 1. Thedata imbalance problem is clearly shown in Table 1. For example, inTable 1, AD cases are nearly double the number of control cases for theproteomics modality.

We examined both negative and positive class imbalances dependingupon the prediction task and the feature set used. In proteomics mea-surements, there are 58 control cases (treated as negative class) versus391 MCI cases (including both stable and converters; treated as positiveclass). For MRI modality, there are 191 control cases (treated as negativeclass) and 138 AD cases (treated as positive class). Disease prognosis is acritical task as the penalty attached to incorrect prediction is more thanmonetary. AD studies are targeted to provide early treatment to probableAD cases and to prevent or delay AD progression in AD cases. Incorrectlypredicting anAD case as normalwill prevent the patient from getting therequired (or timely) medical treatment thereby reducing the patient'slife expectancy. On the other hand, incorrect prediction of AD as a controlcase might cause distress to the patient and the family. Hence, it is chal-lenging to determine the optimal costs to positive or negative classinstances. Given the subtle and critical nature of the domain, in thisstudy, we thoroughly examined different data resampling approachesand proposed a simple and versatile ensemble model approach toeffectively handle class imbalance situation in the ADNI dataset.

Ensemble framework

The ensemble systemproposed in this study is a combination of dataresampling technique, feature selection algorithm, and binary predictionmodel. The proposed ensemble system belongs to the class of externalapproaches with algorithmic level solutions. As noted earlier, externalapproaches for class imbalance problems are easily adaptable and areindependent of the feature selection or classification algorithms. Further-more, based on the domain requirements, algorithmic level solutions canbe integratedwith the proposedmodel to generate customized sophisti-cated learning model. This demonstrates the simplicity and versatility ofour ensemble system. Within the proposed ensemble system, we ana-lyze four basic data sampling approaches in addition to the no samplingapproach, six feature selection algorithms, and two classificationalgorithms. The following are the data sampling approaches studied inthis paper:

1. No sampling: All of the data points frommajority andminority trainingsets are used.

2. Random undersampling: All of the training data points from theminority class are used. Instances are randomly removed from the

t1:7MCI stable cases 233 177t1:8MCI convertor cases 163 142t1:9AD cases 112 138


http://www.adni-info.org

http://surfer.nmr.mgh.harvard.edu/

http://adni.loni.ucla.edu/research/mri-post-processing/

http://adni.loni.ucla.edu/research/mri-post-processing/

http://adni.loni.ucla.edu

http://adni.loni.ucla.edu/wp-content/uploads/2010/11/BC_Plasma_Proteomics_Data_Primer.pdf

http://adni.loni.ucla.edu/wp-content/uploads/2010/11/BC_Plasma_Proteomics_Data_Primer.pdf


T

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386387388

389390391

392393394

395396397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428


UNCO

RREC

majority training set till the desired balance is achieved. Onedisadvantage of this approach is that some useful informationmight be lost from the majority class due to the undersampling.This will be referred to as “Random US” in the following tablesand figures.

3. Random oversampling: All data points from majority and minoritytraining sets are used. Additionally, instances are randomly picked,with replacement, from the minority training set till the desiredbalance is achieved. Adding the same minority samples might resultin overfitting, thereby reducing the generalization ability of the classi-fier. Thiswill be referred to as “RandomOS” in the following tables andfigures.

4. K-Medoids undersampling: This is based on an unsupervised clusteringalgorithm in which the cluster centers are the actual data points. Themajority training set is clustered where the number of clusters equalsthe number of minority training examples. Since, the initial clustercenters are chosen randomly, the process is repeated and the bestresult (the one with the minimum cost) is selected. The final trainingset is a combination of all data from the minority training set and thecluster centers from the majority training set. This approach is usedonly for undersampling, hence it will be referred as “K-Medoids” forthe rest of this paper.

5. SMOTE oversampling: SMOTE is the acronym for “Synthetic Minor-ity Over-sampling Technique” which generates new syntheticdata by randomly interpolating pairs of nearest neighbors. Detailsof the SMOTE algorithm can be found in the work by Chawla et al.(2002). This study used SMOTE to generate new synthetic datafor the minority training set. The final training set is a combina-tion of all data from the majority and minority training sets and,additionally, the new synthetic minority data such that finaltraining set is balanced. In this paper we use SMOTE only foroversampling, and it will be referred as “SMOTE” in the followingfigures and tables.

As noted earlier, an important goal of AD research is to identify keybio-signatures. The bio-signature discovery is done through feature se-lection which is defined as the process of finding a subset of relevant fea-tures (biomarkers) to develop efficient and robust learning models.Feature selection is an active research topic in the machine learningfield. Based on prior work involving analysis of feature selection algo-rithms for bio-signature discovery in ADNI data (Dubey, 2012), thiswork explored the following six top-performing state-of-the-art featureselection algorithms: (1) two tailed Student's t-test3 (referred to asT-test); (2) Relief-F4 based on relevance of features using k-nearestneighbors; (3) Gini Index3 based on measure of inequality in the fre-quency distribution values; (4) Information Gain3 which measures thereduction in uncertainty in predicting the class label; (5) Chi-Square3

test for independence to determine whether the outcome is dependenton a feature; and (6) sparse logistic regression with stability selection(Meinshausen and Bühlmann, 2010) (referred to as SLR + SS) to selectrelevant features. A detailed description of feature selection algorithmscan be found in Appendix A.

In addition, two classifiers including Random Forest (RF) andSupport Vector Machine (SVM) were used for classification usingthe top features selected. The framework for the ensemble systemis illustrated in Fig. 1. The graphical illustration of the basic dataresampling techniques discussed above is shown in Figs. 2 and 3.Intuitively, one of the advantages of the undersampling overoversampling approach is that it reduces the overall training datasize thereby saving memory and speeding up the classification

429

430

431

432

433

434

3 Matlab's ttest2 function was used.4 We used the Feature Selection package in Zhao et al. (2011).


ED P

RO

OF

process. In many empirical studies, undersampling has outperformedoversampling (Drummond and Holte, 2003; Japkowicz, 2000a). Inaddition to these basic resampling approaches, different rates ofresampling and combination resampling approacheswere also exploredin our study.

Detailed ensemble procedure

The mathematical formulation of the problem statement and thesolution is defined as follows:

Set of feature selection algorithms:

F ¼ T−Test;Relief−F;Gini Index; Information Gain; Chi−Square; SLRþ SSf g

Set of class-imbalance handling approaches:

S ¼ Different types and rates of data re−sampling techniquesf g

Set of classification algorithms:

C ¼ Random Forest; Support Vector Machinef g

An ensemble system is defined as follows:

E ¼ f ; s; cf g;where f I ̂F; sI ̂S; and cI ̂C

For any set X, |X| is defined as the cardinality of the set.Hence there were |F| × |S| × |C| ensemble systems studied in this

paper for a given prediction task as illustrated in Fig. 1. In this work,the experiments were designed such that we evaluated each ensemblesystem using k-fold cross validation. The training set in each cross foldwas sampled multiple times to reduce the bias due to random datasetgeneration, thus producing multiple learning models. These modelswere combined using majority voting, where the final label of aninstance is decided based on the majority votes received from all themodels. In case of a tie, the probability of the estimation given by themodel is taken into consideration. For example, if 30 models (usingthe same resampling technique on the training set) are trained toestimate the labels of a test set and 20 models assign a test data pointto class 1 whereas remaining 10 models assign it to class 2, then thefinal label of this particular test data point is taken as class 1. We alsoreported the averaged performance of all models and used it as thebaseline for comparison.

Experimental setup

The experiments conducted in this study were designed to maxi-mally reduce the bias introduced due to randomness and to generateempirically comparable ensemble models. The pre-processed data wasthen divided into majority and minority sub-datasets. 10-fold crossvalidation was used such that each sub-dataset was partitioned into afixed 9:1 train–test ratio. The train and test sets from the respectiveclasses were combined to generate a training dataset and a testingdataset. Data resampling techniques were applied to the trainingdataset whereas for a given prediction task, the testing dataset waskept constant between different resampling techniques for a faircomparison. For example, for the task of discriminating control fromAD cases, random undersampling and SMOTE oversampling techniquesused the same test set for a given cross fold. This approach facilitatesaccurate comparison of the efficacy of different models (refer toFig. 1). Each cross-fold hadmultiple training sets for various resamplingtechniques (except for no-sampling approach, where each cross foldhad just 1 dataset) wherein the test set remained the same and thetraining set varied based on the type of data resampling techniqueemployed. In case of K-Medoids undersampling, the process of choosingthe cluster center is repeated 10 times and the set of cluster centers



435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450451

452453

454455456


which gives the minimum cost is selected. The SMOTE oversamplingalgorithm can have many variations in the choice of the new datapoint (synthetic data) lying on the line segment joining two nearestneighbors. In this paper, we used the basic approach which random-ly chooses the synthetic data point on the line segment. The stabilityselection procedure used 1000 bootstrap runs and selected thoseprominent features. The classifiers with default settings were usedfor all experiments in this study. The predictions obtained from theensemble model were compared with clinical diagnosis to evaluatethe efficacy of the model. The probability of the prediction, obtainedfrom the classifier for each test instance was recorded for later use.The efficacy of different ensemble systems was compared using var-ious performance metrics including accuracy, sensitivity, specificity,

UNCO

RRECTMajority Class

Minority Class

Partitioning given data into training and testing sets.

Feature Selection Algorithm

Selected FeaturesResampled

Training Set

Data Resampling Tec

Fig. 1. Illustrating the proposed ensemble system for imbalance data classification. In this propobothmajority andminority classes as illustrated in the top rectangle (solid line) of the figure. Ditraining set” onwhich a feature selection algorithm is applied to select relevant features resultingenerate a predictionmodel which is tested on the test set to evaluate its efficacy. The steps showprediction model. The steps in dotted black bordered rectangle are repeated for each data resa


and area under the ROC curve (AUC). These metrics are defined asfollows:

Accuracy %ð Þ ¼ TPþ TNTPþ TNþ FPþ FN

� 100

Sensitivity ¼ TPTPþ FN

Specificity ¼ TNTNþ FP

where TP refers to the number of samples correctly identified aspositive (True Positive), FP refers to the number of samples incorrectly

ED P

RO

OF

Training Set

Testing Set

Classification Model

Classification Algorithm

Reduced feature training set

hnique

Test the

Model

sedmodel, a training and a testing set is derived from the given data using data points fromfferent data resampling techniques are applied to the training set to generate a “resampledg in a reduced dimension training set. Subsequently a classification algorithm is applied ton in double blue bordered rectangle are repeated for each feature selection algorithm and

mpling technique.



457458459

460461462463464465466467468469470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

Fig. 2. This example illustrates class imbalance problem and the basic data resampling tech-niques used in theADNI dataset for predictingMCI fromControl cases onproteomics features(refer Table 1). The bar labeled “Complete” represents the data available for analysis. The“Train” bar represents training data taken from both classes for different resampling ap-proaches and “Test” bar represents the test data. A dataset is formed by combining a trainingset and a test set (test set is kept fixed between different sampling approaches, and it neednot be balanced).


identified as positive (False Positive), TN refers to the number of samplescorrectly identified as negative (True Negative), and FN refers to thenumber of samples incorrectly identified as negative (False Negative).

UNCO

RRECTa) No Sampling b) Random Un

Fig. 3. Illustrating three different sampling approaches used in an ensemble system for an experasterisk symbols) from AD cases (marked by orange, for training and green, for testing asteriskand test set in a ratio of 9:1. X-axis represents 10 cross folds and Y-axis represents samples. (a) Dtwo classes. (b) Depicts undersampling scenariowhere training set is balanced by removing datcompared to other two cases. (c) Depicts oversampling scenariowhereminority class is duplicais shown for each cross fold, but 30 datasets were used except in training for no sampling case


Accuracy measures the percentage of correct classifications by themodel. Sensitivity, also known as recall rate or True Positive Rate(TPR), is the proportion of positive samples who are correctly identifiedas positive. Specificity is the proportion of negative samples who arecorrectly identified as negative. It is also known as False Positive Rate(FPR). AUC is computed by averaging the trapezoidal approximationsfor the curve created by TPR and FPR. Multiple classification modelswere generated for every cross fold, each of which provides a prediction,positive or negative, for the given class instance. Accuracy, sensitivity,specificity, and AUC are computed by utilizing the majority labels asdiscussed in Section 2.3.

D P

RO

OF

Results

This section provides the details of the comprehensive experimentsperformed and results obtained to compare efficacy of differentensemble systems. This study was focused on binary classificationproblem of identifying control, MCI, and AD cases from one another.Only MRI and proteomics modalities were studied as these areamong the most easily available features in the AD domain. Thissection is divided into four subsections where each subsectioncompares the proposed ensemble framework with traditional and/orsophisticated solutions for the class imbalance problem. In Section 3.1,feature selection algorithms and basic data resampling approaches(refer to Section 2.2) were compared for different predictiontasks and modalities. Some researchers examined the use of com-bination approaches where different resampling techniques werecombined to achieve a balanced training set (Chawla et al., 2002).In Section 3.2 we studied such an approach and compared it with

E

dersampling c) Random Oversampling

imental setup for predicting control cases (marked by blue, for training and red, for testingsymbols) using proteomics modality (refer to Table 1). Each class is divided into a trainingepicts actual or no sampling scenariowhere training data is unbalancedwith respect to thea points from themajority class as shown by the sparse orange columns for each cross foldted as shownby extra length of blue columns for each cross fold. Note that only one dataset.



UNCO

RRECT

487

488

489

490

491

492

493

494

495

496

Table 2t2:1

t2:2 NC versus MCI prediction task using 147 proteomics features: Summary of data used int2:3 train–test set in each cross fold for different data resampling techniques. MCI includest2:4 both MCI Convertor (163) and MCI Stable (233) subjects.

t2:5 No sampling K-Medoids/random US

SMOTE/random OS

t2:6 Target Sample # Train Test Train Test Train Test

t2:7 NC (−) 58 52 6 52 6 351 6t2:8 MCI (+) 391 351 40 52 40 403 40t2:9 Total 449 403 46 104 46 754 46

a) Random US: Averaged Accuracy using RF b) S

c) Random OS: Averaged AUC using SVM d) K

e) SMOTE: Averaged Sensitivity using RF f) K-

Fig. 4. NC/MCI prediction task: Comparison of feature selection algorithms for different perforcross folds for top 20 features.



our proposed model. On the other hand, some researchers havequestioned the need of a balanced training set and essayed imbal-anced training sets obtained by different rates of data sampling(Estabrooks et al., 2004); we examined the effect of rate of resam-pling in Section 3.3. Finally, in Section 3.4 we compared the proposedapproachwith themulti-classifiermulti-learner approach (Chan andStolfo, 1998).

In the following tables and figures, “(−)” is used to represent thenegative class, whereas “(+)” is used to represent the positive class.“RF Avg” and “SVM Avg” represent averaged performance measures

ED P

RO

OFMOTE: Majority Voting Accuracy using SVM

-Medoids: Majority Voting AUC using RF

Medoids: Majority Voting Specificity using SVM

mance metrics, classifiers, and sampling approaches. The results were averaged across 10



497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520


and “RF MajVote” and “SVM MajVote” represent majority votingperformance measures using RF and SVM classifiers.

Comparing basic data resampling techniques

For the task of predicting NC from MCI cases using proteomicsmeasurements, we used 5 basic data resampling techniques (refer toSection 2.2) and each approach used 6 feature selection algorithmsand 2 different classifiers, thus generating 60 (= 5 × 6 × 2) ensemblesystems. Each ensemble system used 10-fold cross-validation and 30random datasets in each cross fold except the no-sampling approach,yielding 300 (= 10 × 30) classification models. The data distributionfor no sampling, undersampling (random and K-Medoids), and

UNCO

RRECT

e) SMOTE Over Sampling

a) No Sampling

c) K-Medoids Under Sampling

T

Fig. 5. NC/MCI majority voting classification performance comparison of SVM classifier, averadifferent data sampling approaches.


oversampling (random and SMOTE) techniques is summarized inTable 2. To evaluate the six feature selection algorithms, we com-pared the performance of the top features obtained from each ofthese algorithms. A few selected comparison graphs are illustratedin Fig. 4. All other data resampling techniques produced similarresults (Dubey, 2012). As seen from this figure, the performancemetric increases smoothly and stabilizes after selecting top 10–12features; hence the results reported in this study are for top 10features. Comparison of the 6 feature selection algorithms for top 10features using SVM classifier (since SVM gave better classification mea-sures than RF in most cases), is illustrated in Fig. 5. The absolute differ-ence between sensitivity and specificity (referred to as SensitivitySpecificity gap) is displayed for each feature selection algorithm, which

ED P

RO

OF

Accuracy

Sensitivity

Specificity

AUC

Sensitivity Specificity Gap

b) Random Under Sampling

d) Random Over Sampling

ged across 10 cross folds, using top 10 features from six feature selection algorithms for



T

PRO

OF

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

Table 3t3:1

t3:2 NC/MCI: Comparison of different sampling approaches using top 10 proteomics features, averaged across 10 cross folds, in terms of accuracy, sensitivity and specificity, and AUC. The bestt3:3 value in each column for each performance metric is underlined to compare different sampling approaches and highest value in each row is highlighted in bold to compare featuret3:4 selection algorithms and classifiers.

t3:5 SLR + SS T-test

t3:6 Sampling type RF Avg RF MajVote SVM Avg SVM MajVote RF Avg RF MajVote SVM Avg SVM MajVote

t3:7 Accuracy (%) None 90.152 90.152 93.261 93.261 90.620 90.620 89.717 89.717t3:8 Random US 80.146 84.772 80.965 86.326 78.607 83.685 78.344 82.630t3:9 K-Medoids 80.596 85.359 81.384 87.630 78.958 83.217 78.576 81.696t3:10 Random OS 90.748 91.424 90.500 92.293 89.130 89.685 88.093 88.815t3:11 SMOTE 89.971 89.902 90.761 91.054 87.816 88.348 88.517 89.652t3:12 Sensitivity None 0.9850 0.9850 0.9700 0.9700 0.9850 0.9850 0.9725 0.9725t3:13 Random US 0.8017 0.8456 0.8083 0.8608 0.7864 0.8356 0.7818 0.8258t3:14 K-Medoids 0.8062 0.8475 0.8127 0.8758 0.7910 0.8353 0.7869 0.8228t3:15 Random OS 0.9815 0.9897 0.9531 0.9697 0.9663 0.9722 0.9273 0.9322t3:16 SMOTE 0.9523 0.9522 0.9553 0.9572 0.9306 0.9342 0.9390 0.9492t3:17 Specificity None 0.3333 0.3333 0.6833 0.6833 0.3750 0.3750 0.3833 0.3833t3:18 Random US 0.8033 0.8667 0.8236 0.8833 0.7872 0.8500 0.7983 0.8333t3:19 K-Medoids 0.8081 0.9000 0.8258 0.8833 0.7828 0.8167 0.7825 0.7833t3:20 Random OS 0.3992 0.4000 0.5717 0.6000 0.3775 0.3833 0.5600 0.5833t3:21 SMOTE 0.5394 0.5333 0.5881 0.6000 0.5250 0.5417 0.5231 0.5417t3:22 AUC None 0.3279 0.7997 0.6621 0.9392 0.3688 0.8107 0.3671 0.7924t3:23 Random US 0.7981 0.9138 0.8114 0.9267 0.7816 0.9108 0.7839 0.8989t3:24 K-Medoids 0.8007 0.9335 0.8129 0.9319 0.7822 0.9018 0.7808 0.8731t3:25 Random OS 0.6629 0.7600 0.7465 0.8317 0.6414 0.7438 0.7253 0.8071t3:26 SMOTE 0.7376 0.8360 0.7663 0.8788 0.7213 0.8313 0.7206 0.8400


UNCO

RREC

illustrates the classifier's effectiveness in handling the class imbalance. Asmaller gap between sensitivity and specificity is desirable. Clearly,SLR + SS outperformed other feature selection algorithms in all exper-iments; the overall performance of T-test and GiniIndexwas better thanthe remaining ones. Since T-test is very popular in the neuroimagingdomain, this work reports its performance along with SLR + SS for allfollowing experiments. The results are summarized in Table 3. Notethat for the sake of brevity, we only report the most significant andillustrating results here.

From Fig. 5 and Table 3, undersampling approaches, specificallyK-Medoids, obtained better classification performance for imbal-anced ADNI data. SLR + SS performed better in K-Medoids thanrandom undersampling whereas other feature selection algorithmsshowed similar or slightly better performance for random under-sampling. These results corroborate the efficacy of the ensemble sys-tem composed of SLR + SS feature selection algorithm, K-Medoidsdata resampling method, and SVM classifier. Also, majority votingresults were better than the respective averaged performancemeasures.

Similar observations were made for the NC/MCI prediction taskusing MRI features. The summary of datasets used is provided inTable 4 and the classification results are given in Table 5. Tables 6 and7 represent data distribution and prediction performance, respective-ly, of the classical NC/AD prediction task using proteomics features.The data and the performance measures of NC/AD task using MRIfeatures are summarized in Tables 8 and 9, respectively. In thiscase, we encountered negative class majority. The task of predictingNC from MCI Converters & AD cases experiences a significant class-imbalance situation. Tables 10 and 11 summarize the data detailsand performance measures for this task using proteomics features.The MRI counterparts of this task are given in Tables 12 and 13.From these six classification tasks, we conclude that the K-Medoidsundersampling approach dominated the overall efficacy of theensemble system more than any other factor.

Comparison with a combination scheme

Chawla et al. (2002) proposed a combination scheme bymixing dif-ferent rates of oversampling (using SMOTE) and random undersampling


ED

to reverse the initial bias of the learner towards the majority class infavor of the minority class. The training set was not always balancedwith respect to two classes; the approach forced the learner toexperience varying degrees of undersampling such that at some higherdegree of undersampling the minority class had larger presence inthe training set. We examined their combination scheme approachfor NC/MCI prediction task using proteomics data. The training setwas resampled (undersampled/oversampled) at 0%, 10%, 20%, …,100%. 0% resampling is equivalent to “No Sampling” and 100% resam-pling is known as complete sampling or full sampling. Hence, in 100%undersampling, the majority class is reduced to match the minorityclass count and 100% oversampling increases the minority samples inthe training set to match the majority class count. The computation ofthe resampling rate is a slightly modified version of the resamplingrate calculation proposed by Estabrooks et al. (Estabrooks et al., 2004).Mathematically, the gap betweenmajority andminority count is dividedby the desired number of resamplings and is referred to as diffCount inthis study. We started resampling the data at 10%, in increments of10% till 100% resampling is achieved, hence the difference between ma-jority and minority count was divided by 10. In case of undersampling,the majority class count is reduced by a multiple of diffCount. Similarly,a multiple of diffCount is used to increment the minority count inoversampling case. For example, if there are 52 negative samples and356 positive samples available for training, and we are resampling at10% as explained earlier, then the diffCount = (356–52)/10 = 30.4.Therefore, 40% undersampling gives 234 (≈ 356 − 4 × 30.4) majorityclass count and a 30% oversampling gives 143 (≈52 + 3 × 30.4)minority samples in the training set. In our experimental setup, thetraining set was always balanced using different rates of K-Medoidsundersampling and SMOTE oversampling. Hence if the majorityclass was 20% undersampled, then the minority class was 80%oversampled. The data used in different sampling rates is summa-rized in Table 14 and the data distribution is illustrated in Fig. 6. Asbefore, 144 (= 6 × 12 × 2) ensemble systems were generatedusing six feature selection algorithms, 12 resampling techniques,and RF and SVM classifiers. From the classification results, summa-rized in Table 15, it is evident that complete K-Medoidsundersampling (referred to as S0_K100) performs better than otherresampling rates. Also, SLR + SS and SVM gave superior learning



T

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

t5:1

t5:2

t5:3

t5:4

t5:5

t5:6

t5:7

t5:8

t5:9

t5:10

t5:11

t5:12

t5:13

t5:14

t5:15

t5:16

t5:17

t5:18

t5:19

t5:20

t5:21

t5:22

t5:23

t5:24

t5:25

t5:26

Table 4t4:1

t4:2 NCversusMCI prediction task using 305MRI features: Summary of data used in train-test sett4:3 in each cross fold for different data resampling techniques.MCI includes bothMCI Convertort4:4 (142) and MCI Stable (177) subjects.

t4:5 No sampling K-Medoids/random US

SMOTE/random OS

t4:6 Target Sample # Train Test Train Test Train Test

t4:7 NL (−) 191 171 20 171 20 287 20t4:8 MCI (+) 319 287 32 171 32 287 32t4:9 Total 510 458 52 342 52 574 52

Table 6 t6:1

t6:2NC versus AD prediction task using 147 proteomics features. Summary of data used int6:3train–test set in each cross fold for different data resampling techniques.

t6:4No sampling K-Medoids/random US

SMOTE/random OS

t6:5Target Sample # Train Test Train Test Train Test

t6:6NL (−) 58 52 6 52 6 100 6t6:7AD (+) 112 100 12 52 12 100 12t6:8Total 170 152 18 104 18 200 18


C

models and majority voting was more effective than simple averaging.These results are compared in Fig. 7.

Comparing different rates of data resampling

Estabrooks et al. (2004) proposed a multiple resampling method, toefficiently learn from imbalanced data. They experimented with inde-pendently varying rates of oversampling and undersampling. Theygenerated 20 datasets, 10 each for oversampling and undersampling,by increasing the resampling rate in increments of 10% till 100% resam-pling is achieved. From the experiments conducted on various domains,they concluded that optimal resampling rate depends upon the resam-pling strategy and it varies from domain to domain. In this paper, westudied effects of varying rates of oversampling and undersampling onNC/MCI prediction task for proteomics features. The experimentalsetup consisted of 10 cross folds, each having 10 datasets and 9:1train–test ratio in each dataset. Only one of the two resampling ap-proaches is utilized for a particular rate of resampling. Hence, thetraining set was not balanced except in the event of completeoversampling and undersampling. We used diffCount measure, as ex-plained in previous experiments, to achieve varying rates of resamplingand examined 20 resampling techniques. Tables 16, 17 and Fig. 8 sum-marize the data distribution used in this experiment. The resultsof comparison of classification efficacy for independently varying

UNCO

RRE

Table 5NC/MCI: Comparison of different sampling approaches using top 10MRI features, averaged acroeach column for each performance metric is underlined to compare different sampling approagorithms and classifiers.

SLR + SS

Sampling type RFAvg

RF MajVote SVM Avg

Accuracy (%) None 67.436 67.436 67.720Random US 65.863 67.289 66.517K-Medoids 66.988 68.059 67.158Random OS 66.112 67.866 66.136SMOTE 66.001 65.128 64.871

Sensitivity None 0.7740 0.7740 0.7928Random US 0.6312 0.6331 0.6419K-Medoids 0.6398 0.6362 0.6396Random OS 0.7173 0.7178 0.6838SMOTE 0.7004 0.6925 0.6845

Specificity None 0.5136 0.5136 0.4927Random US 0.7134 0.7500 0.7134K-Medoids 0.7292 0.7650 0.7343Random OS 0.5671 0.6136 0.6273SMOTE 0.6010 0.5918 0.5984

AUC None 0.3953 0.6984 0.3876Random US 0.6657 0.7438 0.6708K-Medoids 0.6769 0.7494 0.6802Random OS 0.6184 0.6853 0.6344SMOTE 0.6435 0.7009 0.6339


PRO

OF

rates of under and over sampling approaches are provided inTable 18. This dataset was dominated by positive class samples; hencehigh sensitivity and low specificity were expected. As noted earlier,the effectiveness of a classification model is inversely proportional tothe sensitivity–specificity gap. We used this criterion and observedthat in the ADNI data set, the gap decreased with increasing levelof oversampling (SMOTE) till 40% SMOTE and started increasingagain. Whereas, the gap gradually decreased with increasing degreesof undersampling (K-Medoids) and the best results were achieved at100% K-Medoids with high sensitivity (0.89), good specificity (0.812),high AUC (0.97), and accuracy (88%). Only the complete K-Medoidsundersampling approach increased the specificity by more than 51%.The results for majority performance metrics are illustrated in Fig. 9.

EDComparison with a multi-classifier learning approach

Chan and Stolfo (1998) proposed a multi-classifier meta-learningapproach and concluded that the training class distribution affects theperformance of the learned classifiers and the natural distribution canbe different from the desired training distribution that maximizes per-formance. Their model ensured that none of the data points werediscarded. They split the majority class into non-overlapping subsetssuch that each subset is roughly the size of minority class. A classifierwas trained on each of these subsets and the minority training set.

ss 10 cross folds, in terms of accuracy, sensitivity and specificity, and AUC. The best value inches and highest value in each row is highlighted in bold to compare feature selection al-

T-test

SVM MajVote RFAvg

RF MajVote SVM Avg SVM MajVote

67.720 66.044 66.044 67.482 67.48269.020 66.282 67.729 66.545 67.58269.496 66.466 67.390 66.826 66.96065.559 66.972 66.905 67.156 67.14365.321 66.357 67.051 66.808 67.6740.7928 0.7644 0.7644 0.8085 0.80850.6518 0.6269 0.6297 0.6129 0.62040.6395 0.6203 0.6172 0.6117 0.61400.6740 0.6996 0.6927 0.6388 0.62710.6893 0.6964 0.6987 0.6473 0.65490.4927 0.4977 0.4977 0.4586 0.45860.7650 0.7323 0.7659 0.7643 0.78000.7950 0.7495 0.7800 0.7739 0.77500.6277 0.6236 0.6327 0.7278 0.74680.6059 0.6196 0.6359 0.7127 0.72500.7048 0.3744 0.6878 0.3678 0.68730.7615 0.6729 0.7459 0.6817 0.75140.7715 0.6772 0.7486 0.6856 0.74900.6738 0.6391 0.6810 0.6615 0.71030.7028 0.6506 0.7157 0.6727 0.7380



T

RO

OF

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

Table 7t7:1

t7:2 NC/AD: Comparison of different sampling approaches using top 10 proteomics features, averaged across 10 cross folds, in terms of accuracy, sensitivity and specificity, and AUC. The bestt7:3 value in each column for each performance metric is underlined to compare different sampling approaches and highest value in each row is highlighted in bold to compare featuret7:4 selection algorithms and classifiers.


t7:6 Sampling Type RFAvg

RF MajVote SVM Avg SVM MajVote RFAvg



t8:1

t8:2

t8:3

t8:4

t8:5

t8:6

t8:7

t8:8


UNCO

RREC

Later, these classifiers were stacked together to build a final ensembleclassifier. In our study on ADNI data for NC/MCI prediction task usingproteomics modality, we studied Chan and Stolfo's approach. Weused 52 (−) minority training samples and 356 (+)majority trainingsamples, which gives, roughly, 1:7 minority–majority class ratios. Wegenerated 7 datasets utilizing 7 non-overlapping subsets frommajoritytraining set for a given minority training set. The data distribution isgraphically depicted in Fig. 10. Three data resampling techniques wereexamined, namely, random undersampling, K-Medoids, and Chan andStolfo's approach. The 7 datasets created in each cross fold utilized therespective resampling approach keeping the testing set fixed betweenall three techniques for a given cross fold. We used a simple combina-tion scheme where the classifier performance from all 7 classificationmodels for a cross fold was either averaged or taken as a majorityvote. The results displayed here are averaged over all 10 cross folds.The results are summarized in Table 19 and Fig. 11. We can observefrom these results that Chan and Stolfo's approach gave better accuracybut did not remove the bias towards minority class resulting in com-paratively poor AUC value and sensitivity–specificity gap. K-Medoidsand Random undersampling were able to bridge the gap betweensensitivity and specificity with 88% accuracy and 0.93 AUC. This fur-ther demonstrates the effectiveness of our simple ensemble systemfor handling the imbalanced data.

Discussion

This paper has two major contributions. First, we introduced arobust yet simple framework to address imbalance problem in

695

696

697

698

699

700

701

702

703

704

705

706

Table 8NC versus AD prediction task using 305 MRI features. Summary of data used in train-testset in each cross fold for different data resampling techniques.

No sampling K-Medoids/random US

SMOTE/random OS

Target Sample # Train Test Train Test Train Test

NL (−) 191 171 20 124 20 171 20AD (+) 138 124 14 124 14 171 14Total 329 295 34 248 34 342 34


ED Pclassification study. Secondly, by a comprehensive set of experi-

ments we demonstrated the supremacy of K-Medoid undersamplingapproach over other basic data resampling techniques in the ADNIdataset. We used the approach of completely balancing the trainingset with respect to the two classes by utilizing only one type ofdata resampling technique. To the best of our knowledge, this isthe first study to systematically investigate the data imbalanceissue in the ADNI dataset. In this pilot work, we used MRI and prote-omics modalities in ADNI to assess whether one can still achievereasonably balanced classification results on an imbalanced dataset.We also implemented and applied several state-of-the-art imbal-anced data processing methods, applied them to ADNI dataset andcompared their performance with our proposed ensemble frame-work. Our discovery may provide guidance for future experimentaldesign and statistical integration on large scale neuroimagingdatasets. ADNI provides us an ideal testbed for the developed algo-rithms and tools as the data is so diverse and complex, and its univer-sal availability. Moreover, it is also becoming a model for other largedata collection projects, and clinical trials, so there will be a flood ofdata with similar complexities. We hope our work will increase theinterest in this ubiquitous and important problem and other groupsmay consider using this approach to deal with the imbalance in thetraining dataset when performing future classification studies onimbalanced datasets.

In the study, six feature selection algorithms and five basic dataresampling techniques were compared for different predictiontasks andmodalities. It was concluded that undersampling, in partic-ular K-Medoids, yields better learning models than other resamplingapproaches. “No sampling” approach gave the highest test accuracy,but the results were biased towards the majority class as the classi-fiers tend to minimize the misclassification costs by classifying allsamples into the majority class. This results in a huge gap betweensensitivity and specificity measures. Data resampling approachesperformed better in the class imbalance scenario. Random over-sampling tends to overfit the training data as the data points wereduplicated, whereas random undersampling may lead to loss ofvital information as data points were randomly removed. SMOTE andK-Medoids sampling methods use heuristics to select/eliminate thedata points, hence their performance was superior compared withthe corresponding random resampling techniques. Undersampling



T

PRO

OF

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

Table 9t9:1

t9:2 NC/AD: Comparison of different sampling approaches using top 10MRI features, averaged across 10 cross folds, in terms of accuracy, sensitivity and specificity, and AUC. The best value int9:3 each column for each performance metric is underlined to compare different sampling approaches and highest value in each row is highlighted in bold to compare feature selectiont9:4 algorithms and classifiers.


t9:6 Sampling type RFAvg


RF MajVote SVM Avg SVMMajVote


t10:1

t10:2

t10:3

t10:4

t10:5

t10:6

t10:7

t10:8

t10:9


NCO

RREC

performed better than the oversampling approach for all predictiontasks. This could potentially be due to that in undersampling the datapoints selected in the training set accurately represented the originalclass distribution, and the bias introduced, if any, in selecting the datapoints from the majority class was minimized. On the other hand,oversampling approaches could disturb the data distribution withinthe class either by overfitting or generating synthetic data pointswhich do not follow the original class distribution aswe have very littleinformation about the minority class. Also, the majority voting resultswere shown to be better than the respective averaged performancemeasures, whichdemonstrates the effectiveness of performingmultipleundersampling.

To corroborate our findings, we extended our study to includea few other data resampling approaches proposed by differentresearchers. The first experiment performed in this series was thecomparison of our ensemble framework with the combinationscheme proposed by Chawla et al. (2002) for the ADNI dataset. Inour experimental setup, we ensured balanced training sets withvarying degrees of undersampling (using K-Medoids) and over-sampling (using SMOTE) as noted in Section 3.2. The results supportour ensemble system where complete K-Medoids undersamplingoutperformed all other resampling approaches. These findings suggestthat the complexity of ADNI dataset makes it difficult to generatesynthetic data points which fit the natural class distribution well. Onthe other hand, undersampling selects the data points from the original

U 757

758

759

760

761

762

763

764

765

766

767

768

769

770

Table 10NC versus MCIC & AD prediction task using 147 proteomics features. Summary of dataused in train-test set in each cross fold for different data resampling techniques.; MCIC &AD includes both MCI Convertor (163) and AD (112) subjects.

No Sampling K-Medoids/Random US

SMOTE/Random OS


NL (−) 58 52 6 52 6 247 6MCIC & AD (+) 275 247 28 52 28 247 28Total 333 299 34 104 34 494 34


ED

class distribution and hence has lesser impact, most of which is takencare by repeated application of K-Medoids.

In analysis of different rates of data resampling where trainingdata need not be balanced, we made the same observation ofthe superior performance of the ensemble system using completeK-Medoids undersampling (Section 3.3). The decreasing perfor-mance of oversampling as amount of SMOTE is increased, whichagain indicates the failure of synthetic data generation techniquesfor ADNI. The increasing percentage of K-Medoids not only reducesthe gap between sensitivity and specificity, but it also tries toeliminate/reduce the class bias due to the majority class, whichis a desirable property. We further compared our approach withmulti-classifier meta-learning approach proposed by Chan andStolfo (1998). Their approach splits the majority class into non-overlapping subsets such that each subset is roughly the size ofminority class, different from random undersampling and ourK-Medoids undersampling. Our experiments on ADNI data showedthat both random and K-Medoids undersampling approachesoutperformed Chan's approach.

Comparison with pioneering disease diagnosis research in ADNI. Wecompared our ensemble system's performance with some of the earlierwork done on ADNI dataset. As noted earlier, MRI features are verypopular among researchers owing to their widespread availabilityand high discriminative power (Devanand et al., 2007; Dickersonet al., 2001). Seminal research by (Ray et al., 2007) laid the groundfor blood based proteins as biomarkers for early AD diagnosis(Gomez Ravetti and Moscato, 2008; Johnstone et al., 2012; O'Bryantet al., 2011).

Early identification of potential AD cases before any cognitivedecline symptoms are visible has been examined by several studies.Ray and colleagues usedmolecular tests to identify 18 signalingproteinswhich could discriminate between control and AD cases with nearly90% accuracy (Ray et al., 2007). Gomez Ravetti and Moscato (2008)identified a 5-protein signature from Ray et al.'s 18-protein setwhich achieved 96% accuracy in predicting non-demented from ADcases. Johnstone et al. (2012) identified an 11 protein signature onADNI dataset using a multivariate approach based on combinatorialoptimization ((α, β)-k Feature Set Selection). They achieved 86%sensitivity and 65% specificity when assessed on the full set of



T

PRO

OF

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

Table 11t11:1

t11:2 NC/MCIC&AD: Comparisonof different sampling approaches using top 10proteomics features, averaged across 10 cross folds, in terms of accuracy, sensitivity and specificity, andAUC. Thet11:3 best value in each column for each performance metric is underlined to compare different sampling approaches and highest value in each row is highlighted in bold to compare featuret11:4 selection algorithms and classifiers.






t12:1

t12:2

t12:3

t12:4

t12:5

t12:6

t12:7

t12:8

t12:9


UNCO

RREC

control and AD samples (54 and 112). They also studied balancedapproaches using 54 samples from both classes and demonstratedbalanced sensitivity and specificity measures of 73.1%. Shen et al.(2011) proposed elastic net classifiers based on regularized logisticregression. Shen and his group utilized ADNI dataset with 146 prote-omics features, 57 total control subjects, and 106 total AD cases andachieved best accuracy of 83.7% and an AUC of 89.9%. These resultsare very close to our observations where our ensemble system com-posed of SLR + SS and no sampling approach yielded best accuracyof 84.86%, 91.67% sensitivity, 72.5% specificity, and 91.25% AUCusing top 10 features (See Table 7). In terms of a balanced datasetusing the undersampling approach and top 10 features, we achievedbest accuracy of 84.16%, 83.33% sensitivity and 85.83% specificity,and an AUC of 91.94%.

Shen et al. (2011) used reducedMRI features and a subset of controland AD subjects (54 and 106) from ADNI samples reporting 86.6% pre-diction accuracy. Yang et al. (Yang et al., 2011) proposed an indepen-dent component analysis (ICA) based method for studying thediscriminative power of MRI features by coupling ICA with the SVMclassifier. Their experiments on ADNI dataset resulted in highest accura-cy of 76.9% with 74% sensitivity and 79.5% specificity for control vs AD(236 vs 202) prediction task on ADNI dataset. Our ensemble frameworkfor MRI features performed significantly better giving 87.38% accuracy,83.3% sensitivity and 90.18% specificity using K-Medoids samplingapproach and SLR + SS feature selection algorithm.

An intermediate stage of AD progression is MCI when the patientstarts depicting signs of cognitive decline but is not completely demented.

825

826

827

828

829

830

831

832

833

834

835

836

Table 12NC versus MCIC & AD prediction task using 305 MRI features. Summary of data used intrain–test set in each cross fold for different data resampling techniques.; MCIC & AD in-cludes both MCI Convertor (142) and AD (138) subjects.

No Sampling K-Medoids/Random US

SMOTE/Random OS


NL (−) 191 171 20 171 20 252 20MCIC & AD (+) 280 252 28 171 28 252 28Total 471 423 48 342 48 504 48


ED

An examination of control and prodromal AD cases can give valu-able information about initial signs and factors responsible formemory impairment. There are many prior works on the automateddisease diagnosis problem, that include partial least square basedfeature selection on MRI (Zhou et al., 2011), feature extractionmethods based onMRI data (Cuingnet et al., 2011), and support vec-tor machines to combine MRI, PET and CSF, etc. (Kohannim et al.,2010). In a recent work on ADNI dataset, Johnstone et al. (2012)achieved 93.5% sensitivity and 66.9% specificity for the predictiontask of control vs MCI converters (54 vs 163) using their multivari-ate approach. With balanced training data using 54 samples for bothcategories, they reported 74.3% sensitivity and 79.3% specificity.Shen et al. (2011) studied control vs MCI (57 vs 110) ADNI subjectsfor proteomics modality and observed highest accuracy of 87.4%and 95.3% AUC. We applied our algorithm to predict control fromMCI subjects (including both converters and non-converters). Theensemble system of K-Medoids with SLR + SS algorithm resultedin 87.63% accuracy, with 87.58% sensitivity, and 88.33% specificityfor top 10 features. The data imbalance ratio was 7:1 in ourcase but we still managed to get N85% values for all performancemetrics. This clearly demonstrates the validity and potential of ourmethod.

Many researchers have explored control vs MCI classificationusing MRI features where MCI cases include both converters andnon-converters (Davatzikos et al., 2010; Fan et al., 2008; Liu et al.,2011; Shen et al., 2011; Yang et al., 2011). Shen and others (Shenet al., 2011) observed 74.3% classification accuracy for control vsMCI (57 vs 110) prediction task on ADNI dataset using reduced MRIfeature set. Yang et al.'s ICA method coupled with SVM classifier onADNI dataset was able to discriminate control from MCI cases (236vs 410) with highest accuracy of 72%, 71.3% sensitivity, and 68.6%specificity (Yang et al., 2011). Our proposed ensemble frameworkcomposed of K-Medoids and SLR + SS gave 69.46% accuracy, 64%sensitivity, 79.5% specificity, and 77.15% AUC for the same predictiontask.

In summary, although a direct head-to-head comparison isdifficult (e.g. even the MRI features are different between stud-ies), our experimental results were comparable or outperformedthose of some state-of-the-art algorithms, e.g. (Cuingnet et al.,



TED P

RO

OF

837

838

839

840

841

Table 13t13:1

t13:2 NC/MCIC & AD: Comparison of different sampling approaches using top 10MRI features, averaged across 10 cross folds, in terms of accuracy, sensitivity and specificity, and AUC. The bestt13:3 value in each column for each performance metric is underlined to compare different sampling approaches and highest value in each row is highlighted in bold to compare feature selectiont13:4 algorithms and classifiers.



RF MajVote SVM Avg SVMMajVote RFAvg



Table 14t14:1

t14:2 NC versus MCI prediction task using 147 proteomics features. Summary of data used in train–test set in each cross fold for combination resampling techniques. MCI includes both MCIt14:3 Convertor (163) and MCI Stable (233) subjects.

t14:4 SMOTE % 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

t14:5 K-Medoids % 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

t14:6 Target # Train Train Train Train Train Train Train Train Train Train Train Test

t14:7 NL(−) 58 52 82 112 143 173 204 234 264 295 325 356 6t14:8 MCI(+) 396 52 82 112 143 173 204 234 264 295 325 356 40t14:9 Total 454 104 164 224 286 346 408 468 528 590 650 712 46


EC

2011). More importantly, since we address a fundamental prob-lem, we believe that our work could be complementary to theseexisting research efforts and may help others to achieve a balanced

UNCO

RR 842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

Fig. 6. The bar labeled “Complete” represents the data available for analysis. The “Test” barrepresents the test data and the remaining bars in between represents the training datataken from both classes at different resampling rates. For brevity bar labels are abbreviated,for example 10% SMOTE oversampling ofminority class and 90%K-Medoids undersamplingofmajority class is labeled as “S10_K90”. A train–test dataset is formed by combining a trainset and a test set (test set is kept fixed between different sampling approaches, and it neednot be balanced).


and improved performance on the ADNI or other biomedicaldatasets.

Conclusion

Here we present a novel study in which different sampling ap-proaches were thoroughly analyzed to determine their effectivenessin handling imbalanced neuroimaging data. This work demonstratesthe efficacy of undersampling approach for class imbalance problemin ADNI dataset. In this work, several simple and robust ensemblesystems were built based on different data sampling approaches.Each ensemble system was composed of a feature selection algo-rithm and a data level solution for class imbalance problems(i.e. data resampling approach). We studied six state-of-the-artfeature selection algorithms, namely, two tailed Student's t-test,Relief-F based on relevance of features using k-nearest neighbors,Gini Index based on measure of inequality in the frequency distribu-tion values, Information Gain which measures the reduction inuncertainty in predicting the class label, Chi-Square test for indepen-dence to determine whether the outcome is dependent on a feature,and sparse logistic regression with stability selection. The data levelresampling solutions studied in this work included random under-sampling, random oversampling, K-Medoids based undersampling,and Synthetic Minority Oversampling Technique. We also ex-perimented with different rates of under and over sampling and ex-amined a combination data resampling approach where differentrates of under and over sampling were combined together. Theclassification model was built using decision tree based RandomForest algorithm and decision boundary based Support Vector



NCO

RRECTED P

RO

OF

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

Table 15t15:1

t15:2 NC/MCI: Comparison of different sampling approaches using top 10 proteomics features obtained by SLR + SS and T-Test, averaged across 10 cross folds, in terms of accuracy, sensitivityt15:3 and specificity, and AUC. The best value in each column for each performance metric is underlined to compare different sampling approaches and highest value in each row is highlightedt15:4 in bold to compare feature selection algorithms and classifiers. OS% refers to SMOTE oversampling percentage and US% corresponds to K-Medoids undersampling percentage. Resultst15:5 obtained without resampling approach are indicated by row (0%,0%), (0%,100%) refers to complete undersampling, and (100%,0%) corresponds to complete oversampling.


t15:7 (OS%,US%) RFAvg



t15:8 Accuracy (%) (0%,0%) 94.022 94.565 92.283 92.391 89.130 89.130 90.217 90.217t15:9 (0%,100%) 80.596 85.359 81.384 87.630 78.958 83.217 78.576 81.696t15:10 (10%,90%) 83.587 90.217 85.217 89.130 83.261 85.870 84.674 90.217t15:11 (20%,80%) 84.348 88.043 87.283 89.130 86.304 90.217 85.761 90.217t15:12 (30%,70%) 87.609 90.217 88.804 91.304 87.500 89.130 88.370 91.304t15:13 (40%,60%) 87.391 89.130 90.435 92.391 86.522 89.130 86.630 89.130t15:14 (50%,50%) 88.261 89.130 89.022 90.217 88.804 89.130 89.565 90.217t15:15 (60%,40%) 88.478 89.130 89.565 91.304 88.261 90.217 87.935 88.043t15:16 (70%,30%) 87.717 89.130 89.674 92.391 88.043 89.130 87.935 88.043t15:17 (80%,20%) 87.935 88.043 90.109 92.391 89.457 90.217 89.457 89.130t15:18 (90%,10%) 87.935 88.043 91.087 92.391 89.022 90.217 88.804 89.130t15:19 (100%,0%) 89.971 89.902 90.761 91.054 87.816 88.348 88.517 89.652t15:20




t15:22 Sensitivity (0%,0%) 0.984 0.988 0.948 0.950 0.988 0.988 0.975 0.975t15:23 (0%,100%) 0.806 0.848 0.813 0.876 0.791 0.835 0.787 0.823t15:24 (10%,90%) 0.833 0.900 0.855 0.900 0.835 0.863 0.844 0.913t15:25 (20%,80%) 0.854 0.888 0.885 0.900 0.879 0.925 0.873 0.913t15:26 (30%,70%) 0.891 0.925 0.905 0.925 0.893 0.913 0.900 0.925t15:27 (40%,60%) 0.901 0.925 0.924 0.938 0.885 0.913 0.890 0.913t15:28 (50%,50%) 0.913 0.925 0.915 0.925 0.908 0.913 0.914 0.925t15:29 (60%,40%) 0.918 0.925 0.923 0.938 0.910 0.925 0.906 0.913t15:30 (70%,30%) 0.909 0.925 0.924 0.950 0.904 0.913 0.905 0.913t15:31 (80%,20%) 0.913 0.913 0.928 0.950 0.916 0.925 0.923 0.925t15:32 (90%,10%) 0.911 0.913 0.939 0.950 0.914 0.925 0.919 0.913t15:33 (100%,0%) 0.952 0.952 0.955 0.957 0.931 0.934 0.939 0.949t15:34t15:35 Specificity (0%,0%) 0.650 0.667 0.758 0.750 0.250 0.250 0.417 0.417t15:36 (0%,100%) 0.808 0.900 0.826 0.883 0.783 0.817 0.783 0.783t15:37 (10%,90%) 0.858 0.917 0.833 0.833 0.817 0.833 0.867 0.833t15:38 (20%,80%) 0.775 0.833 0.792 0.833 0.758 0.750 0.758 0.833t15:39 (30%,70%) 0.775 0.750 0.775 0.833 0.758 0.750 0.775 0.833t15:40 (40%,60%) 0.692 0.667 0.775 0.833 0.733 0.750 0.708 0.750t15:41 (50%,50%) 0.683 0.667 0.725 0.750 0.758 0.750 0.775 0.750t15:42 (60%,40%) 0.667 0.667 0.717 0.750 0.700 0.750 0.700 0.667t15:43 (70%,30%) 0.667 0.667 0.717 0.750 0.725 0.750 0.708 0.667t15:44 (80%,20%) 0.658 0.667 0.725 0.750 0.750 0.750 0.708 0.667t15:45 (90%,10%) 0.667 0.667 0.725 0.750 0.733 0.750 0.683 0.750t15:46 (100%,0%) 0.539 0.533 0.588 0.600 0.525 0.542 0.523 0.542t15:47t15:48 AUC (0%,0%) 0.801 0.869 0.841 0.900 0.248 0.654 0.413 0.692t15:49 (0%,100%) 0.801 0.933 0.813 0.932 0.782 0.902 0.781 0.873t15:50 (10%,90%) 0.825 0.946 0.829 0.933 0.796 0.883 0.842 0.896t15:51 (20%,80%) 0.800 0.908 0.830 0.915 0.803 0.848 0.794 0.896t15:52 (30%,70%) 0.811 0.848 0.818 0.927 0.807 0.844 0.817 0.898t15:53 (40%,60%) 0.781 0.833 0.832 0.938 0.799 0.844 0.789 0.846t15:54 (50%,50%) 0.782 0.833 0.801 0.885 0.811 0.844 0.825 0.848t15:55 (60%,40%) 0.776 0.833 0.805 0.883 0.784 0.848 0.791 0.838t15:56 (70%,30%) 0.773 0.833 0.796 0.894 0.800 0.842 0.790 0.838t15:57 (80%,20%) 0.766 0.825 0.817 0.894 0.819 0.848 0.807 0.848t15:58 (90%,10%) 0.768 0.825 0.821 0.894 0.811 0.848 0.777 0.846t15:59 (100%,0%) 0.738 0.836 0.766 0.879 0.721 0.831 0.721 0.840


U

Machine classifiers. The key evaluation criteria used were accuracyand AUC curve along with sensitivity and specificity values. Sincemost resampling approaches randomly select the data points toremove or duplicate, the process was repeated a couple of times toremove any bias due to random selection. We compared the classifi-cation metrics obtained using averaged results and majority votingfor all repetitions. The experiments conducted as a part of thisstudy demonstrated the dominance of undersampling approachesover oversampling techniques. In general, sophisticated techniquessuch as K-Medoids and SMOTE gave better AUC and balanced sen-sitivity and specificity measures than the corresponding randomresampling methods. This paper concludes that the ensemble


system consisting of sparse logistic regression with stability selec-tion as feature selection algorithm and K-Medoids complete under-sampling approach (balanced train set with respect to the twoclasses) elegantly handles class imbalance problem in case ofADNI dataset. Performance metric based on majority voting domi-nates the corresponding averaged metric.

A concerted effort is needed to investigate the class imbalanceproblem in ADNI dataset. To the best of our knowledge, this is thefirst effort in that direction. This work studied proteomics andMRI modalities; future work will involve other MRI data features suchas detailed tensor-based morphometry (TBM) features that were usedin our voxelwise genome-wide association study (Hibar et al., 2011;



T

OO

F

891

892

893

894

895

896

897

898

899

900

901

902

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

929

Fig. 7. NC/MCI majority voting classification performance comparison of SVM classifier, averaged across 10 cross folds, using top 10 features from SLR + SS for different rates of datasampling.

t16:1

t16:2

t16:3

t16:4

t16:5

t16:6

t16:7

t16:8

t17:1

t17:2

t17:3

t17:4

t17:5

t17:6

t17:7

t17:8


RREC

Stein et al., 2010a; Stein et al., 2010b) and our surface multivariateTBM studies (Wang et al., 2011). Other modalities would also beconsidered, such as genomics, psychometric assessment scores, andclinical data. An integrative approach which uses a combination ofdifferent modalities can also be studied. Additionally, experimentscan be performed on Alzheimer's disease datasets from other sourcesto check for common patterns.

In this study, we investigate feature selection for imbalanceddata. Another popular approach for dimensionality reduction is fea-ture extraction, e.g., principal component analysis or independentcomponent analysis, which transforms the data into a different domain.The presented ensemble system can be extended to perform featureextraction and classification for imbalanced data. The current studyfocuses on binary classification. An interesting future direction is toextend the sampling techniques to the case of predictive regression(e.g., prediction of clinical measures). In this case, the distribution ofthe clinical measure should be taken into account when resamplingthe data. To the best of our knowledge, data resampling for regressionhas not been well studied in the literature. We plan to explore this inour future work.

UNCO

Table 16NC versusMCI prediction task using 147 proteomics features. Summary of data used in train–tes(163) and MCI Stable (233) subjects.

SMOTE % 0% 10% 20% 30% 40%

Target Count Train Train Train Train Train

NL(−) 58 52 82 112 143 173MCI(+) 396 356 356 356 356 356Total 454 408 438 468 499 529

Table 17NC versus MCI prediction task using 147 proteomics features. Summary of data used in train–test(163) and MCI Stable (233) subjects.

K-Medoids % 0% 10% 20% 30% 40%

Target Count Train Train Train Train Train

NL(−) 58 52 52 52 52 52MCI(+) 396 356 325 295 264 234Total 454 408 377 347 316 286


ED P

R

Acknowledgments

Data collection and sharing for this project was funded by theAlzheimer's Disease Neuroimaging Initiative (ADNI) (National Insti-tutes of Health grant U01 AG024904). ADNI is funded by the NationalInstitute on Aging, the National Institute of Biomedical Imaging andBioengineering, and through the generous contributions from thefollowing: Abbott; Alzheimer's Association; Alzheimer's Drug Dis-covery Foundation; Amorfix Life Sciences Ltd.; AstraZeneca; BayerHealthCare; BioClinica, Inc.; Biogen Idec Inc.; Bristol-Myers SquibbCompany; Eisai Inc.; Elan Pharmaceuticals Inc.; Eli Lilly and Compa-ny; F. Hoffmann-La Roche Ltd and its affiliated company Genentech,Inc.; GE Healthcare; Innogenetics, N.V.; Janssen Alzheimer Immuno-therapy Research & Development, LLC.; Johnson & Johnson Pharma-ceutical Research & Development LLC.; Medpace, Inc.; Merck & Co.,Inc.; Meso Scale Diagnostics, LLC.; Novartis Pharmaceuticals Corporation;Pfizer Inc.; Servier; Synarc Inc.; and Takeda Pharmaceutical Company. TheCanadian Institutes of Health Research is providing funds to support ADNIclinical sites in Canada. Private sector contributions are facilitated by theFoundation for the National Institutes of Health (www.fnih.org). The

t set in each cross fold for different rates of oversampling.MCI includes bothMCI Convertor

50% 60% 70% 80% 90% 100%

Train Train Train Train Train Train Test

204 234 264 295 325 356 6356 356 356 356 356 356 40560 590 620 651 681 712 46

set in each cross fold for different rates of undersampling. MCI includes both MCI Convertor

50% 60% 70% 80% 90% 100%

Train Train Train Train Train Train Test

52 52 52 52 52 52 6204 173 143 112 82 52 40256 225 195 164 134 104 46


http://www.fnih.org


TD P

RO

OF

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

965

966

967

968

969

970

971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

987

988

989

990

991

Fig. 8. The bar labeled “Complete” represents the data available for analysis. The “Test” bar represents the test data and the remaining bars in between represents the training data takenfrom both classes at different resampling rates. For brevity bar label are abbreviated, for example “S30_K0” corresponds to 30% SMOTE oversampling of minority class and noundersamplingmajority class. A train–test dataset is formed by combining a train set and a test set (test set is kept fixed between different sampling approaches, and it need not be balanced).


UNCO

RREC

grantee organization is the Northern California Institute for Research andEducation, and the study is coordinated by the Alzheimer's diseaseCooperative Study at the University of California, San Diego. ADNIdata are disseminated by the Laboratory for Neuro Imaging at theUniversity of California, Los Angeles. This research was also support-ed by NIH grants P30 AG010129, K01 AG030514, and the DanaFoundation.

Thisworkwas funded by theNational Institute on Aging (AG016570to PMT and R21AG043760 to YW), the National Library of Medicine, theNational Institute for Biomedical Imaging and Bioengineering, and theNational Center for Research Resources (LM05639, EB01651, RR019771to PMT), the US National Science Foundation (NSF) (IIS-0812551,IIS-0953662 to JY), and the National Library of Medicine (R01LM010730 to JY).

Appendix A

In this appendix, we detail the six feature selection algorithmswhichwere adopted in our experiments.

Student's t-test

It is a statistical hypothesis test in which the test statistic followsa Student's t-distribution if the null hypothesis, denoted by H0, is sup-ported. The alternative hypothesis, denoted byH1, checks for the condi-tion that H0 does not hold. This test is suited for distributions which aresmaller in size, symmetric to normal distribution but with unknownvariance. This work employed unpaired two-tailed t-test which com-pares two samples which are independent and identically distributed.For example, one sample is drawn from the population of control sub-jects and another sample is drawn from the population of subjectswith illness. The null hypothesis states that the two samples haveequal means and equal variance. The p-value is computed for each fea-ture independently using t-score (test statistics) and is defined as theprobability of observing a sample statistic as extreme or more extreme


Eas test statistic under the null hypothesis. The null hypothesis is rejectedif p-value is less than or equal to the significance level, usually denotedby α ≤ 0.05. Features are arranged in increasing order of p-value suchthat the most important feature has least p-value. The Matlab's built-in T-test function is used for this algorithm.

Relief-F

Relief-F is an extension of one of the most successful feature subsetselection algorithms, Relief [26] based on relevance of features. Themajority of feature selection algorithms estimate the quality of a featurebased on its conditional independence upon the target class. Reliefalgorithm assesses the significance of a feature based on its ability todistinguish the neighboring instances. The underlying principle statesthat for each feature, if the distance between data points from thesame classes is large, then the feature distinguishes data points withinthe same class. Such a feature is of no use and hence its weight shouldbe reduced. Whereas if the difference between data points from differ-ent classes is large, then the feature distinguishes the data pointsfrom two different classes which serves the feature selection problemformulation well. The weights of such features are increased. Thus, thesignificant features are arranged in descending order of their weights.The Relief-F algorithm improves the Relief algorithm by introduc-ing k-nearest neighbors from each class (Robnik-Šikonja andKononenko, 2003).

Gini Index

Gini Index (GI), also known as Gini Coefficient or Gini Ratio,measures the inequality in the frequency distribution values. Thisstatistical measure of dispersion is commonly used to measurewealth or income inequality within the population or among coun-tries. It can be applied to various other fields as well. Mathematicallyit is defined as the ratio of the area within the Lorenz curve andthe line of equality [18]. GI measures the ability of a feature to



UNCO

RRECTED P

RO

OF

Table 18t18:1

t18:2 NC/MCI: Comparison of different sampling approaches using top 10 proteomics features obtained by SLR + SS and T-test, averaged across 10 cross folds, in terms of accuracy, sensitivityt18:3 and specificity, and AUC. The best value in each column for each performance metric is underlined to compare different sampling approaches and highest value in each row is highlightedt18:4 in bold to compare feature selection algorithms and classifiers. OS% refers to SMOTE oversampling percentage and US% corresponds to K-Medoids undersampling percentage. Results ob-t18:5 tained without resampling approach are indicated by row (0%,0%), (0%,100%) refers to complete undersampling, and (100%,0%) corresponds to complete oversampling.


t18:7 (OS%,US%) RFAvg



t18:8 Accuracy (%) (0%,0%) 94.022 94.565 92.283 92.391 90.000 89.130 90.000 90.217t18:9 (10%,0%) 91.848 92.391 93.804 94.565 88.370 88.043 90.543 89.130t18:10 (20%,0%) 90.217 89.130 92.500 93.478 89.348 90.217 91.304 90.217t18:11 (30%,0%) 90.543 90.217 91.630 92.391 89.891 89.130 89.783 89.130t18:12 (40%,0%) 88.913 90.217 91.522 92.391 89.565 90.217 88.913 89.130t18:13 (50%,0%) 89.130 89.130 91.630 91.304 89.348 89.130 89.022 89.130t18:14 (60%,0%) 89.457 90.217 91.957 92.391 89.565 89.130 89.130 88.043t18:15 (70%,0%) 89.239 89.130 90.652 92.391 89.457 89.130 89.565 89.130t18:16 (80%,0%) 87.717 88.043 90.326 91.304 90.000 90.217 89.022 89.130t18:17 (90%,0%) 87.609 88.043 89.348 90.217 90.000 90.217 89.674 89.130t18:18 (100%,0%) 90.036 88.043 89.638 90.217 91.993 93.478 90.870 91.304t18:19 (0%,10%) 93.043 93.478 93.043 93.478 90.217 90.217 89.348 90.217t18:20 (0%,20%) 93.370 93.478 93.152 94.565 89.783 90.217 89.457 89.130t18:21 (0%,30%) 92.717 93.478 92.283 92.391 89.457 90.217 88.913 90.217t18:22 (0%,40%) 92.500 92.391 91.630 91.304 89.565 89.130 88.478 88.043t18:23 (0%,50%) 93.261 93.478 92.283 93.478 89.022 89.130 88.804 88.043t18:24 (0%,60%) 92.500 93.478 91.196 91.304 89.674 89.130 90.652 90.217t18:25 (0%,70%) 92.174 93.478 89.022 91.304 89.130 90.217 88.587 90.217t18:26 (0%,80%) 89.457 90.217 89.348 94.565 88.261 89.130 88.152 90.217t18:27 (0%,90%) 80.000 84.783 82.065 88.043 85.326 89.130 85.870 90.217t18:28 (0%,100%) 81.522 84.783 82.355 90.217 79.420 83.696 79.493 82.609




t18:30 Sensitivity (0%,0%) 0.984 0.988 0.948 0.950 0.989 0.988 0.975 0.975t18:31 (10%,0%) 0.968 0.975 0.959 0.963 0.958 0.950 0.960 0.950t18:32 (20%,0%) 0.949 0.938 0.951 0.963 0.948 0.950 0.955 0.938t18:33 (30%,0%) 0.944 0.938 0.948 0.963 0.944 0.938 0.938 0.925t18:34 (40%,0%) 0.928 0.938 0.944 0.950 0.936 0.938 0.930 0.925t18:35 (50%,0%) 0.928 0.925 0.941 0.938 0.928 0.925 0.926 0.925t18:36 (60%,0%) 0.929 0.938 0.949 0.950 0.930 0.925 0.933 0.925t18:37 (70%,0%) 0.926 0.925 0.934 0.950 0.925 0.925 0.925 0.925t18:38 (80%,0%) 0.910 0.913 0.928 0.938 0.926 0.925 0.923 0.925t18:39 (90%,0%) 0.908 0.913 0.916 0.925 0.923 0.925 0.926 0.913t18:40 (100%,0%) 0.928 0.913 0.920 0.925 0.942 0.963 0.939 0.950t18:41 (0%,10%) 0.983 0.988 0.953 0.950 0.986 0.988 0.970 0.975t18:42 (0%,20%) 0.986 0.988 0.954 0.963 0.976 0.975 0.966 0.963t18:43 (0%,30%) 0.988 0.988 0.948 0.950 0.974 0.975 0.961 0.963t18:44 (0%,40%) 0.980 0.975 0.940 0.938 0.965 0.963 0.954 0.950t18:45 (0%,50%) 0.983 0.988 0.944 0.950 0.958 0.963 0.948 0.950t18:46 (0%,60%) 0.970 0.975 0.936 0.938 0.955 0.950 0.953 0.950t18:47 (0%,70%) 0.961 0.975 0.908 0.925 0.939 0.950 0.926 0.938t18:48 (0%,80%) 0.920 0.925 0.903 0.950 0.913 0.925 0.896 0.913t18:49 (0%,90%) 0.788 0.838 0.811 0.875 0.873 0.913 0.873 0.913t18:50 (0%,100%) 0.797 0.825 0.804 0.888 0.775 0.813 0.775 0.800t18:51 Specificity (0%,0%) 0.650 0.667 0.758 0.750 0.308 0.250 0.400 0.417t18:52 (10%,0%) 0.592 0.583 0.800 0.833 0.392 0.417 0.542 0.500t18:53 (20%,0%) 0.592 0.583 0.750 0.750 0.533 0.583 0.633 0.667t18:54 (30%,0%) 0.650 0.667 0.708 0.667 0.600 0.583 0.633 0.667t18:55 (40%,0%) 0.633 0.667 0.725 0.750 0.625 0.667 0.617 0.667t18:56 (50%,0%) 0.650 0.667 0.750 0.750 0.667 0.667 0.650 0.667t18:57 (60%,0%) 0.667 0.667 0.725 0.750 0.667 0.667 0.617 0.583t18:58 (70%,0%) 0.667 0.667 0.725 0.750 0.692 0.667 0.700 0.667t18:59 (80%,0%) 0.658 0.667 0.742 0.750 0.725 0.750 0.675 0.667t18:60 (90%,0%) 0.667 0.667 0.742 0.750 0.750 0.750 0.700 0.750t18:61 (100%,0%) 0.719 0.667 0.739 0.750 0.772 0.750 0.708 0.667t18:62 (0%,10%) 0.583 0.583 0.783 0.833 0.342 0.333 0.383 0.417t18:63 (0%,20%) 0.583 0.583 0.783 0.833 0.375 0.417 0.417 0.417t18:64 (0%,30%) 0.525 0.583 0.758 0.750 0.367 0.417 0.408 0.500t18:65 (0%,40%) 0.558 0.583 0.758 0.750 0.433 0.417 0.425 0.417t18:66 (0%,50%) 0.600 0.583 0.783 0.833 0.442 0.417 0.492 0.417t18:67 (0%,60%) 0.625 0.667 0.750 0.750 0.508 0.500 0.600 0.583t18:68 (0%,70%) 0.658 0.667 0.775 0.833 0.575 0.583 0.617 0.667t18:69 (0%,80%) 0.725 0.750 0.833 0.917 0.683 0.667 0.783 0.833t18:70 (0%,90%) 0.883 0.917 0.883 0.917 0.725 0.750 0.767 0.833t18:71 (0%,100%) 0.936 1.000 0.956 1.000 0.925 1.000 0.928 1.000t18:72


Please cite this article as: Dubey, R., et al., Analysis of sampling techniques for imbalanced data: An n= 648 ADNI study, NeuroImage (2013),http://dx.doi.org/10.1016/j.neuroimage.2013.10.005


OO

F

992

993

994

995

996

997

998

999

1000

1001

1002

1003

1004

1005

1006

1007

1008

1009

1010

1011

1012

t18:73 Table 18 (continued)



RF MajVote SVM Avg SVMMajVote

t18:75 AUC (0%,0%) 0.801 0.869 0.841 0.900 0.615 0.654 0.652 0.692t18:76 (10%,0%) 0.754 0.813 0.861 0.956 0.638 0.683 0.731 0.731t18:77 (20%,0%) 0.751 0.792 0.831 0.904 0.717 0.800 0.784 0.833t18:78 (30%,0%) 0.776 0.844 0.812 0.856 0.749 0.792 0.765 0.840t18:79 (40%,0%) 0.761 0.844 0.822 0.894 0.755 0.840 0.759 0.840t18:80 (50%,0%) 0.767 0.833 0.835 0.894 0.777 0.833 0.772 0.840t18:81 (60%,0%) 0.779 0.844 0.823 0.894 0.780 0.833 0.748 0.833t18:82 (70%,0%) 0.778 0.833 0.815 0.894 0.791 0.833 0.804 0.840t18:83 (80%,0%) 0.762 0.825 0.823 0.885 0.816 0.848 0.787 0.840t18:84 (90%,0%) 0.768 0.825 0.808 0.875 0.822 0.848 0.790 0.846t18:85 (100%,0%) 0.818 0.873 0.823 0.954 0.852 0.892 0.818 0.879t18:86 (0%,10%) 0.762 0.817 0.861 0.946 0.635 0.654 0.640 0.704t18:87 (0%,20%) 0.767 0.817 0.856 0.956 0.643 0.692 0.653 0.690t18:88 (0%,30%) 0.724 0.817 0.829 0.900 0.640 0.692 0.647 0.738t18:89 (0%,40%) 0.740 0.813 0.830 0.898 0.670 0.685 0.658 0.683t18:90 (0%,50%) 0.777 0.817 0.844 0.956 0.668 0.685 0.677 0.683t18:91 (0%,60%) 0.775 0.865 0.833 0.898 0.704 0.742 0.750 0.790t18:92 (0%,70%) 0.791 0.865 0.826 0.935 0.739 0.800 0.745 0.840t18:93 (0%,80%) 0.803 0.892 0.857 0.969 0.766 0.833 0.821 0.896t18:94 (0%,90%) 0.821 0.894 0.832 0.935 0.782 0.844 0.801 0.896t18:95 (0%,100%) 0.864 0.967 0.874 0.973 0.850 0.977 0.850 0.977


differentiate between target classes. When all the samples belong tothe same target class, GI is zero indicating maximum inequalitythereby giving most useful information. On the other hand, ifall samples are equally distributed between target classes, thenGI reaches its maximum value denoting minimum informationwhich can be obtained from this feature. Hence, features are ar-ranged in increasing order of GI where most significant feature hasleast GI.

T 1013

1014

1015

1016

1017
EC
Information Gain

Information Gain (IG) is also known as information divergence,Kullback–Leibler divergence, or relative entropy. Information gain is com-monly used as a surrogate for approximating a conditional distribution inclassification setting (Cover and Thomas, 1991). It represents the

UNCO

RR

Fig. 9.NC/MCImajority voting classification performance comparison of SVM classifier, averaged aNote the decreasing sensitivity–specificity gap as the rate of undersampling is increased. Comple


ED P

R

reduction in uncertainty of predicting class label (Y) given a feature vec-tor (xa) which can take up to k possible values. In other words, IGmeasures the reduction in entropy in moving from a priordistribution P(Y) to a posterior distribution P(Y | xa). Both Y and xaare assumed to be discrete. An attribute with higher value of IG isconsidered to be more relevant and is assigned a higher weight. Fea-tures are arranged in decreasing order of their IG values. This is anasymmetric method, i.e. IG(Y | xa) ≠ IG(xa | Y), and is not suitablefor attributes (feature vectors) which can take a large number of dis-crete values as it might cause overfitting problems.

Chi-Square test

The Chi-Square (χ2) test is a statistical test performed on samplesthat follow χ2 distribution, a special case of gamma distribution. It is a

cross 10 cross folds, using top 10 features from SLR + SS for different rates of data sampling.te undersampled dataset (labeled as S0_K100) showed least gap.



UNCO

RRECTED P

RO

OF

Subset # 1

Subset # 2

Subset # 3

Subset # 4

Subset # 5

Subset # 6

Subset # 7

Minority Minority

Subset # 1

Minority

Subset # 2

Minority

Subset # 4

Minority

Subset # 3

Minority

Subset # 6

Minority

Subset # 5

Minority

Subset # 7

a) Given Minority-MajorityTraining set

b) 7 datasets are generated using balanced majority and minority class

M

A

J

O

R

I

T

Y

Fig. 10. Generation of classificationmodels for imbalanced data using Chan and Stolfo (1998) approach. Themajority class (represented by orange colored rectangles in the figure) is evenlydivided into minority class sized non-overlapping subsets.

Table 19t19:1

t19:2 NC/MCI: Comparison of undersampling approach using top 10 proteomics features obtained by SLR + SS and T-test, averaged across 10 cross folds, in terms of accuracy, sensitivity andt19:3 specificity, and AUC. The best value in each column for each performance metric is underlined to compare different sampling approaches and highest value in each row is highlighted int19:4 bold to compare feature selection algorithms and classifiers. Chan US refers to undersampling using Chan et al. (Chan and Stolfo, 1998) approach.





t19:7 Accuracy (%) Random US 80.146 84.772 80.965 86.326 78.607 83.685 78.344 82.630t19:8 K-Medoids 80.596 85.359 81.384 87.630 78.958 83.217 78.576 81.696t19:9 Chan US 87.210 91.304 87.030 89.467 86.284 89.250 85.787 90.087t19:10 Sensitivity Random US 0.802 0.846 0.808 0.861 0.786 0.836 0.782 0.826t19:11 K-Medoids 0.806 0.848 0.813 0.876 0.791 0.835 0.787 0.823t19:12 Chan US 0.942 0.985 0.937 0.977 0.932 0.972 0.923 0.964t19:13 Specificity Random US 0.803 0.867 0.824 0.883 0.787 0.850 0.798 0.833t19:14 K-Medoids 0.808 0.900 0.826 0.883 0.783 0.817 0.783 0.783t19:15 Chan US 0.398 0.433 0.419 0.342 0.393 0.350 0.420 0.475t19:16 AUC Random US 0.798 0.914 0.811 0.927 0.782 0.911 0.784 0.899t19:17 K-Medoids 0.801 0.933 0.813 0.932 0.782 0.902 0.781 0.873t19:18 Chan US 0.619 0.830 0.620 0.800 0.611 0.771 0.626 0.808

Fig. 11.NC/MCImajority voting classification performance comparison of SVM classifier for different undersampling approaches, averaged across 10 cross folds, using top 10 features fromSLR + SS for different rates of data sampling depicting efficacy of K-Medoids and random undersampling approach over Chan and Stolfo proposed solution (Chan and Stolfo, 1998).


Please cite this article as: Dubey, R., et al., Analysis of sampling techniques for imbalanced data: An n= 648 ADNI study, NeuroImage (2013),http://dx.doi.org/10.1016/j.neuroimage.2013.10.005


T

1018

1019

1020

1021

1022

1023

1024

1025

1026

1027

1028

1029

1030

1031

1032

1033

1034

1035

1036103710381039104010411042104310441045104610471048104910501051

1052

1053105410551056105710581059106010611062106310641065106610671068106910701071107210731074107510761077107810791080108110821083108410851086

108710881089Q310901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172


UNCO

RREC

continuous, asymmetrical, skewed to right distribution, and has K de-grees of freedom such that the mean of the distribution is equal to Kand the variance is 2 K. The χ2 distribution is widely used in χ2 test tocompute goodness of fit, independence of criteria, and estimatingconfidence interval and standard deviation. In feature selection, χ2

test for independence is employed to determine whether the outcomeis dependent on a feature. The null hypothesis states that the occur-rences of the outcomes of an observation are statistically independent.P-value is the probability of obtaining a test statistic as extreme as theobserved value under null hypothesis and is computed from χ2 distri-bution table givenχ2 test statistic and K. The null hypothesis is rejectedif p-value is less than the specified significance level α, which is oftenα ≤ 0.05. Rejecting the hypothesis makes the result statistically signif-icant and confirms the dependence of the outcome on the feature value.Features are arranged in increasing order of p-value.

Sparse Logistic Regression

Sparse Logistic Regression (SLR) is an embedded feature selectionalgorithm which uses ‘1-norm regularization in Logistic Regression. Itis one of themost attractive feature selection algorithms in applicationswhich deal with high dimensional data. Logistic Regression (LR) is aclassification technique using linear discriminative model to maximizethe quality of output on training data. For a two class (binomial) classi-fication problem, it assigns a probability to class labels using sigmoidfunction (hθ(x)) such that if hθ(x) ≥0.5, the class label is positive other-wise it is negative. LR tends to overfit when the sample size is limitedand the data is very high dimensional. To reduce overfitting and obtainbetter LR classifiers, regularization is applied to the LR's objectivefunction. The guiding principle in sparse logistic regression is to use reg-ularization in Logistic loss function such that irrelevant features aregiven a zero weight (Liu et al., 2009a). To induce sparsity, ‘1-norm reg-ularized logistic loss function is used (Duchi et al., 2008; Friedman et al.,2010; Fu, 1998). Features are ranked in decreasing order of theirweights. The Matlab code is taken from the SLEP package (Liu et al.,2009b).

References

Akbani, R., Kwek, S., Japkowicz, N., 2004. Applying support vector machines to imbalanceddatasets. Proceedings of the 15th European Conference on Machine Learning (ECML),pp. 39–50.

Alzheimer's Association, 2012. Alzheimer's Disease Facts and Figures. Alzheimer's Associ-ation (Available from: http://www.alz.org).

Bartzokis, G., 2004. Age-related myelin breakdown: a developmental model of cognitivedecline and Alzheimer's disease. Neurobiol. Aging 25 (1), 5–18 (author reply 49–62).

Bernal-Rusiel, J.L., Greve, D.N., Reuter, M., Fischl, B., Sabuncu, M.R., 2012. Statistical analysisof longitudinal neuroimage data with Linear Mixed Effects models. Neuroimage 66C,249–260.

Bradford, J.P., Kunz, C., Kohavi, R., Brunk, C., Brodley, C.E., 1998. Pruning decision treeswith misclassification costs. Proceedings of the European Conference on MachineLearning 131–136.

Chan, P.K., Stolfo, S.J., 1998. Toward scalable learningwith non-uniform class and cost dis-tributions: a case study in credit card fraud detection. In Proceedings of the FourthInternational Conference on Knowledge Discovery and Data Mining. AAAI Press,pp. 164–168.

Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., 2002. SMOTE: synthetic minorityover-sampling technique. J. Artif. Int. Res. 16 (1), 321–357.

Chawla, N., Japkowicz, N., Kołcz, A., 2003. ICML'2003 Workshop on Learning from Imbal-anced Data Sets (II). US, Washington DC.

Chawla, N.V., Japkowicz, N., Kotcz, A., 2004. Editorial: special issue on learning from im-balanced data sets. SIGKDD Explor. Newsl. 6 (1), 1–6.

Chen, C., Liaw, A., Breiman, L., 2004. Using Random Forest to Learn Imbalanced Data.University of California, Berkeley.

Corder, E.H., Saunders, A.M., Strittmatter, W.J., Schmechel, D.E., Gaskell, P.C., Small, G.W.,Roses, A.D., Haines, J.L., Pericak-Vance, M.A., 1993. Gene dose of apolipoprotein Etype 4 allele and the risk of Alzheimer's disease in late onset families. Science 261(5123), 921–923.

Cover, T.M., Thomas, J.A., 1991. Elements of Information Theory. Wiley.Cuingnet, R., Gerardin, E., Tessieras, J., Auzias, G., Lehericy, S., Habert, M.O., Chupin, M.,

Benali, H., Colliot, O., 2011. Automatic classification of patients with Alzheimer's dis-ease from structural MRI: a comparison of ten methods using the ADNI database.Neuroimage 56 (2).


ED P

RO

OF

Davatzikos, C., Bhatt, P., Shaw, L.M., Batmanghelich, K.N., Trojanowski, J.Q., 2010. Predic-tion of MCI to AD conversion, via MRI, CSF biomarkers, and pattern classification.Neurobiol. Aging.

Devanand, D.P., Pradhaban, G., Liu, X., Khandji, A., De Santi, S., Segal, S., Rusinek, H., Pelton,G.H., Honig, L.S., Mayeux, R., Stern, Y., Tabert, M.H., de Leon, M.J., 2007. Hippocampaland entorhinal atrophy in mild cognitive impairment: prediction of Alzheimer dis-ease. Neurology 68 (11), 828–836.

Dickerson, B.C., Goncharova, I., Sullivan, M.P., Forchetti, C., Wilson, R.S., Bennett, D.A.,Beckett, L.A., deToledo-Morrell, L., 2001. MRI-derived entorhinal and hippocam-pal atrophy in incipient and very mild Alzheimer's disease. Neurobiol. Aging 22(5), 747–754.

Drummond, C., Holte, R.C., 2003. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. Working Notes of the ICML'03 Workshop on Learningfrom Imbalanced Data Sets, Washington, DC.

Dubey, R., 2012. Machine Learning Methods for Biosignature Discovery. (Masters Thesis)Arizona State University.

Duchesnay, E., Cachia, A., Boddaert, N., Chabane, N., Mangin, J.F., Martinot, J.L., Brunelle, F.,Zilbovicius, M., 2011. Feature selection and classification of imbalanced datasets: ap-plication to PET images of children with autistic spectrum disorders. Neuroimage 57(3), 1003–1014.

Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T., 2008. Efficient projections ontothe l1-ball for learning in high dimensions. Proceedings of the 25th Internationalconference on Machine learning. ACM, Helsinki, Finland, pp. 272–279.

Elkan, C., 2001. The foundations of cost-sensitive learning. Proceedings of the 17th Inter-national Joint Conference on Artificial Intelligence, volume 2. Morgan Kaufmann Pub-lishers Inc., Seattle, WA, USA, pp. 973–978.

Elkan, C., 2003. Invited Talk: The Real Challenges in Data Mining: A Contrarian View.http://www.site.uottawa.ca/nat/Workshop2003/realchallenges2.ppt.

Ertekin, S., Huang, J., Bottou, L., Giles, L., 2007. Learning on the border: active learning inimbalanced data classification. Proceedings of the Sixteenth ACM Conference on In-formation and Knowledge Management. ACM, Lisbon, Portugal, pp. 127–136.

Estabrooks, A., 2000. A combination scheme for inductive learning from imbalanced datasets. (Master thesis) Computer Science, Dalhousie University.

Estabrooks, A., Jo, T., Japkowicz, N., 2004. A multiple resampling method for learning fromimbalanced data sets. Comput. Intell. 20 (1), 18–36.

Fan, Y., Resnick, S.M., Wu, X., Davatzikos, C., 2008. Structural and functional biomarkers ofprodromal Alzheimer's disease: a high-dimensional pattern classification study.Neuroimage 41 (2), 277–285.

Fitzmaurice, G., Laird, N., Ware, J., 2011. Applied Longitudinal Analysis. Wiley.Friedman, J., Hastie, T., Tibshirani, R., 2010. Regularization paths for generalized linear

models via coordinate descent. J. Stat. Softw. 33 (1), 1–22.Frisoni, G.B., Fox, N.C., Jack, C.R., Scheltens, P., Thompson, P.M., 2010. The clinical use of

structural MRI in Alzheimer disease. Nat. Rev. Neurol. 6 (2), 67–77.Fu, W., 1998. Penalized regressions: the bridge versus the lasso. J. Comput. Graph. Stat. 7

(3), 397–416.Gomez Ravetti, M., Moscato, P., 2008. Identification of a 5-protein biomarker molecular

signature for predicting Alzheimer's disease. PLoS One 3 (9), e3111.He, H., Garcia, E.A., 2009. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21

(9), 1263–1284.Hibar, D.P., Stein, J.L., Kohannim, O., Jahanshad, N., Saykin, A.J., Shen, L., Kim, S., Pankratz,

N., Foroud, T., Huentelman, M.J., Potkin, S.G., Jack Jr., C.R., Weiner, M.W., Toga, A.W.,Thompson, P.M., 2011. Voxelwise gene-wide association study (vGeneWAS): multi-variate gene-based association testing in 731 elderly subjects. Neuroimage 56 (4),1875–1891.

Jack Jr., C.R., Bernstein, M.A., Fox, N.C., Thompson, P., Alexander, G., Harvey, D., Borowski,B., Britson, P.J., Whitwell, J.L., Ward, C., Dale, A.M., Felmlee, J.P., Gunter, J.L., Hill, D.L.G.,Killiany, R., Schuff, N., Fox-Bosetti, S., Lin, C., Studholme, C., DeCarli, C.S., Krueger, G.,Ward, H.A., Metzger, G.J., Scott, K.T., Mallozzi, R., Blezek, D., Levy, J., Debbins, J.P.,Fleisher, A.S., Albert, M., Green, R., Bartzokis, G., Glover, G., Mugler, J., Weiner, M.W.,Study, A., 2008. The Alzheimer's disease neuroimaging initiative (ADNI): MRImethods. J. Magn. Reson. Imaging 27 (4), 685–691.

Japkowicz, N., 2000a. Learning from imbalanced data sets: a comparison of various strat-egies. In: Japkowicz, N. (Ed.), Proceedings of Learning from Imbalanced Data Sets, Pa-pers from the AAAI Workshop, pp. 10–15.

Japkowicz, N., 2000b. The class imbalance problem: significance and strategies. Proceedingsof the 2000 International Conference on Artificial Intelligence (ICAI), pp. 111–117.

Japkowicz, N., 2001. Supervised versus unsupervised binary-learning by feedforwardneural networks. Mach. Learn. 42 (1–2), 97–122.

Japkowicz, N., Stephen, S., 2002. The class imbalance problem: a systematic study. Intell.Data Anal. 6 (5), 429–449.

Jiang, X., El-Kareh, R., Ohno-Machado, L., 2011. Improving predictions in imbalanceddata using Pairwise Expanded Logistic Regression. AMIA Annu. Symp. Proc. 2011,625–634.

Jo, T., Japkowicz, N., 2004. Class imbalances versus small disjuncts. SIGKDD Explor. Newsl.6 (1), 40–49.

Johnstone, D., Milward, E.A., Berretta, R., Moscato, P., 2012. Multivariate protein signa-tures of pre-clinical Alzheimer's disease in the Alzheimer's disease neuroimaging ini-tiative (ADNI) plasma proteome dataset. PLoS One 7 (4), e34341.

Joshi, M.V., Kumar, V., Agarwal, R.C., 2001. Evaluating boosting algorithms to classify rareclasses: comparison and improvements. Proceedings of the 2001 IEEE InternationalConference on Data Mining. IEEE Comput. Soc. , pp. 257–264.

Knoll, U., Nakhaeizadeh, G., Tausend, B., 1994. Cost-sensitive pruning of decision trees.Mach. Learn. ECML-94 (784), 383–386.

Kohannim, O., Hua, X., Hibar, D.P., Lee, S., Chou, Y.Y., Toga, A.W., Jack Jr., C.R., Weiner,M.W., Thompson, P.M., 2010. Boosting power for clinical trials using classifiersbased on multiple biomarkers. Neurobiol. Aging 31 (8), 1429–1442.


http://www.alz.org

http://www.site.uottawa.ca/nat/Workshop2003/realchallenges2.ppt


T

1173117411751176117711781179118011811182118311841185118611871188118911901191119211931194119511961197119811991200120112021203120412051206120712081209121012111212121312141215121612171218121912201221122212231224122512261227122812291230Q41231123212331234

12351236123712381239124012411242124312441245124612471248124912501251125212531254125512561257125812591260126112621263126412651266126712681269127012711272127312741275127612771278127912801281128212831284128512861287128812891290129112921293129412951296

1298


ORREC

Kołcz, A., Chowdhury, A., Alspector, J., 2003. Data duplication: an imbalance prob-lem. Proceedings of the ICML'2003 Workshop on Learning from ImbalancedDatasets.

Kubat, M., Matwin, S., 1997. Addressing the curse of imbalanced training sets: one-sidedselection. Proceedings of the Fourteenth International Conference onMachine Learning.Morgan Kaufmann, pp. 179–186.

Lee, K.J., Hwang, Y.S., Kim, S., Rim, H.C., 2004. Biomedical named entity recognition usingtwo-phase model based on SVMs. J. Biomed. Inform. 37 (6), 436–447.

Ling, C., Li, C., 1998. Data mining for direct marketing: problems and solutions. Proceedingsof the Fourth International Conference on Knowledge Discovery and Data Mining(KDD-98). AAAI Press, pp. 73–79.

Liu, X.Y., Wu, J., Zhou, Z.H., 2009a. Exploratory undersampling for class-imbalance learn-ing. IEEE Trans. Syst. Man Cybern. B Cybern. 39 (2), 539–550.

Liu, J., Chen, J., Ye, J., 2009b. Large-scale sparse logistic regression. Proceedings of the 15thACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM, Paris, France, pp. 547–556.

Liu, J., Ji, S., Ye, J., 2009c. SLEP: Sparse Learning with Efficient Projections. Arizona StateUniversity. http://www.public.asu.edu/jye02/Software/SLEP.

Liu, Y., Paajanen, T., Zhang, Y., Westman, E., Wahlund, L.O., Simmons, A., Tunnard, C.,Sobow, T., Mecocci, P., Tsolaki, M., Vellas, B., Muehlboeck, S., Evans, A., Spenger, C.,Lovestone, S., Soininen, H., 2011. Combination analysis of neuropsychological testsand structural MRI measures in differentiating AD, MCI and control groups—theAddNeuroMed study. Neurobiol. Aging 32 (7), 1198–1206.

Maloof, M.A., 2003. Learning when data sets are imbalanced and when costs are un-equal and unknown. ICML-2003 Workshop on Learning from Imbalanced DataSets II.

Mayeux, R., Saunders, A.M., Shea, S., Mirra, S., Evans, D., Roses, A.D., Hyman, B.T., Crain, B.,Tang,M.X., Phelps, C.H., 1998. Utility of the apolipoprotein E genotype in the diagnosisof Alzheimer's disease. Alzheimer's Disease Centers Consortium on Apolipoprotein Eand Alzheimer's Disease. N. Engl. J. Med. 338 (8), 506–511.

Meinshausen, N., Bühlmann, P., 2010. Stability selection. J. R. Stat. Soc. Ser. B 72 (4),417–473.

Mueller, S.G., Weiner, M.W., Thal, L.J., Petersen, R.C., Jack, C.R., Jagust, W., Trojanowski, J.Q.,Toga, A.W., Beckett, L., 2005. Ways toward an early diagnosis in Alzheimer's disease:the Alzheimer's Disease Neuroimaging Initiative (ADNI). Alzheimers Dement. 1 (1),55–66.

O'Bryant, S.E., Xiao, G., Barber, R., Huebinger, R., Wilhelmsen, K., Edwards, M., Graff-Radford, N., Doody, R., Diaz-Arrastia, R., 2011. A blood-based screening tool forAlzheimer's disease that spans serum and plasma: findings from TARC and ADNI.PLoS One 6 (12), e28092.

Padmaja, T.M., Krishna, P.R., Bapi, R.S., 2008. Majority filter-based minority prediction(MFMP): an approach for unbalanced datasets. TENCON 2008–2008 IEEE Region 10Conference, pp. 1–6.

Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C., 1994. Reducing misclassifica-tion costs. Proceedings of the 11th International Conference on Machine Learning,pp. 217–225.

Provost, F., 2000. Machine Learning from Imbalanced Data Sets 101. Workshop on Learn-ing from Imbalanced Data Sets. AAAI, Texas, US.

Provost, F., Fawcett, T., 2001. Robust classification for imprecise environments. Mach.Learn. 42 (3), 203–231.

Ray, S., Britschgi, M., Herbert, C., Takeda-Uchimura, Y., Boxer, A., Blennow, K., Friedman,L.F., Galasko, D.R., Jutel, M., Karydas, A., Kaye, J.A., Leszek, J., Miller, B.L., Minthon, L.,Quinn, J.F., Rabinovici, G.D., Robinson, W.H., Sabbagh, M.N., So, Y.T., Sparks, D.L.,Tabaton, M., Tinklenberg, J., Yesavage, J.A., Tibshirani, R., Wyss-Coray, T., 2007. Classi-fication and prediction of clinical Alzheimer's diagnosis based on plasma signalingproteins. Nat. Med. 13 (11), 1359–1362.

Reiman, E.M., Jagust, W.J., 2011. Brain imaging in the study of Alzheimer's disease.Neuroimage.

Robnik-Šikonja, M., Kononenko, I., 2003. Theoretical and empirical analysis of ReliefF andRReliefF. Mach. Learn. 53 (1–2), 23–69.

Shaw, L.M., Vanderstichele, H., Knapik-Czajka, M., Clark, C.M., Aisen, P.S., Petersen, R.C.,Blennow, K., Soares, H., Simon, A., Lewczuk, P., Dean, R., Siemers, E., Potter, W., Lee,

UN

1297


ED P

RO

OF

V.M., Trojanowski, J.Q., 2009. Cerebrospinal fluid biomarker signature in Alzheimer'sdisease neuroimaging initiative subjects. Ann. Neurol. 65 (4), 403–413.

Shen, L., Kim, S., Qi, Y., Inlow, M., Swaminathan, S., Nho, K., Wan, J., Risacher, S.L., Shaw,L.M., Trojanowski, J.Q., Weiner, M.W., Saykin, A.J., 2011. Identifying neuroimagingand proteomic biomarkers for MCI and AD via the elastic net. Proceedings of theFirst International Conference on Multimodal Brain Image Analysis. Springer-Verlag,Toronto, Canada, pp. 27–34.

Sperling, R.A., Aisen, P.S., Beckett, L.A., Bennett, D.A., Craft, S., Fagan, A.M., Iwatsubo, T.,Jack Jr., C.R., Kaye, J., Montine, T.J., Park, D.C., Reiman, E.M., Rowe, C.C., Siemers, E.,Stern, Y., Yaffe, K., Carrillo, M.C., Thies, B., Morrison-Bogorad, M., Wagster, M.V.,Phelps, C.H., 2011. Toward defining the preclinical stages of Alzheimer's disease: rec-ommendations from the National Institute on Aging-Alzheimer's Associationworkgroups on diagnostic guidelines for Alzheimer's disease. Alzheimers Dement. 7(3), 280–292.

Stein, J.L., Hua, X., Lee, S., Ho, A.J., Leow, A.D., Toga, A.W., Saykin, A.J., Shen, L., Foroud, T.,Pankratz, N., Huentelman, M.J., Craig, D.W., Gerber, J.D., Allen, A.N., Corneveaux, J.J.,Dechairo, B.M., Potkin, S.G., Weiner, M.W., Thompson, P.M., 2010a. Voxelwisegenome-wide association study (vGWAS). Neuroimage 53 (3), 1160–1174.

Stein, J.L., Hua, X., Morra, J.H., Lee, S., Hibar, D.P., Ho, A.J., Leow, A.D., Toga, A.W., Sul, J.H.,Kang, H.M., Eskin, E., Saykin, A.J., Shen, L., Foroud, T., Pankratz, N., Huentelman, M.J.,Craig, D.W., Gerber, J.D., Allen, A.N., Corneveaux, J.J., Stephan, D.A., Webster, J.,DeChairo, B.M., Potkin, S.G., Jack Jr., C.R., Weiner, M.W., Thompson, P.M., 2010b.Genome-wide analysis reveals novel genes influencing temporal lobe structurewith relevance to neurodegeneration in Alzheimer's disease. Neuroimage 51 (2),542–554.

Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A., 2007. Experimental perspectives onlearning from imbalanced data. Proceedings of the 24th International Conferenceon Machine Learning. ACM, Corvalis, Oregon, pp. 935–942.

Visa, S., Ralescu, A., 2005. Issues in mining imbalanced data sets — a review paper.Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive ScienceConference, pp. 67–73.

Vlkolinskỳ, R., Cairns, N., Fountoulakis, M., Lubec, G., 2001. Decreased brain levels of 2′,3′-cyclic nucleotide-3′-phosphodiesterase in Down syndrome and Alzheimer's disease.Neurobiol. Aging 22 (4), 547–553.

Wang, Y., Song, Y., Rajagopalan, P., An, T., Liu, K., Chou, Y.Y., Gutman, B., Toga, A.W.,Thompson, P.M., 2011. Surface-based TBM boosts power to detect disease effectson the brain: an N = 804 ADNI study. Neuroimage 56 (4), 1993–2010.

Weiner, M.W., Veitch, D.P., Aisen, P.S., Beckett, L.A., Cairns, N.J., Green, R.C., Harvey, D.,Jack, C.R., Jagust, W., Liu, E., Morris, J.C., Petersen, R.C., Saykin, A.J., Schmidt, M.E.,Shaw, L., Siuciak, J.A., Soares, H., Toga, A.W., Trojanowski, J.Q., 2012. The Alzheimer'sDisease Neuroimaging Initiative: a review of papers published since its inception.Alzheimers Dement. 8 (1 Suppl.), S1–S68.

Yang, Q., Wu, X., 2006. 10 Challenging problems in data mining research. Int. J. Inf.Technol. Decis. Mak. 5 (4), 597–604.

Yang, W., Lui, R.L., Gao, J.H., Chan, T.F., Yau, S.T., Sperling, R.A., Huang, X., 2011. Indepen-dent component analysis-based classification of Alzheimer's disease MRI data.J. Alzheimers Dis. 24 (4), 775–783.

Yen, S.-J., Lee, Y.-S., 2006. Cluster-Based Sampling Approaches to Imbalanced Data Distri-butions. In: Tjoa, A., Trujillo, J. (Eds.), Springer Berlin Heidelberg, pp. 427–436.

Yuan, L., Wang, Y., Thompson, P.M., Narayan, V.A., Ye, J., 2012. Multi-source feature learn-ing for joint analysis of incomplete multiple heterogeneous neuroimaging data.Neuroimage 61 (3), 622–632.

Zadrozny, B., Langford, J., Abe, N., 2003. Cost-sensitive learning by cost-proportionate ex-ample weighting. Proceedings of the Third IEEE International Conference on DataMining. IEEE Comput. Soc. , p. 435.

Zhao, Z., Morstatter, F., Sharma, S., Alelyani, S., Anand, A., Liu, H., 2011. Advancing featureselection research. ASU Feature Selection Repository.

Zheng, Z., Srihari, R., 2003. Optimally combining positive and negative features for textcategorization. Workshop for Learning from Imbalanced Datasets II, Proceedings ofthe (ICML).

Zhou, L., Wang, Y., Li, Y., Yap, P.T., Shen, D., 2011. Hierarchical anatomical brain networksfor MCI prediction: revisiting volumetric measures. PLoS One 6 (7), e21935.

C


http://www.public.asu.edu/jye02/Software/SLEP


Analysis of sampling techniques for imbalanced data: An n...

Documents

Transcript of Analysis of sampling techniques for imbalanced data: An n...