Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader)...

30
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University

Transcript of Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader)...

Page 1: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Competition II: Springleaf

Sha Li (Team leader)Xiaoyan Chong, Minglu Ma, Yue Wang

CAMCOS Fall 2015San Jose State University

Page 2: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Agenda• Kaggle Competition: Springleaf dataset

introduction• Data Preprocessing• Classification Methodologies & Results

• Logistic Regression• Random Forest• XGBoost• Stacking

• Summary & Conclusion

Page 3: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Kaggle Competition: SpringleafObjective: Predict whether customers will

respond to a direct mail loan offer

• Customers: 145,231• Independent variables: 1932• “Anonymous” features• Dependent variable:

– target = 0: DID NOT RESPOND– target = 1: RESPONDED

• Training sets: 96,820 obs.• Testing sets: 48,411 obs.

Page 4: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Dataset facts• R package used to read file:

data.table::fread

• Target=0 obs.: 111,458• Target=1 obs.: 33,773• Numerical variables: 1,876• Character variables: 51• Constant variables: 5• Variable level counts:

– 67.0% columns havelevels <= 100

Count of levels for each column

76.7%

23.3%

Class 0 and 1 count

Variables count

Page 5: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Missing values• “”, “NA”: 0.6%• “[]”, -1: 2.0%• -99999, 96, …, 999, …,

99999999: 24.9%• 25.3% columns have

missing values 61.7%

Count of NAs in each column Count of NAs in each row

Page 6: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Challenges for classification

• Huge Dataset (145,231 X 1932)• “Anonymous” features• Uneven distribution of response variable• 27.6% of missing values• Deal with both numerical and categorical

variables• Undetermined portion of Categorical

variables• Data pre-processing complexity

Page 7: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Data preprocessingRemove ID and target

Replace NA by median  Replace NA randomly

Replace [] and ‐1 as NA

Remove  duplicate cols

Replace character cols

Remove low variance cols

Regard NA as a new group 

Normalize Log(1+|x|)

Page 8: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Principal Component Analysis

When PC is close to 400,it can explain 90% variance.

pc1

Page 9: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

LDA: Linear discriminant analysis• We are interested in the most discriminatory direction,

not the maximum variance.• Find the direction that best separates the two classes.

Var1 and Var2 are large

Significant overlap

µ1 µ2

µ1 and µ2 are close

Page 10: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Methodology

• K Nearest Neighbor (KNN) • Support Vector Machine (SVM)• Logistic Regression• Random Forest• XGBoost (eXtreme Gradient Boosting)• Stacking

Page 11: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

K Nearest Neighbor (KNN)

• Target =0• Target =1

Overall Accuracy

Target = 1 Accuracy

Accuracy

72.1 73.9 75.0 76.1 76.5 76.8 77.0

22.818.3 15.3 12.1 10.5 9.4 7.5

0.010.020.030.040.050.060.070.080.090.0

100.0

3 5 7 11 15 21 39

Acc

urac

y

K

KNNOverall Target=1

Page 12: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Support Vector Machine (SVM)

• Expensive; takes long time for each run• Good results for numerical data

Accuracy

Overall 78.1%

Target = 1 13.3%

Target = 0 97.6%

Confusion matrix Prediction

Truth

0 1

0 19609 483

1 5247 803

Page 13: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Logistic Regression

• Logistic regression is a regression model where the dependent variable is categorical.

• Measures the relationship between dependent variable and independent variables by estimating probabilities

Page 14: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Logistic Regression

Accuracy

Overall 79.2 %

Target = 1 28.1 %

Target = 0 94.5 %

Confusion matrix Prediction

Truth

0 1

0 53921 3159

1 12450 4853

75.0075.5076.0076.5077.0077.5078.0078.5079.0079.5080.00

2 5 15 25 35 45 55 65 75 85 95 105

115

125

135

145

155

165

175

185

195

210

240

280

320

Acc

urac

y

PC

Overall

0.0010.0020.0030.0040.0050.0060.0070.0080.0090.00

100.00

2 5 15 25 35 45 55 65 75 85 95 105

115

125

135

145

155

165

175

185

195

210

240

280

320

Acc

urac

y

PC

Target=1

Page 15: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Random Forest• Machine learning ensemble algorithm

-- Combining multiple predictors • Based on tree model• For both regression and classification • Automatic variable selection • Handles missing values• Robust, improving model stability and accuracy

Page 16: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Random ForestTrain datasetTrain dataset

Draw Bootstrap Samples

Draw Bootstrap Samples

Build random tree

Build random tree

Predict based on each treePredict based on each tree

Majority voteMajority vote

A Random Tree

Page 17: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Random Forest

Accuracy

Overall 79.3%

Target = 1 20.1%

Target = 0 96.8%

Confusion matrix Prediction

Truth

0 1

0 36157 1181

1 8850 2223

• Target =1• Overall• Target =0

Tree number(500) vs Misclassification Error

Page 18: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

XGBoost• Additive tree model: add new trees that complement the already-built

ones• Response is the optimal linear combination of all decision trees• Popular in Kaggle competitions for efficiency and accuracy

……..

Greedy Algorithm

Number of Tree

Error

Additive tree model

Page 19: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

XGBoost• Additive tree model: add new trees that complement the already-built

ones• Response is the optimal linear combination of all decision trees• Popular in Kaggle competitions for efficiency and accuracy

Page 20: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

XGBoost

AccuracyOverall 80.0%

Target = 1 26.8%Target = 0 96.1%

Train error

Test errorConfusion

matrix Prediction

Truth

0 1

0 35744 1467

1 8201 2999

Page 21: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Methods Comparison

77.0 78.1 77.8 79.0 79.2 80.0

6.613.3

19.0 20.128.1 26.8

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

Acc

urac

y

Overall Target =1

Page 22: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Winner or Combination ?

Page 23: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Stacking

Base learners Meta learner

Labeled data

……

Final

prediction

Test

Base learner C1

Base learner C2

Base learner Cn

• Main Idea: Learn and combine multiple classifiers

Metafeatures

Train

Page 24: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Generating Base and Meta Learners

• Base model—efficiency, accuracy and diversity Sampling training examples Sampling features Using different learning models

• Meta learner Majority voting Weighted averaging Kmeans Higher level classifier — Supervised(XGBoost)

24

Unsupervised

Page 25: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Stacking model

XGBoostPredictions

XGBoost

Logistic Regression

Random Forest Total data

Base learners Meta learner

Finalprediction

Meta Features

Combined data

Total dataSparse

CondenseLow level

PCA…

Page 26: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Stacking ResultsBase Model Accuracy Accuracy

(target=1)XGB + total data 80.0% 28.5%XGB + condense data 79.5% 27.9%

XGB + Low level data 79.5% 27.7%

Logistic regression+ sparse data 78.2% 26.8 %

Logistic regression+ condense data 79.1% 28.1%

Random forest + PCA 77.6% 20.9%

Meta Model Accuracy Accuracy (target=1)

XGB 81.11% 29.21%Averaging 79.44% 27.31%Kmeans 77.45% 23.91%

Accuracy of XGB

0.00%20.00%40.00%60.00%80.00%

100.00%

Accuracy of Base Model

Accuracy Accuracy (target=1)

Page 27: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Stacking ResultsBase Model Accuracy Accuracy

(target=1)XGB + total data 80.0% 28.5%XGB + condense data 79.5% 27.9%

XGB + Low level data 79.5% 27.7%

Logistic regression+ sparse data 78.2% 26.8 %

Logistic regression+ condense data 79.1% 28.1%

Random forest + PCA 77.6% 20.9%

Meta Model Accuracy Accuracy (target=1)

XGB 81.11% 29.21%Averaging 79.44% 27.31%Kmeans 77.45% 23.91%

Accuracy of XGB

0.00%20.00%40.00%60.00%80.00%

100.00%

Accuracy of Base Model

Accuracy Accuracy (target=1)

Page 28: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Summary and Conclusion• Data mining project in the real world

Huge and noisy data• Data preprocessing

Feature encoding Different missing value process:

New level, Median / Mean, or Random assignment• Classification techniques

Classifiers based on distance are not suitable Classifiers handling mixed type of variables are preferred Categorical variables are dominant Stacking makes further promotion

• Biggest improvement came from model selection, parameter tuning, stacking

• Result comparison: Winner result: 80.4%Our result: 79.5%

Page 29: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

Acknowledgements

We would like to express our deep gratitude tothe following people / organization:

• Profs. Bremer and Simic for their proposal that made this project possible

• Woodward Foundation for funding• Profs. Simic and CAMCOS for all the support• Prof. Chen for his guidance, valuable

comments and suggestions

Page 30: Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader) XiaoyanChong, MingluMa, Yue Wang CAMCOS Fall 2015 San Jose State University

QUESTIONS?