Competition II: Springleaf

Sha Li (Team leader)Xiaoyan Chong, Minglu Ma, Yue Wang

CAMCOS Fall 2015San Jose State University

Agenda• Kaggle Competition: Springleaf dataset

introduction• Data Preprocessing• Classification Methodologies & Results

• Logistic Regression• Random Forest• XGBoost• Stacking

• Summary & Conclusion

Kaggle Competition: SpringleafObjective: Predict whether customers will

respond to a direct mail loan offer

• Customers: 145,231• Independent variables: 1932• “Anonymous” features• Dependent variable:

– target = 0: DID NOT RESPOND– target = 1: RESPONDED

• Training sets: 96,820 obs.• Testing sets: 48,411 obs.

Dataset facts• R package used to read file:

data.table::fread

• Target=0 obs.: 111,458• Target=1 obs.: 33,773• Numerical variables: 1,876• Character variables: 51• Constant variables: 5• Variable level counts:

– 67.0% columns havelevels <= 100

Count of levels for each column

Class 0 and 1 count

Variables count

Missing values• “”, “NA”: 0.6%• “[]”, -1: 2.0%• -99999, 96, …, 999, …,

99999999: 24.9%• 25.3% columns have

missing values 61.7%

Count of NAs in each column Count of NAs in each row

Challenges for classification

• Huge Dataset (145,231 X 1932)• “Anonymous” features• Uneven distribution of response variable• 27.6% of missing values• Deal with both numerical and categorical

variables• Undetermined portion of Categorical

variables• Data pre-processing complexity

Data preprocessingRemove ID and target

Replace NA by median Replace NA randomly

Replace [] and ‐1 as NA

Remove duplicate cols

Replace character cols

Remove low variance cols

Regard NA as a new group

Normalize Log(1+|x|)

Principal Component Analysis

When PC is close to 400,it can explain 90% variance.

LDA: Linear discriminant analysis• We are interested in the most discriminatory direction,

not the maximum variance.• Find the direction that best separates the two classes.

Var1 and Var2 are large

Significant overlap

µ1 µ2

µ1 and µ2 are close

Methodology

• K Nearest Neighbor (KNN) • Support Vector Machine (SVM)• Logistic Regression• Random Forest• XGBoost (eXtreme Gradient Boosting)• Stacking

K Nearest Neighbor (KNN)

• Target =0• Target =1

Overall Accuracy

Target = 1 Accuracy

Accuracy

72.1 73.9 75.0 76.1 76.5 76.8 77.0

22.818.3 15.3 12.1 10.5 9.4 7.5

0.010.020.030.040.050.060.070.080.090.0

3 5 7 11 15 21 39

KNNOverall Target=1

Support Vector Machine (SVM)

• Expensive; takes long time for each run• Good results for numerical data

Accuracy

Overall 78.1%

Target = 1 13.3%

Target = 0 97.6%

Confusion matrix Prediction

0 19609 483

1 5247 803

Logistic Regression

• Logistic regression is a regression model where the dependent variable is categorical.

• Measures the relationship between dependent variable and independent variables by estimating probabilities

Logistic Regression

Accuracy

Overall 79.2 %

Target = 1 28.1 %

Target = 0 94.5 %

0 53921 3159

1 12450 4853

75.0075.5076.0076.5077.0077.5078.0078.5079.0079.5080.00

2 5 15 25 35 45 55 65 75 85 95 105

Overall

0.0010.0020.0030.0040.0050.0060.0070.0080.0090.00

100.00

2 5 15 25 35 45 55 65 75 85 95 105

Target=1

Random Forest• Machine learning ensemble algorithm

-- Combining multiple predictors • Based on tree model• For both regression and classification • Automatic variable selection • Handles missing values• Robust, improving model stability and accuracy

Random ForestTrain datasetTrain dataset

Draw Bootstrap Samples

Build random tree

Predict based on each treePredict based on each tree

Majority voteMajority vote

A Random Tree

Random Forest

Accuracy

Overall 79.3%

Target = 1 20.1%

Target = 0 96.8%

0 36157 1181

1 8850 2223

• Target =1• Overall• Target =0

Tree number(500) vs Misclassification Error

XGBoost• Additive tree model: add new trees that complement the already-built

ones• Response is the optimal linear combination of all decision trees• Popular in Kaggle competitions for efficiency and accuracy

……..

Greedy Algorithm

Number of Tree

Additive tree model

XGBoost• Additive tree model: add new trees that complement the already-built

ones• Response is the optimal linear combination of all decision trees• Popular in Kaggle competitions for efficiency and accuracy

XGBoost

AccuracyOverall 80.0%

Target = 1 26.8%Target = 0 96.1%

Train error

Test errorConfusion

matrix Prediction

0 35744 1467

1 8201 2999

Methods Comparison

77.0 78.1 77.8 79.0 79.2 80.0

6.613.3

19.0 20.128.1 26.8

Overall Target =1

Winner or Combination ?

Stacking

Base learners Meta learner

Labeled data

……

prediction

Base learner C1

Base learner C2

Base learner Cn

• Main Idea: Learn and combine multiple classifiers

Metafeatures

Generating Base and Meta Learners

• Base model—efficiency, accuracy and diversity Sampling training examples Sampling features Using different learning models

• Meta learner Majority voting Weighted averaging Kmeans Higher level classifier — Supervised(XGBoost)

Unsupervised

Stacking model

XGBoostPredictions

XGBoost

Logistic Regression

Random Forest Total data

Base learners Meta learner

Finalprediction

Meta Features

Combined data

Total dataSparse

CondenseLow level

PCA…

Stacking ResultsBase Model Accuracy Accuracy

(target=1)XGB + total data 80.0% 28.5%XGB + condense data 79.5% 27.9%

XGB + Low level data 79.5% 27.7%

Logistic regression+ sparse data 78.2% 26.8 %

Logistic regression+ condense data 79.1% 28.1%

Random forest + PCA 77.6% 20.9%

Meta Model Accuracy Accuracy (target=1)

XGB 81.11% 29.21%Averaging 79.44% 27.31%Kmeans 77.45% 23.91%

Accuracy of XGB

0.00%20.00%40.00%60.00%80.00%

100.00%

Accuracy of Base Model

Accuracy Accuracy (target=1)

Stacking ResultsBase Model Accuracy Accuracy

(target=1)XGB + total data 80.0% 28.5%XGB + condense data 79.5% 27.9%

XGB + Low level data 79.5% 27.7%

Logistic regression+ sparse data 78.2% 26.8 %

Logistic regression+ condense data 79.1% 28.1%

Random forest + PCA 77.6% 20.9%

Meta Model Accuracy Accuracy (target=1)

XGB 81.11% 29.21%Averaging 79.44% 27.31%Kmeans 77.45% 23.91%

Accuracy of XGB

0.00%20.00%40.00%60.00%80.00%

100.00%

Accuracy of Base Model

Accuracy Accuracy (target=1)

Summary and Conclusion• Data mining project in the real world

Huge and noisy data• Data preprocessing

Feature encoding Different missing value process:

New level, Median / Mean, or Random assignment• Classification techniques

Classifiers based on distance are not suitable Classifiers handling mixed type of variables are preferred Categorical variables are dominant Stacking makes further promotion

• Biggest improvement came from model selection, parameter tuning, stacking

• Result comparison： Winner result: 80.4%Our result: 79.5%

Acknowledgements

We would like to express our deep gratitude tothe following people / organization:

• Profs. Bremer and Simic for their proposal that made this project possible

• Woodward Foundation for funding• Profs. Simic and CAMCOS for all the support• Prof. Chen for his guidance, valuable

comments and suggestions

QUESTIONS?

Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader)...

Documents

Transcript of Competition II: Springleaf - sjsu.edu€¦ · Competition II: Springleaf ShaLi (Team leader)...

Chapter Monopolistic Competition 15. Between Monopoly & Perfect Competition Imperfect competition – Between perfect competition and monopoly – Oligopoly.

Monopolistic Competition Perfect Competition aka Pure Competition Oligopoly Monopoly.

European competition policy Competition An insider's …ec.europa.eu/dgs/competition/economist/LLN_11032009.pdf · European competition policy Competition An insider's prospective

Competition Rulesweb.sabc.co.za/.../160607_Tomorrowland_Competition... · Competition Rules: Addendum: Tomorrowland Competition . Competition Rules: Addendum: Tomorrowland Competition

WGU and Springleaf Financial Sharon Smith Partner to Offer ...

SPRINGLEAF BUILDING

Lantern Festival @ Springside - Nee Soon South...Lantern Festival @ Springside Springleaf NC held its annual signature event the Lantern Festival at Springside this year on 10th September

The Kaggle Competitions: An Introduction to CAMCOS …gchen/talks/Chen_CAMCOS_F15.pdf · The Kaggle Competitions: An Introduction to CAMCOS Fall 2015 Guangliang Chen Math/Stats Colloquium

How to strengthen competition advocacy through competition ...ec.europa.eu/competition/publications/cpn/2006_1_28.pdf · How to strengthen competition advocacy through competition

On Classification: An Empirical Study of Existing ... · On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions CAMCOS Report Day December 9th,

OECD/KOREA POLICY CENTRE COMPETITION … Competition... · • an opportunity for younger competition authorities to learn ... Competition Programme 2013 Annual Report ... competition

CAMCOS Reports Day May 17, 2006. Mathematical and Statistical Analysis of Heat Pipe Design Sandy DeSousa Cuong Dong Sergio de Ornelas Michelle Fernelius.

CAMCOS Report Day

ADVANCED DIPLOMA IN HOSPITALITY AND …...ADVANCED DIPLOMA IN HOSPITALITY AND TOURISM MANAGEMENT Articulation Partner: LONDON SCHOOL OF BUSINESS & FINANCE Springleaf Tower, #06-01

Springleaf Holdings, Inc.Springleaf is a leading consumer finance company providing responsible loan products to customers through our nationwide branch network and through iLoan,

US v. Springleaf Holdings complaint.pdf

Springleaf Financial Services - Nevada

SHERWOOD GARDENS SHOPPING CENTER For Lease · Salon Centric, Nationwide Auto Insurance, Subway iHeart Radio, Advance America, Springleaf Financial Project Highlights 185,403 SF Community

Springleaf Holdings, OneMain Financial Holdings ...