Predictive model

Post on 10-Jan-2017

108 views 0 download

Transcript of Predictive model

Income Analysis

Ping Yin11/10/2016

Contents• Executive Summary ------------------------------------------------------------------------------------- 3• Introduction ---------------------------------------------------------------------------------------------- 4• Purpose ---------------------------------------------------------------------------------------------------- 5• Methodology

Data Selection ----------------------------------------------------------------------------------- 6Exploration ----------------------------------------------------------------------------------- 7-24Preparation & Transformation ---------------------------------------------------------- 25-34Model Development & Assessment --------------------------------------------------- 35-44Model Comparison ------------------------------------------------------------------------ 45-47

• Options and Recommendations ---------------------------------------------------------------- 48-52• Summary ------------------------------------------------------------------------------------------------- 53• Appendix ------------------------------------------------------------------------------------------------- 54

Executive Summary• After data preparation and partition, three models are built in SAS

studio, EM, and DataRobot

• The same test dataset is scored by these models

• The model built in EM has the best performance

Introduction• Can we predict Income level based on age, gender, education, etc.?

• What is my income level after I graduate?

Purpose

• Figure out the best predictive model for Income dataset

• Predict my Income level

• Practice skills for preparing data, building model, and model assessment

Data Selection• Income dataset is originally extracted from 1994 Census bureau database

• Downloaded from Kaggle.com

• Reasons for choosing it:• Target variable, Income, is categorical variable• Medium size: 10+ columns and 30K+ rows• Used in Macro and DataRobot projects

Exploration• Using SAS studio to explore data• 32,561 observations• 15 variables: 6 Num, 9 Char• Num: Age Capitalgain Capitalloss Weekhour Edunum Fnlwgt• Char: Income Relationship Education Occupation Sex Marital

Workclass Race Nativecountry• Target: Income (“>50K” , “<=50k”)

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

Exploration

ExplorationData issues :

• Missing value: Workclass Occupation Nativecountry• Multiple levels: Education Marital Workclass Nativecountry• Numeric variables: Capitalgain Capitalloss• Screen variable: Fnlwgt

Preparation & Transformations• Solutions:

• Imputing missing value using subject matter knowledge: impute missing value for Workclass and Occupation with “Unemployeed”

• Imputing missing value using mode value: impute missing value for Nativecountry with “United-States”

Preparation & Transformations• Solutions:

• Coverting Capitalgain and Capitalloss from Num to Char• Binning multiple-level variables: Education Marital Workclass

Preparation & Transformations• Solutions:

• Binning Nativecountry and creating a new variable: region

Preparation & Transformations• Reasons for dropping variable Fnlwgt:

• It is the weight on the Current Population Survey files, not original data from Census• It shows near zero importance in last week DataRobot project

Preparation & Transformations• Reasons for not handling with variable Occupation:

• 15 levels• Do not have a sound criterion

• Reasons for not handling with variable Race and Relationship:• 5-6 Levels • Each level is meaningful

Preparation & TransformationsAfter preparation:

Preparation & Transformations

Preparation & Transformations

Preparation & Transformations• Data partition using Strata method

Now it is ready to go!

Training dataset

Test dataset

SAS Studio

Enterprise Miner

DataRobot

Model Development & Assessment: SAS Studio

Model Development & Assessment: SAS Studio

Model Development & Assessment: SAS Studio

Model Development & Assessment: SAS Studio

Model Development & Assessment: EM

Model Development & Assessment: EM

Model Development & Assessment: DataRobot

Model Development & Assessment: DataRobot

Model Development & Assessment: DataRobot

Model Development & Assessment: DataRobot

Model Comparison

Model Comparison• The best model in this project:

EM Studio DataRobot

Model Comparison: Predict my Income levelPing Dataset

EM

Studio

DataRobot

Options and Recommendations

Using 60% data to build a model

Using 70% data to build a model

Options and RecommendationsMacro Project

DataRobot Project

The overall best model

Options and Recommendations• Factors which may cause these differences:

• Dropping variable Fnlwgt

• Reducing levels

• Variable transformation: Capitalgain Capitalloss

• Increase speed, but decrease model performance

Options• Using DataRobot to build models without handling “data issues”

• Keep trying in SAS studio

Summary• We can predict Income level based on these characteristics • For Income dataset, DataRobot is most robust to build models

• Be aware of unexpected outcomes for data preparing

• Back and forth, until getting an ideal result

AppendixLink to Data:

https://www.kaggle.com/uciml/adult-census-Income

Thanks !