Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling....

Dean Abbott

Abbott Analytics

KNIME Fall Summit

#KNIMEFallSummit16

September 16, 2016

[email protected]

Twitter: @deanabb

Measuring

Variable Importance

with Target Shuffling

mailto:[email protected]

Dean Abbott

Abbott Analytics

KNIME Fall Summit

September 16, 2016

[email protected]

Twitter: @deanabb

Measuring

Variable Importance

with Target Input Shuffling

mailto:[email protected]

Dean Abbott

Co-Founder and Chief Data Scient is t and

Chief Techology Off icer, SmarterHQ

[email protected]

Twit ter : @deanabb

© Abbott Analytics, 2001-20164

A SaaS contextual marketing technology Tier 1 brands use to drive

conversion and loyalty, through multi-channel personalization

AWS: Redshift, MySQL/Aurora, EC2, S3, Kinesis

Why Am I Talking About this

Arcane Topic?

• I’ve been bothered by this for

decades….yes...I’m that old

• It’s conceptually easy to do.


Variable Importance in

Linear Regression



Decision Trees

• Decision Trees

• You think they are easy to explain?



Neural Networks

• Huh?



Neural Networks

• Or what neural

nets really look

like…


Naïve Bayes Model Outputs


Essentially a

series of

cross-tabs for

every

variable!

Remember,

the final

probability is

the product

of the

individual

variable

probabilities.

11

SVM Output


Neural Networks: Interpretation via Sensitivities

• Sensitivities reflect the amount of change in the outputs when each of the inputs is changed or wiggled some small amount—a larger sensitivity means the output changes more for a small change in the input.

• Provide measure of the importance of each input variable in the model (by itself)

• Can use sensitivities to reduce input variables in other neural network, decision tree, or regression models


KNIME Random Forest Node

Helps with Importance


Outline

• Classical variable importance: linear regression

• Hack #1: use linear regression model statistics to

infer variable importance

• Hack #2: use target shuffling to infer variable

importance


The Data: Easiest Possible!

• 3 inputs: each is a random Normal: mean = 20, std = 5

• Target variable: 0.5*var1 + 0.2*var2 + 0.3*var3

• 95,412 records (same size as cup98lrn)


Let’s Start with Normal


Variable Importance Using

Linear Regression Coefficient

• Coefficient match (be definition) the proportions used to

be build the target variable

• This is the average influence of each input on the

predictions for all records


t-proportion

For Each Variable to Assess Influence

• T-value measures the significance of the relationship.

• It turns out, that the proportion of the t-values for the exact model

matches the coefficients



Prediction Proportion

• How would an empiricist compute

influence?1. Compute the proportion of the prediction that comes from

each term in the model

1. Influence of variable 1 = W1 * var1



2. Average the influences over all records



Prediction Proportion• Compute the contribution of each term in the linear regression model

separately (each record).

• Var1_influence = $var1coef$ * $var1$, etc.

• Compute the proportion of the contribution of the predicted target variable value

• Average the contributions of each variable for each record to compute the average influence of each variable


So Far So Good

• Now let’s do the same

analysis for

• Neural Networks

• Support Vector

Machines.


So Far So Good

• Now let’s do the same

analysis for

• Neural Networks

• Support Vector

Machines.

• Uh.....maybe not


Do it the YACK way

• Yet

• Another

• Creative

use of

• KNIME


Why “Target Shuffling”?

• We don’t always have nice

metrics to identify the best

inputs with predictive models

(NNets, SVM, … anything

other than regression!)

• Even with regression, we don’t

always have nice inputs

• See John Elder’s introduction

of Target Shuffling to the data

mining community


http://semanticommunity.info/@api/deki/files/30744/Elder_-_Target_Shuffling_Sept.2014.pdf

Input Distributions Are

Not Always Ideal


Why “Target Shuffling”?

• Don’t care about the “target” part

• The Target shuffling node doesn’t care either

• Scramble (randomly) a single (input variable) column

• Target Shuffling Node doesn’t have to be in a loop; it can scramble a column while leaving the others in their natural order

• Captures the actual distribution of the data


Let’s call it

Input Shuffling


Input

Principles of Input Shuffling

• Key: randomly re-select a value of a single input variable value while

leaving all other variables in with their original values

• Compute the standard deviation (or some other measure of

perturbation) for each record

• Of the Target Variable Predictions

• NOT the actual target variable

• This perturbation is a measure of how influential the variable is in

the model

• High standard deviation -> lots of influence

• Low standard deviation -> not much influence

• ~0 standard deviation -> no influence


Shuffled Inputs Meta Node

Two Loops: (1) loop on input variables and (2) loop on shuffled input variable (50x or so)


Shuffling Inputs

All inputs and target

Just 1

input


Shuffling Inputs

All inputs and target

Just 1 input at a time


Single Record:

What it looks like• Single Record: 50 “shuffles”: Row0


Average for All Records in data

(~9K for this data set)

• Measures the spread of the predictions when randomly

perturbing the single input variable



Input Shuffling for

Idealized Linear Regression Data

• Compute proportion of the average standard deviation from shuffling

the input (keeping others with the original values)

• (yes, I know I’m averaging standard deviations!)


Realistic Data:

KDD Cup 1998• 95,412: cup98lrn from KDD Cup 1998 Competition

• Use only the responders (4843) in linear regression models

• Hundreds of fields in data, but only use 4 for research

purposes

• LASTGIFT, NGIFTALL,

RFA_2F, D_RFA_2A

• Continuous target

• Two continuous

• One ordinal (RFA_2F)

• One dummy (D_RFA_2A)


Realistic Data:

KDD Cup 1998• Heavy skew of LASTGIFT, NGIFTALL,

TARGET_D

• Makes visualization

difficult

• Biases

regression

coefficients

(if

one cares)


Could Use Normalized Data

• To remove influence of skew and scale

• Log10 transform LASTGIFT, NGIFTALL, TARGET_D

• Scale all variables (post log10) to [0, 1]


Normalized Data

• Relationships clearer

• LASTGIFT strong positive correlation with TARGET_D

• NGIFTALL, RFA_2F, D_RFA_2A all have apparently slight negative

correlation

with

TARGET_D


The Basic Model:

Linear RegressionCoefficient

Use abs() for all calculations


Linear Regression: Compare

Influence Using Different MethodsCoefficient T-Proportion

Use abs() for all t-proportion calculationsUse abs() for all calculations


Linear Regression: Compare

Influence Using Different MethodsCoefficient T-Proportion

Prediction Proportion Input Shuffling

Use abs() for all t-proportion calculationsUse abs() for all calculations

Use abs() for all t-proportion calculations© Abbott Analytics, 2001-201644

Linear Regression, Neural Network, and

Random Forest: Input Shuffling Influence

Input Shuffling- LR Input Shuffling - MLP


Repeat for More Inputs –

KDD Cup 98


Apply Input Shuffling to

Larger KDD Cup 98 Data


Shuffle

LASTGIFT_log10

Variable Influence from

Regression Diagnostics


Input Shuffling Variable Influence:

Regression


currentColumnName VariableInfluence_Linear VariableInfluence_RF VariableInfluence_GBM

D_RFA_2A 0.0518 0.0139 0.0051

LASTGIFT_log10 0.0477 0.0383 0.0596

E_RFA_2A 0.0426 0.0155 0.0153

F_RFA_2A 0.0266 0.0105 0.0037

MINRAMNT_log10 0.0077 0.0127 0.0113

RFA_2F 0.0073 0.0122 0.0063

A_GEOCODE2 0.0060 0.0020 0.0008

B_GEOCODE2 0.0057 0.0011 0.0002

MINRDATE 0.0040 0.0061 0.0085

NGIFTALL 0.0038 0.0075 0.0066

MAXRDATE 0.0028 0.0035 0.0044

C_GEOCODE2 0.0025 0.0005 0.0000

NUMPRM12 0.0024 0.0022 0.0033

DOMAIN3 0.0021 0.0008 0.0009

CARDPM12 0.0016 0.0026 0.0037

LASTDATE 0.0005 0.0029 0.0018

AGE_imputerand 0.0004 0.0029 0.0046

DOMAIN2 0.0002 0.0012 0.0002

NUMPROM 0.0001 0.0036 0.0067

DOMAIN1 0.0000 0.0000 0.0000

Accuracy Comparison on

Testing Data


Linear Regression Random Forests Gradient Boosting


Regression (Unnormalized!)


currentColumnName VariableInfluence_Linear VariableInfluence_RF VariableInfluence_GBM

E_RFA_2A 4.337 0.807 0.396

LASTGIFT 4.052 2.252 4.016

D_RFA_2A 3.566 0.625 0.245

F_RFA_2A 3.552 0.457 0.000

RAMNTALL 2.429 0.540 1.239

NGIFTALL 2.258 0.692 0.957

MINRAMNT 2.111 0.708 0.722

RFA_2F 1.274 0.618 0.480

FISTDATE 0.970 0.298 0.731

A_GEOCODE2 0.754 0.130 0.086

B_GEOCODE2 0.519 0.082 0.017

DOMAIN3 0.362 0.052 0.066

DOMAIN1 0.358 0.080 0.036

C_GEOCODE2 0.307 0.028 0.000

NUMPRM12 0.304 0.154 0.262

DOMAIN2 0.289 0.072 0.028

MAXRDATE 0.213 0.297 0.444

MINRDATE 0.200 0.345 0.455

CARDPM12 0.178 0.139 0.296

AGE_imputerand 0.174 0.202 0.363

MAXRAMNT 0.168 1.791 1.547

LASTDATE 0.036 0.240 0.243


Classification



Classification


VariableVariableInfluence

LogisticVariableInfuence

RFVariableInfuence

GBM

FISTDATE 0.0123 0.0349 0.0124

D_RFA_2A 0.0080 0.0024 0.0027

RFA_2F 0.0080 0.0176 0.0040

DOMAIN3 0.0072 0.0056 0.0057

E_RFA_2A 0.0069 0.0069 0.0055

NGIFTALL 0.0057 0.0347 0.0180

DOMAIN1 0.0011 0.0084 0.0013

LASTGIFT 0.0004 0.0236 0.0132

F_RFA_2A 0.0003 0.0103 0.0001

Discussion

• Why Input Shuffling is good• Works for any input distribution

• Works with any algorithm

• Measures importance based on other input variables in natural patterns

rather than an idealized value (like the mean or mode)

• Can use many metrics to measure what “importance” means to you

• Why Input Shuffling is not so good• Takes a long time to run if you have lots of inputs, lots of records

• No statistically defensible metric to use (yet)


Conclusion

• Variable influence can be computed as a single

• Coefficients aren’t good measures unless the variables conform to linear

regression assumptions

• Some models don’t have “coefficients” at all so we can’t use the linear regression

approach

• Using target shuffling, we can generate randomized sensitivity scores easily for any

model

• If inputs are not normally distributed, average overall influence doesn’t tell

the full story (or may even tell a misleading story) about how valuable the

variable is in predicting the target

• Breaking predictions into bins (deciles or other number of bins) allows us to

compute an influence score for every part of the predicted range

• Answers the question: for high predicted values, which variables are most

influential


Binning Predicted Values

into Buckets (Deciles, Quintiles,…)

• Predictions Deciling predicted values allows us to compute

variable influence for each of these ranges of

the predicted values. Note that the top and

bottom bins have much larger variances.


LASTGIFT Influence

• LASTGIFT has stronger influence (positive) at the high end of predictions

• Significant influence for all predicted values

• Nearly constant influence for Bins 7-10

• Monotonic influence vs. predicted values


RFA_2F Influence

• RFA_2F has

stronger influence

(negative) at the

low end of

predictions

• Almost no

influence for Bin 7

– Bin 10

• Monotonic

influence vs.

predicted values


NGIFTALL Influence

• NGIFTALL has

stronger influence

(negative) at the

low end of

predictions

• Mostly monotonic

influence vs.

predicted values


D_RFA_2A Influence

• D_RFA_2A has strong influence at the low end of predictions only (Bin 1 and Bin2)

• No influence at all for Bin 3 through Bin 10


Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling....

Documents

Transcript of Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling....