4ms Shuffling Clock Multiplier Eurorack Module User Manual ...
Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling....
Transcript of Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling....
Dean Abbott
Abbott Analytics
KNIME Fall Summit
#KNIMEFallSummit16
September 16, 2016
Twitter: @deanabb
Measuring
Variable Importance
with Target Shuffling
Dean Abbott
Abbott Analytics
KNIME Fall Summit
September 16, 2016
Twitter: @deanabb
Measuring
Variable Importance
with Target Input Shuffling
Dean Abbott
Co-Founder and Chief Data Scient is t and
Chief Techology Off icer, SmarterHQ
Twit ter : @deanabb
© Abbott Analytics, 2001-20164
A SaaS contextual marketing technology Tier 1 brands use to drive
conversion and loyalty, through multi-channel personalization
AWS: Redshift, MySQL/Aurora, EC2, S3, Kinesis
Why Am I Talking About this
Arcane Topic?
• I’ve been bothered by this for
decades….yes...I’m that old
• It’s conceptually easy to do.
© Abbott Analytics, 2001-20166
Variable Importance in
Linear Regression
© Abbott Analytics, 2001-20167
Variable Importance in
Decision Trees
• Decision Trees
• You think they are easy to explain?
© Abbott Analytics, 2001-20168
Variable Importance in
Neural Networks
• Huh?
© Abbott Analytics, 2001-20169
Variable Importance in
Neural Networks
• Or what neural
nets really look
like…
© Abbott Analytics, 2001-201610
Naïve Bayes Model Outputs
© Abbott Analytics, 2001-2016
Essentially a
series of
cross-tabs for
every
variable!
Remember,
the final
probability is
the product
of the
individual
variable
probabilities.
11
SVM Output
© Abbott Analytics, 2001-201612
Neural Networks: Interpretation via Sensitivities
• Sensitivities reflect the amount of change in the outputs when each of the inputs is changed or wiggled some small amount—a larger sensitivity means the output changes more for a small change in the input.
• Provide measure of the importance of each input variable in the model (by itself)
• Can use sensitivities to reduce input variables in other neural network, decision tree, or regression models
© Abbott Analytics, 2001-201613
KNIME Random Forest Node
Helps with Importance
© Abbott Analytics, 2001-201614
© Abbott Analytics, 2001-201615
Outline
• Classical variable importance: linear regression
• Hack #1: use linear regression model statistics to
infer variable importance
• Hack #2: use target shuffling to infer variable
importance
© Abbott Analytics, 2001-201616
The Data: Easiest Possible!
• 3 inputs: each is a random Normal: mean = 20, std = 5
• Target variable: 0.5*var1 + 0.2*var2 + 0.3*var3
• 95,412 records (same size as cup98lrn)
© Abbott Analytics, 2001-201617
Let’s Start with Normal
© Abbott Analytics, 2001-201618
Variable Importance Using
Linear Regression Coefficient
• Coefficient match (be definition) the proportions used to
be build the target variable
• This is the average influence of each input on the
predictions for all records
© Abbott Analytics, 2001-201619
© Abbott Analytics, 2001-201620
t-proportion
For Each Variable to Assess Influence
• T-value measures the significance of the relationship.
• It turns out, that the proportion of the t-values for the exact model
matches the coefficients
© Abbott Analytics, 2001-201621
Variable Importance Using
Prediction Proportion
• How would an empiricist compute
influence?1. Compute the proportion of the prediction that comes from
each term in the model
1. Influence of variable 1 = W1 * var1
2. Influence of variable 2 = W2 * var2
3. Influence of variable 3 = W3 * var3
2. Average the influences over all records
© Abbott Analytics, 2001-201622
Variable Importance Using
Prediction Proportion• Compute the contribution of each term in the linear regression model
separately (each record).
• Var1_influence = $var1coef$ * $var1$, etc.
• Compute the proportion of the contribution of the predicted target variable value
• Average the contributions of each variable for each record to compute the average influence of each variable
© Abbott Analytics, 2001-201623
So Far So Good
• Now let’s do the same
analysis for
• Neural Networks
• Support Vector
Machines.
© Abbott Analytics, 2001-201624
So Far So Good
• Now let’s do the same
analysis for
• Neural Networks
• Support Vector
Machines.
• Uh.....maybe not
© Abbott Analytics, 2001-201625
Do it the YACK way
• Yet
• Another
• Creative
use of
• KNIME
© Abbott Analytics, 2001-201626
Why “Target Shuffling”?
• We don’t always have nice
metrics to identify the best
inputs with predictive models
(NNets, SVM, … anything
other than regression!)
• Even with regression, we don’t
always have nice inputs
• See John Elder’s introduction
of Target Shuffling to the data
mining community
© Abbott Analytics, 2001-201627
http://semanticommunity.info/@api/deki/files/30744/Elder_-_Target_Shuffling_Sept.2014.pdf
Input Distributions Are
Not Always Ideal
© Abbott Analytics, 2001-201628
Why “Target Shuffling”?
• Don’t care about the “target” part
• The Target shuffling node doesn’t care either
• Scramble (randomly) a single (input variable) column
• Target Shuffling Node doesn’t have to be in a loop; it can scramble a column while leaving the others in their natural order
• Captures the actual distribution of the data
© Abbott Analytics, 2001-201629
Let’s call it
Input Shuffling
© Abbott Analytics, 2001-201630
Input
Principles of Input Shuffling
• Key: randomly re-select a value of a single input variable value while
leaving all other variables in with their original values
• Compute the standard deviation (or some other measure of
perturbation) for each record
• Of the Target Variable Predictions
• NOT the actual target variable
• This perturbation is a measure of how influential the variable is in
the model
• High standard deviation -> lots of influence
• Low standard deviation -> not much influence
• ~0 standard deviation -> no influence
© Abbott Analytics, 2001-201631
Shuffled Inputs Meta Node
Two Loops: (1) loop on input variables and (2) loop on shuffled input variable (50x or so)
© Abbott Analytics, 2001-201632
Shuffling Inputs
All inputs and target
Just 1
input
© Abbott Analytics, 2001-201633
Shuffling Inputs
All inputs and target
Just 1 input at a time
© Abbott Analytics, 2001-201634
Single Record:
What it looks like• Single Record: 50 “shuffles”: Row0
© Abbott Analytics, 2001-201635
Average for All Records in data
(~9K for this data set)
• Measures the spread of the predictions when randomly
perturbing the single input variable
© Abbott Analytics, 2001-201636
Variable Importance Using
Input Shuffling for
Idealized Linear Regression Data
• Compute proportion of the average standard deviation from shuffling
the input (keeping others with the original values)
• (yes, I know I’m averaging standard deviations!)
© Abbott Analytics, 2001-201637
Realistic Data:
KDD Cup 1998• 95,412: cup98lrn from KDD Cup 1998 Competition
• Use only the responders (4843) in linear regression models
• Hundreds of fields in data, but only use 4 for research
purposes
• LASTGIFT, NGIFTALL,
RFA_2F, D_RFA_2A
• Continuous target
• Two continuous
• One ordinal (RFA_2F)
• One dummy (D_RFA_2A)
© Abbott Analytics, 2001-201638
Realistic Data:
KDD Cup 1998• Heavy skew of LASTGIFT, NGIFTALL,
TARGET_D
• Makes visualization
difficult
• Biases
regression
coefficients
(if
one cares)
© Abbott Analytics, 2001-201639
Could Use Normalized Data
• To remove influence of skew and scale
• Log10 transform LASTGIFT, NGIFTALL, TARGET_D
• Scale all variables (post log10) to [0, 1]
© Abbott Analytics, 2001-201640
Normalized Data
• Relationships clearer
• LASTGIFT strong positive correlation with TARGET_D
• NGIFTALL, RFA_2F, D_RFA_2A all have apparently slight negative
correlation
with
TARGET_D
© Abbott Analytics, 2001-201641
The Basic Model:
Linear RegressionCoefficient
Use abs() for all calculations
© Abbott Analytics, 2001-201642
Linear Regression: Compare
Influence Using Different MethodsCoefficient T-Proportion
Use abs() for all t-proportion calculationsUse abs() for all calculations
© Abbott Analytics, 2001-201643
Linear Regression: Compare
Influence Using Different MethodsCoefficient T-Proportion
Prediction Proportion Input Shuffling
Use abs() for all t-proportion calculationsUse abs() for all calculations
Use abs() for all t-proportion calculations© Abbott Analytics, 2001-201644
Linear Regression, Neural Network, and
Random Forest: Input Shuffling Influence
Input Shuffling- LR Input Shuffling - MLP
© Abbott Analytics, 2001-201645
Repeat for More Inputs –
KDD Cup 98
© Abbott Analytics, 2001-201646
Apply Input Shuffling to
Larger KDD Cup 98 Data
© Abbott Analytics, 2001-201647
Shuffle
LASTGIFT_log10
Variable Influence from
Regression Diagnostics
© Abbott Analytics, 2001-201648
Input Shuffling Variable Influence:
Regression
© Abbott Analytics, 2001-201649
currentColumnName VariableInfluence_Linear VariableInfluence_RF VariableInfluence_GBM
D_RFA_2A 0.0518 0.0139 0.0051
LASTGIFT_log10 0.0477 0.0383 0.0596
E_RFA_2A 0.0426 0.0155 0.0153
F_RFA_2A 0.0266 0.0105 0.0037
MINRAMNT_log10 0.0077 0.0127 0.0113
RFA_2F 0.0073 0.0122 0.0063
A_GEOCODE2 0.0060 0.0020 0.0008
B_GEOCODE2 0.0057 0.0011 0.0002
MINRDATE 0.0040 0.0061 0.0085
NGIFTALL 0.0038 0.0075 0.0066
MAXRDATE 0.0028 0.0035 0.0044
C_GEOCODE2 0.0025 0.0005 0.0000
NUMPRM12 0.0024 0.0022 0.0033
DOMAIN3 0.0021 0.0008 0.0009
CARDPM12 0.0016 0.0026 0.0037
LASTDATE 0.0005 0.0029 0.0018
AGE_imputerand 0.0004 0.0029 0.0046
DOMAIN2 0.0002 0.0012 0.0002
NUMPROM 0.0001 0.0036 0.0067
DOMAIN1 0.0000 0.0000 0.0000
Accuracy Comparison on
Testing Data
© Abbott Analytics, 2001-201650
Linear Regression Random Forests Gradient Boosting
Input Shuffling Variable Influence:
Regression (Unnormalized!)
© Abbott Analytics, 2001-201651
currentColumnName VariableInfluence_Linear VariableInfluence_RF VariableInfluence_GBM
E_RFA_2A 4.337 0.807 0.396
LASTGIFT 4.052 2.252 4.016
D_RFA_2A 3.566 0.625 0.245
F_RFA_2A 3.552 0.457 0.000
RAMNTALL 2.429 0.540 1.239
NGIFTALL 2.258 0.692 0.957
MINRAMNT 2.111 0.708 0.722
RFA_2F 1.274 0.618 0.480
FISTDATE 0.970 0.298 0.731
A_GEOCODE2 0.754 0.130 0.086
B_GEOCODE2 0.519 0.082 0.017
DOMAIN3 0.362 0.052 0.066
DOMAIN1 0.358 0.080 0.036
C_GEOCODE2 0.307 0.028 0.000
NUMPRM12 0.304 0.154 0.262
DOMAIN2 0.289 0.072 0.028
MAXRDATE 0.213 0.297 0.444
MINRDATE 0.200 0.345 0.455
CARDPM12 0.178 0.139 0.296
AGE_imputerand 0.174 0.202 0.363
MAXRAMNT 0.168 1.791 1.547
LASTDATE 0.036 0.240 0.243
Input Shuffling Variable Influence:
Classification
© Abbott Analytics, 2001-201652
Input Shuffling Variable Influence:
Classification
© Abbott Analytics, 2001-201653
VariableVariableInfluence
LogisticVariableInfuence
RFVariableInfuence
GBM
FISTDATE 0.0123 0.0349 0.0124
D_RFA_2A 0.0080 0.0024 0.0027
RFA_2F 0.0080 0.0176 0.0040
DOMAIN3 0.0072 0.0056 0.0057
E_RFA_2A 0.0069 0.0069 0.0055
NGIFTALL 0.0057 0.0347 0.0180
DOMAIN1 0.0011 0.0084 0.0013
LASTGIFT 0.0004 0.0236 0.0132
F_RFA_2A 0.0003 0.0103 0.0001
Discussion
• Why Input Shuffling is good• Works for any input distribution
• Works with any algorithm
• Measures importance based on other input variables in natural patterns
rather than an idealized value (like the mean or mode)
• Can use many metrics to measure what “importance” means to you
• Why Input Shuffling is not so good• Takes a long time to run if you have lots of inputs, lots of records
• No statistically defensible metric to use (yet)
© Abbott Analytics, 2001-201654
Conclusion
• Variable influence can be computed as a single
• Coefficients aren’t good measures unless the variables conform to linear
regression assumptions
• Some models don’t have “coefficients” at all so we can’t use the linear regression
approach
• Using target shuffling, we can generate randomized sensitivity scores easily for any
model
• If inputs are not normally distributed, average overall influence doesn’t tell
the full story (or may even tell a misleading story) about how valuable the
variable is in predicting the target
• Breaking predictions into bins (deciles or other number of bins) allows us to
compute an influence score for every part of the predicted range
• Answers the question: for high predicted values, which variables are most
influential
© Abbott Analytics, 2001-201655
Binning Predicted Values
into Buckets (Deciles, Quintiles,…)
• Predictions Deciling predicted values allows us to compute
variable influence for each of these ranges of
the predicted values. Note that the top and
bottom bins have much larger variances.
© Abbott Analytics, 2001-201656
LASTGIFT Influence
• LASTGIFT has stronger influence (positive) at the high end of predictions
• Significant influence for all predicted values
• Nearly constant influence for Bins 7-10
• Monotonic influence vs. predicted values
© Abbott Analytics, 2001-201657
RFA_2F Influence
• RFA_2F has
stronger influence
(negative) at the
low end of
predictions
• Almost no
influence for Bin 7
– Bin 10
• Monotonic
influence vs.
predicted values
© Abbott Analytics, 2001-201658
NGIFTALL Influence
• NGIFTALL has
stronger influence
(negative) at the
low end of
predictions
• Mostly monotonic
influence vs.
predicted values
© Abbott Analytics, 2001-201659
D_RFA_2A Influence
• D_RFA_2A has strong influence at the low end of predictions only (Bin 1 and Bin2)
• No influence at all for Bin 3 through Bin 10
© Abbott Analytics, 2001-201660