The zen of predictive modelling

32
Presciient Training The Zen of Predictive Modelling Eugene Dubossarsky [email protected] +61414573322 @cargomoose

Transcript of The zen of predictive modelling

Page 1: The zen of predictive modelling

Pres

ciie

nt T

rain

ing

The Zen of Predictive Modelling

Eugene [email protected]+61414573322@cargomoose

Page 2: The zen of predictive modelling

What This Talk Isn’t About

But worth mentioning anyway:

R and The Sydney Users of R Forum

Analyst First

My Courses

Page 3: The zen of predictive modelling

Sydney Users of R Forum

• Just 1 shy of 500 members

• Regular meetups

• Study groups: introduction to R, “Machine Learning for Hackers”, “Elements of Statistical Learning”

Page 4: The zen of predictive modelling
Page 5: The zen of predictive modelling

R

• Do a Google image search for “ggplot2”

• Look for “r4stats”, “popularity”

• Join SURF

• Download R and start using it.

Page 6: The zen of predictive modelling
Page 7: The zen of predictive modelling

Analyst First• Strategic, Cultural, Organisational, Human issues in analytics

• Making analytics work in organisations

• Focus on the Human side of analytics

• International : Aust, NZ, Singapore, US, Japan, India, Hong Kong

• analystfirst.com – see “core principles” and “what is analyst first” ?

Page 8: The zen of predictive modelling

My Analytics Training Courses• Predictive Modelling, Data Mining, R, Forensic Analytics, Visualisation,

Forecasting training courses

• Sydney, Melbourne, Canberra, Singapore

• Public and in-house

• Pre-prepared or customised

• Informal coaching/mentoring

• Strategy, Review, Advice and Assistance with Analytics Capability Development in your organisation

Page 9: The zen of predictive modelling

The Zen of Predictive ModellingPr

edic

tive

Mod

els

• The Most Important Part of My “Predictive Modelling and Data Mining Course”

• What every user of predictive modelling should know

• What every manager and owner of predictive modelling capability must know

• “Open Secrets” known to the masters

Page 10: The zen of predictive modelling

The Zen of Predictive ModellingPr

edic

tive

Mod

els

• To save people time

• To see the forest for the trees

• To real value out of predictive analytics

Page 11: The zen of predictive modelling

The Right Point of ViewPr

edic

tive

Mod

els

Which is unlike the other two?

• Kohonen neural network

• Backpropagation neural network

• CART decision tree

Page 12: The zen of predictive modelling

The Right Point of ViewPr

edic

tive

Mod

els

Which is unlike the other two?

• CART decision tree

• Random Forest

• Support Vector Machine

Page 13: The zen of predictive modelling

The Right Point of ViewPr

edic

tive

Mod

els

Which is unlike the other two?

• Backpropagation Neural Network

• Linear Model

• CART Decision Tree

Page 14: The zen of predictive modelling

The Right Point of ViewPr

edic

tive

Mod

els

• Out Of Sample Accuracy

• Robustness (Out of Time Accuracy)

• Interpretability

• Implementability

Page 15: The zen of predictive modelling

The Right Point of ViewPr

edic

tive

Mod

els

• Out Of Sample Accuracy

• Robustness (Out of Time Accuracy)

• Interpretability

• Implementability

Page 16: The zen of predictive modelling

The Right Point of ViewPr

edic

tive

Mod

els

• Out Of Sample Accuracy

• Robustness (Out of Time Accuracy)

• Interpretability

• Implementability

Page 17: The zen of predictive modelling

The Right Point of ViewPr

edic

tive

Mod

els

Why build predictive models ?

• Insights

• Operational prediction

• “What-if” analysis

Page 18: The zen of predictive modelling

What Do All Predictive Models Have in Common ?Pr

edic

tive

Mod

els

All Predictive Models:

• Have a training set of predictors and outcomes

• Probably have a cross-validation and test set of predictors and outcomes too.

• Are “fit” (optimsied) to minimise an error function between their actual and target

outcomes

• Are probably cross-validated to control overfitting on an out-of-sample data set

• Provide information on the relationship between the predictors and outcomes in

the data

• Can be used to score new data (make new predictions)

• Can be deployed in IT systems

• Can be interrogated for insights

• Are only as accurate as the data allows

• Provide a (fairly) accurate estimate of how well they will predict on new data

Page 19: The zen of predictive modelling

What Do All Predictive Model Insights Have in Common ?Pr

edic

tive

Mod

els

All Predictive Models:

• Have variable importance measures (a number of which can be applied to any

model)

• Allow plotting predictors vs outcomes

• Have variable accuracy measures

• Can be resampled for more robust measures of accuracy

Page 20: The zen of predictive modelling

What Do All Predictive Model Predictions Have in Common?Pr

edic

tive

Mod

els

All Predictive Models:

• Make predictions that are numeric : estimates of amount for regression, and

probability for classification

• All predictions are applications of the underlying model structure and parameters

(formula) to new predictor data sets

• All predictions are deterministic. Once a model is fitted, the predictions for a given

record will be the same every time. (Though the prediction may be a distribution

rather than a fixed point. Also, note that model fitting itself may be random – some

models may differ slightly each time they are fitted to the same data set)

Page 21: The zen of predictive modelling

How Do Predictive Model Families Differ?Pr

edic

tive

Mod

els

• Classification vs Regression (most families can do both)

• Predictive accuracy vs insights

• Predictive accuracy vs stability

• Deterministic fitting vs randomised fitting

• Specific insights

• Structure and complexity

• Model assumptions (linear models, neural nets)

• Model structure (trees vs additive models vs SVM vs Neural Nets etc)

• The kinds of insights models provide

• Tendency to overfit (most, but not all)

• Dependence on metrics

• Sensitivity to missing values and categorical variables

Page 22: The zen of predictive modelling

Becoming a Master of Modelling Kung FuPr

edic

tive

Mod

els

• Predictive models should be thought of as a “black box” initially, with the

characteristics that all models have in common recognised

• The focus should be on the data, not the model.

• Focusing on the specific characteristics of the model is important when: deciding on

the degree of accuracy desired, and the kinds of insights desired.

• It is good to start by working with one highly accurate, simple to use method

(randomForest is a good choice) and one or two highly interpretable models (rpart

decision trees and (generalised) linear models are good here.

• In fact, you can go a long way with just randomForest alone.

Page 23: The zen of predictive modelling

Becoming a Master of Modelling Kung FuPr

edic

tive

Mod

els

• Master an adequate tool.

• Empty your mind of the tool . It is an illusion.

• Meditate on the data.

Page 24: The zen of predictive modelling

Meditating on DataPr

edic

tive

Mod

els

• Start with a highly accurate, nonparametric model you are comfortable with.

• The accuracy of a highly accuarate method is close to the theoretical limit of

accuracy possible on the data. World class experts may get closer, but not a whole

lot closer.

• So once you build the model, forget about the specific family you used. It is just a

tool.

• Each predictor may provide a unique amount of predictability to the model.

Measure it.

• Each predictor may be masked by other predictors. Be careful.

• Check relationships between data and strongest predictors

Page 25: The zen of predictive modelling

Meditating on DataPr

edic

tive

Mod

els

• There are at least 3 ways that a predictor can be important. They are not the same:

• What is the unique contribution of the predictor to the accuracy of the model ?

• What is the individual predictive power of the predictor alone ?

• How vital is the predictor to the structure of a particular model ?

• The first two are about the data, the third is more about the specific model. Which

is more important ?

Page 26: The zen of predictive modelling

Meditating on DataPr

edic

tive

Mod

els

• There are at least 3 ways that a predictor can be important. They are not the same:

• What is the unique contribution of the predictor to the accuracy of the model ?

• What is the individual predictive power of the predictor alone ?

• How vital is the predictor to the structure of a particular model ?

• The first two are about the data, the third is more about the specific model. Which

is more important ?

Page 27: The zen of predictive modelling

The Predictive Modelling Master’s Data Meditation

Pred

ictiv

e M

odel

s

• Start with a highly accurate, nonparametric model you are comfortable with.

• The accuracy of a highly accuarate method is close to the theoretical limit of

accuracy possible on the data. World class experts may get closer, but not a whole

lot closer.

• So once you build the model, forget about the specific family you used. It is just a

tool.

• Measure model accuracy on out-of-sample data. Pay attention to any imbalances in

class or data subset accuracy.

• Measure model stability if necessary (it almost always is)

• Measure the importance of all variables, using the three main techniques.

• Measure again, holding some of the main predictors constant

• Measure (visualise) the effects of each predictor

• Build an interpretable model to help tell the story

Page 28: The zen of predictive modelling

The Master Sharpens the Sword : Getting More Accuracy

Pred

ictiv

e M

odel

s

• There is never enough data

• Some model accuracy can result from trying other model families. Usually not

much, and not the best use of time, though for some reason the favourite activity of

new data miners.

• Some more model accuracy can result from tweaking model parameters. This is

perhaps less of a waste of time, but still not the ideal focus.

• The most dramatic improvement in model accuracy comes from new predictors.

• New predictors may be entirely new data sets, or complex new transformations of

existing data.

• A large, multi-tabular data set may well have information that has not been

captured in the data.

• The most common information of this type involves relations between individual

records. (eg. Time series windows, geographic neighbourhoods or social network

statistics per record)

Page 29: The zen of predictive modelling

Illusions On the Path

Pred

ictiv

e M

odel

s

• Colossal wastes of time can include

• Trying to find the “right” model family

• Getting stuck in data preprocessing trying to get all the predictors “right”

• Trying to figure out what the targets should be (usually a sign that the business

problem is not well understood)

• Trying to “improve” the model without defining what that means

Page 30: The zen of predictive modelling

The Sun Tzu of Modelling: Be Prepared

Pred

ictiv

e M

odel

s

• Know what you are modelling and for what purpose.

• Know what your target variable is. You may have more than one.

• Do not hesitate, model with what you have, and add more predictors later.

• Messy data is better than no data

• Use the right error measures

• Know the connection between the model and your business

• Evaluate, interrogate the model accordingly

• Always question the business value of the analysis

• Always be ready to suggest the business use of the analysis

• Don’t assume that the client understands what to do with the model

Page 31: The zen of predictive modelling

Strategy and Tactics

Pred

ictiv

e M

odel

s

• Why are you (re)building the model?

• If Strategic: what is going to be done with the insights ? By whom ?

• If Operational: what are the key metrics – accuracy, value, deployability?

Page 32: The zen of predictive modelling

Questions ?

Pred

ictiv

e M

odel

s