A Manager’s Guide Through Random Forests Kirk Monteverde Student Loan Risk Advisors February 2012.

A Manager’s Guide A Manager’s Guide Through Random ForestsThrough Random Forests

Kirk MonteverdeStudent Loan Risk Advisors

February 2012

Inferno: Canto IMidway upon the journey of our lifeI found myself within a forest dark,

Leo Breiman(1928-2005)Our Virgil

Breiman’s Intellectual JourneyBreiman’s Intellectual Journey

Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees, (CART) Wadsworth, New York.

Breiman, L. (1996a). Bagging predictors, Machine Learning 26: 123–140.

Breiman, L. (2001). Random forests, Machine Learning 45: 5–32.

What is CART?What is CART?

CART is one of a number of recursive “tree-structure” techniques (others include CHAID and C4.5)

DVs can be categorical or continuous (we concentrate here on dichotomous DV models)

Idea is to start by choosing the predictor variable (and its cut-point) that separates your development sample into two piles that best differentiate on the DV

Repeat this process for each of the daughter nodes and continue until some stopping rule is reached

CART topicsCART topics

Non-parametric (non-linear) modeling

Nearest Neighbor approach to modeling

Bias / Variance trade-off and Over-fitting

Contrast with Logistic Regression Contrast with Logistic Regression (linear model for categorical DVs)(linear model for categorical DVs)

CART topics: non-linearity

Contrast with Logistic Regression Contrast with Logistic Regression (linear model for categorical DVs)(linear model for categorical DVs)Modeling with Predictor A only Modeling with both A and B


Contrast with Logistic Regression Contrast with Logistic Regression (Predicted Good Odds)(Predicted Good Odds)


Contrast with Logistic Regression Contrast with Logistic Regression (perturbed data)(perturbed data)


Recursive Tree (CART) approachRecursive Tree (CART) approach (Perturbed Data: Split Depth 1)(Perturbed Data: Split Depth 1)


Recursive Tree (CART) approachRecursive Tree (CART) approach (Perturbed Data: Split Depth 2)(Perturbed Data: Split Depth 2)


Recursive Tree (CART) approachRecursive Tree (CART) approach

“After putting the rabbit into the hat in the full view of the audience, it does not seem necessary to make so much fuss about drawing it out again.”

Robinson, Joan (1966), “Comment on Samuelson and Modigliani”, Review of Economic Studies, 33, 307-8.

Unfair characterization of standard tree techniques◦ precisely because of the concern for over-fitting, trees typically not

grown to their full depth (and/or they are “pruned”)

Yet Random Forests does grow maximum depth trees◦ controlling for over-fitting in another manner


Linear FitLinear Fit

CART topics: nearest neighbor

1-Nearest Neighbor1-Nearest Neighbor


15-Nearest Neighbors15-Nearest Neighbors


Model bias and variance:Model bias and variance:using a development sampleusing a development sample

Measure of a model’s “usefulness” has two parts

Model Bias: How far from the population’s true value do we expect the model’s prediction to be? A model that perfectly fits it’s random sample is unbiased, but an unbiased model need not fit perfectly as long as we do not expect it to be off target.

Model Variance: How far off do we expect this model’s predictions (however biased) to be from the mean of all predictions that the model would give were we to draw multiple samples from the population and average all the model’s predictions over those samples?

Mean Squared Error (MSE)= Bias(θ*)² + Var(θ*)

It is MSE that one should seek to minimize

CART topics: bias/variance trade-off

Model bias and variance:Model bias and variance:using a development sampleusing a development sample

In Ordinary Least Squares (OLS) the bias term is zero(again, a model need not perfectly fit its sample to be unbiased)

“Goodness of fit” reduces to finding the model with the smallest variance

OLS models are Best Linear Unbiased Estimators (“BLUE”); Gauss-Markov Theorem

BLUE models may not be the best overall estimators

◦ Non-linear models may have lower MSE◦ Biased models may have lower MSE (e.g. Ridge Regression)


Problem of Over-FittingProblem of Over-Fitting

For prediction, model “usefulness” needs to be assessed using data not used to develop the model

◦ Task is to minimize MSE of Test or “Hold-out” sample◦ Over-fitting is related to the bias/variance trade-off

Expected Prediction Error (EPE) example using “k”NN

◦ Assume Y = f(x) + ε, with E(ε) = 0 and Variance (ε) = σ2

◦ EPE(xi)= σ2+[f(xi)–average of k nearest neighbors)]2 +σ2/k

As k increases, bias increases and variance decreases


Over-fittingOver-fittingOver-fitting issue is not restricted to non-traditional

methods

Issue is famously illustrated by the fitting an OLS model to variables first transformed to higher level polynomials

◦ A line can be perfectly fit through two development sample points◦ A quadratic curve (2nd degree polynomial) can be perfectly fit through 3 pts◦ An n-degree polynomial can be passed perfectly through n+1 data points

Adding “extra” variables to an OLS model can over-fit◦ Occam razor◦ Albert Einstein’s advise on modeling


Over-fittingOver-fitting


e.g. value of traditional tree-growing techniques is their focus on avoiding over-fitting using such approaches as cross-validation

Bagging topicsBagging topics

Aggregation

Bootstrapping

B agg ing = Bootstrap Aggregating

Aggregation as Variance Aggregation as Variance ReductionReduction

“Bagging helps… in short because averaging reduces variance and leaves bias alone”

Hastie, Tibshirani, and Friedman, Element of Statistical Learning, 2nd edition, p. 285

Therefore aggregation typically helps improve performance of high variance/low bias techniques (e.g., trees) and does not improve linear models

And variance reduction derives from the fundamental observation that the square of an average is less than (or equal to) the average of the squares

mars.csie.ntu.edu.tw/~cychen/papersurvey/Bagging.ppt (see especially the slide entitled Why Bagging Works (2))

BAGGING topics: aggregation

The Wisdom of CrowdsThe Wisdom of CrowdsJames Surowiecki, 2004James Surowiecki, 2004

Francis Galton’s experience at the 1906 West of England Fat Stock and Poultry Exhibition

Jack Treynor’s jelly-beans-in-the-jar experiment◦ Only one of 56 student guessers came closer to the

truth than the average of the class’s guesses

Who Wants to Be a Millionaire?◦ Call an expert? 65% correct◦ Ask the audience? 91% correct


Scott E. Page, Scott E. Page, The DifferenceThe Difference, 2007 , 2007


Which person from the following list was not a member of the Monkees (a 1960s pop band)?

(A) Peter Tork (C) Roger Noll(B) Davy Jones (D) Michael Nesmith

The non-Monkee is Roger Noll, a Stanford economist. Now imagine a crowd of 100 people with knowledge distributed as:

7 know all 3 of the Monkees10 know 2 of the Monkees15 know 1 of the Monkees68 have no clue

So Noll will garner, on average, 34 votes versus 22 votes for each of the other choices.

Crowd Wisdom:Crowd Wisdom: more than reduced variancemore than reduced variance

Implication of Surowiecki’s examples is that one should not expend energy trying to identify an expert within a group but instead rely on the group’s collective wisdom, but

◦ Opinions must be independent◦ Some knowledge of the truth must reside with some

group members

Kindergartners guessing the weight of a 747

◦ The square of the average of bad guesses, no matter how bad, is still no further from the truth (and is usually much closer) than the averaged squared distance from truth of each of guesses


Hastie, Tibshirani, and FriedmanHastie, Tibshirani, and Friedman


Bootstrapping:Bootstrapping:a method, not a concepta method, not a concept

First used to quantify, via a simulation-like approach, the accuracy of statistical estimates (e.g. the variance of a predicted y value at a specific value of x)

METHOD◦ Draw from one’s development sample a selection of records, one-at–a-

time, returning each selected record back into the pool each time, giving it a chance to be selected repeatedly

◦ Make this new “bootstrapped” sample the same size as the original development sample

◦ Repeat to construct a series of such “bootstrapped” samples

BAGGING topics: bootstrapping

Bootstrapping and AggregationBootstrapping and Aggregation

Bootstrapping is used to create different predictive models which are then aggregated

The same approach (e.g., CART/ tree navigation) can lead to very different models (i.e., different y values predicted for the same x value) when different bootstrapped samples are used

An example of a situation which can benefit from bagging is one where model predictors are highly correlated and the modeling technique is CART


Hastie, Tibshirani, and Friedman Hastie, Tibshirani, and Friedman

5 candidate predictor variables (standardized normal) correlated to one another at .95 (multicollinearity)

Only one is in reality related to the dichotomous DV

◦ If “true predictor” value is greater than its mean, the DV has an 80% chance of being “YES”, otherwise a 20% chance

◦ “YES” values associated with high values of “true predictor”; “NO” values associated with low values of “true predictor”

When run with bootstrapped sample data, CART often uses some of the four causally unrelated (but deceptively correlated) variables to parse the tree

Development sample size: 30


Digression on Bayes Error RateDigression on Bayes Error RateIf “true predictor” value is greater than its mean, then the DV hasIf “true predictor” value is greater than its mean, then the DV hasan 80% chance of being “YES”, otherwise it has only a 20% chancean 80% chance of being “YES”, otherwise it has only a 20% chance


Bayes Error Rate: The expected misclassification rate even though the true data generation algorithm and the values of the predictors that matter are known: .1 + .1 = .2

Hastie, Tibshirani, and FriedmanHastie, Tibshirani, and FriedmanOriginal and first 5 bootstrapped CART Original and first 5 bootstrapped CART

modelsmodels


Hastie, Tibshirani, and FriedmanHastie, Tibshirani, and Friedman


Random ForestsRandom Forests“This concept [bagged classifiers] has been popularized outside of statistics as the Wisdom of Crowds (Surowiecki, 2004) — the collective knowledge of a diverse and independent body of people typically exceeds the knowledge of any single individual, and can be harnessed by voting. Of course, the main caveat here is independent,” and bagged trees not.”

Hastie, Tibshirani, and Friedman, Element of Statistical Learning, 2nd edition, p. 286

The Random Forests technique focuses on making tree models (more) independent before bagging them

DisclaimerDisclaimer

Random Forests has attracted “neural net”-like buzz

It’s strength is in its easy-to-understand pedigree

Other recent advanced non-linear techniques are arguably “better” and are certainly more robust (e.g., Gradient Stochastic Boosting)

Features touted in the method (and implemented in most software packages) not covered here include

◦ Out-of-Bag samples◦ Variable Importance algorithm◦ Proximity Plots

RANDOM FORESTS

De-correlating tree algorithmsDe-correlating tree algorithms

Trees built using bootstrapped samples are correlated, meaning that they tend to give the same estimates for the Dependant Variable

◦ they are built, after-all, using the same set of the predictors, varying only the composition of their bootstrapped samples

The trick to de-correlating trees is to randomly select only a subset from among available predictors as one builds the trees

Rule-of thumb for random forest classifiers is to use a number of predictors equal to the square root of the number of available predictors

RANDOM FORESTS

Important Predecessor PaperImportant Predecessor PaperHo, Tin Kam (1995). Random Decision Forests,

in M. Kavavaugh and P. Storms (eds), Proc. Third International Conference on Document Analysis and Recognition, Vol. 1, IEEE Computer Society Press, New York, pp. 278–282.

Short, readable paper from Bell Labs scientist; an engineer’s interest in solving a practical problem (handwritten digit recognition)

Introduced idea of using only a subset of candidate predictors for each of an ensemble of tree growths (but all against single training data set)

Tree navigation not using CART, but rather two geometric-inspired classifying algorithms (reminiscent of Nearest Neighbor)

Trees are grown to full depth (no pruning) with each tree “voting”

Insight that for all trees, the training set is 100% correctly classified (except where 2 sample points with the same IV values are associated with a different DV classification); technique is only useful for classifying points not present in the sample (points lying “between” sample points)

RANDOM FORESTS

Recalling problems of treesRecalling problems of trees grown to their full depth grown to their full depth

For categorical (unordered) independent variables, a fully grown tree is nothing more than a multi-dimensional crosstab; CART, or any tree navigation method, is simply the means to a known end

◦ Social Scientist’s Saturated Model◦ Logistic Regression run by including all possible interactive effects

And if all cells of this multi-dimensional crosstab are populated (e.g., all possible predictor level combinations are represented in the sample data), then there are no levels of predictor variable combinations “between” those levels observed in the sample

For ordinal level predictor data (“good, bad, or ugly”) and interval level data, tree navigation (e.g., using CART) still involves forming groupings but the mechanism of group formation is constrained and, for interval level data, groups are defined as ranges

RANDOM FORESTS

Add bootstrapping and use 1-NN Add bootstrapping and use 1-NN (not CART) for prediction(not CART) for prediction

RANDOM FORESTS

First Bootstrapped “Collective”First Bootstrapped “Collective”6-member collective (collective defined as all6-member collective (collective defined as all

combinations of 4 variables taken 2 at a time) combinations of 4 variables taken 2 at a time)

RANDOM FORESTS

First Collective’s voting on TEST First Collective’s voting on TEST sample point .75.,75,.75,.75, BLUEsample point .75.,75,.75,.75, BLUE

RANDOM FORESTS

RF: the complete algorithmRF: the complete algorithm (from Hastie, Tibshirani, and Friedman) (from Hastie, Tibshirani, and Friedman)

RANDOM FORESTS

Summary and Final CautionSummary and Final CautionCART topics

◦ Non-parametric (non-linear) modeling

◦ Nearest Neighbor approach to modeling

◦ Bias / Variance trade-off and Over-fitting

Bagging Topics◦ Aggregation and the Wisdom of Crowds

◦ Bootstrapping

Random Forests Importance of explicitly addressing the relative

costs of misclassification when using non-linear classification algorithms (e.g. CART and RF)

A Manager’s Guide Through Random Forests Kirk Monteverde Student Loan Risk Advisors February 2012.

Documents

Transcript of A Manager’s Guide Through Random Forests Kirk Monteverde Student Loan Risk Advisors February 2012.