A Manager’s Guide Through Random Forests Kirk Monteverde Student Loan Risk Advisors February 2012.
-
Upload
bruno-carroll -
Category
Documents
-
view
214 -
download
1
Transcript of A Manager’s Guide Through Random Forests Kirk Monteverde Student Loan Risk Advisors February 2012.
A Manager’s Guide A Manager’s Guide Through Random ForestsThrough Random Forests
Kirk MonteverdeStudent Loan Risk Advisors
February 2012
Inferno: Canto IMidway upon the journey of our lifeI found myself within a forest dark,
Leo Breiman(1928-2005)Our Virgil
Breiman’s Intellectual JourneyBreiman’s Intellectual Journey
Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees, (CART) Wadsworth, New York.
Breiman, L. (1996a). Bagging predictors, Machine Learning 26: 123–140.
Breiman, L. (2001). Random forests, Machine Learning 45: 5–32.
What is CART?What is CART?
CART is one of a number of recursive “tree-structure” techniques (others include CHAID and C4.5)
DVs can be categorical or continuous (we concentrate here on dichotomous DV models)
Idea is to start by choosing the predictor variable (and its cut-point) that separates your development sample into two piles that best differentiate on the DV
Repeat this process for each of the daughter nodes and continue until some stopping rule is reached
CART topicsCART topics
Non-parametric (non-linear) modeling
Nearest Neighbor approach to modeling
Bias / Variance trade-off and Over-fitting
Contrast with Logistic Regression Contrast with Logistic Regression (linear model for categorical DVs)(linear model for categorical DVs)
CART topics: non-linearity
Contrast with Logistic Regression Contrast with Logistic Regression (linear model for categorical DVs)(linear model for categorical DVs)Modeling with Predictor A only Modeling with both A and B
CART topics: non-linearity
Contrast with Logistic Regression Contrast with Logistic Regression (Predicted Good Odds)(Predicted Good Odds)
CART topics: non-linearity
Contrast with Logistic Regression Contrast with Logistic Regression (perturbed data)(perturbed data)
CART topics: non-linearity
Recursive Tree (CART) approachRecursive Tree (CART) approach (Perturbed Data: Split Depth 1)(Perturbed Data: Split Depth 1)
CART topics: non-linearity
Recursive Tree (CART) approachRecursive Tree (CART) approach (Perturbed Data: Split Depth 2)(Perturbed Data: Split Depth 2)
CART topics: non-linearity
Recursive Tree (CART) approachRecursive Tree (CART) approach
“After putting the rabbit into the hat in the full view of the audience, it does not seem necessary to make so much fuss about drawing it out again.”
Robinson, Joan (1966), “Comment on Samuelson and Modigliani”, Review of Economic Studies, 33, 307-8.
Unfair characterization of standard tree techniques◦ precisely because of the concern for over-fitting, trees typically not
grown to their full depth (and/or they are “pruned”)
Yet Random Forests does grow maximum depth trees◦ controlling for over-fitting in another manner
CART topics: non-linearity
Linear FitLinear Fit
CART topics: nearest neighbor
1-Nearest Neighbor1-Nearest Neighbor
CART topics: nearest neighbor
15-Nearest Neighbors15-Nearest Neighbors
CART topics: nearest neighbor
Model bias and variance:Model bias and variance:using a development sampleusing a development sample
Measure of a model’s “usefulness” has two parts
Model Bias: How far from the population’s true value do we expect the model’s prediction to be? A model that perfectly fits it’s random sample is unbiased, but an unbiased model need not fit perfectly as long as we do not expect it to be off target.
Model Variance: How far off do we expect this model’s predictions (however biased) to be from the mean of all predictions that the model would give were we to draw multiple samples from the population and average all the model’s predictions over those samples?
Mean Squared Error (MSE)= Bias(θ*)² + Var(θ*)
It is MSE that one should seek to minimize
CART topics: bias/variance trade-off
Model bias and variance:Model bias and variance:using a development sampleusing a development sample
In Ordinary Least Squares (OLS) the bias term is zero(again, a model need not perfectly fit its sample to be unbiased)
“Goodness of fit” reduces to finding the model with the smallest variance
OLS models are Best Linear Unbiased Estimators (“BLUE”); Gauss-Markov Theorem
BLUE models may not be the best overall estimators
◦ Non-linear models may have lower MSE◦ Biased models may have lower MSE (e.g. Ridge Regression)
CART topics: bias/variance trade-off
Problem of Over-FittingProblem of Over-Fitting
For prediction, model “usefulness” needs to be assessed using data not used to develop the model
◦ Task is to minimize MSE of Test or “Hold-out” sample◦ Over-fitting is related to the bias/variance trade-off
Expected Prediction Error (EPE) example using “k”NN
◦ Assume Y = f(x) + ε, with E(ε) = 0 and Variance (ε) = σ2
◦ EPE(xi)= σ2+[f(xi)–average of k nearest neighbors)]2 +σ2/k
As k increases, bias increases and variance decreases
CART topics: bias/variance trade-off
Over-fittingOver-fittingOver-fitting issue is not restricted to non-traditional
methods
Issue is famously illustrated by the fitting an OLS model to variables first transformed to higher level polynomials
◦ A line can be perfectly fit through two development sample points◦ A quadratic curve (2nd degree polynomial) can be perfectly fit through 3 pts◦ An n-degree polynomial can be passed perfectly through n+1 data points
Adding “extra” variables to an OLS model can over-fit◦ Occam razor◦ Albert Einstein’s advise on modeling
CART topics: bias/variance trade-off
Over-fittingOver-fitting
CART topics: bias/variance trade-off
e.g. value of traditional tree-growing techniques is their focus on avoiding over-fitting using such approaches as cross-validation
Bagging topicsBagging topics
Aggregation
Bootstrapping
B agg ing = Bootstrap Aggregating
Aggregation as Variance Aggregation as Variance ReductionReduction
“Bagging helps… in short because averaging reduces variance and leaves bias alone”
Hastie, Tibshirani, and Friedman, Element of Statistical Learning, 2nd edition, p. 285
Therefore aggregation typically helps improve performance of high variance/low bias techniques (e.g., trees) and does not improve linear models
And variance reduction derives from the fundamental observation that the square of an average is less than (or equal to) the average of the squares
mars.csie.ntu.edu.tw/~cychen/papersurvey/Bagging.ppt (see especially the slide entitled Why Bagging Works (2))
BAGGING topics: aggregation
The Wisdom of CrowdsThe Wisdom of CrowdsJames Surowiecki, 2004James Surowiecki, 2004
Francis Galton’s experience at the 1906 West of England Fat Stock and Poultry Exhibition
Jack Treynor’s jelly-beans-in-the-jar experiment◦ Only one of 56 student guessers came closer to the
truth than the average of the class’s guesses
Who Wants to Be a Millionaire?◦ Call an expert? 65% correct◦ Ask the audience? 91% correct
BAGGING topics: aggregation
Scott E. Page, Scott E. Page, The DifferenceThe Difference, 2007 , 2007
BAGGING topics: aggregation
Which person from the following list was not a member of the Monkees (a 1960s pop band)?
(A) Peter Tork (C) Roger Noll(B) Davy Jones (D) Michael Nesmith
The non-Monkee is Roger Noll, a Stanford economist. Now imagine a crowd of 100 people with knowledge distributed as:
7 know all 3 of the Monkees10 know 2 of the Monkees15 know 1 of the Monkees68 have no clue
So Noll will garner, on average, 34 votes versus 22 votes for each of the other choices.
Crowd Wisdom:Crowd Wisdom: more than reduced variancemore than reduced variance
Implication of Surowiecki’s examples is that one should not expend energy trying to identify an expert within a group but instead rely on the group’s collective wisdom, but
◦ Opinions must be independent◦ Some knowledge of the truth must reside with some
group members
Kindergartners guessing the weight of a 747
◦ The square of the average of bad guesses, no matter how bad, is still no further from the truth (and is usually much closer) than the averaged squared distance from truth of each of guesses
BAGGING topics: aggregation
Hastie, Tibshirani, and FriedmanHastie, Tibshirani, and Friedman
BAGGING topics: aggregation
Bootstrapping:Bootstrapping:a method, not a concepta method, not a concept
First used to quantify, via a simulation-like approach, the accuracy of statistical estimates (e.g. the variance of a predicted y value at a specific value of x)
METHOD◦ Draw from one’s development sample a selection of records, one-at–a-
time, returning each selected record back into the pool each time, giving it a chance to be selected repeatedly
◦ Make this new “bootstrapped” sample the same size as the original development sample
◦ Repeat to construct a series of such “bootstrapped” samples
BAGGING topics: bootstrapping
Bootstrapping and AggregationBootstrapping and Aggregation
Bootstrapping is used to create different predictive models which are then aggregated
The same approach (e.g., CART/ tree navigation) can lead to very different models (i.e., different y values predicted for the same x value) when different bootstrapped samples are used
An example of a situation which can benefit from bagging is one where model predictors are highly correlated and the modeling technique is CART
BAGGING topics: bootstrapping
Hastie, Tibshirani, and Friedman Hastie, Tibshirani, and Friedman
5 candidate predictor variables (standardized normal) correlated to one another at .95 (multicollinearity)
Only one is in reality related to the dichotomous DV
◦ If “true predictor” value is greater than its mean, the DV has an 80% chance of being “YES”, otherwise a 20% chance
◦ “YES” values associated with high values of “true predictor”; “NO” values associated with low values of “true predictor”
When run with bootstrapped sample data, CART often uses some of the four causally unrelated (but deceptively correlated) variables to parse the tree
Development sample size: 30
BAGGING topics: bootstrapping
Digression on Bayes Error RateDigression on Bayes Error RateIf “true predictor” value is greater than its mean, then the DV hasIf “true predictor” value is greater than its mean, then the DV hasan 80% chance of being “YES”, otherwise it has only a 20% chancean 80% chance of being “YES”, otherwise it has only a 20% chance
BAGGING topics: bootstrapping
Bayes Error Rate: The expected misclassification rate even though the true data generation algorithm and the values of the predictors that matter are known: .1 + .1 = .2
Hastie, Tibshirani, and FriedmanHastie, Tibshirani, and FriedmanOriginal and first 5 bootstrapped CART Original and first 5 bootstrapped CART
modelsmodels
BAGGING topics: bootstrapping
Hastie, Tibshirani, and FriedmanHastie, Tibshirani, and Friedman
BAGGING topics: bootstrapping
Random ForestsRandom Forests“This concept [bagged classifiers] has been popularized outside of statistics as the Wisdom of Crowds (Surowiecki, 2004) — the collective knowledge of a diverse and independent body of people typically exceeds the knowledge of any single individual, and can be harnessed by voting. Of course, the main caveat here is independent,” and bagged trees not.”
Hastie, Tibshirani, and Friedman, Element of Statistical Learning, 2nd edition, p. 286
The Random Forests technique focuses on making tree models (more) independent before bagging them
DisclaimerDisclaimer
Random Forests has attracted “neural net”-like buzz
It’s strength is in its easy-to-understand pedigree
Other recent advanced non-linear techniques are arguably “better” and are certainly more robust (e.g., Gradient Stochastic Boosting)
Features touted in the method (and implemented in most software packages) not covered here include
◦ Out-of-Bag samples◦ Variable Importance algorithm◦ Proximity Plots
RANDOM FORESTS
De-correlating tree algorithmsDe-correlating tree algorithms
Trees built using bootstrapped samples are correlated, meaning that they tend to give the same estimates for the Dependant Variable
◦ they are built, after-all, using the same set of the predictors, varying only the composition of their bootstrapped samples
The trick to de-correlating trees is to randomly select only a subset from among available predictors as one builds the trees
Rule-of thumb for random forest classifiers is to use a number of predictors equal to the square root of the number of available predictors
RANDOM FORESTS
Important Predecessor PaperImportant Predecessor PaperHo, Tin Kam (1995). Random Decision Forests,
in M. Kavavaugh and P. Storms (eds), Proc. Third International Conference on Document Analysis and Recognition, Vol. 1, IEEE Computer Society Press, New York, pp. 278–282.
Short, readable paper from Bell Labs scientist; an engineer’s interest in solving a practical problem (handwritten digit recognition)
Introduced idea of using only a subset of candidate predictors for each of an ensemble of tree growths (but all against single training data set)
Tree navigation not using CART, but rather two geometric-inspired classifying algorithms (reminiscent of Nearest Neighbor)
Trees are grown to full depth (no pruning) with each tree “voting”
Insight that for all trees, the training set is 100% correctly classified (except where 2 sample points with the same IV values are associated with a different DV classification); technique is only useful for classifying points not present in the sample (points lying “between” sample points)
RANDOM FORESTS
Recalling problems of treesRecalling problems of trees grown to their full depth grown to their full depth
For categorical (unordered) independent variables, a fully grown tree is nothing more than a multi-dimensional crosstab; CART, or any tree navigation method, is simply the means to a known end
◦ Social Scientist’s Saturated Model◦ Logistic Regression run by including all possible interactive effects
And if all cells of this multi-dimensional crosstab are populated (e.g., all possible predictor level combinations are represented in the sample data), then there are no levels of predictor variable combinations “between” those levels observed in the sample
For ordinal level predictor data (“good, bad, or ugly”) and interval level data, tree navigation (e.g., using CART) still involves forming groupings but the mechanism of group formation is constrained and, for interval level data, groups are defined as ranges
RANDOM FORESTS
Add bootstrapping and use 1-NN Add bootstrapping and use 1-NN (not CART) for prediction(not CART) for prediction
RANDOM FORESTS
First Bootstrapped “Collective”First Bootstrapped “Collective”6-member collective (collective defined as all6-member collective (collective defined as all
combinations of 4 variables taken 2 at a time) combinations of 4 variables taken 2 at a time)
RANDOM FORESTS
First Collective’s voting on TEST First Collective’s voting on TEST sample point .75.,75,.75,.75, BLUEsample point .75.,75,.75,.75, BLUE
RANDOM FORESTS
RF: the complete algorithmRF: the complete algorithm (from Hastie, Tibshirani, and Friedman) (from Hastie, Tibshirani, and Friedman)
RANDOM FORESTS
Summary and Final CautionSummary and Final CautionCART topics
◦ Non-parametric (non-linear) modeling
◦ Nearest Neighbor approach to modeling
◦ Bias / Variance trade-off and Over-fitting
Bagging Topics◦ Aggregation and the Wisdom of Crowds
◦ Bootstrapping
Random Forests Importance of explicitly addressing the relative
costs of misclassification when using non-linear classification algorithms (e.g. CART and RF)