Tree net and_randomforests_2009
-
Upload
matthew-magistrado -
Category
Technology
-
view
963 -
download
0
description
Transcript of Tree net and_randomforests_2009
Introduction to Random Forestsand Stochastic Gradient Boosting
Dan SteinbergMykhaylo Golovnya
August, 2009
2 Salford Systems © Copyright 2009
Initial Ideas on Combining Trees
Idea that combining good methods could yield promising results was suggested by researchers more than a decade ago
In tree-structured analysis, suggestion stems from:
Wray Buntine (1991)
Kwok and Carter (1990)
Heath, Kasif and Salzberg (1993)
Notion is that if the trees can somehow get at different aspects of the data, the combination will be “better”
Better in this context means more accurate in classification and prediction for future cases
The original implementation of CART already included bagging (Bootstrap Aggregation) and ARCing (Adaptive Resampling and Combining) approaches to build tree ensembles
3 Salford Systems © Copyright 2009
Past Decade Development
The original bagging and boosting approaches relied on sampling with replacement techniques to obtain a new modeling dataset
Subsequent approaches focused on refining the sampling machinery or changing the modeling emphasis from the original dependent variable to current model generalized residuals
Most important variants (and dates of published articles) are:
Bagging (Breiman, 1996, “Bootstrap Aggregation”)
Boosting (Freund and Schapire, 1995)
Multiple Additive Regression Trees (Friedman, 1999, aka MART™ or TreeNet™)
RandomForests™ (Breiman, 2001)
Work continues with major refinements underway (Friedman in collaboration with Salford Systems)
4 Salford Systems © Copyright 2009
Simplest example:
Grow a tree on training data
Find a way to grow another tree, different from currently available (change something in set up)
Repeat many times, say 500 replications
Average results or create voting scheme; for example, relate PD to fraction of trees predicting default for a given
Beauty of the method is that every new tree starts with a complete set of dataAny one tree can run out of data, but when that happens we just start again with a new tree and all the data (before sampling)
PredictionVia
Voting
Multi Tree Methods
5 Salford Systems © Copyright 2009
Random Forest
A random forest is a collection of single trees grown in a special way
The overall prediction is determined by voting (in classification) or averaging (in regression)
Accuracy is achieved by using a large number of trees
The Law of Large Numbers ensures convergence
The key to accuracy is low correlation and bias
To keep bias and correlation low, trees are grown to maximum depth
Using more trees does not lead to overfitting, because each tree is grown independently
Correlation is kept low through explicitly introduced randomness
RandomForests™ often works well when other methods work poorly
The reasons for this are poorly understood
Sometimes other methods work well and RandomForests™ doesn’t
6 Salford Systems © Copyright 2009
Randomness is introduced in order to keep correlation low
Randomness is introduced in two distinct ways
Each tree is grown on a bootstrap sample from the learning set
Default bootstrap sample size equals original sample size
Smaller bootstrap sample sizes are sometimes useful
A number R is specified (square root by default) such that it is noticeably smaller than the total number of available predictors
During tree growing phase, at each node only R predictors are randomly selected and tried
Randomness also reduces the signal to noise ratio in a single tree
A low correlation between trees is more important than a high signal when many trees contribute to forming the model
RandomForests™ trees often have very low signal strength, even when the signal strength of the forest is high
7 Salford Systems © Copyright 2009
Important to Keep Correlation Low
Averaging many base learners improves the signal to noise ratio dramatically provided that the correlation of errors is kept low
Hundreds of base learners are needed for the most noticeable effect
0
2
4
6
8
10
12
14
16
18
20
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8
Correlation of Errors
Sig
nal
to
No
ise
Mu
ltip
lier
1 tree
2 trees
5 trees
10 trees
20 trees
50 trees
100 trees
200 trees
500 trees
1000 trees
8 Salford Systems © Copyright 2009
Randomness in Split Selection
Topic discussed by several Machine Learning researchers
Possibilities:
Select splitter, split point, or both at random
Choose splitter at random from the top K splitters
Random Forests: Suppose we have M available predictors
Select R eligible splitters at random and let best split node
If R=1 this is just random splitter selection
If R=M this becomes Brieman’s bagger
If R << M then we get Breiman’s Random Forests
Breiman suggests R=sqrt(M) as a good rule of thumb
9 Salford Systems © Copyright 2009
Performance as a Function of R
05
10
1520253035
404550
0 20 40 60 80 100 120
N Vars
Err
or
1st Tree 100 Trees
In this experiment, we ran RF with 100 trees on sample data (772x111) using different values for the number of variables R (N Vars) searched at each split
Combining trees always improves performance, with the optimal number of sampled predictors already establishing around 11
10 Salford Systems © Copyright 2009
Usage Notes
RF does not require an explicit test sample
Capable of capturing high-order interactions
Both running speed and resources consumed for the most part depends on the row dimension of the data
Trees are grown using in as simple as feasible way to keep run times low (no surrogates, no priors, etc.)
Classification models produce pseudo-probability scores (percent of votes)
Performance-wise is capable of matching the performance of modern boosting techniques, including MART (described later)
Naturally allows parallel processing
The final model code is usually bulky, voluminous, and impossible to interpret directly
Current stable implementations include multinomial classification and least squares regression with an on-going research in the more advanced fields of predictive modeling (survival, choice, etc.)
11 Salford Systems © Copyright 2009
Proximity Matrix – Raw Material for Further Advances
RF introduces a novel way to define proximity between two observations:
For a dataset of size N define an N x N matrix of proximities
Initialize all proximities to zeroes
For any given tree, apply the tree to the dataset
If case i and case j both end up in the same node, increase proximity Proxi j
between i and j by one
Accumulate over all trees in RF and normalize by twice the number of trees in RF
The resulting matrix provides intrinsic measure of proximity
Observations that are “alike” will have proximities close to one
The closer the proximity to 0, the more dissimilar cases i and j are
The measure is invariant to monotone transformations
The measure is clearly defined for any type of independent variables, including categorical
12 Salford Systems © Copyright 2009
Based on proximities one can:
Proceed with a well-defined clustering solution
Note: the solution is guided by the target variable used in the RF model
Detect outliers
By computing average proximity between the current observation and all the remaining observations sharing the same class
Generate informative data views/projections using scaling coordinates
Non-metric multidimensional scaling produces most satisfactory results here
Do missing value imputation using current proximities as weights in the nearest neighbor imputation techniques
Ongoing work on possible expansion of the above to the unsupervised learning area of data mining
Post Processing and Interpretation
13 Salford Systems © Copyright 2009
Introduction to Stochastic Gradient Boosting
TreeNet (TN) is a new approach to machine learning and function approximation developed by Jerome H. Friedman at Stanford University
Co-author of CART® with Breiman, Olshen and Stone
Author of MARS®, PRIM, Projection Pursuit, COSA, RuleFit™ and more
Also known as Stochastic Gradient Boosting and MART (Multiple Additive Regression Trees)
Naturally supports the following classes of predictive models
Regression (continuous target, LS and LAD loss functions)
Binary classification (binary target, logistic likelihood loss function)
Multinomial classification (multiclass target, multinomial likelihood loss function)
Poisson regression (counting target, Poisson likelihood loss function)
Exponential survival (positive target with censoring)
Proportional hazard Cox survival model
TN builds on the notions of committees of experts and boosting but is substantially different in key implementation details
14 Salford Systems © Copyright 2009
Predictive Modeling
We are interested in studying the conditional distribution of the dependent variable Y given X in the predictor space
We assume that some quantity f can be used to fully or partially describe such distribution
In regression problems f is usually the mean or the median
In binary classification problems f is the log-odds of Y=1
In Cox survival problems f is the scaling factor in the unknown hazard function
Thus we want to construct a “nice” function f (X) which in turn can be used to study the behavior of y at the given point in the predictor space
Function f (X) is sometimes referred to as “response surface”
We need to define how “nice” can be measured
ModelX f
15 Salford Systems © Copyright 2009
Loss Functions
In predictive modeling the problem is usually attacked by introducing a well chosen loss function L(Y, X, f(X))
In stochastic gradient boosting we need a loss function for which gradients can easily be computed and used to construct good base learners
The loss function used on the test data does not need the same properties
Practical ways of constructing loss functions
Direct interpretation of f(Xi) as an estimate Yi or a population statistic of the distribution of Y
conditional on X
Least Squares Loss (LS), fi is an estimate of E(Y|Xi)
Least Absolute Deviation Loss (LAD), fi is an estimate of median(Y|Xi)
Huber-M Loss, fi is an estimate of Yi
Choosing a conditional distribution for Y|X,defining f(X) as a parameter of that distribution and using the negative log-likelihood as the loss function
Logistic Loss (conditional Bernoulli, f(X) is the half log-odds of Y=1)
Poisson Loss (conditional Poisson, f(X) is the log())
Exponential Loss (conditional Exponential, f(X) is the log())
More general likelihood functions, for example, multinomial discrete choice, the Cox model
16 Salford Systems © Copyright 2009
Regression and Classification Losses
Regression Losses
0
0.5
1
1.5
2
2.5
3
3.5
4
-3 -2 -1 0 1 2 3y-f
L
LAD Huber-M 1.0 Huber-M 1.5 Huber-M 0.5 LS
Classification Loss when observed y=+1
0
1
2
3
4
5
6
7
8
-3 -2 -1 0 1 2 3
fL
(y-p)^2 |y-p| exp ( -2yf ) log (1+exp( -2yf ))
Huber-M regression loss is a reasonable compromise between the classical LS loss and robust LAD loss
Logistic log-likelihood based loss strikes the middle ground between the extremely sensitive exponential loss on one side and conventional LS and LAD losses on the other side
17 Salford Systems © Copyright 2009
Practical Estimate
In reality, we have a set of N observed pairs (Xi, yi) from the population, not the entire population
Hence, we use sample-based estimates of L(Y, X, f(X))
To avoid biased estimates, one usually partitions the data into independent learn and test samples using the latter to compute an unbiased estimate of the population loss
In Stochastic gradient boosting the problem is attacked by acting like we are trying to minimize the loss function on the learn sample. But doing so in a slow constrained way
This results in a series of models that move closer and closer to the f(X) function that minimizes the loss on the learn sample. Eventually new models become overfit to the learn sample
From this sequence the function f(X) with the lowest loss on the test sample is chosen
By choosing from a fixed set of models overfitting to the test data is avoided
Sometimes the loss functions used on the test data and learn data differ
18 Salford Systems © Copyright 2009
Parametric Approach
The function f(X) is introduced as a known function of a fixed set of unknown parameters
The problem then reduces to finding a set of optimal parameter estimates using classical optimization techniques
In linear regression and logistic regression: f(X) is a linear combination of fixed predictors; the parameters are the intercept and the slope coefficients
Major problem: the function and predictors need to be specified beforehand – this can result in a lengthy specification search process by trial and error
If this trial-and error-process uses the same data as the final model, that model will be overfit. This is the classical overfitting problem
If new data are used to estimate the final model and the model performs poorly, the specification search process must be repeated
This approach shows most benefits on small datasets where only simple specifications can be justified, or on datasets where there is strong a priori knowledge of the correct specification
19 Salford Systems © Copyright 2009
Non-parametric Approach
Construct f(X) using data driven incremental approach
Start with a constant, then at each stage adjust the values of f(X) by small increments in various regions of data
It is important to keep the adjustment rate low – the resulting model will become smoother and be less subject to overfitting
Treating fi = f(Xi) at all individual observed data points as separate parameters, the negative of the gradient points in the direction of change in f(X) that results in the steepest reduction of the loss
G = {gi = - dR / d fi; i=1,…,N}.
The components of the negative gradient will be called generalized residuals
We want to limit the number of currently allowed separate adjustments to a small number M – a natural way to proceed then is to find an orthogonal partition of the X-space into M mutually exclusive regions such that the variance of the residuals within each region is minimized
This job is accomplished by building a fixed size M-node regression tree using the generalized residuals as the current target variable
20 Salford Systems © Copyright 2009
TreeNet Process
Begin with the sample mean (e.g., logit - for all observations set p=sample share)
Add one very small tree as initial model based on gradients
For regression and logit, residuals are gradients
Could be as small as ONE split generating 2 terminal nodes
Typical model will have 3-5 splits in a tree, generating 4-6 terminal nodes
Output is a continuous response surface (e.g. log-odds for binary classification)
Model is intentionally “weak”
Multiply contribution by a learning factor before adding it to model
Model is now: mean + Tree1
Compute new gradients (residuals)
The actual definition of the residual is driven by the type of the loss function
Grow second small tree to predict the residuals from the first tree
New model is now: mean + Tree1 + Tree2
Repeat iteratively while checking performance on an independent test sample
21 Salford Systems © Copyright 2009
Benefits of TreeNet
Built on CART trees and thus
immune to outliers
selects variables,
results invariant with monotone transformations of variables
handles missing values automatically
Resistant to mislabeled target data
In medicine cases are commonly misdiagnosed
In business, occasionally non-responders flagged as “responders”
Resistant to over training – generalizes very well
Can be remarkably accurate with little effort
Trains very rapidly; comparable to CART
22 Salford Systems © Copyright 2009
2009 KDD Cup 2nd place “Fast Scoring on Large Database”
2007 PAKDD competition: home loans up-sell to credit card owners 2nd place
Model built in half a day using previous year submission as a blueprint
2006 PAKDD competition: customer type discrimination 3rd place
Model built in one day. 1st place accuracy 81.9% TreeNet Accuracy 81.2%
2005 BI-CUP Sponsored by University of Chile attracted 60 competitors
2004 KDD Cup “Most Accurate”
2003 “Duke University/NCR Teradata CRM modeling competition
Most Accurate” and “Best Top Decile Lift” on both in and out of time samples
A major financial services company has tested TreeNet across a broad range of targeted marketing and risk models for the past 2 years
TreeNet consistently outperforms previous best models (around 10% AUROC)
TreeNet models can be built in a fraction of the time previously devoted
TreeNet reveals previously undetected predictive power in data
TN Successes
23 Salford Systems © Copyright 2009
Trees are kept small (2-6 nodes common)
Updates are small – can be as small as .01, .001, .0001
Use random subsets of the training data in each cycle
Never train on all the training data in any one cycle
Highly problematic cases are IGNORED
If model prediction starts to diverge substantially from observed data, that data will not be used in further updates
TN allows very flexible control over interactions:
Strictly Additive Models (no interactions allowed)
Low level interactions allowed
High level interactions allowed
Constraints: only specific interactions allowed (TN PRO)
Key Controls
24 Salford Systems © Copyright 2009
As TN models consist of hundreds or even thousands of trees there is no useful way to represent the model via a display of one or two trees
However, the model can be summarized in a variety of ways
Partial Dependency Plots: These exhibit the relationship between the target and any predictor – as captured by the model
Variable Importance Rankings: These stable rankings give an excellent assessment of the relative importance of predictors
ROC and Gains Curves: TN models produce scores that are typically unique fore ach scored record
Confusion Matrix: Using an adjustable score threshold this matrix displays the model false positive and false negative rates
TreeNet models based on 2-node trees by definition EXCLUDE interactions
Model may be highly nonlinear but is by definition strictly additive
Every term in the model is based on a single variable (single split)
Build TreeNet on a larger tree (default is 6 nodes)
Permits up to 5-way interaction but in practice is more like 3-way interaction
Can conduct informal likelihood ratio test TN(2-node) versus TN(6-node)
Large differences signal important interactions
Interpreting TN Models
25 Salford Systems © Copyright 2009
Example: Boston Housing
The results of running TN on the Boston Housing dataset are shown
All of the key insights agree with similar findings by MARS and CART
Variable Score
LSTAT 100.00 ||||||||||||||||||||||||||||||||||||||||||
RM 83.71 |||||||||||||||||||||||||||||||||||
DIS 45.45 |||||||||||||||||||
CRIM 31.91 |||||||||||||
NOX 30.69 ||||||||||||
AGE 28.62 |||||||||||
PT 22.81 |||||||||
TAX 19.74 |||||||
INDUS 12.19 ||||
CHAS 11.93 ||||
26 Salford Systems © Copyright 2009
References
Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and Regression Trees, Pacific Grove: Wadsworth
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
Hastie, T., Tibshirani, R., and Friedman, J.H (2000). The Elements of Statistical Learning. Springer.
Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth National Conference, Morgan Kaufmann, pp. 148-156.
Friedman, J.H. (1999). Stochastic gradient boosting. Stanford: Statistics Department, Stanford University.
Friedman, J.H. (1999). Greedy function approximation: a gradient boosting machine. Stanford: Statistics Department, Stanford University.