Tree net and_randomforests_2009

Introduction to Random Forestsand Stochastic Gradient Boosting

Dan SteinbergMykhaylo Golovnya

[email protected]

August, 2009

2 Salford Systems © Copyright 2009

Initial Ideas on Combining Trees

Idea that combining good methods could yield promising results was suggested by researchers more than a decade ago

In tree-structured analysis, suggestion stems from:

Wray Buntine (1991)

Kwok and Carter (1990)

Heath, Kasif and Salzberg (1993)

Notion is that if the trees can somehow get at different aspects of the data, the combination will be “better”

Better in this context means more accurate in classification and prediction for future cases

The original implementation of CART already included bagging (Bootstrap Aggregation) and ARCing (Adaptive Resampling and Combining) approaches to build tree ensembles


Past Decade Development

The original bagging and boosting approaches relied on sampling with replacement techniques to obtain a new modeling dataset

Subsequent approaches focused on refining the sampling machinery or changing the modeling emphasis from the original dependent variable to current model generalized residuals

Most important variants (and dates of published articles) are:

Bagging (Breiman, 1996, “Bootstrap Aggregation”)

Boosting (Freund and Schapire, 1995)

Multiple Additive Regression Trees (Friedman, 1999, aka MART™ or TreeNet™)

RandomForests™ (Breiman, 2001)

Work continues with major refinements underway (Friedman in collaboration with Salford Systems)


Simplest example:

Grow a tree on training data

Find a way to grow another tree, different from currently available (change something in set up)

Repeat many times, say 500 replications

Average results or create voting scheme; for example, relate PD to fraction of trees predicting default for a given

Beauty of the method is that every new tree starts with a complete set of dataAny one tree can run out of data, but when that happens we just start again with a new tree and all the data (before sampling)

PredictionVia

Voting

Multi Tree Methods


Random Forest

A random forest is a collection of single trees grown in a special way

The overall prediction is determined by voting (in classification) or averaging (in regression)

Accuracy is achieved by using a large number of trees

The Law of Large Numbers ensures convergence

The key to accuracy is low correlation and bias

To keep bias and correlation low, trees are grown to maximum depth

Using more trees does not lead to overfitting, because each tree is grown independently

Correlation is kept low through explicitly introduced randomness

RandomForests™ often works well when other methods work poorly

The reasons for this are poorly understood

Sometimes other methods work well and RandomForests™ doesn’t


Randomness is introduced in order to keep correlation low

Randomness is introduced in two distinct ways

Each tree is grown on a bootstrap sample from the learning set

Default bootstrap sample size equals original sample size

Smaller bootstrap sample sizes are sometimes useful

A number R is specified (square root by default) such that it is noticeably smaller than the total number of available predictors

During tree growing phase, at each node only R predictors are randomly selected and tried

Randomness also reduces the signal to noise ratio in a single tree

A low correlation between trees is more important than a high signal when many trees contribute to forming the model

RandomForests™ trees often have very low signal strength, even when the signal strength of the forest is high


Important to Keep Correlation Low

Averaging many base learners improves the signal to noise ratio dramatically provided that the correlation of errors is kept low

Hundreds of base learners are needed for the most noticeable effect

0

2

4

6

8

10

12

14

16

18

20

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8

Correlation of Errors

Sig

nal

to

No

ise

Mu

ltip

lier

1 tree

2 trees

5 trees

10 trees

20 trees

50 trees

100 trees

200 trees

500 trees

1000 trees


Randomness in Split Selection

Topic discussed by several Machine Learning researchers

Possibilities:

Select splitter, split point, or both at random

Choose splitter at random from the top K splitters

Random Forests: Suppose we have M available predictors

Select R eligible splitters at random and let best split node

If R=1 this is just random splitter selection

If R=M this becomes Brieman’s bagger

If R << M then we get Breiman’s Random Forests

Breiman suggests R=sqrt(M) as a good rule of thumb


Performance as a Function of R

05

10

1520253035

404550

0 20 40 60 80 100 120

N Vars

Err

or

1st Tree 100 Trees

In this experiment, we ran RF with 100 trees on sample data (772x111) using different values for the number of variables R (N Vars) searched at each split

Combining trees always improves performance, with the optimal number of sampled predictors already establishing around 11


Usage Notes

RF does not require an explicit test sample

Capable of capturing high-order interactions

Both running speed and resources consumed for the most part depends on the row dimension of the data

Trees are grown using in as simple as feasible way to keep run times low (no surrogates, no priors, etc.)

Classification models produce pseudo-probability scores (percent of votes)

Performance-wise is capable of matching the performance of modern boosting techniques, including MART (described later)

Naturally allows parallel processing

The final model code is usually bulky, voluminous, and impossible to interpret directly

Current stable implementations include multinomial classification and least squares regression with an on-going research in the more advanced fields of predictive modeling (survival, choice, etc.)


Proximity Matrix – Raw Material for Further Advances

RF introduces a novel way to define proximity between two observations:

For a dataset of size N define an N x N matrix of proximities

Initialize all proximities to zeroes

For any given tree, apply the tree to the dataset

If case i and case j both end up in the same node, increase proximity Proxi j

between i and j by one

Accumulate over all trees in RF and normalize by twice the number of trees in RF

The resulting matrix provides intrinsic measure of proximity

Observations that are “alike” will have proximities close to one

The closer the proximity to 0, the more dissimilar cases i and j are

The measure is invariant to monotone transformations

The measure is clearly defined for any type of independent variables, including categorical


Based on proximities one can:

Proceed with a well-defined clustering solution

Note: the solution is guided by the target variable used in the RF model

Detect outliers

By computing average proximity between the current observation and all the remaining observations sharing the same class

Generate informative data views/projections using scaling coordinates

Non-metric multidimensional scaling produces most satisfactory results here

Do missing value imputation using current proximities as weights in the nearest neighbor imputation techniques

Ongoing work on possible expansion of the above to the unsupervised learning area of data mining

Post Processing and Interpretation


Introduction to Stochastic Gradient Boosting

TreeNet (TN) is a new approach to machine learning and function approximation developed by Jerome H. Friedman at Stanford University

Co-author of CART® with Breiman, Olshen and Stone

Author of MARS®, PRIM, Projection Pursuit, COSA, RuleFit™ and more

Also known as Stochastic Gradient Boosting and MART (Multiple Additive Regression Trees)

Naturally supports the following classes of predictive models

Regression (continuous target, LS and LAD loss functions)

Binary classification (binary target, logistic likelihood loss function)

Multinomial classification (multiclass target, multinomial likelihood loss function)

Poisson regression (counting target, Poisson likelihood loss function)

Exponential survival (positive target with censoring)

Proportional hazard Cox survival model

TN builds on the notions of committees of experts and boosting but is substantially different in key implementation details


Predictive Modeling

We are interested in studying the conditional distribution of the dependent variable Y given X in the predictor space

We assume that some quantity f can be used to fully or partially describe such distribution

In regression problems f is usually the mean or the median

In binary classification problems f is the log-odds of Y=1

In Cox survival problems f is the scaling factor in the unknown hazard function

Thus we want to construct a “nice” function f (X) which in turn can be used to study the behavior of y at the given point in the predictor space

Function f (X) is sometimes referred to as “response surface”

We need to define how “nice” can be measured

ModelX f


Loss Functions

In predictive modeling the problem is usually attacked by introducing a well chosen loss function L(Y, X, f(X))

In stochastic gradient boosting we need a loss function for which gradients can easily be computed and used to construct good base learners

The loss function used on the test data does not need the same properties

Practical ways of constructing loss functions

Direct interpretation of f(Xi) as an estimate Yi or a population statistic of the distribution of Y

conditional on X

Least Squares Loss (LS), fi is an estimate of E(Y|Xi)

Least Absolute Deviation Loss (LAD), fi is an estimate of median(Y|Xi)

Huber-M Loss, fi is an estimate of Yi

Choosing a conditional distribution for Y|X,defining f(X) as a parameter of that distribution and using the negative log-likelihood as the loss function

Logistic Loss (conditional Bernoulli, f(X) is the half log-odds of Y=1)

Poisson Loss (conditional Poisson, f(X) is the log())

Exponential Loss (conditional Exponential, f(X) is the log())

More general likelihood functions, for example, multinomial discrete choice, the Cox model


Regression and Classification Losses

Regression Losses

0

0.5

1

1.5

2

2.5

3

3.5

4

-3 -2 -1 0 1 2 3y-f

L

LAD Huber-M 1.0 Huber-M 1.5 Huber-M 0.5 LS

Classification Loss when observed y=+1

0

1

2

3

4

5

6

7

8

-3 -2 -1 0 1 2 3

fL

(y-p)^2 |y-p| exp ( -2yf ) log (1+exp( -2yf ))

Huber-M regression loss is a reasonable compromise between the classical LS loss and robust LAD loss

Logistic log-likelihood based loss strikes the middle ground between the extremely sensitive exponential loss on one side and conventional LS and LAD losses on the other side


Practical Estimate

In reality, we have a set of N observed pairs (Xi, yi) from the population, not the entire population

Hence, we use sample-based estimates of L(Y, X, f(X))

To avoid biased estimates, one usually partitions the data into independent learn and test samples using the latter to compute an unbiased estimate of the population loss

In Stochastic gradient boosting the problem is attacked by acting like we are trying to minimize the loss function on the learn sample. But doing so in a slow constrained way

This results in a series of models that move closer and closer to the f(X) function that minimizes the loss on the learn sample. Eventually new models become overfit to the learn sample

From this sequence the function f(X) with the lowest loss on the test sample is chosen

By choosing from a fixed set of models overfitting to the test data is avoided

Sometimes the loss functions used on the test data and learn data differ


Parametric Approach

The function f(X) is introduced as a known function of a fixed set of unknown parameters

The problem then reduces to finding a set of optimal parameter estimates using classical optimization techniques

In linear regression and logistic regression: f(X) is a linear combination of fixed predictors; the parameters are the intercept and the slope coefficients

Major problem: the function and predictors need to be specified beforehand – this can result in a lengthy specification search process by trial and error

If this trial-and error-process uses the same data as the final model, that model will be overfit. This is the classical overfitting problem

If new data are used to estimate the final model and the model performs poorly, the specification search process must be repeated

This approach shows most benefits on small datasets where only simple specifications can be justified, or on datasets where there is strong a priori knowledge of the correct specification


Non-parametric Approach

Construct f(X) using data driven incremental approach

Start with a constant, then at each stage adjust the values of f(X) by small increments in various regions of data

It is important to keep the adjustment rate low – the resulting model will become smoother and be less subject to overfitting

Treating fi = f(Xi) at all individual observed data points as separate parameters, the negative of the gradient points in the direction of change in f(X) that results in the steepest reduction of the loss

G = {gi = - dR / d fi; i=1,…,N}.

The components of the negative gradient will be called generalized residuals

We want to limit the number of currently allowed separate adjustments to a small number M – a natural way to proceed then is to find an orthogonal partition of the X-space into M mutually exclusive regions such that the variance of the residuals within each region is minimized

This job is accomplished by building a fixed size M-node regression tree using the generalized residuals as the current target variable


TreeNet Process

Begin with the sample mean (e.g., logit - for all observations set p=sample share)

Add one very small tree as initial model based on gradients

For regression and logit, residuals are gradients

Could be as small as ONE split generating 2 terminal nodes

Typical model will have 3-5 splits in a tree, generating 4-6 terminal nodes

Output is a continuous response surface (e.g. log-odds for binary classification)

Model is intentionally “weak”

Multiply contribution by a learning factor before adding it to model

Model is now: mean + Tree1

Compute new gradients (residuals)

The actual definition of the residual is driven by the type of the loss function

Grow second small tree to predict the residuals from the first tree

New model is now: mean + Tree1 + Tree2

Repeat iteratively while checking performance on an independent test sample


Benefits of TreeNet

Built on CART trees and thus

immune to outliers

selects variables,

results invariant with monotone transformations of variables

handles missing values automatically

Resistant to mislabeled target data

In medicine cases are commonly misdiagnosed

In business, occasionally non-responders flagged as “responders”

Resistant to over training – generalizes very well

Can be remarkably accurate with little effort

Trains very rapidly; comparable to CART


2009 KDD Cup 2nd place “Fast Scoring on Large Database”

2007 PAKDD competition: home loans up-sell to credit card owners 2nd place

Model built in half a day using previous year submission as a blueprint

2006 PAKDD competition: customer type discrimination 3rd place

Model built in one day. 1st place accuracy 81.9% TreeNet Accuracy 81.2%

2005 BI-CUP Sponsored by University of Chile attracted 60 competitors

2004 KDD Cup “Most Accurate”

2003 “Duke University/NCR Teradata CRM modeling competition

Most Accurate” and “Best Top Decile Lift” on both in and out of time samples

A major financial services company has tested TreeNet across a broad range of targeted marketing and risk models for the past 2 years

TreeNet consistently outperforms previous best models (around 10% AUROC)

TreeNet models can be built in a fraction of the time previously devoted

TreeNet reveals previously undetected predictive power in data

TN Successes


Trees are kept small (2-6 nodes common)

Updates are small – can be as small as .01, .001, .0001

Use random subsets of the training data in each cycle

Never train on all the training data in any one cycle

Highly problematic cases are IGNORED

If model prediction starts to diverge substantially from observed data, that data will not be used in further updates

TN allows very flexible control over interactions:

Strictly Additive Models (no interactions allowed)

Low level interactions allowed

High level interactions allowed

Constraints: only specific interactions allowed (TN PRO)

Key Controls


As TN models consist of hundreds or even thousands of trees there is no useful way to represent the model via a display of one or two trees

However, the model can be summarized in a variety of ways

Partial Dependency Plots: These exhibit the relationship between the target and any predictor – as captured by the model

Variable Importance Rankings: These stable rankings give an excellent assessment of the relative importance of predictors

ROC and Gains Curves: TN models produce scores that are typically unique fore ach scored record

Confusion Matrix: Using an adjustable score threshold this matrix displays the model false positive and false negative rates

TreeNet models based on 2-node trees by definition EXCLUDE interactions

Model may be highly nonlinear but is by definition strictly additive

Every term in the model is based on a single variable (single split)

Build TreeNet on a larger tree (default is 6 nodes)

Permits up to 5-way interaction but in practice is more like 3-way interaction

Can conduct informal likelihood ratio test TN(2-node) versus TN(6-node)

Large differences signal important interactions

Interpreting TN Models


Example: Boston Housing

The results of running TN on the Boston Housing dataset are shown

All of the key insights agree with similar findings by MARS and CART

Variable Score

LSTAT 100.00 ||||||||||||||||||||||||||||||||||||||||||

RM 83.71 |||||||||||||||||||||||||||||||||||

DIS 45.45 |||||||||||||||||||

CRIM 31.91 |||||||||||||

NOX 30.69 ||||||||||||

AGE 28.62 |||||||||||

PT 22.81 |||||||||

TAX 19.74 |||||||

INDUS 12.19 ||||

CHAS 11.93 ||||


References

Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and Regression Trees, Pacific Grove: Wadsworth

Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.

Hastie, T., Tibshirani, R., and Friedman, J.H (2000). The Elements of Statistical Learning. Springer.

Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth National Conference, Morgan Kaufmann, pp. 148-156.

Friedman, J.H. (1999). Stochastic gradient boosting. Stanford: Statistics Department, Stanford University.

Friedman, J.H. (1999). Greedy function approximation: a gradient boosting machine. Stanford: Statistics Department, Stanford University.

Tree net and_randomforests_2009

Technology

Transcript of Tree net and_randomforests_2009