Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

41
SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 227 Ensemble Learners Classification Trees (Done) Bagging: Averaging Trees Random Forests: Cleverer Averaging of Trees Boosting: Cleverest Averaging of Trees Details in chaps 10, 15 and 16 Methods for dramatically improving the performance of weak learners such as Trees. Classification trees are adaptive and robust, but do not make good prediction models. The techniques discussed here enhance their performance considerably.

description

Dr. Trevor Hastie of Stanford University discusses the data science behind Gradient Boosted Regression and Classification

Transcript of Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

Page 1: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 227

Ensemble Learners

• Classification Trees (Done)

• Bagging: Averaging Trees

• Random Forests: Cleverer Averaging of Trees

• Boosting: Cleverest Averaging of Trees

• Details in chaps 10, 15 and 16

Methods for dramatically improving the performance of weak

learners such as Trees. Classification trees are adaptive and robust,

but do not make good prediction models. The techniques discussed

here enhance their performance considerably.

Page 2: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 228

Properties of Trees

• [✔]Can handle huge datasets

• [✔]Can handle mixed predictors—quantitative and qualitative

• [✔]Easily ignore redundant variables

• [✔]Handle missing data elegantly through surrogate splits

• [✔]Small trees are easy to interpret

• [✖]Large trees are hard to interpret

• [✖]Prediction performance is often poor

Page 3: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 229

600/1536

280/1177

180/1065

80/861

80/652

77/423

20/238

19/236 1/2

57/185

48/113

37/101 1/12

9/72

3/229

0/209

100/204

36/123

16/94

14/89 3/5

9/29

16/81

9/112

6/109 0/3

48/359

26/337

19/110

18/109 0/1

7/227

0/22

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

ch$<0.0555

remove<0.06

ch!<0.191

george<0.005

hp<0.03

CAPMAX<10.5

receive<0.125 edu<0.045

our<1.2

CAPAVE<2.7505

free<0.065

business<0.145

george<0.15

hp<0.405

CAPAVE<2.907

1999<0.58

ch$>0.0555

remove>0.06

ch!>0.191

george>0.005

hp>0.03

CAPMAX>10.5

receive>0.125 edu>0.045

our>1.2

CAPAVE>2.7505

free>0.065

business>0.145

george>0.15

hp>0.405

CAPAVE>2.907

1999>0.58

Page 4: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 230

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Specificity

Sen

sitiv

ity

ROC curve for pruned tree on SPAM data

o TREE − Error: 8.7%

SPAM Data

Overall error rate on test data:

8.7%.

ROC curve obtained by vary-

ing the threshold c0 of the clas-

sifier:

C(X) = +1 if Pr(+1|X) > c0.

Sensitivity: proportion of true

spam identified

Specificity: proportion of true

email identified.

We may want specificity to be high, and suffer some spam:

Specificity : 95% =⇒ Sensitivity : 79%

Page 5: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 231

Toy Classification Problem

-6 -4 -2 0 2 4 6

-6-4

-20

24

6

Bayes Error Rate: 0.25

0

00

0

00

0

0

00

0

0

0

0

0

0

0

0

0 0 0

00

0

00

00

00

0

0

00

00

00

0 0

0

0

00

0

0

0 0

00

0

0

0

000

0

0 0

00

00

00 00 0000

0

0

0

0

0

0

0

00

00

0

0

00

00

0

0

0

000

00

00

0

000 00

0

0

0

0

0

00

00

0

00

0

00

00

0

0

0

0

0

0

0

0

00

00

0

0

0

0

00

0

0

0 00

0

0 00

00

00

0

0

0

0

0 0

0

0

000

0

0

0

0

00

0

0

000

0

00

0

0

0

0 00

0

00 00 0

0 0

000

00

0

0000 0

0

0

0

0

00 0

000

0

0

0

0

00

0

00

0

0

0

0

0

0

0

0 00

0

00

00

0

0

0

0 0

00

00

0 0

0

0

0

00

0

00

0

0

000

0

0

0

00

0

00 00

0

00

0

0

0

0

0

0

0

00

0

0

0

0

00

000

0

0

00

00

0

00

0

0

0 0

0

0

0

0

0

0

00

0

0

0

0

0

00

00

0

0

00

00 0

0

0

0

0 0

0

0

00

000

0

0 0

00

00

0 0

0 0

0

0

0

0

00

0

0

0

0

000

0

00

0 0

0

0

00

00

0 0

0

0

0

0

0

0

0

0

0

0 00

00

0

00

0 0

00

00

1

1

1

11

1

1

1

11

1

111

1

1

11

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

1

1

11

1

1

11

1

1 11

1

1

1

1

1

1

1

1

1

1

1

11 1

1

1

1

1

1

1

1

1

11

1

1

1

1

11

1

11

1

1

11

1

1

11

11

11

1

11

11

1

1

1

11

1

1

1

1

11

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

111

1

1

1

11

11

11

1

1

1

11

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

11

1

1

1

1

1

1

11

111

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

11

11

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1 1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

11

11

1

1

1

1

11

11

1

1

1

1

1

11

1

1

1

11

1

1

1 1

1

1

1

1

1

1

1

1

11

1

1

1 1

1

1

11

1

1

1

1

1 1

1

11

1

1

1

1

1

1

1

1

11

1

11

1

X2

X1

• Data X and Y , with Y

taking values +1 or−1.• Here X =∈ IR2

• The black boundary

is the Bayes Decision

Boundary - the best

one can do. Here 25%

• Goal: given N training

pairs (Xi, Yi) produce

a classifier C(X) ∈{−1, 1}

• Also estimate the probability of the class labels Pr(Y = +1|X).

Page 6: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 232

Toy Example — No Noise

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

0

0

00

0 0

0

0 0

0

0

0

0

0

0

00

0

0

0

0

0

0

0

0

0

0

0

0

00

0

0

0

000

0

0

0

0

0

0

0

0

0

00

00 0

0

00

0

0

00

0

0

0

0

0

0

0

0

0

0

0

0

0

0 0

00

00

0

0

0

0

0

0

00

0

0

0

0

00

0

0

000

0

0

00

0

0

0

0 0

00

0

0 00

0

0

0

00

00 0

0

0

00

0

0

0

0

0

00

0

00

00 0

0

000

0

0

0

00

0 00

0

0

0

0

0

0

0

0

0

00

0

0 0

0

0

0

00

0

00

00

0

0

00

0

0

0

0

00

0 0

0

00

0

0

0

0 0

0

0

0

0

00

0

000

000

0

00 0

0

0

0

0 00

0

0

0

00

00

0

0

0

0

0

00

0

00

0

00

0

0

00

00

0

0 0

0

0

0

0

0

0

00

0

00

00

00

0

0

00

0

00

0

0

00

0

00

0 0

0

0

0

0

0

0

0

0

0

0

0

00

00

0

0

0

0

0

00

0

0 00

00

0

0

0

0

0

00

00

0

0

00

0

00

00

0

0

00

00

0

000 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0 00

0

0

00

00 00

00

00

00

0

0

0

0

0

0

0

0 0

0

0

0

0

00

0

00

0

000

00

00

0

0

0

1

1

1

1

1

1

11

11

11

1

1

1

1

1

1

1

11

1

11

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

11

11

1

1 1

1

1

1

1

1

1

1

11

1

1

1

1

11

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1 1

11

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1

1

11

1 1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1 1

1

11

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1 11

1

1

11

11

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

11

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1 11 1

1

1

1

1

1

1

1

11

1

1

1

1

11

11

1

1

11

1

1 1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

Bayes Error Rate: 0

X2

X1

• Deterministic problem;

noise comes from sam-

pling distribution of X.

• Use a training sample

of size 200.

• Here Bayes Error is

0%.

Page 7: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 233

Classification Tree

x.2<-1.06711x.2>-1.06711

94/200

1

0/34

1

x.2<1.14988x.2>1.14988

72/166

0

x.1<1.13632x.1>1.13632

40/134

0

x.1<-0.900735x.1>-0.900735

23/117

0

x.1<-1.1668x.1>-1.1668

5/26

1

0/12

1

x.1<-1.07831x.1>-1.07831

5/14

1

1/5

1

4/9

1

x.2<-0.823968x.2>-0.823968

2/91

0

2/8

0

0/83

0

0/17

1

0/32

1

Page 8: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 234

Decision Boundary: Tree

X1

X2

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

1

1

11

1

11

1

1

1

11

1

1

1

1

11

1

1

1

1

1

11

1

11

1

11

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

111

1

1

1 1

1

1

1

1

11 1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

1

1

1

1

0

0

0

0

0

0

0

00

0

0

0

0

0

0

00

0

0

0

0

0

0

0

0

0

0

0

0

000

0

00

00

0

0

0

00

0

0

0

0

0

0

00

00

0

0

0

0

0

00

0

0

0

0

00

0 0

0

0

0

0

0 0

00

0

0

0

0

0

0

00

0

00

0

0

0 0

0

0

0

00

0

0

0

00

0

Error Rate: 0.073

• The eight-node tree is

boxy, and has a test-

error rate 0f 7.3%

• When the nested

spheres are in 10-

dimensions, a classifi-

cation tree produces

a rather noisy and

inaccurate rule C(X),

with error rates around

30%.

Page 9: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 235

Model Averaging

Classification trees can be simple, but often produce noisy (bushy)

or weak (stunted) classifiers.

• Bagging (Breiman, 1996): Fit many large trees to

bootstrap-resampled versions of the training data, and classify

by majority vote.

• Boosting (Freund & Shapire, 1996): Fit many large or small

trees to reweighted versions of the training data. Classify by

weighted majority vote.

• Random Forests (Breiman 1999): Fancier version of bagging.

In general Boosting ≻ Random Forests ≻ Bagging ≻ Single Tree.

Page 10: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 236

Bagging

• Bagging or bootstrap aggregation averages a noisy fitted

function refit to many bootstrap samples, to reduce its

variance. Bagging is a poor man’s Bayes; See pp 282.

• Suppose C(S, x) is a fitted classifier, such as a tree, fit to our

training data S, producing a predicted class label at input

point x.

• To bag C, we draw bootstrap samples S∗1, . . .S∗B each of size

N with replacement from the training data. Then

Cbag(x) = Majority Vote {C(S∗b, x)}Bb=1.

• Bagging can dramatically reduce the variance of unstable

procedures (like trees), leading to improved prediction.

However any simple structure in C (e.g a tree) is lost.

Page 11: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 237

|

x.1 < 0.395

0 1

0

1 0

1

1 0

Original Tree

|

x.1 < 0.555

0

1 0

0

1

b = 1

|

x.2 < 0.205

0 1

0 1

0 1

b = 2

|

x.2 < 0.285

1 10

1 0

b = 3

|

x.3 < 0.985

0

1

0 1

1 1

b = 4

|

x.4 < −1.36

0

11 0

10

1 0

b = 5

|

x.1 < 0.395

1 1 0 0

1

b = 6

|

x.1 < 0.395

0 1

0 1

1

b = 7

|

x.3 < 0.985

0 1

0 0

1 0

b = 8

|

x.1 < 0.395

0

1

0 11 0

b = 9

|

x.1 < 0.555

1 0

1

0 1

b = 10

|

x.1 < 0.555

0 1

0

1

b = 11

Page 12: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 238

Decision Boundary: Bagging

X1

X2

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

1

1

11

1

11

1

1

1

11

1

1

1

1

11

1

1

1

1

1

11

1

11

1

11

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

111

1

1

1 1

1

1

1

1

11 1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

1

1

1

1

0

0

0

0

0

0

0

00

0

0

0

0

0

0

00

0

0

0

0

0

0

0

0

0

0

0

0

000

0

00

00

0

0

0

00

0

0

0

0

0

0

00

00

0

0

0

0

0

00

0

0

0

0

00

0 0

0

0

0

0

0 0

00

0

0

0

0

0

0

00

0

00

0

0

0 0

0

0

0

00

0

0

0

00

0

Error Rate: 0.032

• Bagging averages many

trees, and produces

smoother decision

boundaries.

• Bagging reduces error

by variance reduction.

• Bias is slightly in-

creased because bagged

trees are slightly shal-

lower.

Page 13: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 239

Random forests

• refinement of bagged trees; very popular— chapter 15.

• at each tree split, a random sample of m features is drawn, and

only those m features are considered for splitting. Typically

m =√p or log2 p, where p is the number of features

• For each tree grown on a bootstrap sample, the error rate for

observations left out of the bootstrap sample is monitored.

This is called the “out-of-bag” or OOB error rate.

• random forests tries to improve on bagging by “de-correlating”

the trees, and reduce the variance.

• Each tree has the same expectation, so increasing the number

of trees does not alter the bias of bagging or random forests.

Page 14: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 240

02

46

810

P − Probability of Informed Person Being Correct

Exp

ecte

d C

orre

ct o

ut o

f 10

Wisdom of Crowds

ConsensusIndividual

0.25 0.50 0.75 1.00

Wisdom of Crowds

• 10 movie categories, 4 choices

in each category

• 50 people, for each question 15

informed and 35 guessers

• For informed people,

Pr(correct) = p

Pr(incorrect) = (1− p)/3

• For guessers,

Pr(all choices) = 1/4.

Consensus (orange) improve over individuals (teal) even for

moderate p.

Page 15: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 241

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Specificity

Sen

sitiv

ity

ROC curve for TREE, SVM and Random Forest on SPAM data

ooo

Random Forest − Error: 4.75%SVM − Error: 6.7%TREE − Error: 8.7%

TREE, SVM and RF

Random Forest dominates

both other methods on the

SPAM data — 4.75% error.

Used 500 trees with default

settings for random Forest

package in R.

Page 16: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 242

1 4 7 10 16 22 28 34 40 46

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Number of Randomly Selected Splitting Variables m

Cor

rela

tion

betw

een

Tre

es

RF: decorrelation

• VarfRF (x) = ρ(x)σ2(x)

• The correlation is because

we are growing trees to the

same data.

• The correlation decreases

as m decreases—m is the

number of variables sam-

pled at each split.

Based on a simple experiment with random forest regression

models.

Page 17: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 243

0 10 20 30 40 50

0.65

0.70

0.75

0.80

0.85

Random Forest Ensemble

m

Mea

n S

quar

ed E

rror

and

Squ

ared

Bia

s

Var

ianc

e

Mean Squared ErrorSquared BiasVariance

0.0

0.05

0.10

0.15

0.20

Bias & Variance

• The smaller m, the

lower the variance of

the random forest en-

semble

• Small m is also as-

sociated with higher

bias, because impor-

tant variables can be

missed by the sam-

pling.

Page 18: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 244

Training Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted Sample

Weighted SampleWeighted Sample

Training Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted SampleWeighted Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

CM (x)

C3(x)

C2(x)

C1(x)

Boosting

• Breakthrough invention of Freund &

Schapire (1997)

• Average many trees, each grown to re-

weighted versions of the training data.

• Weighting decorrelates the trees, by fo-

cussing on regions missed by past trees.

• Final Classifier is weighted average of

classifiers:

C(x) = sign

[M∑

m=1

αmCm(x)

]

Note: Cm(x) ∈ {−1,+1}.

Page 19: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 245

Number of Terms

Tes

t Err

or

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4 Bagging

AdaBoost

100 Node Trees

Boosting vs Bagging

• 2000 points from

Nested Spheres in R10

• Bayes error rate is 0%.

• Trees are grown best

first without pruning.

• Leftmost term is a sin-

gle tree.

Page 20: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 246

AdaBoost (Freund & Schapire, 1996)

1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N .

2. For m = 1 to M repeat steps (a)–(d):

(a) Fit a classifier Cm(x) to the training data using weights wi.

(b) Compute weighted error of newest tree

errm =

∑Ni=1 wiI(yi 6= Cm(xi))∑N

i=1 wi

.

(c) Compute αm = log[(1− errm)/errm].

(d) Update weights for i = 1, . . . , N :

wi ← wi · exp[αm · I(yi 6= Cm(xi))]

and renormalize to wi to sum to 1.

3. Output C(x) = sign[∑M

m=1 αmCm(x)].

Page 21: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 247

Boosting Iterations

Tes

t Err

or

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4

0.5

Single Stump

400 Node Tree

Boosting Stumps

• A stump is a two-node

tree, after a single split.

• Boosting stumps works

remarkably well on

the nested-spheres

problem.

Page 22: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 248

Number of Terms

Tra

in a

nd T

est E

rror

0 100 200 300 400 500 600

0.0

0.1

0.2

0.3

0.4

0.5

Training Error

• Nested spheres in 10-

Dimensions.

• Bayes error is 0%.

• Boosting drives the training

error to zero (green curve).

• Further iterations continue to

improve test error in many ex-

amples (red curve).

Page 23: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 249

Number of Terms

Tra

in a

nd T

est E

rror

0 100 200 300 400 500 600

0.0

0.1

0.2

0.3

0.4

0.5

Bayes Error

Noisy Problems

• Nested Gaussians in

10-Dimensions.

• Bayes error is 25%.

• Boosting with stumps

• Here the test error

does increase, but quite

slowly.

Page 24: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 250

Stagewise Additive Modeling

Boosting builds an additive model

F (x) =

M∑

m=1

βmb(x; γm).

Here b(x, γm) is a tree, and γm parametrizes the splits.

We do things like that in statistics all the time!

• GAMs: F (x) =∑

j fj(xj)

• Basis expansions: F (x) =∑M

m=1 θmhm(x)

Traditionally the parameters fm, θm are fit jointly (i.e. least

squares, maximum likelihood).

With boosting, the parameters (βm, γm) are fit in a stagewise

fashion. This slows the process down, and overfits less quickly.

Page 25: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 251

Example: Least Squares Boosting

1. Start with function F0(x) = 0 and residual r = y, m = 0

2. m← m+ 1

3. Fit a CART regression tree to r giving g(x)

4. Set fm(x)← ǫ · g(x)5. Set Fm(x) ← Fm−1(x) + fm(x), r ← r − fm(x) and repeat

step 2–5 many times

• g(x) typically shallow tree; e.g. constrained to have k = 3 splits

only (depth).

• Stagewise fitting (no backfitting!). Slows down (over-) fitting

process.

• 0 < ǫ ≤ 1 is shrinkage parameter; slows overfitting even further.

Page 26: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 252

Least Squares Boosting and ǫ

ǫ = 1

ǫ = .01

Single tree

Prediction Error

Number of steps m

• If instead of trees, we use linear regression terms, boosting with

small ǫ very similar to lasso!

Page 27: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 253

Adaboost: Stagewise Modeling

• AdaBoost builds an additive logistic regression model

F (x) = logPr(Y = 1|x)Pr(Y = −1|x) =

M∑

m=1

αmfm(x)

by stagewise fitting using the loss function

L(y, F (x)) = exp(−y F (x)).

• Instead of fitting trees to residuals, the special form of the

exponential loss leads to fitting trees to weighted versions of

the original data. See pp 343.

Page 28: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 254

Why Exponential Loss?

−2 −1 0 1 2

0.0

0.5

1.0

1.5

2.0

2.5

3.0 Misclassification

ExponentialBinomial DevianceSquared ErrorSupport Vector

Loss

y · F

• y · F is the margin; positive

margin means correct classifi-

cation.

• e−yF (x) is a monotone,

smooth upper bound on

misclassification loss at x.

• Leads to simple reweighting

scheme.

• Has logit transform as popu-

lation minimizer

F ∗(x) =1

2log

Pr(Y = 1|x)Pr(Y = −1|x)

• Other more robust loss func-

tions, like binomial deviance.

Page 29: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 255

General Boosting Algorithms

The boosting paradigm can be extended to general loss functions

— not only squared error or exponential.

1. Initialize F0(x) = 0.

2. For m = 1 to M :

(a) Compute

(βm, γm) = argminβ,γ∑N

i=1 L(yi, Fm−1(xi) + βb(xi; γ)).

(b) Set Fm(x) = Fm−1(x) + βmb(x; γm).

Sometimes we replace step (b) in item 2 by

(b∗) Set Fm(x) = Fm−1(x) + ǫβmb(x; γm)

Here ǫ is a shrinkage factor; in gbm package in R, the default is

ǫ = 0.001. Shrinkage slows the stagewise model-building even more,

and typically leads to better performance.

Page 30: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 256

Gradient Boosting

• General boosting algorithm that works with a variety of

different loss functions. Models include regression, resistant

regression, K-class classification and risk modeling.

• Gradient Boosting builds additive tree models, for example, for

representing the logits in logistic regression.

• Tree size is a parameter that determines the order of

interaction (next slide).

• Gradient Boosting inherits all the good features of trees

(variable selection, missing data, mixed predictors), and

improves on the weak features, such as prediction performance.

• Gradient Boosting is described in detail in , section 10.10,

and is implemented in Greg Ridgeways gbm package in R.

Page 31: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 257

0 500 1000 1500 2000 2500

0.04

00.

045

0.05

00.

055

0.06

00.

065

0.07

0

Spam Data

Number of Trees

Tes

t Err

orBaggingRandom ForestGradient Boosting (5 Node)

Page 32: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 258

Spam Example Results

With 3000 training and 1500 test observations, Gradient Boosting

fits an additive logistic model

f(x) = logPr(spam|x)Pr(email|x)

using trees with J = 6 terminal-node trees.

Gradient Boosting achieves a test error of 4.5%, compared to 5.3%

for an additive GAM, 4.75% for Random Forests, and 8.7% for

CART.

Page 33: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 259

Spam: Variable Importance

!$

hpremove

freeCAPAVE

yourCAPMAX

georgeCAPTOT

eduyouour

moneywill

1999business

re(

receiveinternet

000email

meeting;

650overmailpm

peopletechnology

hplall

orderaddress

makefont

projectdata

originalreport

conferencelab

[creditparts

#85

tablecs

direct415857

telnetlabs

addresses3d

0 20 40 60 80 100

Relative importance

Page 34: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 260

Spam: Partial Dependence

!

Part

ial D

ependence

0.0 0.2 0.4 0.6 0.8 1.0

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

remove

Part

ial D

ependence

0.0 0.2 0.4 0.6

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

edu

Part

ial D

ependence

0.0 0.2 0.4 0.6 0.8 1.0

-1.0

-0.6

-0.2

0.0

0.2

hp

Part

ial D

ependence

0.0 0.5 1.0 1.5 2.0 2.5 3.0

-1.0

-0.6

-0.2

0.0

0.2

Page 35: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 261

0 200 400 600 800 1000

0.32

0.34

0.36

0.38

0.40

0.42

0.44

California Housing Data

Number of Trees

Tes

t Ave

rage

Abs

olut

e E

rror

RF m=2RF m=6GBM depth=4GBM depth=6

Page 36: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 262

Number of Terms

Tes

t Err

or

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4 Stumps

10 Node100 NodeAdaboost

Tree Size

The tree size J determines

the interaction order of the

model:

η(X) =∑

j

ηj(Xj)

+∑

jk

ηjk(Xj , Xk)

+∑

jkl

ηjkl(Xj , Xk, Xl)

+ · · ·

Page 37: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 263

Stumps win!

Since the true decision boundary is the surface of a sphere, the

function that describes it has the form

f(X) = X21 +X2

2 + . . .+X2p − c = 0.

Boosted stumps via Gradient Boosting returns reasonable

approximations to these quadratic functions.

Coordinate Functions for Additive Logistic Trees

f1(x1) f2(x2) f3(x3) f4(x4) f5(x5)

f6(x6) f7(x7) f8(x8) f9(x9) f10(x10)

Page 38: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 264

Learning Ensembles

Ensemble Learning consists of two steps:

1. Construction of a dictionary D = {T1(X), T2(X), . . . , TM (X)}of basis elements (weak learners) Tm(X).

2. Fitting a model f(X) =∑

m∈D αmTm(X).

Simple examples of ensembles

• Linear regression: The ensemble consists of the coordinate

functions Tm(X) = Xm. The fitting is done by least squares.

• Random Forests: The ensemble consists of trees grown to

booststrapped versions of the data, with additional

randomization at each split. The fitting simply averages.

• Gradient Boosting: The ensemble is grown in an adaptive

fashion, but then simply averaged at the end.

Page 39: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 265

Learning Ensembles and the Lasso

Proposed by Popescu and Friedman (2004):

Fit the model f(X) =∑

m∈D αmTm(X) in step (2) using Lasso

regularization:

α(λ) = argminα

N∑

i=1

L[yi, α0 +

M∑

m=1

αmTm(xi)] + λ

M∑

m=1

|αm|.

Advantages:

• Often ensembles are large, involving thousands of small trees.

The post-processing selects a smaller subset of these and

combines them efficiently.

• Often the post-processing improves the process that generated

the ensemble.

Page 40: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 266

Post-Processing Random Forest Ensembles

0 100 200 300 400 500

0.04

0.05

0.06

0.07

0.08

0.09

Spam Data

Number of Trees

Tes

t Err

orRandom ForestRandom Forest (5%, 6)Gradient Boost (5 node)

Page 41: Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

SLDM III c©Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 267

Software

• R: free GPL statistical computing environment available from

CRAN, implements the S language. Includes:

– randomForest: implementation of Leo Breimans algorithms.

– rpart: Terry Therneau’s implementation of classification and

regression trees.

– gbm: Greg Ridgeway’s implementation of Friedman’s

gradient boosting algorithm.

• Salford Systems: Commercial implementation of trees, random

forests and gradient boosting.

• Splus (Insightful): Commerical version of S.

• Weka: GPL software from University of Waikato, New Zealand.

Includes Trees, Random Forests and many other procedures.