Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s...

Applied Machine LearningApplied Machine LearningBootstrap, Bagging and Boosting

Siamak RavanbakhshSiamak Ravanbakhsh

COMP 551COMP 551 (winter 2020)(winter 2020)1

bootstrap for uncertainty estimationbagging for variance reduction

random forestsboosting

AdaBoostgradient boostingrelationship to L1 regularization

Learning objectivesLearning objectives

2

BootstrapBootstrap

a simple approach to estimate the uncertainty in prediction

3 . 1

BootstrapBootstrap


given the dataset D = {(x , y )}

(n) (n)n=1N

3 . 1

BootstrapBootstrap



(n) (n)n=1N

subsample with replacement B datasets of size N

D =b {(x , y )} , b =(n,b) (n,b)n=1N 1, … ,B

3 . 1

BootstrapBootstrap


train a model on each of these bootstrap datasets (called bootstrap samples)


(n) (n)n=1N


D =b {(x , y )} , b =(n,b) (n,b)n=1N 1, … ,B

3 . 1

BootstrapBootstrap




(n) (n)n=1N


D =b {(x , y )} , b =(n,b) (n,b)n=1N 1, … ,B

produce a measure of uncertainty from these models

for model parametersfor predictions

3 . 1

BootstrapBootstrap




(n) (n)n=1N


D =b {(x , y )} , b =(n,b) (n,b)n=1N 1, … ,B

produce a measure of uncertainty from these models

for model parametersfor predictions

non-parametric bootstrap

3 . 1

Bootstrap: Bootstrap: exampleexampleϕ (x) =k e−

s2(x−μ )k

2

y =(n) sin(x ) +(n) cos( ) +∣x ∣(n) ϵ

#x: N #y: N plt.plot(x, y, 'b.') phi = lambda x,mu: np.exp(-(x-mu)**2) mu = np.linspace(0,10,10) #10 Gaussians basesPhi = phi(x[:,None], mu[None,:]) #N x 10w = np.linalg.lstsq(Phi, y)[0]yh = np.dot(Phi,w)plt.plot(x, yh, 'g-')

123456789

before adding noise

our fit to data using 10 Gaussian bases

noise

Recall: linear model with nonlinear Gaussian bases (N=100)

3 . 2

Bootstrap: Bootstrap: exampleexample ϕ (x) =k e−

s2(x−μ )k

2

ws = np.zeros((B,D))for b in range(B):

#Phi: N x D1#y: N2B = 5003

45

inds = np.random.randint(N, size=(N))6 Phi_b = Phi[inds,:] #N x D7 y_b = y[inds] #N8 #fit the subsampled data9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12

Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samples

gives a measure of uncertainty of the parameters

3 . 3


s2(x−μ )k

2


#Phi: N x D1#y: N2B = 5003

45


inds = np.random.randint(N, size=(N)) Phi_b = Phi[inds,:] #N x D

#Phi: N x D1#y: N2B = 5003ws = np.zeros((B,D))4for b in range(B):5

67

y_b = y[inds] #N8 #fit the subsampled data9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12



each color is a different weight w d

3 . 3


s2(x−μ )k

2


#Phi: N x D1#y: N2B = 5003

45




67


y_b = y[inds] #N #fit the subsampled data

#Phi: N x D1#y: N2B = 5003ws = np.zeros((B,D))4for b in range(B):5 inds = np.random.randint(N, size=(N))6 Phi_b = Phi[inds,:] #N x D7

89

ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12




3 . 3


s2(x−μ )k

2


#Phi: N x D1#y: N2B = 5003

45




67


y_b = y[inds] #N #fit the subsampled data

#Phi: N x D1#y: N2B = 5003ws = np.zeros((B,D))4for b in range(B):5 inds = np.random.randint(N, size=(N))6 Phi_b = Phi[inds,:] #N x D7

89

ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12

ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]

#Phi: N x D1#y: N2B = 5003ws = np.zeros((B,D))4for b in range(B):5 inds = np.random.randint(N, size=(N))6 Phi_b = Phi[inds,:] #N x D7 y_b = y[inds] #N8 #fit the subsampled data9

1011

plt.hist(ws, bins=50)12




3 . 3


s2(x−μ )k

2


3 . 4


s2(x−μ )k

2

Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samplesalso gives a measure of uncertainty of the predictions

3 . 4


s2(x−μ )k

2

#Phi: N x D#Phi_test: Nt x D#y: N#ws: B x D from previous codey_hats = np.zeros((B, Nt))

12345

for b in range(B):6 wb = ws[b,:]7 y_hats[b,:] = np.dot(Phi_test, wb)8 9# get 95% quantiles 10y_5 = np.quantile(y_hats, .05, axis=0)11y_95 = np.quantile(y_hats, .95, axis=0)12

Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samplesalso gives a measure of uncertainty of the predictions

3 . 4


s2(x−μ )k

2


12345


for b in range(B): wb = ws[b,:] y_hats[b,:] = np.dot(Phi_test, wb)

#Phi: N x D1#Phi_test: Nt x D2#y: N3#ws: B x D from previous code4y_hats = np.zeros((B, Nt))5

678

9# get 95% quantiles 10y_5 = np.quantile(y_hats, .05, axis=0)11y_95 = np.quantile(y_hats, .95, axis=0)12

Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samplesalso gives a measure of uncertainty of the predictions the red lines are 5% and 95% quantiles

(for each point we can get these across bootstrap model predictions)

3 . 4

Winter 2020 | Applied Machine Learning (COMP551)


s2(x−μ )k

2


12345


for b in range(B): wb = ws[b,:] y_hats[b,:] = np.dot(Phi_test, wb)

#Phi: N x D1#Phi_test: Nt x D2#y: N3#ws: B x D from previous code4y_hats = np.zeros((B, Nt))5

678

9# get 95% quantiles 10y_5 = np.quantile(y_hats, .05, axis=0)11y_95 = np.quantile(y_hats, .95, axis=0)12y_5 = np.quantile(y_hats, .05, axis=0)y_95 = np.quantile(y_hats, .95, axis=0)

#Phi: N x D1#Phi_test: Nt x D2#y: N3#ws: B x D from previous code4y_hats = np.zeros((B, Nt))5for b in range(B):6 wb = ws[b,:]7 y_hats[b,:] = np.dot(Phi_test, wb)8 9# get 95% quantiles 10

1112

Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samplesalso gives a measure of uncertainty of the predictions the red lines are 5% and 95% quantiles

(for each point we can get these across bootstrap model predictions)

3 . 4

BaggingBagging

Var(z +1 z ) =2 E[(z +1 z ) ] −22 E[z +1 z ]2

2

use bootstrap for more accurate prediction (not just uncertainty)

variance of sum of random variables

4 . 1

BaggingBagging

Var(z +1 z ) =2 E[(z +1 z ) ] −22 E[z +1 z ]2

2


= E[z +12 z +2

2 2z z ] −1 2 (E[z ] +1 E[z ])22


4 . 1

BaggingBagging

Var(z +1 z ) =2 E[(z +1 z ) ] −22 E[z +1 z ]2

2


= E[z +12 z +2

2 2z z ] −1 2 (E[z ] +1 E[z ])22

= E[z ] +12 E[z ] +2

2 E[2z z ] −1 2 E[z ] −12 E[z ] −2

2 2E[z ]E[z ]1 2


4 . 1

BaggingBagging

Var(z +1 z ) =2 E[(z +1 z ) ] −22 E[z +1 z ]2

2


= E[z +12 z +2

2 2z z ] −1 2 (E[z ] +1 E[z ])22

= E[z ] +12 E[z ] +2

2 E[2z z ] −1 2 E[z ] −12 E[z ] −2

2 2E[z ]E[z ]1 2

= Var(z ) +1 Var(z ) +2 2Cov(z , z )1 2


4 . 1

BaggingBagging

Var(z +1 z ) =2 E[(z +1 z ) ] −22 E[z +1 z ]2

2


= E[z +12 z +2

2 2z z ] −1 2 (E[z ] +1 E[z ])22

= E[z ] +12 E[z ] +2

2 E[2z z ] −1 2 E[z ] −12 E[z ] −2

2 2E[z ]E[z ]1 2

= Var(z ) +1 Var(z ) +2 2Cov(z , z )1 2


for uncorrelated variables this term is zero

4 . 1

BaggingBagging

average of uncorrelated random variables has a lower variance


4 . 2

BaggingBagging

are uncorrelated random variables with mean and variance z , … , z 1 B μ σ2

the average has mean and variance=z z

B1 ∑b b μ



4 . 2

BaggingBagging



B1 ∑b b μ



Var( z ) =B1 ∑b b Var( z ) =B2

1 ∑b b Bσ =B21 2

σB1 2

4 . 2

BaggingBagging



B1 ∑b b μ


use this to reduce the variance of our models (bias remains the same)

regression: average the model predictions (x) =f (x)B1 ∑b f b



1 ∑b b Bσ =B21 2

σB1 2

4 . 2

BaggingBagging



B1 ∑b b μ


use this to reduce the variance of our models (bias remains the same)

regression: average the model predictions (x) =f (x)B1 ∑b f b

use bootstrap samples to reduce correlation

issue: model predictions are not uncorrelated (trained using the same data)

bagging (bootstrap aggregation)



1 ∑b b Bσ =B21 2

σB1 2

4 . 2

Bagging Bagging for classificationfor classification

averaging makes sense for regression, how about classification?

4 . 3


are IID Bernoulli random variables with meanz , … , z ∈1 B {0, 1}

=z z B1 ∑b b

μ = .5 + ϵ

wisdom of crowds

for we have p( >z .5) goes to 1 as B grows

> 0


4 . 3


mode of iid classifiers that are better than chance is a better classifier

use voting


=z z B1 ∑b b

μ = .5 + ϵ

wisdom of crowds


> 0


4 . 3



use voting

crowds are wiser when

individuals are better than randomvotes are uncorrelated


=z z B1 ∑b b

μ = .5 + ϵ

wisdom of crowds


> 0


4 . 3



use voting

crowds are wiser when

individuals are better than randomvotes are uncorrelated


=z z B1 ∑b b

μ = .5 + ϵ

wisdom of crowds


> 0

use bootstrap samples to reduce correlationbagging (bootstrap aggregation)


4 . 3

Bagging decision treesBagging decision trees

example

setup

synthetic dataset5 correlated features1st feature is a noisy predictor of the label

4 . 4


example

setup


Bootstrap samples create different decision trees (due to high variance)

compared to decision trees, no longer interpretable!

4 . 4


example

setup


B

voting for the most probably classaveraging probabilities

Bootstrap samples create different decision trees (due to high variance)

compared to decision trees, no longer interpretable!

4 . 4

Random forestsRandom forests

further reduce the correlation between decision trees

4 . 5



feature sub-samplingonly a random subset of features are available for split at each step

further reduce the dependence between decision trees

4 . 5





Dmagic number?this is a hyper-parameter, can be optimized using CV

4 . 5





Dmagic number?this is a hyper-parameter, can be optimized using CV

Out Of Bag (OOB) samples:

the instances not included in a bootsrap dataset can be used for validationsimultaneous validation of decision trees in a forestno need to set aside data for cross validation

4 . 5

ExampleExample: spam detection: spam detection

N=4601 emailsbinary classification task: spam - not spamD=57 features:

48 words: percentage of words in the email that match these wordse.g., business,address,internet, free, George (customized per user)

6 characters: again percentage of characters that match thesech; , ch( ,ch[ ,ch! ,ch$ , ch#

average, max, sum of length of uninterrupted sequences of capital letters:CAPAVE, CAPMAX, CAPTOT

Dataset

4 . 6






an example offeature engineering

Dataset

4 . 6






average value of these features in the spam and non-spam emails

an example offeature engineering

Dataset

4 . 6


decision tree after pruning

4 . 7


decision tree after pruning

misclassification rate on test data

4 . 7


decision tree after pruning number of leaves (17) in optimal pruningdecided based on cross-validation error

test error

cv error

misclassification rate on test data

4 . 7


Bagging and Random Forests do much betterthan a single decision tree!

4 . 8



Bagging and Random Forests do much betterthan a single decision tree!

Out Of Bag (OOB) error can be used for parameter tuning(e.g., size of the forest)

4 . 8

Summary so far...Summary so far...

Bootstrap is a powerful technique to get uncertainty estimatesBootstrep aggregation (Bagging) can reduce the variance of unstable models

5

Summary so far...Summary so far...

Bootstrap is a powerful technique to get uncertainty estimatesBootstrep aggregation (Bagging) can reduce the variance of unstable modelsRandom forests:

Bagging + further de-corelation of features at each splitOOB validation instead of CVdestroy interpretability of decision treesperform well in practicecan fail if only few relevant features exist (due to feature-sampling)

5

Adaptive basesAdaptive bases

several methods can be classified as learning these bases adaptively

decision treesgeneralized additive modelsboostingneural networks

f(x) = w ϕ (x; v )∑d d d d

in boosting each basis is a classifier or regression function (weak learner, or base learner)create a strong learner by sequentially combining week learners

6 . 1

Forward stagewise additive modellingForward stagewise additive modelling

6 . 2


model f(x) = w ϕ(x; v )∑t=1T {t} {t} a simple model, such as decision stump (decision tree with one node)

6 . 2



cost J({w , v } ) ={t} {t}t L(y , f(x ))∑n=1

N (n) (n)

so far we have seen L2 loss, log loss and hinge loss

optimizing this cost is difficult given the form of f

6 . 2




cost J({w , v } ) ={t} {t}t L(y , f(x ))∑n=1

N (n) (n)

so far we have seen L2 loss, log loss and hinge loss

optimizing this cost is difficult given the form of f

optimization idea add one weak-learner in each stage t, to reduce the error of previous stage

f (x) ={t} f (x ) +{t−1} (n) w ϕ(x ; v ){t} (n) {t}

2. add it to the current model

v ,w ={t} {t} arg min L(y , f (x ) +v,w∑n=1N (n) {t−1} (n) wϕ(x ; v))(n)

1. find the best weak learner

6 . 2

consider weak learners that are individual features ϕ (x) ={t} w x

{t}d{t}model

7 . 1

loss & forward stagewise loss & forward stagewise linearlinear model modelL 2

cost

arg min (y −d,w d 21 ∑n=1

N (n) (f (x ) +{t−1} (n) w x ))d d

(n)2

using L2 loss for regression

at stage t


{t}d{t}model

7 . 1


cost


N (n) (f (x ) +{t−1} (n) w x ))d d

(n)2


at stage t


{t}d{t}model

residual r(n)

7 . 1


cost


N (n) (f (x ) +{t−1} (n) w x ))d d

(n)2


at stage t


{t}d{t}model

residual r(n)

optimization recall: optimal weight for each d is w =d

x ∑n d

(n) 2 x r ∑

n d

(n)d

(n)

pick the feature that most significantly reduces the residual

7 . 1


cost


N (n) (f (x ) +{t−1} (n) w x ))d d

(n)2


at stage t


{t}d{t}model

residual r(n)

optimization recall: optimal weight for each d is w =d

x ∑n d

(n) 2 x r ∑

n d

(n)d

(n)

pick the feature that most significantly reduces the residual

f (x) ={t} αw x ∑t d{t}

{t}d{t}the model at time-step t:

using a small helps with test errorα

7 . 1

is this related to L1-regularized linear regression?


example

using small learning rate L2 Boosting has a similar regularization path to lasso

boosting

α = .01


∣w ∣∑d d t

lasso

w d

{t}

at each time-step only one feature is updated / addedd{t}

7 . 2


example

using small learning rate L2 Boosting has a similar regularization path to lasso

boosting

α = .01

we can view boosting as doing feature (base learner) selection in exponentially large spaces (e.g., all trees of size K)the number of steps t plays a similar role to (the inverse of) regularization hyper-parameter


∣w ∣∑d d t

lasso

w d

{t}

at each time-step only one feature is updated / addedd{t}

7 . 2

Exponential loss Exponential loss & AdaBoost& AdaBoost

loss functions for binary classification y ∈ {−1, +1}

predicted label is =y sign(f(x))

8 . 1


loss functions for binary classification y ∈ {−1, +1}


misclassification loss(0-1 loss)

L(y, f(x)) = I(yf(x) > 0)

8 . 1


loss functions for binary classification

L(y, f(x)) = log (1 + e )−yf(x)log-loss(aka cross entropy loss or binomial deviance)

y ∈ {−1, +1}



L(y, f(x)) = I(yf(x) > 0)

8 . 1




y ∈ {−1, +1}



L(y, f(x)) = I(yf(x) > 0)

L(y, f(x)) = max(0, 1 − yf(x))Hinge losssupport vector loss

8 . 1




y ∈ {−1, +1}



L(y, f(x)) = I(yf(x) > 0)


yet another loss function is exponential loss L(y, f(x)) = e−yf(x)

note that the loss grows faster than the other surrogate losses (more sensitive to outliers)

8 . 1




y ∈ {−1, +1}



L(y, f(x)) = I(yf(x) > 0)


yet another loss function is exponential loss L(y, f(x)) = e−yf(x)

note that the loss grows faster than the other surrogate losses (more sensitive to outliers)

useful property when working with additive models:

L(y, f (x) +{t−1} w ϕ(x, v )) ={t} {t} L(y, f (x)) ⋅{t−1} L(y,w ϕ(x, v )){t} {t}

treat this as a weight q for an instanceinstances that are not properly classified before receive a higher weight

8 . 1

Exponential loss &Exponential loss & AdaBoost AdaBoost

J({w , v } ) ={t} {t}t L(y , f (x ) +∑n=1

N (n) {t−1} (n) w ϕ(x , v )) ={t} (n) {t} q L(y ,w ϕ(x , v ))∑n

(n) (n) {t} (n) {t}

cost

L(y , f (x ))(n) {t−1} (n)loss for this instance at previous stage

using exponential loss

8 . 2


J({w , v } ) ={t} {t}t L(y , f (x ) +∑n=1


(n) (n) {t} (n) {t}

cost



discrete AdaBoost: assume this is a simple classifier, so its output is +/- 1

8 . 2


J({w , v } ) ={t} {t}t L(y , f (x ) +∑n=1


(n) (n) {t} (n) {t}

cost




J({w , v } ) ={t} {t}t q e∑n

(n) −y w ϕ(x ,v )(n) {t} (n) {t}

objective is to find the weak learner minimizing the cost aboveoptimization

8 . 2


J({w , v } ) ={t} {t}t L(y , f (x ) +∑n=1


(n) (n) {t} (n) {t}

cost



= e q I(y =−w{t}∑n

(n) (n) ϕ(x , v )) +(n) {t} e q I(y =w{t}∑n

(n) (n) ϕ(x , v ))(n) {t}


J({w , v } ) ={t} {t}t q e∑n

(n) −y w ϕ(x ,v )(n) {t} (n) {t}


8 . 2


J({w , v } ) ={t} {t}t L(y , f (x ) +∑n=1


(n) (n) {t} (n) {t}

cost



= e q I(y =−w{t}∑n

(n) (n) ϕ(x , v )) +(n) {t} e q I(y =w{t}∑n

(n) (n) ϕ(x , v ))(n) {t}


J({w , v } ) ={t} {t}t q e∑n

(n) −y w ϕ(x ,v )(n) {t} (n) {t}


= e q +−w{t}∑n

(n) (e −w{t}e ) q I(y =−w{t}

∑n(n) (n) ϕ(x , v ))(n) {t}

assuming the weak learner should minimize this costthis is classification with weighted intances

w ≥{t} 0does not depend onthe weak learner

8 . 2

Exponential loss Exponential loss & & AdaBoostAdaBoost

= e q +−w{t}∑n

(n) (e −w{t}e ) q I(y =−w{t}

∑n(n) (n) ϕ(x , v ))(n) {t}

assuming the weak learner should minimize this costthis is classification with weighted instancesthis gives


J({w , v } ) ={t} {t}t q L(y ,w ϕ(x , v ))∑n

(n) (n) {t} (n) {t}cost

v{t}

8 . 3


= e q +−w{t}∑n

(n) (e −w{t}e ) q I(y =−w{t}

∑n(n) (n) ϕ(x , v ))(n) {t}



J({w , v } ) ={t} {t}t q L(y ,w ϕ(x , v ))∑n

(n) (n) {t} (n) {t}cost

still need to find the optimal w{t}

v{t}

8 . 3


= e q +−w{t}∑n

(n) (e −w{t}e ) q I(y =−w{t}

∑n(n) (n) ϕ(x , v ))(n) {t}



J({w , v } ) ={t} {t}t q L(y ,w ϕ(x , v ))∑n

(n) (n) {t} (n) {t}cost


v{t}

setting gives =∂w{t}

∂J 0 w ={t} log 2

1ℓ{t}

1−ℓ{t} weight-normalized misclassification error

ℓ ={t}

q∑n

(n)

q I(ϕ(x ;v ) =y )∑n

(n) (n) {t} (n)

8 . 3


= e q +−w{t}∑n

(n) (e −w{t}e ) q I(y =−w{t}

∑n(n) (n) ϕ(x , v ))(n) {t}



J({w , v } ) ={t} {t}t q L(y ,w ϕ(x , v ))∑n

(n) (n) {t} (n) {t}cost


v{t}


∂J 0 w ={t} log 2

1ℓ{t}


ℓ ={t}

q∑n

(n)

q I(ϕ(x ;v ) =y )∑n

(n) (n) {t} (n)

since weak learner is better than chance and so w ≥{t} 0ℓ <{t} .5

8 . 3


= e q +−w{t}∑n

(n) (e −w{t}e ) q I(y =−w{t}

∑n(n) (n) ϕ(x , v ))(n) {t}



J({w , v } ) ={t} {t}t q L(y ,w ϕ(x , v ))∑n

(n) (n) {t} (n) {t}cost


v{t}


∂J 0 w ={t} log 2

1ℓ{t}


ℓ ={t}

q∑n

(n)

q I(ϕ(x ;v ) =y )∑n

(n) (n) {t} (n)

we can now update instance weights q for next iteration(multiply by the new loss)

q =(n),{t+1} q e(n),{t} −w y ϕ(x ;v ){t} (n) (n) {t}


8 . 3


= e q +−w{t}∑n

(n) (e −w{t}e ) q I(y =−w{t}

∑n(n) (n) ϕ(x , v ))(n) {t}



J({w , v } ) ={t} {t}t q L(y ,w ϕ(x , v ))∑n

(n) (n) {t} (n) {t}cost


v{t}


∂J 0 w ={t} log 2

1ℓ{t}


ℓ ={t}

q∑n

(n)

q I(ϕ(x ;v ) =y )∑n

(n) (n) {t} (n)

we can now update instance weights q for next iteration(multiply by the new loss)

q =(n),{t+1} q e(n),{t} −w y ϕ(x ;v ){t} (n) (n) {t}


8 . 3

since w > 0, the weight q of misclassified points increase and the rest decrease


overall algorithm for discrete AdaBoost

8 . 4



w ϕ(x; v ){1} {1}

8 . 4



w ϕ(x; v ){1} {1}

w ϕ(x; v ){2} {2}

8 . 4



w ϕ(x; v ){1} {1}

w ϕ(x; v ){2} {2}

w ϕ(x; v ){3} {3}

w ϕ(x; v ){T} {T}

8 . 4



w ϕ(x; v ){1} {1}

w ϕ(x; v ){2} {2}

w ϕ(x; v ){3} {3}

w ϕ(x; v ){T} {T}

f(x) = sign( w ϕ(x; v ))∑t{t} {t}

8 . 4

return

initialize for t=1:T fit the simple classifier to the weighted dataset

w :{t} = log 21

ℓ{t}1−ℓ{t}

ℓ :{t} =

q∑n

(n) q I(ϕ(x ;v ) =y )∑

n(n) (n) {t} (n)

q :(n) = q e ∀n(n) −w y ϕ(x ;v ){t} (n) (n) {t}

q :(n) = ∀nN1

ϕ(x, v ){t}

f(x) = sign( w ϕ(x; v ))∑t{t} {t}



w ϕ(x; v ){1} {1}

w ϕ(x; v ){2} {2}

w ϕ(x; v ){3} {3}

w ϕ(x; v ){T} {T}

f(x) = sign( w ϕ(x; v ))∑t{t} {t}

8 . 4

AdaBoostAdaBoost

example

w ϕ(x; v ){1} {1}

w ϕ(x; v ){2} {2}

w ϕ(x; v ){3} {3}

w ϕ(x; v ){T} {T}

=y sign( w ϕ(x; v ))∑t{t} {t}

each weak learner is a decision stump (dashed line)

t = 1 t = 2 t = 3

t = 6 t = 10 t = 150

green is the decision boundary of f {t}

circle size is proportional to qn,{t}

8 . 5

AdaBoostAdaBoost

example

features are samples from standard Gaussianx , … ,x 1(n)

10(n)

8 . 6

AdaBoostAdaBoost

example


10(n)

y =(n) I( x >∑d(n)

d

29.34)label

8 . 6

AdaBoostAdaBoost

example


10(n)

y =(n) I( x >∑d(n)

d

29.34)label

N=2000 training examples

8 . 6


AdaBoostAdaBoost

example


10(n)

y =(n) I( x >∑d(n)

d

29.34)label

N=2000 training examplesnotice that test error does not increaseAdaBoost is very slow to overfit

8 . 6

application:application: Viola-Jones face detection Viola-Jones face detection

Haar features are computationally efficienteach feature is a weak learnerAdaBoost picks one feature at a time (label: face/no-face)

Still can be inefficient:

use the fact that faces are rare (.01% of subwindows are faces)cascade of classifiers due to small rate

image source: David Lowe slides

9





100% detectionFP rate

cumulativeFP rate


9






cumulativeFP rate

cascade is applied over all image subwindows


9






cumulativeFP rate

fast enough for real-time (object) detection

cascade is applied over all image subwindows


9

Gradient boostingGradient boosting

10 . 1

idea fit the weak learner to the gradient of the cost


let f ={t} [f (x ), … , f (x )]{t} (1) {t} (N) ⊤y = [y , … , y ](1) (N) ⊤

and true labels

10 . 1




and true labels

=f arg min L(f ,y)f

ignoring the structure of fif we use gradient descent to minimize the loss

10 . 1




and true labels

=f arg min L(f ,y)f


write as a sum of stepsf =f f ={T} f −{0} w g∑t=1

T {t} {t}

L(f ,y)∂f∂ {t−1}

gradient vectorits role is similar to residual

10 . 1




and true labels

=f arg min L(f ,y)f



T {t} {t}

w ={t} arg min L(f −w{t−1} wg ){t}

we can look for the optimal step size

L(f ,y)∂f∂ {t−1}


10 . 1




and true labels

=f arg min L(f ,y)f



T {t} {t}

so far we treated f as a parameter vector



L(f ,y)∂f∂ {t−1}


10 . 1




and true labels

=f arg min L(f ,y)f



T {t} {t}




L(f ,y)∂f∂ {t−1}


10 . 1


fit the weak-learner to negative of the gradient v ={t} arg min ∣∣ϕ −v 21

v (−g)∣∣ 22



and true labels

=f arg min L(f ,y)f



T {t} {t}




L(f ,y)∂f∂ {t−1}


10 . 1


we are fitting the gradient using L2 loss regardless of the original loss function

fit the weak-learner to negative of the gradient v ={t} arg min ∣∣ϕ −v 21

v (−g)∣∣ 22

ϕ =v [ϕ(x ; v), … ,ϕ(x ; v)](1) (N) ⊤

Gradient Gradient treetree boosting boosting

apply gradient boosting to CART (classification and regression trees)

initialize to predict a constant for t=1:T calculate the negative of the gradient fit a regression tree to and produce regions

readjust predictions per region

update return

f {0}

f (x) ={t} f (x) +{t−1} w I(x ∈∑k=1

Kk R )k

r = − L(f ,y)∂f∂ {t−1}

X, rN × D N

R , … ,R 1 K

w =k arg min L(y , f (x ) +w∑x ∈R

(n)k

(n) {t−1} (n) w )k

f (x){T}

α

10 . 2





update return

f {0}

f (x) ={t} f (x) +{t−1} w I(x ∈∑k=1

Kk R )k

r = − L(f ,y)∂f∂ {t−1}

X, rN × D N

R , … ,R 1 K


(n)k

(n) {t−1} (n) w )k

f (x){T}

α

decide T using a validation set (early stopping)

10 . 2





update return

f {0}

f (x) ={t} f (x) +{t−1} w I(x ∈∑k=1

Kk R )k

r = − L(f ,y)∂f∂ {t−1}

X, rN × D N

R , … ,R 1 K


(n)k

(n) {t−1} (n) w )k

f (x){T}

α

shallow trees of K = 4-8 leaf usuallywork well as weak learners


10 . 2





update return

f {0}

f (x) ={t} f (x) +{t−1} w I(x ∈∑k=1

Kk R )k

r = − L(f ,y)∂f∂ {t−1}

X, rN × D N

R , … ,R 1 K


(n)k

(n) {t−1} (n) w )k

f (x){T}

this is effectively the line-search

α



10 . 2





update return

f {0}

f (x) ={t} f (x) +{t−1} w I(x ∈∑k=1

Kk R )k

r = − L(f ,y)∂f∂ {t−1}

X, rN × D N

R , … ,R 1 K


(n)k

(n) {t−1} (n) w )k

f (x){T}


αusing a small learning rate here improves test error (shrinkage)



10 . 2





update return

f {0}

f (x) ={t} f (x) +{t−1} w I(x ∈∑k=1

Kk R )k

r = − L(f ,y)∂f∂ {t−1}

X, rN × D N

R , … ,R 1 K


(n)k

(n) {t−1} (n) w )k

f (x){T}


stochastic gradient boosting

combines bootstrap and boostinguse a subsample at each iteration abovesimilar to stochastic gradient descent

αusing a small learning rate here improves test error (shrinkage)



10 . 2

Gradient tree boostingGradient tree boosting

example


10(n)

y =(n) I( x >∑d d(n) 9.34)label


recall the synthetic example:

10 . 3


example


10(n)

y =(n) I( x >∑d d(n) 9.34)label



Gradient tree boosting (using log-loss) works better than Adaboost

10 . 3


example


10(n)

y =(n) I( x >∑d d(n) 9.34)label



since sum of features are used in prediction using stumps work best

Gradient tree boosting (using log-loss) works better than Adaboost

10 . 3


example


10(n)

y =(n) I( x >∑d d(n) 9.34)label



α = .2

10 . 4


example


10(n)

y =(n) I( x >∑d d(n) 9.34)label



deviance = cross entropy = log-loss

(K=2) stump

α = .2

10 . 4


example


10(n)

y =(n) I( x >∑d d(n) 9.34)label




(K=2) stump K=6

α = .2

10 . 4


example


10(n)

y =(n) I( x >∑d d(n) 9.34)label




(K=2) stump K=6

in both cases using shrinkage helpsα = .2while test loss may increase, test misclassification error does not

10 . 4


example


10(n)

y =(n) I( x >∑d d(n) 9.34)label



α = 1

α = 1

α = .1α = .1

stochastic withbatch size 50%

stochastic withbatch size 50%

α = .1 α = .1and stochasticand stochastic

both shrinkage and subsampling can helpmore hyper-parameters to tune

10 . 5



example

see the interactive demo: https://arogozhnikov.github.io/2016/07/05/gradient_boosting_playground.html

10 . 6

https://arogozhnikov.github.io/2016/07/05/gradient_boosting_playground.html

SummarySummarytwo ensemble methods

11


bagging & random forests (reduce variance)

produce models with minimal correlationuse their average prediction

11




boosting (reduces the bias of the weak learner)

models are added in stepsa single cost function is minimizedfor exponential loss: interpret as re-weighting the instance (AdaBoost)gradient boosting: fit the weak learner to the negative of the gradientinterpretation as L1 regularization for "weak learner"-selectionalso related to max-margin classification (for large number of steps T)

11




boosting (reduces the bias of the weak learner)

models are added in stepsa single cost function is minimizedfor exponential loss: interpret as re-weighting the instance (AdaBoost)gradient boosting: fit the weak learner to the negative of the gradientinterpretation as L1 regularization for "weak learner"-selectionalso related to max-margin classification (for large number of steps T)

random forests and (gradient) boosting generally perform very well

11


− L(f ,y)∂f∂ {t−1}

=f f ={T} f −{0} w L(f ,y)∑t=1

T {t}∂f∂ {t−1}Gradient for some loss functions

one-hot coding for C-class classification

setting loss function

regression ∣∣y −2

1 f ∣∣ 22 y − f

regressionregression ∣∣y − f ∣∣ 1 sign(y − f)

multiclass

classification multi-class cross-entropy Y −PN × C

predicted class probabilities

N × C

binary

classification

12

exp(−yf)exponential loss

−y exp(−yf)

P =c,: softmax(f )[c]

Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s...

Documents

Transcript of Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s...