Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s...

123
Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1

Transcript of Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s...

Page 1: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Applied Machine LearningApplied Machine LearningBootstrap, Bagging and Boosting

Siamak RavanbakhshSiamak Ravanbakhsh

COMP 551COMP 551 (winter 2020)(winter 2020)1

Page 2: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

bootstrap for uncertainty estimationbagging for variance reduction

random forestsboosting

AdaBoostgradient boostingrelationship to L1 regularization

Learning objectivesLearning objectives

2

Page 3: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

BootstrapBootstrap   

a simple approach to estimate the uncertainty in prediction

3 . 1

Page 4: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

BootstrapBootstrap   

a simple approach to estimate the uncertainty in prediction

given the dataset                                          D = {(x , y )}

(n) (n)n=1N

3 . 1

Page 5: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

BootstrapBootstrap   

a simple approach to estimate the uncertainty in prediction

given the dataset                                          D = {(x , y )}

(n) (n)n=1N

subsample with replacement B datasets of size N

D =b {(x , y )} , b =(n,b) (n,b)n=1N 1, … ,B

3 . 1

Page 6: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

BootstrapBootstrap   

a simple approach to estimate the uncertainty in prediction

train a model on each of these bootstrap datasets (called bootstrap samples)

given the dataset                                          D = {(x , y )}

(n) (n)n=1N

subsample with replacement B datasets of size N

D =b {(x , y )} , b =(n,b) (n,b)n=1N 1, … ,B

3 . 1

Page 7: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

BootstrapBootstrap   

a simple approach to estimate the uncertainty in prediction

train a model on each of these bootstrap datasets (called bootstrap samples)

given the dataset                                          D = {(x , y )}

(n) (n)n=1N

subsample with replacement B datasets of size N

D =b {(x , y )} , b =(n,b) (n,b)n=1N 1, … ,B

produce a measure of uncertainty from these models

for model parametersfor predictions

3 . 1

Page 8: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

BootstrapBootstrap   

a simple approach to estimate the uncertainty in prediction

train a model on each of these bootstrap datasets (called bootstrap samples)

given the dataset                                          D = {(x , y )}

(n) (n)n=1N

subsample with replacement B datasets of size N

D =b {(x , y )} , b =(n,b) (n,b)n=1N 1, … ,B

produce a measure of uncertainty from these models

for model parametersfor predictions

non-parametric bootstrap

3 . 1

Page 9: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

  Bootstrap: Bootstrap: exampleexampleϕ (x) =k e−

s2(x−μ )k

2

y =(n) sin(x ) +(n) cos( ) +∣x ∣(n) ϵ

#x: N #y: N plt.plot(x, y, 'b.') phi = lambda x,mu: np.exp(-(x-mu)**2) mu = np.linspace(0,10,10) #10 Gaussians basesPhi = phi(x[:,None], mu[None,:]) #N x 10w = np.linalg.lstsq(Phi, y)[0]yh = np.dot(Phi,w)plt.plot(x, yh, 'g-')

123456789

before adding noise

our fit to data using 10 Gaussian bases

noise

Recall: linear model with nonlinear Gaussian bases (N=100)

3 . 2

Page 10: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Bootstrap: Bootstrap: exampleexample  ϕ (x) =k e−

s2(x−μ )k

2

ws = np.zeros((B,D))for b in range(B):

#Phi: N x D1#y: N2B = 5003

45

inds = np.random.randint(N, size=(N))6 Phi_b = Phi[inds,:] #N x D7 y_b = y[inds] #N8 #fit the subsampled data9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12

Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samples

gives a measure of uncertainty of the parameters 

3 . 3

Page 11: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Bootstrap: Bootstrap: exampleexample  ϕ (x) =k e−

s2(x−μ )k

2

ws = np.zeros((B,D))for b in range(B):

#Phi: N x D1#y: N2B = 5003

45

inds = np.random.randint(N, size=(N))6 Phi_b = Phi[inds,:] #N x D7 y_b = y[inds] #N8 #fit the subsampled data9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12

inds = np.random.randint(N, size=(N)) Phi_b = Phi[inds,:] #N x D

#Phi: N x D1#y: N2B = 5003ws = np.zeros((B,D))4for b in range(B):5

67

y_b = y[inds] #N8 #fit the subsampled data9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12

Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samples

gives a measure of uncertainty of the parameters 

each color is a different weight w d

3 . 3

Page 12: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Bootstrap: Bootstrap: exampleexample  ϕ (x) =k e−

s2(x−μ )k

2

ws = np.zeros((B,D))for b in range(B):

#Phi: N x D1#y: N2B = 5003

45

inds = np.random.randint(N, size=(N))6 Phi_b = Phi[inds,:] #N x D7 y_b = y[inds] #N8 #fit the subsampled data9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12

inds = np.random.randint(N, size=(N)) Phi_b = Phi[inds,:] #N x D

#Phi: N x D1#y: N2B = 5003ws = np.zeros((B,D))4for b in range(B):5

67

y_b = y[inds] #N8 #fit the subsampled data9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12

y_b = y[inds] #N #fit the subsampled data

#Phi: N x D1#y: N2B = 5003ws = np.zeros((B,D))4for b in range(B):5 inds = np.random.randint(N, size=(N))6 Phi_b = Phi[inds,:] #N x D7

89

ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12

Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samples

gives a measure of uncertainty of the parameters 

each color is a different weight w d

3 . 3

Page 13: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Bootstrap: Bootstrap: exampleexample  ϕ (x) =k e−

s2(x−μ )k

2

ws = np.zeros((B,D))for b in range(B):

#Phi: N x D1#y: N2B = 5003

45

inds = np.random.randint(N, size=(N))6 Phi_b = Phi[inds,:] #N x D7 y_b = y[inds] #N8 #fit the subsampled data9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12

inds = np.random.randint(N, size=(N)) Phi_b = Phi[inds,:] #N x D

#Phi: N x D1#y: N2B = 5003ws = np.zeros((B,D))4for b in range(B):5

67

y_b = y[inds] #N8 #fit the subsampled data9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12

y_b = y[inds] #N #fit the subsampled data

#Phi: N x D1#y: N2B = 5003ws = np.zeros((B,D))4for b in range(B):5 inds = np.random.randint(N, size=(N))6 Phi_b = Phi[inds,:] #N x D7

89

ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12

ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]

#Phi: N x D1#y: N2B = 5003ws = np.zeros((B,D))4for b in range(B):5 inds = np.random.randint(N, size=(N))6 Phi_b = Phi[inds,:] #N x D7 y_b = y[inds] #N8 #fit the subsampled data9

1011

plt.hist(ws, bins=50)12

Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samples

gives a measure of uncertainty of the parameters 

each color is a different weight w d

3 . 3

Page 14: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Bootstrap: Bootstrap: exampleexampleϕ (x) =k e−

s2(x−μ )k

2

Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samples

3 . 4

Page 15: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Bootstrap: Bootstrap: exampleexampleϕ (x) =k e−

s2(x−μ )k

2

Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samplesalso gives a measure of uncertainty of the predictions

3 . 4

Page 16: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Bootstrap: Bootstrap: exampleexampleϕ (x) =k e−

s2(x−μ )k

2

#Phi: N x D#Phi_test: Nt x D#y: N#ws: B x D from previous codey_hats = np.zeros((B, Nt))

12345

for b in range(B):6 wb = ws[b,:]7 y_hats[b,:] = np.dot(Phi_test, wb)8 9# get 95% quantiles 10y_5 = np.quantile(y_hats, .05, axis=0)11y_95 = np.quantile(y_hats, .95, axis=0)12

Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samplesalso gives a measure of uncertainty of the predictions

3 . 4

Page 17: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Bootstrap: Bootstrap: exampleexampleϕ (x) =k e−

s2(x−μ )k

2

#Phi: N x D#Phi_test: Nt x D#y: N#ws: B x D from previous codey_hats = np.zeros((B, Nt))

12345

for b in range(B):6 wb = ws[b,:]7 y_hats[b,:] = np.dot(Phi_test, wb)8 9# get 95% quantiles 10y_5 = np.quantile(y_hats, .05, axis=0)11y_95 = np.quantile(y_hats, .95, axis=0)12

for b in range(B): wb = ws[b,:] y_hats[b,:] = np.dot(Phi_test, wb)

#Phi: N x D1#Phi_test: Nt x D2#y: N3#ws: B x D from previous code4y_hats = np.zeros((B, Nt))5

678

9# get 95% quantiles 10y_5 = np.quantile(y_hats, .05, axis=0)11y_95 = np.quantile(y_hats, .95, axis=0)12

Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samplesalso gives a measure of uncertainty of the predictions the red lines are 5% and 95% quantiles

(for each point we can get these across bootstrap model predictions)

3 . 4

Page 18: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Winter 2020 | Applied Machine Learning (COMP551)

Bootstrap: Bootstrap: exampleexampleϕ (x) =k e−

s2(x−μ )k

2

#Phi: N x D#Phi_test: Nt x D#y: N#ws: B x D from previous codey_hats = np.zeros((B, Nt))

12345

for b in range(B):6 wb = ws[b,:]7 y_hats[b,:] = np.dot(Phi_test, wb)8 9# get 95% quantiles 10y_5 = np.quantile(y_hats, .05, axis=0)11y_95 = np.quantile(y_hats, .95, axis=0)12

for b in range(B): wb = ws[b,:] y_hats[b,:] = np.dot(Phi_test, wb)

#Phi: N x D1#Phi_test: Nt x D2#y: N3#ws: B x D from previous code4y_hats = np.zeros((B, Nt))5

678

9# get 95% quantiles 10y_5 = np.quantile(y_hats, .05, axis=0)11y_95 = np.quantile(y_hats, .95, axis=0)12y_5 = np.quantile(y_hats, .05, axis=0)y_95 = np.quantile(y_hats, .95, axis=0)

#Phi: N x D1#Phi_test: Nt x D2#y: N3#ws: B x D from previous code4y_hats = np.zeros((B, Nt))5for b in range(B):6 wb = ws[b,:]7 y_hats[b,:] = np.dot(Phi_test, wb)8 9# get 95% quantiles 10

1112

Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samplesalso gives a measure of uncertainty of the predictions the red lines are 5% and 95% quantiles

(for each point we can get these across bootstrap model predictions)

3 . 4

Page 19: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

BaggingBagging   

Var(z +1 z ) =2 E[(z +1 z ) ] −22 E[z +1 z ]2

2

use bootstrap for more accurate prediction (not just uncertainty)

variance of sum of random variables

4 . 1

Page 20: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

BaggingBagging   

Var(z +1 z ) =2 E[(z +1 z ) ] −22 E[z +1 z ]2

2

use bootstrap for more accurate prediction (not just uncertainty)

= E[z +12 z +2

2 2z z ] −1 2 (E[z ] +1 E[z ])22

variance of sum of random variables

4 . 1

Page 21: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

BaggingBagging   

Var(z +1 z ) =2 E[(z +1 z ) ] −22 E[z +1 z ]2

2

use bootstrap for more accurate prediction (not just uncertainty)

= E[z +12 z +2

2 2z z ] −1 2 (E[z ] +1 E[z ])22

= E[z ] +12 E[z ] +2

2 E[2z z ] −1 2 E[z ] −12 E[z ] −2

2 2E[z ]E[z ]1 2

variance of sum of random variables

4 . 1

Page 22: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

BaggingBagging   

Var(z +1 z ) =2 E[(z +1 z ) ] −22 E[z +1 z ]2

2

use bootstrap for more accurate prediction (not just uncertainty)

= E[z +12 z +2

2 2z z ] −1 2 (E[z ] +1 E[z ])22

= E[z ] +12 E[z ] +2

2 E[2z z ] −1 2 E[z ] −12 E[z ] −2

2 2E[z ]E[z ]1 2

= Var(z ) +1 Var(z ) +2 2Cov(z , z )1 2

variance of sum of random variables

4 . 1

Page 23: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

BaggingBagging   

Var(z +1 z ) =2 E[(z +1 z ) ] −22 E[z +1 z ]2

2

use bootstrap for more accurate prediction (not just uncertainty)

= E[z +12 z +2

2 2z z ] −1 2 (E[z ] +1 E[z ])22

= E[z ] +12 E[z ] +2

2 E[2z z ] −1 2 E[z ] −12 E[z ] −2

2 2E[z ]E[z ]1 2

= Var(z ) +1 Var(z ) +2 2Cov(z , z )1 2

variance of sum of random variables

for uncorrelated variables this term is zero

4 . 1

Page 24: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

BaggingBagging   

average of uncorrelated random variables has a lower variance

use bootstrap for more accurate prediction (not just uncertainty)

4 . 2

Page 25: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

BaggingBagging   

                               are uncorrelated random variables with mean        and variance           z , … , z 1 B μ σ2

the average                                      has mean         and variance=z z

B1 ∑b b μ

average of uncorrelated random variables has a lower variance

use bootstrap for more accurate prediction (not just uncertainty)

4 . 2

Page 26: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

BaggingBagging   

                               are uncorrelated random variables with mean        and variance           z , … , z 1 B μ σ2

the average                                      has mean         and variance=z z

B1 ∑b b μ

average of uncorrelated random variables has a lower variance

use bootstrap for more accurate prediction (not just uncertainty)

Var( z ) =B1 ∑b b Var( z ) =B2

1 ∑b b Bσ =B21 2

σB1 2

4 . 2

Page 27: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

BaggingBagging   

                               are uncorrelated random variables with mean        and variance           z , … , z 1 B μ σ2

the average                                      has mean         and variance=z z

B1 ∑b b μ

average of uncorrelated random variables has a lower variance

use this to reduce the variance of our models (bias remains the same)

regression: average the model predictions (x) =f (x)B1 ∑b f b

use bootstrap for more accurate prediction (not just uncertainty)

Var( z ) =B1 ∑b b Var( z ) =B2

1 ∑b b Bσ =B21 2

σB1 2

4 . 2

Page 28: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

BaggingBagging   

                               are uncorrelated random variables with mean        and variance           z , … , z 1 B μ σ2

the average                                      has mean         and variance=z z

B1 ∑b b μ

average of uncorrelated random variables has a lower variance

use this to reduce the variance of our models (bias remains the same)

regression: average the model predictions (x) =f (x)B1 ∑b f b

use bootstrap samples to reduce correlation

issue: model predictions are not uncorrelated (trained using the same data)

bagging (bootstrap aggregation)

use bootstrap for more accurate prediction (not just uncertainty)

Var( z ) =B1 ∑b b Var( z ) =B2

1 ∑b b Bσ =B21 2

σB1 2

4 . 2

Page 29: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Bagging Bagging for classificationfor classification  

averaging makes sense for regression, how about classification?

4 . 3

Page 30: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Bagging Bagging for classificationfor classification  

are IID Bernoulli random variables with meanz , … , z ∈1 B {0, 1}

=z z B1 ∑b b

μ = .5 + ϵ

wisdom of crowds

for we have p( >z .5) goes to 1 as B grows

> 0

averaging makes sense for regression, how about classification?

4 . 3

Page 31: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Bagging Bagging for classificationfor classification  

mode of iid classifiers that are better than chance is a better classifier

use voting

are IID Bernoulli random variables with meanz , … , z ∈1 B {0, 1}

=z z B1 ∑b b

μ = .5 + ϵ

wisdom of crowds

for we have p( >z .5) goes to 1 as B grows

> 0

averaging makes sense for regression, how about classification?

4 . 3

Page 32: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Bagging Bagging for classificationfor classification  

mode of iid classifiers that are better than chance is a better classifier

use voting

crowds are wiser when

individuals are better than randomvotes are uncorrelated

are IID Bernoulli random variables with meanz , … , z ∈1 B {0, 1}

=z z B1 ∑b b

μ = .5 + ϵ

wisdom of crowds

for we have p( >z .5) goes to 1 as B grows

> 0

averaging makes sense for regression, how about classification?

4 . 3

Page 33: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Bagging Bagging for classificationfor classification  

mode of iid classifiers that are better than chance is a better classifier

use voting

crowds are wiser when

individuals are better than randomvotes are uncorrelated

are IID Bernoulli random variables with meanz , … , z ∈1 B {0, 1}

=z z B1 ∑b b

μ = .5 + ϵ

wisdom of crowds

for we have p( >z .5) goes to 1 as B grows

> 0

use bootstrap samples to reduce correlationbagging (bootstrap aggregation)

averaging makes sense for regression, how about classification?

4 . 3

Page 34: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Bagging decision treesBagging decision trees  

example

setup

synthetic dataset5 correlated features1st feature is a noisy predictor of the label

4 . 4

Page 35: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Bagging decision treesBagging decision trees  

example

setup

synthetic dataset5 correlated features1st feature is a noisy predictor of the label

Bootstrap samples create different decision trees (due to high variance)

compared to decision trees, no longer interpretable!

4 . 4

Page 36: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Bagging decision treesBagging decision trees  

example

setup

synthetic dataset5 correlated features1st feature is a noisy predictor of the label

B

voting for the most probably classaveraging probabilities

Bootstrap samples create different decision trees (due to high variance)

compared to decision trees, no longer interpretable!

4 . 4

Page 37: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Random forestsRandom forests   

further reduce the correlation between decision trees

4 . 5

Page 38: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Random forestsRandom forests   

further reduce the correlation between decision trees

feature sub-samplingonly a random subset of features are available for split at each step

further reduce the dependence between decision trees

4 . 5

Page 39: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Random forestsRandom forests   

further reduce the correlation between decision trees

feature sub-samplingonly a random subset of features are available for split at each step

further reduce the dependence between decision trees

Dmagic number?this is a hyper-parameter, can be optimized using CV

4 . 5

Page 40: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Random forestsRandom forests   

further reduce the correlation between decision trees

feature sub-samplingonly a random subset of features are available for split at each step

further reduce the dependence between decision trees

Dmagic number?this is a hyper-parameter, can be optimized using CV

Out Of Bag (OOB) samples:

the instances not included in a bootsrap dataset can be used for validationsimultaneous validation of decision trees in a forestno need to set aside data for cross validation

4 . 5

Page 41: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

ExampleExample: spam detection: spam detection  

N=4601 emailsbinary classification task: spam - not spamD=57 features:

48 words: percentage of words in the email that match these wordse.g., business,address,internet, free, George (customized per user)

6 characters: again percentage of characters that match thesech; , ch( ,ch[ ,ch! ,ch$ , ch#

average, max, sum of length of uninterrupted sequences of capital letters:CAPAVE, CAPMAX, CAPTOT

Dataset

4 . 6

Page 42: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

ExampleExample: spam detection: spam detection  

N=4601 emailsbinary classification task: spam - not spamD=57 features:

48 words: percentage of words in the email that match these wordse.g., business,address,internet, free, George (customized per user)

6 characters: again percentage of characters that match thesech; , ch( ,ch[ ,ch! ,ch$ , ch#

average, max, sum of length of uninterrupted sequences of capital letters:CAPAVE, CAPMAX, CAPTOT

an example offeature engineering

Dataset

4 . 6

Page 43: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

ExampleExample: spam detection: spam detection  

N=4601 emailsbinary classification task: spam - not spamD=57 features:

48 words: percentage of words in the email that match these wordse.g., business,address,internet, free, George (customized per user)

6 characters: again percentage of characters that match thesech; , ch( ,ch[ ,ch! ,ch$ , ch#

average, max, sum of length of uninterrupted sequences of capital letters:CAPAVE, CAPMAX, CAPTOT

average value of these features in the spam and non-spam emails

an example offeature engineering

Dataset

4 . 6

Page 44: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

  ExampleExample: spam detection: spam detection

decision tree after pruning

4 . 7

Page 45: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

  ExampleExample: spam detection: spam detection

decision tree after pruning

misclassification rate on test data

4 . 7

Page 46: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

  ExampleExample: spam detection: spam detection

decision tree after pruning number of leaves (17) in optimal pruningdecided based on cross-validation error

test error

cv error

misclassification rate on test data

4 . 7

Page 47: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

ExampleExample: spam detection: spam detection  

Bagging and Random Forests do much betterthan a single decision tree!

4 . 8

Page 48: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Winter 2020 | Applied Machine Learning (COMP551)

ExampleExample: spam detection: spam detection  

Bagging and Random Forests do much betterthan a single decision tree!

Out Of Bag (OOB) error can be used for parameter tuning(e.g., size of the forest)

4 . 8

Page 49: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Summary so far...Summary so far...  

Bootstrap is a powerful technique to get uncertainty estimatesBootstrep aggregation (Bagging) can reduce the variance of unstable models

5

Page 50: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Summary so far...Summary so far...  

Bootstrap is a powerful technique to get uncertainty estimatesBootstrep aggregation (Bagging) can reduce the variance of unstable modelsRandom forests:

Bagging + further de-corelation of features at each splitOOB validation instead of CVdestroy interpretability of decision treesperform well in practicecan fail if only few relevant features exist (due to feature-sampling)

5

Page 51: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Adaptive basesAdaptive bases   

several methods can be classified as learning these bases adaptively

decision treesgeneralized additive modelsboostingneural networks

f(x) = w ϕ (x; v )∑d d d d

in boosting each basis is a classifier or regression function (weak learner, or base learner)create a strong learner by sequentially combining week learners

6 . 1

Page 52: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Forward stagewise additive modellingForward stagewise additive modelling  

6 . 2

Page 53: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Forward stagewise additive modellingForward stagewise additive modelling  

model f(x) = w ϕ(x; v )∑t=1T {t} {t} a simple model, such as decision stump (decision tree with one node)

6 . 2

Page 54: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Forward stagewise additive modellingForward stagewise additive modelling  

model f(x) = w ϕ(x; v )∑t=1T {t} {t} a simple model, such as decision stump (decision tree with one node)

cost J({w , v } ) ={t} {t}t L(y , f(x ))∑n=1

N (n) (n)

so far we have seen L2 loss, log loss and hinge loss

optimizing this cost is difficult given the form of f

6 . 2

Page 55: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Winter 2020 | Applied Machine Learning (COMP551)

Forward stagewise additive modellingForward stagewise additive modelling  

model f(x) = w ϕ(x; v )∑t=1T {t} {t} a simple model, such as decision stump (decision tree with one node)

cost J({w , v } ) ={t} {t}t L(y , f(x ))∑n=1

N (n) (n)

so far we have seen L2 loss, log loss and hinge loss

optimizing this cost is difficult given the form of f

optimization idea add one weak-learner in each stage t, to reduce the error of previous stage

f (x) ={t} f (x ) +{t−1} (n) w ϕ(x ; v ){t} (n) {t}

2. add it to the current model

v ,w ={t} {t} arg min L(y , f (x ) +v,w∑n=1N (n) {t−1} (n) wϕ(x ; v))(n)

1. find the best weak learner

6 . 2

Page 56: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

consider weak learners that are individual features ϕ (x) ={t} w x

{t}d{t}model

7 . 1

  loss & forward stagewise loss & forward stagewise linearlinear model modelL 2

Page 57: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

cost

arg min (y −d,w d 21 ∑n=1

N (n) (f (x ) +{t−1} (n) w x ))d d

(n)2

using L2 loss for regression

at stage t

consider weak learners that are individual features ϕ (x) ={t} w x

{t}d{t}model

7 . 1

  loss & forward stagewise loss & forward stagewise linearlinear model modelL 2

Page 58: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

cost

arg min (y −d,w d 21 ∑n=1

N (n) (f (x ) +{t−1} (n) w x ))d d

(n)2

using L2 loss for regression

at stage t

consider weak learners that are individual features ϕ (x) ={t} w x

{t}d{t}model

residual r(n)

7 . 1

  loss & forward stagewise loss & forward stagewise linearlinear model modelL 2

Page 59: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

cost

arg min (y −d,w d 21 ∑n=1

N (n) (f (x ) +{t−1} (n) w x ))d d

(n)2

using L2 loss for regression

at stage t

consider weak learners that are individual features ϕ (x) ={t} w x

{t}d{t}model

residual r(n)

optimization recall: optimal weight for each d is w =d

x ∑n d

(n) 2 x r ∑

n d

(n)d

(n)

pick the feature that most significantly reduces the residual

7 . 1

  loss & forward stagewise loss & forward stagewise linearlinear model modelL 2

Page 60: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

cost

arg min (y −d,w d 21 ∑n=1

N (n) (f (x ) +{t−1} (n) w x ))d d

(n)2

using L2 loss for regression

at stage t

consider weak learners that are individual features ϕ (x) ={t} w x

{t}d{t}model

residual r(n)

optimization recall: optimal weight for each d is w =d

x ∑n d

(n) 2 x r ∑

n d

(n)d

(n)

pick the feature that most significantly reduces the residual

f (x) ={t} αw x ∑t d{t}

{t}d{t}the model at time-step t:

using a small      helps with test errorα

7 . 1

is this related to L1-regularized linear regression?

  loss & forward stagewise loss & forward stagewise linearlinear model modelL 2

Page 61: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

example

using small learning rate                   L2 Boosting has a similar regularization path to lasso

boosting

α = .01

  loss & forward stagewise loss & forward stagewise linearlinear model modelL 2

∣w ∣∑d d t

lasso

w d

{t}

at each time-step only one feature           is updated / addedd{t}

7 . 2

Page 62: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Winter 2020 | Applied Machine Learning (COMP551)

example

using small learning rate                   L2 Boosting has a similar regularization path to lasso

boosting

α = .01

we can view boosting as doing feature (base learner) selection in exponentially large spaces (e.g., all trees of size K)the number of steps t plays a similar role to (the inverse of) regularization hyper-parameter

  loss & forward stagewise loss & forward stagewise linearlinear model modelL 2

∣w ∣∑d d t

lasso

w d

{t}

at each time-step only one feature           is updated / addedd{t}

7 . 2

Page 63: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Exponential loss Exponential loss & AdaBoost& AdaBoost  

loss functions for binary classification y ∈ {−1, +1}

predicted label is =y sign(f(x))

8 . 1

Page 64: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Exponential loss Exponential loss & AdaBoost& AdaBoost  

loss functions for binary classification y ∈ {−1, +1}

predicted label is =y sign(f(x))

misclassification loss(0-1 loss)

L(y, f(x)) = I(yf(x) > 0)

8 . 1

Page 65: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Exponential loss Exponential loss & AdaBoost& AdaBoost  

loss functions for binary classification

L(y, f(x)) = log (1 + e )−yf(x)log-loss(aka cross entropy loss or binomial deviance)

y ∈ {−1, +1}

predicted label is =y sign(f(x))

misclassification loss(0-1 loss)

L(y, f(x)) = I(yf(x) > 0)

8 . 1

Page 66: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Exponential loss Exponential loss & AdaBoost& AdaBoost  

loss functions for binary classification

L(y, f(x)) = log (1 + e )−yf(x)log-loss(aka cross entropy loss or binomial deviance)

y ∈ {−1, +1}

predicted label is =y sign(f(x))

misclassification loss(0-1 loss)

L(y, f(x)) = I(yf(x) > 0)

L(y, f(x)) = max(0, 1 − yf(x))Hinge losssupport vector loss

8 . 1

Page 67: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Exponential loss Exponential loss & AdaBoost& AdaBoost  

loss functions for binary classification

L(y, f(x)) = log (1 + e )−yf(x)log-loss(aka cross entropy loss or binomial deviance)

y ∈ {−1, +1}

predicted label is =y sign(f(x))

misclassification loss(0-1 loss)

L(y, f(x)) = I(yf(x) > 0)

L(y, f(x)) = max(0, 1 − yf(x))Hinge losssupport vector loss

yet another loss function is exponential loss L(y, f(x)) = e−yf(x)

note that the loss grows faster than the other surrogate losses (more sensitive to outliers)

8 . 1

Page 68: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Exponential loss Exponential loss & AdaBoost& AdaBoost  

loss functions for binary classification

L(y, f(x)) = log (1 + e )−yf(x)log-loss(aka cross entropy loss or binomial deviance)

y ∈ {−1, +1}

predicted label is =y sign(f(x))

misclassification loss(0-1 loss)

L(y, f(x)) = I(yf(x) > 0)

L(y, f(x)) = max(0, 1 − yf(x))Hinge losssupport vector loss

yet another loss function is exponential loss L(y, f(x)) = e−yf(x)

note that the loss grows faster than the other surrogate losses (more sensitive to outliers)

useful property when working with additive models:

L(y, f (x) +{t−1} w ϕ(x, v )) ={t} {t} L(y, f (x)) ⋅{t−1} L(y,w ϕ(x, v )){t} {t}

treat this as a weight q for an instanceinstances that are not properly classified before receive a higher weight

8 . 1

Page 69: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Exponential loss &Exponential loss & AdaBoost AdaBoost  

J({w , v } ) ={t} {t}t L(y , f (x ) +∑n=1

N (n) {t−1} (n) w ϕ(x , v )) ={t} (n) {t} q L(y ,w ϕ(x , v ))∑n

(n) (n) {t} (n) {t}

cost

L(y , f (x ))(n) {t−1} (n)loss for this instance at previous stage

using exponential loss

8 . 2

Page 70: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Exponential loss &Exponential loss & AdaBoost AdaBoost  

J({w , v } ) ={t} {t}t L(y , f (x ) +∑n=1

N (n) {t−1} (n) w ϕ(x , v )) ={t} (n) {t} q L(y ,w ϕ(x , v ))∑n

(n) (n) {t} (n) {t}

cost

L(y , f (x ))(n) {t−1} (n)loss for this instance at previous stage

using exponential loss

discrete AdaBoost: assume this is a simple classifier, so its output is +/- 1

8 . 2

Page 71: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Exponential loss &Exponential loss & AdaBoost AdaBoost  

J({w , v } ) ={t} {t}t L(y , f (x ) +∑n=1

N (n) {t−1} (n) w ϕ(x , v )) ={t} (n) {t} q L(y ,w ϕ(x , v ))∑n

(n) (n) {t} (n) {t}

cost

L(y , f (x ))(n) {t−1} (n)loss for this instance at previous stage

using exponential loss

discrete AdaBoost: assume this is a simple classifier, so its output is +/- 1

J({w , v } ) ={t} {t}t q e∑n

(n) −y w ϕ(x ,v )(n) {t} (n) {t}

objective is to find the weak learner minimizing the cost aboveoptimization

8 . 2

Page 72: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Exponential loss &Exponential loss & AdaBoost AdaBoost  

J({w , v } ) ={t} {t}t L(y , f (x ) +∑n=1

N (n) {t−1} (n) w ϕ(x , v )) ={t} (n) {t} q L(y ,w ϕ(x , v ))∑n

(n) (n) {t} (n) {t}

cost

L(y , f (x ))(n) {t−1} (n)loss for this instance at previous stage

using exponential loss

= e q I(y =−w{t}∑n

(n) (n) ϕ(x , v )) +(n) {t} e q I(y =w{t}∑n

(n) (n) ϕ(x , v ))(n) {t}

discrete AdaBoost: assume this is a simple classifier, so its output is +/- 1

J({w , v } ) ={t} {t}t q e∑n

(n) −y w ϕ(x ,v )(n) {t} (n) {t}

objective is to find the weak learner minimizing the cost aboveoptimization

8 . 2

Page 73: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Exponential loss &Exponential loss & AdaBoost AdaBoost  

J({w , v } ) ={t} {t}t L(y , f (x ) +∑n=1

N (n) {t−1} (n) w ϕ(x , v )) ={t} (n) {t} q L(y ,w ϕ(x , v ))∑n

(n) (n) {t} (n) {t}

cost

L(y , f (x ))(n) {t−1} (n)loss for this instance at previous stage

using exponential loss

= e q I(y =−w{t}∑n

(n) (n) ϕ(x , v )) +(n) {t} e q I(y =w{t}∑n

(n) (n) ϕ(x , v ))(n) {t}

discrete AdaBoost: assume this is a simple classifier, so its output is +/- 1

J({w , v } ) ={t} {t}t q e∑n

(n) −y w ϕ(x ,v )(n) {t} (n) {t}

objective is to find the weak learner minimizing the cost aboveoptimization

= e q +−w{t}∑n

(n) (e −w{t}e ) q I(y =−w{t}

∑n(n) (n) ϕ(x , v ))(n) {t}

assuming                     the weak learner should minimize this costthis is classification with weighted intances

w ≥{t} 0does not depend onthe weak learner

8 . 2

Page 74: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Exponential loss Exponential loss & & AdaBoostAdaBoost  

= e q +−w{t}∑n

(n) (e −w{t}e ) q I(y =−w{t}

∑n(n) (n) ϕ(x , v ))(n) {t}

assuming                     the weak learner should minimize this costthis is classification with weighted instancesthis gives

w ≥{t} 0does not depend onthe weak learner

J({w , v } ) ={t} {t}t q L(y ,w ϕ(x , v ))∑n

(n) (n) {t} (n) {t}cost

v{t}

8 . 3

Page 75: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Exponential loss Exponential loss & & AdaBoostAdaBoost  

= e q +−w{t}∑n

(n) (e −w{t}e ) q I(y =−w{t}

∑n(n) (n) ϕ(x , v ))(n) {t}

assuming                     the weak learner should minimize this costthis is classification with weighted instancesthis gives

w ≥{t} 0does not depend onthe weak learner

J({w , v } ) ={t} {t}t q L(y ,w ϕ(x , v ))∑n

(n) (n) {t} (n) {t}cost

still need to find the optimal w{t}

v{t}

8 . 3

Page 76: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Exponential loss Exponential loss & & AdaBoostAdaBoost  

= e q +−w{t}∑n

(n) (e −w{t}e ) q I(y =−w{t}

∑n(n) (n) ϕ(x , v ))(n) {t}

assuming                     the weak learner should minimize this costthis is classification with weighted instancesthis gives

w ≥{t} 0does not depend onthe weak learner

J({w , v } ) ={t} {t}t q L(y ,w ϕ(x , v ))∑n

(n) (n) {t} (n) {t}cost

still need to find the optimal w{t}

v{t}

setting                          gives =∂w{t}

∂J 0 w ={t} log 2

1ℓ{t}

1−ℓ{t} weight-normalized misclassification error

ℓ ={t}

q∑n

(n)

q I(ϕ(x ;v ) =y )∑n

(n) (n) {t} (n)

8 . 3

Page 77: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Exponential loss Exponential loss & & AdaBoostAdaBoost  

= e q +−w{t}∑n

(n) (e −w{t}e ) q I(y =−w{t}

∑n(n) (n) ϕ(x , v ))(n) {t}

assuming                     the weak learner should minimize this costthis is classification with weighted instancesthis gives

w ≥{t} 0does not depend onthe weak learner

J({w , v } ) ={t} {t}t q L(y ,w ϕ(x , v ))∑n

(n) (n) {t} (n) {t}cost

still need to find the optimal w{t}

v{t}

setting                          gives =∂w{t}

∂J 0 w ={t} log 2

1ℓ{t}

1−ℓ{t} weight-normalized misclassification error

ℓ ={t}

q∑n

(n)

q I(ϕ(x ;v ) =y )∑n

(n) (n) {t} (n)

since weak learner is better than chance                      and so w ≥{t} 0ℓ <{t} .5

8 . 3

Page 78: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Exponential loss Exponential loss & & AdaBoostAdaBoost  

= e q +−w{t}∑n

(n) (e −w{t}e ) q I(y =−w{t}

∑n(n) (n) ϕ(x , v ))(n) {t}

assuming                     the weak learner should minimize this costthis is classification with weighted instancesthis gives

w ≥{t} 0does not depend onthe weak learner

J({w , v } ) ={t} {t}t q L(y ,w ϕ(x , v ))∑n

(n) (n) {t} (n) {t}cost

still need to find the optimal w{t}

v{t}

setting                          gives =∂w{t}

∂J 0 w ={t} log 2

1ℓ{t}

1−ℓ{t} weight-normalized misclassification error

ℓ ={t}

q∑n

(n)

q I(ϕ(x ;v ) =y )∑n

(n) (n) {t} (n)

we can now update instance weights q for next iteration(multiply by the new loss)

q =(n),{t+1} q e(n),{t} −w y ϕ(x ;v ){t} (n) (n) {t}

since weak learner is better than chance                      and so w ≥{t} 0ℓ <{t} .5

8 . 3

Page 79: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Exponential loss Exponential loss & & AdaBoostAdaBoost  

= e q +−w{t}∑n

(n) (e −w{t}e ) q I(y =−w{t}

∑n(n) (n) ϕ(x , v ))(n) {t}

assuming                     the weak learner should minimize this costthis is classification with weighted instancesthis gives

w ≥{t} 0does not depend onthe weak learner

J({w , v } ) ={t} {t}t q L(y ,w ϕ(x , v ))∑n

(n) (n) {t} (n) {t}cost

still need to find the optimal w{t}

v{t}

setting                          gives =∂w{t}

∂J 0 w ={t} log 2

1ℓ{t}

1−ℓ{t} weight-normalized misclassification error

ℓ ={t}

q∑n

(n)

q I(ϕ(x ;v ) =y )∑n

(n) (n) {t} (n)

we can now update instance weights q for next iteration(multiply by the new loss)

q =(n),{t+1} q e(n),{t} −w y ϕ(x ;v ){t} (n) (n) {t}

since weak learner is better than chance                      and so w ≥{t} 0ℓ <{t} .5

8 . 3

since w > 0, the weight q of misclassified points increase and the rest decrease

Page 80: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

  Exponential loss Exponential loss & & AdaBoostAdaBoost

overall algorithm for discrete AdaBoost

8 . 4

Page 81: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

  Exponential loss Exponential loss & & AdaBoostAdaBoost

overall algorithm for discrete AdaBoost

w ϕ(x; v ){1} {1}

8 . 4

Page 82: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

  Exponential loss Exponential loss & & AdaBoostAdaBoost

overall algorithm for discrete AdaBoost

w ϕ(x; v ){1} {1}

w ϕ(x; v ){2} {2}

8 . 4

Page 83: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

  Exponential loss Exponential loss & & AdaBoostAdaBoost

overall algorithm for discrete AdaBoost

w ϕ(x; v ){1} {1}

w ϕ(x; v ){2} {2}

w ϕ(x; v ){3} {3}

w ϕ(x; v ){T} {T}

8 . 4

Page 84: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

  Exponential loss Exponential loss & & AdaBoostAdaBoost

overall algorithm for discrete AdaBoost

w ϕ(x; v ){1} {1}

w ϕ(x; v ){2} {2}

w ϕ(x; v ){3} {3}

w ϕ(x; v ){T} {T}

f(x) = sign( w ϕ(x; v ))∑t{t} {t}

8 . 4

Page 85: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

     return

initialize  for t=1:T       fit the simple classifier        to the weighted dataset       

w :{t} = log 21

ℓ{t}1−ℓ{t}

ℓ :{t} =

q∑n

(n) q I(ϕ(x ;v ) =y )∑

n(n) (n) {t} (n)

q :(n) = q e ∀n(n) −w y ϕ(x ;v ){t} (n) (n) {t}

q :(n) = ∀nN1

ϕ(x, v ){t}

f(x) = sign( w ϕ(x; v ))∑t{t} {t}

  Exponential loss Exponential loss & & AdaBoostAdaBoost

overall algorithm for discrete AdaBoost

w ϕ(x; v ){1} {1}

w ϕ(x; v ){2} {2}

w ϕ(x; v ){3} {3}

w ϕ(x; v ){T} {T}

f(x) = sign( w ϕ(x; v ))∑t{t} {t}

8 . 4

Page 86: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

AdaBoostAdaBoost  

example

w ϕ(x; v ){1} {1}

w ϕ(x; v ){2} {2}

w ϕ(x; v ){3} {3}

w ϕ(x; v ){T} {T}

=y sign( w ϕ(x; v ))∑t{t} {t}

each weak learner is a decision stump (dashed line)

t = 1 t = 2 t = 3

t = 6 t = 10 t = 150

green is the decision boundary of f {t}

circle size is proportional to qn,{t}

8 . 5

Page 87: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

AdaBoostAdaBoost  

example

features are samples from standard Gaussianx , … ,x 1(n)

10(n)

8 . 6

Page 88: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

AdaBoostAdaBoost  

example

features are samples from standard Gaussianx , … ,x 1(n)

10(n)

y =(n) I( x >∑d(n)

d

29.34)label

8 . 6

Page 89: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

AdaBoostAdaBoost  

example

features are samples from standard Gaussianx , … ,x 1(n)

10(n)

y =(n) I( x >∑d(n)

d

29.34)label

N=2000 training examples

8 . 6

Page 90: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

AdaBoostAdaBoost  

example

features are samples from standard Gaussianx , … ,x 1(n)

10(n)

y =(n) I( x >∑d(n)

d

29.34)label

N=2000 training examples

8 . 6

Page 91: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Winter 2020 | Applied Machine Learning (COMP551)

AdaBoostAdaBoost  

example

features are samples from standard Gaussianx , … ,x 1(n)

10(n)

y =(n) I( x >∑d(n)

d

29.34)label

N=2000 training examplesnotice that test error does not increaseAdaBoost is very slow to overfit

8 . 6

Page 92: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

application:application: Viola-Jones face detection Viola-Jones face detection  

Haar features are computationally efficienteach feature is a weak learnerAdaBoost picks one feature at a time (label: face/no-face)

Still can be inefficient:

use the fact that faces are rare (.01% of subwindows are faces)cascade of classifiers due to small rate

image source: David Lowe slides

9

Page 93: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

application:application: Viola-Jones face detection Viola-Jones face detection  

Haar features are computationally efficienteach feature is a weak learnerAdaBoost picks one feature at a time (label: face/no-face)

Still can be inefficient:

use the fact that faces are rare (.01% of subwindows are faces)cascade of classifiers due to small rate

100% detectionFP rate

cumulativeFP rate

image source: David Lowe slides

9

Page 94: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

application:application: Viola-Jones face detection Viola-Jones face detection  

Haar features are computationally efficienteach feature is a weak learnerAdaBoost picks one feature at a time (label: face/no-face)

Still can be inefficient:

use the fact that faces are rare (.01% of subwindows are faces)cascade of classifiers due to small rate

100% detectionFP rate

cumulativeFP rate

cascade is applied over all image subwindows

image source: David Lowe slides

9

Page 95: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

application:application: Viola-Jones face detection Viola-Jones face detection  

Haar features are computationally efficienteach feature is a weak learnerAdaBoost picks one feature at a time (label: face/no-face)

Still can be inefficient:

use the fact that faces are rare (.01% of subwindows are faces)cascade of classifiers due to small rate

100% detectionFP rate

cumulativeFP rate

fast enough for real-time (object) detection

cascade is applied over all image subwindows

image source: David Lowe slides

9

Page 96: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient boostingGradient boosting   

10 . 1

idea fit the weak learner to the gradient of the cost

Page 97: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient boostingGradient boosting   

let f ={t} [f (x ), … , f (x )]{t} (1) {t} (N) ⊤y = [y , … , y ](1) (N) ⊤

and true labels

10 . 1

idea fit the weak learner to the gradient of the cost

Page 98: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient boostingGradient boosting   

let f ={t} [f (x ), … , f (x )]{t} (1) {t} (N) ⊤y = [y , … , y ](1) (N) ⊤

and true labels

=f arg min L(f ,y)f

ignoring the structure of fif we use gradient descent to minimize the loss

10 . 1

idea fit the weak learner to the gradient of the cost

Page 99: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient boostingGradient boosting   

let f ={t} [f (x ), … , f (x )]{t} (1) {t} (N) ⊤y = [y , … , y ](1) (N) ⊤

and true labels

=f arg min L(f ,y)f

ignoring the structure of fif we use gradient descent to minimize the loss

write      as a sum of stepsf =f f ={T} f −{0} w g∑t=1

T {t} {t}

L(f ,y)∂f∂ {t−1}

gradient vectorits role is similar to residual

10 . 1

idea fit the weak learner to the gradient of the cost

Page 100: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient boostingGradient boosting   

let f ={t} [f (x ), … , f (x )]{t} (1) {t} (N) ⊤y = [y , … , y ](1) (N) ⊤

and true labels

=f arg min L(f ,y)f

ignoring the structure of fif we use gradient descent to minimize the loss

write      as a sum of stepsf =f f ={T} f −{0} w g∑t=1

T {t} {t}

w ={t} arg min L(f −w{t−1} wg ){t}

we can look for the optimal step size

L(f ,y)∂f∂ {t−1}

gradient vectorits role is similar to residual

10 . 1

idea fit the weak learner to the gradient of the cost

Page 101: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient boostingGradient boosting   

let f ={t} [f (x ), … , f (x )]{t} (1) {t} (N) ⊤y = [y , … , y ](1) (N) ⊤

and true labels

=f arg min L(f ,y)f

ignoring the structure of fif we use gradient descent to minimize the loss

write      as a sum of stepsf =f f ={T} f −{0} w g∑t=1

T {t} {t}

so far we treated f as a parameter vector

w ={t} arg min L(f −w{t−1} wg ){t}

we can look for the optimal step size

L(f ,y)∂f∂ {t−1}

gradient vectorits role is similar to residual

10 . 1

idea fit the weak learner to the gradient of the cost

Page 102: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient boostingGradient boosting   

let f ={t} [f (x ), … , f (x )]{t} (1) {t} (N) ⊤y = [y , … , y ](1) (N) ⊤

and true labels

=f arg min L(f ,y)f

ignoring the structure of fif we use gradient descent to minimize the loss

write      as a sum of stepsf =f f ={T} f −{0} w g∑t=1

T {t} {t}

so far we treated f as a parameter vector

w ={t} arg min L(f −w{t−1} wg ){t}

we can look for the optimal step size

L(f ,y)∂f∂ {t−1}

gradient vectorits role is similar to residual

10 . 1

idea fit the weak learner to the gradient of the cost

fit the weak-learner to negative of the gradient v ={t} arg min ∣∣ϕ −v 21

v (−g)∣∣ 22

Page 103: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient boostingGradient boosting   

let f ={t} [f (x ), … , f (x )]{t} (1) {t} (N) ⊤y = [y , … , y ](1) (N) ⊤

and true labels

=f arg min L(f ,y)f

ignoring the structure of fif we use gradient descent to minimize the loss

write      as a sum of stepsf =f f ={T} f −{0} w g∑t=1

T {t} {t}

so far we treated f as a parameter vector

w ={t} arg min L(f −w{t−1} wg ){t}

we can look for the optimal step size

L(f ,y)∂f∂ {t−1}

gradient vectorits role is similar to residual

10 . 1

idea fit the weak learner to the gradient of the cost

we are fitting the gradient using L2 loss regardless of the original loss function

fit the weak-learner to negative of the gradient v ={t} arg min ∣∣ϕ −v 21

v (−g)∣∣ 22

ϕ =v [ϕ(x ; v), … ,ϕ(x ; v)](1) (N) ⊤

Page 104: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient Gradient treetree boosting boosting  

apply gradient boosting to CART (classification and regression trees)

 

 

initialize      to predict a constant  for t=1:T       calculate the negative of the gradient        fit a regression tree to       and produce regions

     re­adjust predictions per region

     update  return 

f {0}

f (x) ={t} f (x) +{t−1} w I(x ∈∑k=1

Kk R )k

r = − L(f ,y)∂f∂ {t−1}

X, rN × D N

R , … ,R 1 K

w =k arg min L(y , f (x ) +w∑x ∈R

(n)k

(n) {t−1} (n) w )k

f (x){T}

α

10 . 2

Page 105: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient Gradient treetree boosting boosting  

apply gradient boosting to CART (classification and regression trees)

 

 

initialize      to predict a constant  for t=1:T       calculate the negative of the gradient        fit a regression tree to       and produce regions

     re­adjust predictions per region

     update  return 

f {0}

f (x) ={t} f (x) +{t−1} w I(x ∈∑k=1

Kk R )k

r = − L(f ,y)∂f∂ {t−1}

X, rN × D N

R , … ,R 1 K

w =k arg min L(y , f (x ) +w∑x ∈R

(n)k

(n) {t−1} (n) w )k

f (x){T}

α

decide T using a validation set (early stopping)

10 . 2

Page 106: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient Gradient treetree boosting boosting  

apply gradient boosting to CART (classification and regression trees)

 

 

initialize      to predict a constant  for t=1:T       calculate the negative of the gradient        fit a regression tree to       and produce regions

     re­adjust predictions per region

     update  return 

f {0}

f (x) ={t} f (x) +{t−1} w I(x ∈∑k=1

Kk R )k

r = − L(f ,y)∂f∂ {t−1}

X, rN × D N

R , … ,R 1 K

w =k arg min L(y , f (x ) +w∑x ∈R

(n)k

(n) {t−1} (n) w )k

f (x){T}

α

shallow trees of K = 4-8 leaf usuallywork well as weak learners

decide T using a validation set (early stopping)

10 . 2

Page 107: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient Gradient treetree boosting boosting  

apply gradient boosting to CART (classification and regression trees)

 

 

initialize      to predict a constant  for t=1:T       calculate the negative of the gradient        fit a regression tree to       and produce regions

     re­adjust predictions per region

     update  return 

f {0}

f (x) ={t} f (x) +{t−1} w I(x ∈∑k=1

Kk R )k

r = − L(f ,y)∂f∂ {t−1}

X, rN × D N

R , … ,R 1 K

w =k arg min L(y , f (x ) +w∑x ∈R

(n)k

(n) {t−1} (n) w )k

f (x){T}

this is effectively the line-search

α

shallow trees of K = 4-8 leaf usuallywork well as weak learners

decide T using a validation set (early stopping)

10 . 2

Page 108: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient Gradient treetree boosting boosting  

apply gradient boosting to CART (classification and regression trees)

 

 

initialize      to predict a constant  for t=1:T       calculate the negative of the gradient        fit a regression tree to       and produce regions

     re­adjust predictions per region

     update  return 

f {0}

f (x) ={t} f (x) +{t−1} w I(x ∈∑k=1

Kk R )k

r = − L(f ,y)∂f∂ {t−1}

X, rN × D N

R , … ,R 1 K

w =k arg min L(y , f (x ) +w∑x ∈R

(n)k

(n) {t−1} (n) w )k

f (x){T}

this is effectively the line-search

αusing a small learning rate here improves test error (shrinkage)

shallow trees of K = 4-8 leaf usuallywork well as weak learners

decide T using a validation set (early stopping)

10 . 2

Page 109: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient Gradient treetree boosting boosting  

apply gradient boosting to CART (classification and regression trees)

 

 

initialize      to predict a constant  for t=1:T       calculate the negative of the gradient        fit a regression tree to       and produce regions

     re­adjust predictions per region

     update  return 

f {0}

f (x) ={t} f (x) +{t−1} w I(x ∈∑k=1

Kk R )k

r = − L(f ,y)∂f∂ {t−1}

X, rN × D N

R , … ,R 1 K

w =k arg min L(y , f (x ) +w∑x ∈R

(n)k

(n) {t−1} (n) w )k

f (x){T}

this is effectively the line-search

stochastic gradient boosting

combines bootstrap and boostinguse a subsample at each iteration abovesimilar to stochastic gradient descent

αusing a small learning rate here improves test error (shrinkage)

shallow trees of K = 4-8 leaf usuallywork well as weak learners

decide T using a validation set (early stopping)

10 . 2

Page 110: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient tree boostingGradient tree boosting  

example

features are samples from standard Gaussianx , … ,x 1(n)

10(n)

y =(n) I( x >∑d d(n) 9.34)label

N=2000 training examples

recall the synthetic example:

10 . 3

Page 111: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient tree boostingGradient tree boosting  

example

features are samples from standard Gaussianx , … ,x 1(n)

10(n)

y =(n) I( x >∑d d(n) 9.34)label

N=2000 training examples

recall the synthetic example:

Gradient tree boosting (using log-loss) works better than Adaboost

10 . 3

Page 112: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient tree boostingGradient tree boosting  

example

features are samples from standard Gaussianx , … ,x 1(n)

10(n)

y =(n) I( x >∑d d(n) 9.34)label

N=2000 training examples

recall the synthetic example:

since sum of features are used in prediction using stumps work best

Gradient tree boosting (using log-loss) works better than Adaboost

10 . 3

Page 113: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient tree boostingGradient tree boosting  

example

features are samples from standard Gaussianx , … ,x 1(n)

10(n)

y =(n) I( x >∑d d(n) 9.34)label

N=2000 training examples

recall the synthetic example:

α = .2

10 . 4

Page 114: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient tree boostingGradient tree boosting  

example

features are samples from standard Gaussianx , … ,x 1(n)

10(n)

y =(n) I( x >∑d d(n) 9.34)label

N=2000 training examples

recall the synthetic example:

deviance = cross entropy = log-loss

 (K=2) stump

α = .2

10 . 4

Page 115: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient tree boostingGradient tree boosting  

example

features are samples from standard Gaussianx , … ,x 1(n)

10(n)

y =(n) I( x >∑d d(n) 9.34)label

N=2000 training examples

recall the synthetic example:

deviance = cross entropy = log-loss

 (K=2) stump K=6

α = .2

10 . 4

Page 116: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient tree boostingGradient tree boosting  

example

features are samples from standard Gaussianx , … ,x 1(n)

10(n)

y =(n) I( x >∑d d(n) 9.34)label

N=2000 training examples

recall the synthetic example:

deviance = cross entropy = log-loss

 (K=2) stump K=6

in both cases using shrinkage                 helpsα = .2while test loss may increase, test misclassification error does not

10 . 4

Page 117: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient tree boostingGradient tree boosting  

example

features are samples from standard Gaussianx , … ,x 1(n)

10(n)

y =(n) I( x >∑d d(n) 9.34)label

N=2000 training examples

recall the synthetic example:

α = 1

α = 1

α = .1α = .1

stochastic withbatch size 50%

stochastic withbatch size 50%

α = .1 α = .1and stochasticand stochastic

both shrinkage and subsampling can helpmore hyper-parameters to tune

10 . 5

Page 118: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Winter 2020 | Applied Machine Learning (COMP551)

Gradient tree boostingGradient tree boosting  

example

see the interactive demo: https://arogozhnikov.github.io/2016/07/05/gradient_boosting_playground.html

10 . 6

Page 119: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

SummarySummarytwo ensemble methods

11

Page 120: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

SummarySummarytwo ensemble methods

bagging & random forests (reduce variance)

produce models with minimal correlationuse their average prediction

11

Page 121: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

SummarySummarytwo ensemble methods

bagging & random forests (reduce variance)

produce models with minimal correlationuse their average prediction

boosting (reduces the bias of the weak learner)

models are added in stepsa single cost function is minimizedfor exponential loss: interpret as re-weighting the instance (AdaBoost)gradient boosting: fit the weak learner to the negative of the gradientinterpretation as L1 regularization for "weak learner"-selectionalso related to max-margin classification (for large number of steps T)

11

Page 122: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

SummarySummarytwo ensemble methods

bagging & random forests (reduce variance)

produce models with minimal correlationuse their average prediction

boosting (reduces the bias of the weak learner)

models are added in stepsa single cost function is minimizedfor exponential loss: interpret as re-weighting the instance (AdaBoost)gradient boosting: fit the weak learner to the negative of the gradientinterpretation as L1 regularization for "weak learner"-selectionalso related to max-margin classification (for large number of steps T)

random forests and (gradient) boosting generally perform very well

11

Page 123: Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s ure o f unce r t a i nt y o f t h e p a ra m e t e r s e a ch co ...

Gradient boostingGradient boosting   

− L(f ,y)∂f∂ {t−1}

=f f ={T} f −{0} w L(f ,y)∑t=1

T {t}∂f∂ {t−1}Gradient for some loss functions

one-hot coding for C-class classification

setting loss function

regression ∣∣y −2

1 f ∣∣ 22 y − f

regressionregression ∣∣y − f ∣∣ 1 sign(y − f)

multiclass

classification multi-class cross-entropy Y −PN × C

predicted class probabilities

N × C

binary

classification

12

exp(−yf)exponential loss

−y exp(−yf)

P =c,: softmax(f )[c]