Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s...
Transcript of Applied Machine Learning - cs.mcgill.casiamak/COMP551/slides/10-boosting.pdf · g i v e s a m e a s...
Applied Machine LearningApplied Machine LearningBootstrap, Bagging and Boosting
Siamak RavanbakhshSiamak Ravanbakhsh
COMP 551COMP 551 (winter 2020)(winter 2020)1
bootstrap for uncertainty estimationbagging for variance reduction
random forestsboosting
AdaBoostgradient boostingrelationship to L1 regularization
Learning objectivesLearning objectives
2
BootstrapBootstrap
a simple approach to estimate the uncertainty in prediction
3 . 1
BootstrapBootstrap
a simple approach to estimate the uncertainty in prediction
given the dataset D = {(x , y )}
(n) (n)n=1N
3 . 1
BootstrapBootstrap
a simple approach to estimate the uncertainty in prediction
given the dataset D = {(x , y )}
(n) (n)n=1N
subsample with replacement B datasets of size N
D =b {(x , y )} , b =(n,b) (n,b)n=1N 1, … ,B
3 . 1
BootstrapBootstrap
a simple approach to estimate the uncertainty in prediction
train a model on each of these bootstrap datasets (called bootstrap samples)
given the dataset D = {(x , y )}
(n) (n)n=1N
subsample with replacement B datasets of size N
D =b {(x , y )} , b =(n,b) (n,b)n=1N 1, … ,B
3 . 1
BootstrapBootstrap
a simple approach to estimate the uncertainty in prediction
train a model on each of these bootstrap datasets (called bootstrap samples)
given the dataset D = {(x , y )}
(n) (n)n=1N
subsample with replacement B datasets of size N
D =b {(x , y )} , b =(n,b) (n,b)n=1N 1, … ,B
produce a measure of uncertainty from these models
for model parametersfor predictions
3 . 1
BootstrapBootstrap
a simple approach to estimate the uncertainty in prediction
train a model on each of these bootstrap datasets (called bootstrap samples)
given the dataset D = {(x , y )}
(n) (n)n=1N
subsample with replacement B datasets of size N
D =b {(x , y )} , b =(n,b) (n,b)n=1N 1, … ,B
produce a measure of uncertainty from these models
for model parametersfor predictions
non-parametric bootstrap
3 . 1
Bootstrap: Bootstrap: exampleexampleϕ (x) =k e−
s2(x−μ )k
2
y =(n) sin(x ) +(n) cos( ) +∣x ∣(n) ϵ
#x: N #y: N plt.plot(x, y, 'b.') phi = lambda x,mu: np.exp(-(x-mu)**2) mu = np.linspace(0,10,10) #10 Gaussians basesPhi = phi(x[:,None], mu[None,:]) #N x 10w = np.linalg.lstsq(Phi, y)[0]yh = np.dot(Phi,w)plt.plot(x, yh, 'g-')
123456789
before adding noise
our fit to data using 10 Gaussian bases
noise
Recall: linear model with nonlinear Gaussian bases (N=100)
3 . 2
Bootstrap: Bootstrap: exampleexample ϕ (x) =k e−
s2(x−μ )k
2
ws = np.zeros((B,D))for b in range(B):
#Phi: N x D1#y: N2B = 5003
45
inds = np.random.randint(N, size=(N))6 Phi_b = Phi[inds,:] #N x D7 y_b = y[inds] #N8 #fit the subsampled data9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12
Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samples
gives a measure of uncertainty of the parameters
3 . 3
Bootstrap: Bootstrap: exampleexample ϕ (x) =k e−
s2(x−μ )k
2
ws = np.zeros((B,D))for b in range(B):
#Phi: N x D1#y: N2B = 5003
45
inds = np.random.randint(N, size=(N))6 Phi_b = Phi[inds,:] #N x D7 y_b = y[inds] #N8 #fit the subsampled data9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12
inds = np.random.randint(N, size=(N)) Phi_b = Phi[inds,:] #N x D
#Phi: N x D1#y: N2B = 5003ws = np.zeros((B,D))4for b in range(B):5
67
y_b = y[inds] #N8 #fit the subsampled data9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12
Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samples
gives a measure of uncertainty of the parameters
each color is a different weight w d
3 . 3
Bootstrap: Bootstrap: exampleexample ϕ (x) =k e−
s2(x−μ )k
2
ws = np.zeros((B,D))for b in range(B):
#Phi: N x D1#y: N2B = 5003
45
inds = np.random.randint(N, size=(N))6 Phi_b = Phi[inds,:] #N x D7 y_b = y[inds] #N8 #fit the subsampled data9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12
inds = np.random.randint(N, size=(N)) Phi_b = Phi[inds,:] #N x D
#Phi: N x D1#y: N2B = 5003ws = np.zeros((B,D))4for b in range(B):5
67
y_b = y[inds] #N8 #fit the subsampled data9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12
y_b = y[inds] #N #fit the subsampled data
#Phi: N x D1#y: N2B = 5003ws = np.zeros((B,D))4for b in range(B):5 inds = np.random.randint(N, size=(N))6 Phi_b = Phi[inds,:] #N x D7
89
ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12
Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samples
gives a measure of uncertainty of the parameters
each color is a different weight w d
3 . 3
Bootstrap: Bootstrap: exampleexample ϕ (x) =k e−
s2(x−μ )k
2
ws = np.zeros((B,D))for b in range(B):
#Phi: N x D1#y: N2B = 5003
45
inds = np.random.randint(N, size=(N))6 Phi_b = Phi[inds,:] #N x D7 y_b = y[inds] #N8 #fit the subsampled data9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12
inds = np.random.randint(N, size=(N)) Phi_b = Phi[inds,:] #N x D
#Phi: N x D1#y: N2B = 5003ws = np.zeros((B,D))4for b in range(B):5
67
y_b = y[inds] #N8 #fit the subsampled data9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12
y_b = y[inds] #N #fit the subsampled data
#Phi: N x D1#y: N2B = 5003ws = np.zeros((B,D))4for b in range(B):5 inds = np.random.randint(N, size=(N))6 Phi_b = Phi[inds,:] #N x D7
89
ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]10 11plt.hist(ws, bins=50)12
ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0]
#Phi: N x D1#y: N2B = 5003ws = np.zeros((B,D))4for b in range(B):5 inds = np.random.randint(N, size=(N))6 Phi_b = Phi[inds,:] #N x D7 y_b = y[inds] #N8 #fit the subsampled data9
1011
plt.hist(ws, bins=50)12
Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samples
gives a measure of uncertainty of the parameters
each color is a different weight w d
3 . 3
Bootstrap: Bootstrap: exampleexampleϕ (x) =k e−
s2(x−μ )k
2
Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samples
3 . 4
Bootstrap: Bootstrap: exampleexampleϕ (x) =k e−
s2(x−μ )k
2
Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samplesalso gives a measure of uncertainty of the predictions
3 . 4
Bootstrap: Bootstrap: exampleexampleϕ (x) =k e−
s2(x−μ )k
2
#Phi: N x D#Phi_test: Nt x D#y: N#ws: B x D from previous codey_hats = np.zeros((B, Nt))
12345
for b in range(B):6 wb = ws[b,:]7 y_hats[b,:] = np.dot(Phi_test, wb)8 9# get 95% quantiles 10y_5 = np.quantile(y_hats, .05, axis=0)11y_95 = np.quantile(y_hats, .95, axis=0)12
Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samplesalso gives a measure of uncertainty of the predictions
3 . 4
Bootstrap: Bootstrap: exampleexampleϕ (x) =k e−
s2(x−μ )k
2
#Phi: N x D#Phi_test: Nt x D#y: N#ws: B x D from previous codey_hats = np.zeros((B, Nt))
12345
for b in range(B):6 wb = ws[b,:]7 y_hats[b,:] = np.dot(Phi_test, wb)8 9# get 95% quantiles 10y_5 = np.quantile(y_hats, .05, axis=0)11y_95 = np.quantile(y_hats, .95, axis=0)12
for b in range(B): wb = ws[b,:] y_hats[b,:] = np.dot(Phi_test, wb)
#Phi: N x D1#Phi_test: Nt x D2#y: N3#ws: B x D from previous code4y_hats = np.zeros((B, Nt))5
678
9# get 95% quantiles 10y_5 = np.quantile(y_hats, .05, axis=0)11y_95 = np.quantile(y_hats, .95, axis=0)12
Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samplesalso gives a measure of uncertainty of the predictions the red lines are 5% and 95% quantiles
(for each point we can get these across bootstrap model predictions)
3 . 4
Winter 2020 | Applied Machine Learning (COMP551)
Bootstrap: Bootstrap: exampleexampleϕ (x) =k e−
s2(x−μ )k
2
#Phi: N x D#Phi_test: Nt x D#y: N#ws: B x D from previous codey_hats = np.zeros((B, Nt))
12345
for b in range(B):6 wb = ws[b,:]7 y_hats[b,:] = np.dot(Phi_test, wb)8 9# get 95% quantiles 10y_5 = np.quantile(y_hats, .05, axis=0)11y_95 = np.quantile(y_hats, .95, axis=0)12
for b in range(B): wb = ws[b,:] y_hats[b,:] = np.dot(Phi_test, wb)
#Phi: N x D1#Phi_test: Nt x D2#y: N3#ws: B x D from previous code4y_hats = np.zeros((B, Nt))5
678
9# get 95% quantiles 10y_5 = np.quantile(y_hats, .05, axis=0)11y_95 = np.quantile(y_hats, .95, axis=0)12y_5 = np.quantile(y_hats, .05, axis=0)y_95 = np.quantile(y_hats, .95, axis=0)
#Phi: N x D1#Phi_test: Nt x D2#y: N3#ws: B x D from previous code4y_hats = np.zeros((B, Nt))5for b in range(B):6 wb = ws[b,:]7 y_hats[b,:] = np.dot(Phi_test, wb)8 9# get 95% quantiles 10
1112
Recall: linear model with nonlinear Gaussian bases (N=100)using B=500 bootstrap samplesalso gives a measure of uncertainty of the predictions the red lines are 5% and 95% quantiles
(for each point we can get these across bootstrap model predictions)
3 . 4
BaggingBagging
Var(z +1 z ) =2 E[(z +1 z ) ] −22 E[z +1 z ]2
2
use bootstrap for more accurate prediction (not just uncertainty)
variance of sum of random variables
4 . 1
BaggingBagging
Var(z +1 z ) =2 E[(z +1 z ) ] −22 E[z +1 z ]2
2
use bootstrap for more accurate prediction (not just uncertainty)
= E[z +12 z +2
2 2z z ] −1 2 (E[z ] +1 E[z ])22
variance of sum of random variables
4 . 1
BaggingBagging
Var(z +1 z ) =2 E[(z +1 z ) ] −22 E[z +1 z ]2
2
use bootstrap for more accurate prediction (not just uncertainty)
= E[z +12 z +2
2 2z z ] −1 2 (E[z ] +1 E[z ])22
= E[z ] +12 E[z ] +2
2 E[2z z ] −1 2 E[z ] −12 E[z ] −2
2 2E[z ]E[z ]1 2
variance of sum of random variables
4 . 1
BaggingBagging
Var(z +1 z ) =2 E[(z +1 z ) ] −22 E[z +1 z ]2
2
use bootstrap for more accurate prediction (not just uncertainty)
= E[z +12 z +2
2 2z z ] −1 2 (E[z ] +1 E[z ])22
= E[z ] +12 E[z ] +2
2 E[2z z ] −1 2 E[z ] −12 E[z ] −2
2 2E[z ]E[z ]1 2
= Var(z ) +1 Var(z ) +2 2Cov(z , z )1 2
variance of sum of random variables
4 . 1
BaggingBagging
Var(z +1 z ) =2 E[(z +1 z ) ] −22 E[z +1 z ]2
2
use bootstrap for more accurate prediction (not just uncertainty)
= E[z +12 z +2
2 2z z ] −1 2 (E[z ] +1 E[z ])22
= E[z ] +12 E[z ] +2
2 E[2z z ] −1 2 E[z ] −12 E[z ] −2
2 2E[z ]E[z ]1 2
= Var(z ) +1 Var(z ) +2 2Cov(z , z )1 2
variance of sum of random variables
for uncorrelated variables this term is zero
4 . 1
BaggingBagging
average of uncorrelated random variables has a lower variance
use bootstrap for more accurate prediction (not just uncertainty)
4 . 2
BaggingBagging
are uncorrelated random variables with mean and variance z , … , z 1 B μ σ2
the average has mean and variance=z z
B1 ∑b b μ
average of uncorrelated random variables has a lower variance
use bootstrap for more accurate prediction (not just uncertainty)
4 . 2
BaggingBagging
are uncorrelated random variables with mean and variance z , … , z 1 B μ σ2
the average has mean and variance=z z
B1 ∑b b μ
average of uncorrelated random variables has a lower variance
use bootstrap for more accurate prediction (not just uncertainty)
Var( z ) =B1 ∑b b Var( z ) =B2
1 ∑b b Bσ =B21 2
σB1 2
4 . 2
BaggingBagging
are uncorrelated random variables with mean and variance z , … , z 1 B μ σ2
the average has mean and variance=z z
B1 ∑b b μ
average of uncorrelated random variables has a lower variance
use this to reduce the variance of our models (bias remains the same)
regression: average the model predictions (x) =f (x)B1 ∑b f b
use bootstrap for more accurate prediction (not just uncertainty)
Var( z ) =B1 ∑b b Var( z ) =B2
1 ∑b b Bσ =B21 2
σB1 2
4 . 2
BaggingBagging
are uncorrelated random variables with mean and variance z , … , z 1 B μ σ2
the average has mean and variance=z z
B1 ∑b b μ
average of uncorrelated random variables has a lower variance
use this to reduce the variance of our models (bias remains the same)
regression: average the model predictions (x) =f (x)B1 ∑b f b
use bootstrap samples to reduce correlation
issue: model predictions are not uncorrelated (trained using the same data)
bagging (bootstrap aggregation)
use bootstrap for more accurate prediction (not just uncertainty)
Var( z ) =B1 ∑b b Var( z ) =B2
1 ∑b b Bσ =B21 2
σB1 2
4 . 2
Bagging Bagging for classificationfor classification
averaging makes sense for regression, how about classification?
4 . 3
Bagging Bagging for classificationfor classification
are IID Bernoulli random variables with meanz , … , z ∈1 B {0, 1}
=z z B1 ∑b b
μ = .5 + ϵ
wisdom of crowds
for we have p( >z .5) goes to 1 as B grows
> 0
averaging makes sense for regression, how about classification?
4 . 3
Bagging Bagging for classificationfor classification
mode of iid classifiers that are better than chance is a better classifier
use voting
are IID Bernoulli random variables with meanz , … , z ∈1 B {0, 1}
=z z B1 ∑b b
μ = .5 + ϵ
wisdom of crowds
for we have p( >z .5) goes to 1 as B grows
> 0
averaging makes sense for regression, how about classification?
4 . 3
Bagging Bagging for classificationfor classification
mode of iid classifiers that are better than chance is a better classifier
use voting
crowds are wiser when
individuals are better than randomvotes are uncorrelated
are IID Bernoulli random variables with meanz , … , z ∈1 B {0, 1}
=z z B1 ∑b b
μ = .5 + ϵ
wisdom of crowds
for we have p( >z .5) goes to 1 as B grows
> 0
averaging makes sense for regression, how about classification?
4 . 3
Bagging Bagging for classificationfor classification
mode of iid classifiers that are better than chance is a better classifier
use voting
crowds are wiser when
individuals are better than randomvotes are uncorrelated
are IID Bernoulli random variables with meanz , … , z ∈1 B {0, 1}
=z z B1 ∑b b
μ = .5 + ϵ
wisdom of crowds
for we have p( >z .5) goes to 1 as B grows
> 0
use bootstrap samples to reduce correlationbagging (bootstrap aggregation)
averaging makes sense for regression, how about classification?
4 . 3
Bagging decision treesBagging decision trees
example
setup
synthetic dataset5 correlated features1st feature is a noisy predictor of the label
4 . 4
Bagging decision treesBagging decision trees
example
setup
synthetic dataset5 correlated features1st feature is a noisy predictor of the label
Bootstrap samples create different decision trees (due to high variance)
compared to decision trees, no longer interpretable!
4 . 4
Bagging decision treesBagging decision trees
example
setup
synthetic dataset5 correlated features1st feature is a noisy predictor of the label
B
voting for the most probably classaveraging probabilities
Bootstrap samples create different decision trees (due to high variance)
compared to decision trees, no longer interpretable!
4 . 4
Random forestsRandom forests
further reduce the correlation between decision trees
4 . 5
Random forestsRandom forests
further reduce the correlation between decision trees
feature sub-samplingonly a random subset of features are available for split at each step
further reduce the dependence between decision trees
4 . 5
Random forestsRandom forests
further reduce the correlation between decision trees
feature sub-samplingonly a random subset of features are available for split at each step
further reduce the dependence between decision trees
Dmagic number?this is a hyper-parameter, can be optimized using CV
4 . 5
Random forestsRandom forests
further reduce the correlation between decision trees
feature sub-samplingonly a random subset of features are available for split at each step
further reduce the dependence between decision trees
Dmagic number?this is a hyper-parameter, can be optimized using CV
Out Of Bag (OOB) samples:
the instances not included in a bootsrap dataset can be used for validationsimultaneous validation of decision trees in a forestno need to set aside data for cross validation
4 . 5
ExampleExample: spam detection: spam detection
N=4601 emailsbinary classification task: spam - not spamD=57 features:
48 words: percentage of words in the email that match these wordse.g., business,address,internet, free, George (customized per user)
6 characters: again percentage of characters that match thesech; , ch( ,ch[ ,ch! ,ch$ , ch#
average, max, sum of length of uninterrupted sequences of capital letters:CAPAVE, CAPMAX, CAPTOT
Dataset
4 . 6
ExampleExample: spam detection: spam detection
N=4601 emailsbinary classification task: spam - not spamD=57 features:
48 words: percentage of words in the email that match these wordse.g., business,address,internet, free, George (customized per user)
6 characters: again percentage of characters that match thesech; , ch( ,ch[ ,ch! ,ch$ , ch#
average, max, sum of length of uninterrupted sequences of capital letters:CAPAVE, CAPMAX, CAPTOT
an example offeature engineering
Dataset
4 . 6
ExampleExample: spam detection: spam detection
N=4601 emailsbinary classification task: spam - not spamD=57 features:
48 words: percentage of words in the email that match these wordse.g., business,address,internet, free, George (customized per user)
6 characters: again percentage of characters that match thesech; , ch( ,ch[ ,ch! ,ch$ , ch#
average, max, sum of length of uninterrupted sequences of capital letters:CAPAVE, CAPMAX, CAPTOT
average value of these features in the spam and non-spam emails
an example offeature engineering
Dataset
4 . 6
ExampleExample: spam detection: spam detection
decision tree after pruning
4 . 7
ExampleExample: spam detection: spam detection
decision tree after pruning
misclassification rate on test data
4 . 7
ExampleExample: spam detection: spam detection
decision tree after pruning number of leaves (17) in optimal pruningdecided based on cross-validation error
test error
cv error
misclassification rate on test data
4 . 7
ExampleExample: spam detection: spam detection
Bagging and Random Forests do much betterthan a single decision tree!
4 . 8
Winter 2020 | Applied Machine Learning (COMP551)
ExampleExample: spam detection: spam detection
Bagging and Random Forests do much betterthan a single decision tree!
Out Of Bag (OOB) error can be used for parameter tuning(e.g., size of the forest)
4 . 8
Summary so far...Summary so far...
Bootstrap is a powerful technique to get uncertainty estimatesBootstrep aggregation (Bagging) can reduce the variance of unstable models
5
Summary so far...Summary so far...
Bootstrap is a powerful technique to get uncertainty estimatesBootstrep aggregation (Bagging) can reduce the variance of unstable modelsRandom forests:
Bagging + further de-corelation of features at each splitOOB validation instead of CVdestroy interpretability of decision treesperform well in practicecan fail if only few relevant features exist (due to feature-sampling)
5
Adaptive basesAdaptive bases
several methods can be classified as learning these bases adaptively
decision treesgeneralized additive modelsboostingneural networks
f(x) = w ϕ (x; v )∑d d d d
in boosting each basis is a classifier or regression function (weak learner, or base learner)create a strong learner by sequentially combining week learners
6 . 1
Forward stagewise additive modellingForward stagewise additive modelling
6 . 2
Forward stagewise additive modellingForward stagewise additive modelling
model f(x) = w ϕ(x; v )∑t=1T {t} {t} a simple model, such as decision stump (decision tree with one node)
6 . 2
Forward stagewise additive modellingForward stagewise additive modelling
model f(x) = w ϕ(x; v )∑t=1T {t} {t} a simple model, such as decision stump (decision tree with one node)
cost J({w , v } ) ={t} {t}t L(y , f(x ))∑n=1
N (n) (n)
so far we have seen L2 loss, log loss and hinge loss
optimizing this cost is difficult given the form of f
6 . 2
Winter 2020 | Applied Machine Learning (COMP551)
Forward stagewise additive modellingForward stagewise additive modelling
model f(x) = w ϕ(x; v )∑t=1T {t} {t} a simple model, such as decision stump (decision tree with one node)
cost J({w , v } ) ={t} {t}t L(y , f(x ))∑n=1
N (n) (n)
so far we have seen L2 loss, log loss and hinge loss
optimizing this cost is difficult given the form of f
optimization idea add one weak-learner in each stage t, to reduce the error of previous stage
f (x) ={t} f (x ) +{t−1} (n) w ϕ(x ; v ){t} (n) {t}
2. add it to the current model
v ,w ={t} {t} arg min L(y , f (x ) +v,w∑n=1N (n) {t−1} (n) wϕ(x ; v))(n)
1. find the best weak learner
6 . 2
consider weak learners that are individual features ϕ (x) ={t} w x
{t}d{t}model
7 . 1
loss & forward stagewise loss & forward stagewise linearlinear model modelL 2
cost
arg min (y −d,w d 21 ∑n=1
N (n) (f (x ) +{t−1} (n) w x ))d d
(n)2
using L2 loss for regression
at stage t
consider weak learners that are individual features ϕ (x) ={t} w x
{t}d{t}model
7 . 1
loss & forward stagewise loss & forward stagewise linearlinear model modelL 2
cost
arg min (y −d,w d 21 ∑n=1
N (n) (f (x ) +{t−1} (n) w x ))d d
(n)2
using L2 loss for regression
at stage t
consider weak learners that are individual features ϕ (x) ={t} w x
{t}d{t}model
residual r(n)
7 . 1
loss & forward stagewise loss & forward stagewise linearlinear model modelL 2
cost
arg min (y −d,w d 21 ∑n=1
N (n) (f (x ) +{t−1} (n) w x ))d d
(n)2
using L2 loss for regression
at stage t
consider weak learners that are individual features ϕ (x) ={t} w x
{t}d{t}model
residual r(n)
optimization recall: optimal weight for each d is w =d
x ∑n d
(n) 2 x r ∑
n d
(n)d
(n)
pick the feature that most significantly reduces the residual
7 . 1
loss & forward stagewise loss & forward stagewise linearlinear model modelL 2
cost
arg min (y −d,w d 21 ∑n=1
N (n) (f (x ) +{t−1} (n) w x ))d d
(n)2
using L2 loss for regression
at stage t
consider weak learners that are individual features ϕ (x) ={t} w x
{t}d{t}model
residual r(n)
optimization recall: optimal weight for each d is w =d
x ∑n d
(n) 2 x r ∑
n d
(n)d
(n)
pick the feature that most significantly reduces the residual
f (x) ={t} αw x ∑t d{t}
{t}d{t}the model at time-step t:
using a small helps with test errorα
7 . 1
is this related to L1-regularized linear regression?
loss & forward stagewise loss & forward stagewise linearlinear model modelL 2
example
using small learning rate L2 Boosting has a similar regularization path to lasso
boosting
α = .01
loss & forward stagewise loss & forward stagewise linearlinear model modelL 2
∣w ∣∑d d t
lasso
w d
{t}
at each time-step only one feature is updated / addedd{t}
7 . 2
Winter 2020 | Applied Machine Learning (COMP551)
example
using small learning rate L2 Boosting has a similar regularization path to lasso
boosting
α = .01
we can view boosting as doing feature (base learner) selection in exponentially large spaces (e.g., all trees of size K)the number of steps t plays a similar role to (the inverse of) regularization hyper-parameter
loss & forward stagewise loss & forward stagewise linearlinear model modelL 2
∣w ∣∑d d t
lasso
w d
{t}
at each time-step only one feature is updated / addedd{t}
7 . 2
Exponential loss Exponential loss & AdaBoost& AdaBoost
loss functions for binary classification y ∈ {−1, +1}
predicted label is =y sign(f(x))
8 . 1
Exponential loss Exponential loss & AdaBoost& AdaBoost
loss functions for binary classification y ∈ {−1, +1}
predicted label is =y sign(f(x))
misclassification loss(0-1 loss)
L(y, f(x)) = I(yf(x) > 0)
8 . 1
Exponential loss Exponential loss & AdaBoost& AdaBoost
loss functions for binary classification
L(y, f(x)) = log (1 + e )−yf(x)log-loss(aka cross entropy loss or binomial deviance)
y ∈ {−1, +1}
predicted label is =y sign(f(x))
misclassification loss(0-1 loss)
L(y, f(x)) = I(yf(x) > 0)
8 . 1
Exponential loss Exponential loss & AdaBoost& AdaBoost
loss functions for binary classification
L(y, f(x)) = log (1 + e )−yf(x)log-loss(aka cross entropy loss or binomial deviance)
y ∈ {−1, +1}
predicted label is =y sign(f(x))
misclassification loss(0-1 loss)
L(y, f(x)) = I(yf(x) > 0)
L(y, f(x)) = max(0, 1 − yf(x))Hinge losssupport vector loss
8 . 1
Exponential loss Exponential loss & AdaBoost& AdaBoost
loss functions for binary classification
L(y, f(x)) = log (1 + e )−yf(x)log-loss(aka cross entropy loss or binomial deviance)
y ∈ {−1, +1}
predicted label is =y sign(f(x))
misclassification loss(0-1 loss)
L(y, f(x)) = I(yf(x) > 0)
L(y, f(x)) = max(0, 1 − yf(x))Hinge losssupport vector loss
yet another loss function is exponential loss L(y, f(x)) = e−yf(x)
note that the loss grows faster than the other surrogate losses (more sensitive to outliers)
8 . 1
Exponential loss Exponential loss & AdaBoost& AdaBoost
loss functions for binary classification
L(y, f(x)) = log (1 + e )−yf(x)log-loss(aka cross entropy loss or binomial deviance)
y ∈ {−1, +1}
predicted label is =y sign(f(x))
misclassification loss(0-1 loss)
L(y, f(x)) = I(yf(x) > 0)
L(y, f(x)) = max(0, 1 − yf(x))Hinge losssupport vector loss
yet another loss function is exponential loss L(y, f(x)) = e−yf(x)
note that the loss grows faster than the other surrogate losses (more sensitive to outliers)
useful property when working with additive models:
L(y, f (x) +{t−1} w ϕ(x, v )) ={t} {t} L(y, f (x)) ⋅{t−1} L(y,w ϕ(x, v )){t} {t}
treat this as a weight q for an instanceinstances that are not properly classified before receive a higher weight
8 . 1
Exponential loss &Exponential loss & AdaBoost AdaBoost
J({w , v } ) ={t} {t}t L(y , f (x ) +∑n=1
N (n) {t−1} (n) w ϕ(x , v )) ={t} (n) {t} q L(y ,w ϕ(x , v ))∑n
(n) (n) {t} (n) {t}
cost
L(y , f (x ))(n) {t−1} (n)loss for this instance at previous stage
using exponential loss
8 . 2
Exponential loss &Exponential loss & AdaBoost AdaBoost
J({w , v } ) ={t} {t}t L(y , f (x ) +∑n=1
N (n) {t−1} (n) w ϕ(x , v )) ={t} (n) {t} q L(y ,w ϕ(x , v ))∑n
(n) (n) {t} (n) {t}
cost
L(y , f (x ))(n) {t−1} (n)loss for this instance at previous stage
using exponential loss
discrete AdaBoost: assume this is a simple classifier, so its output is +/- 1
8 . 2
Exponential loss &Exponential loss & AdaBoost AdaBoost
J({w , v } ) ={t} {t}t L(y , f (x ) +∑n=1
N (n) {t−1} (n) w ϕ(x , v )) ={t} (n) {t} q L(y ,w ϕ(x , v ))∑n
(n) (n) {t} (n) {t}
cost
L(y , f (x ))(n) {t−1} (n)loss for this instance at previous stage
using exponential loss
discrete AdaBoost: assume this is a simple classifier, so its output is +/- 1
J({w , v } ) ={t} {t}t q e∑n
(n) −y w ϕ(x ,v )(n) {t} (n) {t}
objective is to find the weak learner minimizing the cost aboveoptimization
8 . 2
Exponential loss &Exponential loss & AdaBoost AdaBoost
J({w , v } ) ={t} {t}t L(y , f (x ) +∑n=1
N (n) {t−1} (n) w ϕ(x , v )) ={t} (n) {t} q L(y ,w ϕ(x , v ))∑n
(n) (n) {t} (n) {t}
cost
L(y , f (x ))(n) {t−1} (n)loss for this instance at previous stage
using exponential loss
= e q I(y =−w{t}∑n
(n) (n) ϕ(x , v )) +(n) {t} e q I(y =w{t}∑n
(n) (n) ϕ(x , v ))(n) {t}
discrete AdaBoost: assume this is a simple classifier, so its output is +/- 1
J({w , v } ) ={t} {t}t q e∑n
(n) −y w ϕ(x ,v )(n) {t} (n) {t}
objective is to find the weak learner minimizing the cost aboveoptimization
8 . 2
Exponential loss &Exponential loss & AdaBoost AdaBoost
J({w , v } ) ={t} {t}t L(y , f (x ) +∑n=1
N (n) {t−1} (n) w ϕ(x , v )) ={t} (n) {t} q L(y ,w ϕ(x , v ))∑n
(n) (n) {t} (n) {t}
cost
L(y , f (x ))(n) {t−1} (n)loss for this instance at previous stage
using exponential loss
= e q I(y =−w{t}∑n
(n) (n) ϕ(x , v )) +(n) {t} e q I(y =w{t}∑n
(n) (n) ϕ(x , v ))(n) {t}
discrete AdaBoost: assume this is a simple classifier, so its output is +/- 1
J({w , v } ) ={t} {t}t q e∑n
(n) −y w ϕ(x ,v )(n) {t} (n) {t}
objective is to find the weak learner minimizing the cost aboveoptimization
= e q +−w{t}∑n
(n) (e −w{t}e ) q I(y =−w{t}
∑n(n) (n) ϕ(x , v ))(n) {t}
assuming the weak learner should minimize this costthis is classification with weighted intances
w ≥{t} 0does not depend onthe weak learner
8 . 2
Exponential loss Exponential loss & & AdaBoostAdaBoost
= e q +−w{t}∑n
(n) (e −w{t}e ) q I(y =−w{t}
∑n(n) (n) ϕ(x , v ))(n) {t}
assuming the weak learner should minimize this costthis is classification with weighted instancesthis gives
w ≥{t} 0does not depend onthe weak learner
J({w , v } ) ={t} {t}t q L(y ,w ϕ(x , v ))∑n
(n) (n) {t} (n) {t}cost
v{t}
8 . 3
Exponential loss Exponential loss & & AdaBoostAdaBoost
= e q +−w{t}∑n
(n) (e −w{t}e ) q I(y =−w{t}
∑n(n) (n) ϕ(x , v ))(n) {t}
assuming the weak learner should minimize this costthis is classification with weighted instancesthis gives
w ≥{t} 0does not depend onthe weak learner
J({w , v } ) ={t} {t}t q L(y ,w ϕ(x , v ))∑n
(n) (n) {t} (n) {t}cost
still need to find the optimal w{t}
v{t}
8 . 3
Exponential loss Exponential loss & & AdaBoostAdaBoost
= e q +−w{t}∑n
(n) (e −w{t}e ) q I(y =−w{t}
∑n(n) (n) ϕ(x , v ))(n) {t}
assuming the weak learner should minimize this costthis is classification with weighted instancesthis gives
w ≥{t} 0does not depend onthe weak learner
J({w , v } ) ={t} {t}t q L(y ,w ϕ(x , v ))∑n
(n) (n) {t} (n) {t}cost
still need to find the optimal w{t}
v{t}
setting gives =∂w{t}
∂J 0 w ={t} log 2
1ℓ{t}
1−ℓ{t} weight-normalized misclassification error
ℓ ={t}
q∑n
(n)
q I(ϕ(x ;v ) =y )∑n
(n) (n) {t} (n)
8 . 3
Exponential loss Exponential loss & & AdaBoostAdaBoost
= e q +−w{t}∑n
(n) (e −w{t}e ) q I(y =−w{t}
∑n(n) (n) ϕ(x , v ))(n) {t}
assuming the weak learner should minimize this costthis is classification with weighted instancesthis gives
w ≥{t} 0does not depend onthe weak learner
J({w , v } ) ={t} {t}t q L(y ,w ϕ(x , v ))∑n
(n) (n) {t} (n) {t}cost
still need to find the optimal w{t}
v{t}
setting gives =∂w{t}
∂J 0 w ={t} log 2
1ℓ{t}
1−ℓ{t} weight-normalized misclassification error
ℓ ={t}
q∑n
(n)
q I(ϕ(x ;v ) =y )∑n
(n) (n) {t} (n)
since weak learner is better than chance and so w ≥{t} 0ℓ <{t} .5
8 . 3
Exponential loss Exponential loss & & AdaBoostAdaBoost
= e q +−w{t}∑n
(n) (e −w{t}e ) q I(y =−w{t}
∑n(n) (n) ϕ(x , v ))(n) {t}
assuming the weak learner should minimize this costthis is classification with weighted instancesthis gives
w ≥{t} 0does not depend onthe weak learner
J({w , v } ) ={t} {t}t q L(y ,w ϕ(x , v ))∑n
(n) (n) {t} (n) {t}cost
still need to find the optimal w{t}
v{t}
setting gives =∂w{t}
∂J 0 w ={t} log 2
1ℓ{t}
1−ℓ{t} weight-normalized misclassification error
ℓ ={t}
q∑n
(n)
q I(ϕ(x ;v ) =y )∑n
(n) (n) {t} (n)
we can now update instance weights q for next iteration(multiply by the new loss)
q =(n),{t+1} q e(n),{t} −w y ϕ(x ;v ){t} (n) (n) {t}
since weak learner is better than chance and so w ≥{t} 0ℓ <{t} .5
8 . 3
Exponential loss Exponential loss & & AdaBoostAdaBoost
= e q +−w{t}∑n
(n) (e −w{t}e ) q I(y =−w{t}
∑n(n) (n) ϕ(x , v ))(n) {t}
assuming the weak learner should minimize this costthis is classification with weighted instancesthis gives
w ≥{t} 0does not depend onthe weak learner
J({w , v } ) ={t} {t}t q L(y ,w ϕ(x , v ))∑n
(n) (n) {t} (n) {t}cost
still need to find the optimal w{t}
v{t}
setting gives =∂w{t}
∂J 0 w ={t} log 2
1ℓ{t}
1−ℓ{t} weight-normalized misclassification error
ℓ ={t}
q∑n
(n)
q I(ϕ(x ;v ) =y )∑n
(n) (n) {t} (n)
we can now update instance weights q for next iteration(multiply by the new loss)
q =(n),{t+1} q e(n),{t} −w y ϕ(x ;v ){t} (n) (n) {t}
since weak learner is better than chance and so w ≥{t} 0ℓ <{t} .5
8 . 3
since w > 0, the weight q of misclassified points increase and the rest decrease
Exponential loss Exponential loss & & AdaBoostAdaBoost
overall algorithm for discrete AdaBoost
8 . 4
Exponential loss Exponential loss & & AdaBoostAdaBoost
overall algorithm for discrete AdaBoost
w ϕ(x; v ){1} {1}
8 . 4
Exponential loss Exponential loss & & AdaBoostAdaBoost
overall algorithm for discrete AdaBoost
w ϕ(x; v ){1} {1}
w ϕ(x; v ){2} {2}
8 . 4
Exponential loss Exponential loss & & AdaBoostAdaBoost
overall algorithm for discrete AdaBoost
w ϕ(x; v ){1} {1}
w ϕ(x; v ){2} {2}
w ϕ(x; v ){3} {3}
w ϕ(x; v ){T} {T}
8 . 4
Exponential loss Exponential loss & & AdaBoostAdaBoost
overall algorithm for discrete AdaBoost
w ϕ(x; v ){1} {1}
w ϕ(x; v ){2} {2}
w ϕ(x; v ){3} {3}
w ϕ(x; v ){T} {T}
f(x) = sign( w ϕ(x; v ))∑t{t} {t}
8 . 4
return
initialize for t=1:T fit the simple classifier to the weighted dataset
w :{t} = log 21
ℓ{t}1−ℓ{t}
ℓ :{t} =
q∑n
(n) q I(ϕ(x ;v ) =y )∑
n(n) (n) {t} (n)
q :(n) = q e ∀n(n) −w y ϕ(x ;v ){t} (n) (n) {t}
q :(n) = ∀nN1
ϕ(x, v ){t}
f(x) = sign( w ϕ(x; v ))∑t{t} {t}
Exponential loss Exponential loss & & AdaBoostAdaBoost
overall algorithm for discrete AdaBoost
w ϕ(x; v ){1} {1}
w ϕ(x; v ){2} {2}
w ϕ(x; v ){3} {3}
w ϕ(x; v ){T} {T}
f(x) = sign( w ϕ(x; v ))∑t{t} {t}
8 . 4
AdaBoostAdaBoost
example
w ϕ(x; v ){1} {1}
w ϕ(x; v ){2} {2}
w ϕ(x; v ){3} {3}
w ϕ(x; v ){T} {T}
=y sign( w ϕ(x; v ))∑t{t} {t}
each weak learner is a decision stump (dashed line)
t = 1 t = 2 t = 3
t = 6 t = 10 t = 150
green is the decision boundary of f {t}
circle size is proportional to qn,{t}
8 . 5
AdaBoostAdaBoost
example
features are samples from standard Gaussianx , … ,x 1(n)
10(n)
8 . 6
AdaBoostAdaBoost
example
features are samples from standard Gaussianx , … ,x 1(n)
10(n)
y =(n) I( x >∑d(n)
d
29.34)label
8 . 6
AdaBoostAdaBoost
example
features are samples from standard Gaussianx , … ,x 1(n)
10(n)
y =(n) I( x >∑d(n)
d
29.34)label
N=2000 training examples
8 . 6
AdaBoostAdaBoost
example
features are samples from standard Gaussianx , … ,x 1(n)
10(n)
y =(n) I( x >∑d(n)
d
29.34)label
N=2000 training examples
8 . 6
Winter 2020 | Applied Machine Learning (COMP551)
AdaBoostAdaBoost
example
features are samples from standard Gaussianx , … ,x 1(n)
10(n)
y =(n) I( x >∑d(n)
d
29.34)label
N=2000 training examplesnotice that test error does not increaseAdaBoost is very slow to overfit
8 . 6
application:application: Viola-Jones face detection Viola-Jones face detection
Haar features are computationally efficienteach feature is a weak learnerAdaBoost picks one feature at a time (label: face/no-face)
Still can be inefficient:
use the fact that faces are rare (.01% of subwindows are faces)cascade of classifiers due to small rate
image source: David Lowe slides
9
application:application: Viola-Jones face detection Viola-Jones face detection
Haar features are computationally efficienteach feature is a weak learnerAdaBoost picks one feature at a time (label: face/no-face)
Still can be inefficient:
use the fact that faces are rare (.01% of subwindows are faces)cascade of classifiers due to small rate
100% detectionFP rate
cumulativeFP rate
image source: David Lowe slides
9
application:application: Viola-Jones face detection Viola-Jones face detection
Haar features are computationally efficienteach feature is a weak learnerAdaBoost picks one feature at a time (label: face/no-face)
Still can be inefficient:
use the fact that faces are rare (.01% of subwindows are faces)cascade of classifiers due to small rate
100% detectionFP rate
cumulativeFP rate
cascade is applied over all image subwindows
image source: David Lowe slides
9
application:application: Viola-Jones face detection Viola-Jones face detection
Haar features are computationally efficienteach feature is a weak learnerAdaBoost picks one feature at a time (label: face/no-face)
Still can be inefficient:
use the fact that faces are rare (.01% of subwindows are faces)cascade of classifiers due to small rate
100% detectionFP rate
cumulativeFP rate
fast enough for real-time (object) detection
cascade is applied over all image subwindows
image source: David Lowe slides
9
Gradient boostingGradient boosting
10 . 1
idea fit the weak learner to the gradient of the cost
Gradient boostingGradient boosting
let f ={t} [f (x ), … , f (x )]{t} (1) {t} (N) ⊤y = [y , … , y ](1) (N) ⊤
and true labels
10 . 1
idea fit the weak learner to the gradient of the cost
Gradient boostingGradient boosting
let f ={t} [f (x ), … , f (x )]{t} (1) {t} (N) ⊤y = [y , … , y ](1) (N) ⊤
and true labels
=f arg min L(f ,y)f
ignoring the structure of fif we use gradient descent to minimize the loss
10 . 1
idea fit the weak learner to the gradient of the cost
Gradient boostingGradient boosting
let f ={t} [f (x ), … , f (x )]{t} (1) {t} (N) ⊤y = [y , … , y ](1) (N) ⊤
and true labels
=f arg min L(f ,y)f
ignoring the structure of fif we use gradient descent to minimize the loss
write as a sum of stepsf =f f ={T} f −{0} w g∑t=1
T {t} {t}
L(f ,y)∂f∂ {t−1}
gradient vectorits role is similar to residual
10 . 1
idea fit the weak learner to the gradient of the cost
Gradient boostingGradient boosting
let f ={t} [f (x ), … , f (x )]{t} (1) {t} (N) ⊤y = [y , … , y ](1) (N) ⊤
and true labels
=f arg min L(f ,y)f
ignoring the structure of fif we use gradient descent to minimize the loss
write as a sum of stepsf =f f ={T} f −{0} w g∑t=1
T {t} {t}
w ={t} arg min L(f −w{t−1} wg ){t}
we can look for the optimal step size
L(f ,y)∂f∂ {t−1}
gradient vectorits role is similar to residual
10 . 1
idea fit the weak learner to the gradient of the cost
Gradient boostingGradient boosting
let f ={t} [f (x ), … , f (x )]{t} (1) {t} (N) ⊤y = [y , … , y ](1) (N) ⊤
and true labels
=f arg min L(f ,y)f
ignoring the structure of fif we use gradient descent to minimize the loss
write as a sum of stepsf =f f ={T} f −{0} w g∑t=1
T {t} {t}
so far we treated f as a parameter vector
w ={t} arg min L(f −w{t−1} wg ){t}
we can look for the optimal step size
L(f ,y)∂f∂ {t−1}
gradient vectorits role is similar to residual
10 . 1
idea fit the weak learner to the gradient of the cost
Gradient boostingGradient boosting
let f ={t} [f (x ), … , f (x )]{t} (1) {t} (N) ⊤y = [y , … , y ](1) (N) ⊤
and true labels
=f arg min L(f ,y)f
ignoring the structure of fif we use gradient descent to minimize the loss
write as a sum of stepsf =f f ={T} f −{0} w g∑t=1
T {t} {t}
so far we treated f as a parameter vector
w ={t} arg min L(f −w{t−1} wg ){t}
we can look for the optimal step size
L(f ,y)∂f∂ {t−1}
gradient vectorits role is similar to residual
10 . 1
idea fit the weak learner to the gradient of the cost
fit the weak-learner to negative of the gradient v ={t} arg min ∣∣ϕ −v 21
v (−g)∣∣ 22
Gradient boostingGradient boosting
let f ={t} [f (x ), … , f (x )]{t} (1) {t} (N) ⊤y = [y , … , y ](1) (N) ⊤
and true labels
=f arg min L(f ,y)f
ignoring the structure of fif we use gradient descent to minimize the loss
write as a sum of stepsf =f f ={T} f −{0} w g∑t=1
T {t} {t}
so far we treated f as a parameter vector
w ={t} arg min L(f −w{t−1} wg ){t}
we can look for the optimal step size
L(f ,y)∂f∂ {t−1}
gradient vectorits role is similar to residual
10 . 1
idea fit the weak learner to the gradient of the cost
we are fitting the gradient using L2 loss regardless of the original loss function
fit the weak-learner to negative of the gradient v ={t} arg min ∣∣ϕ −v 21
v (−g)∣∣ 22
ϕ =v [ϕ(x ; v), … ,ϕ(x ; v)](1) (N) ⊤
Gradient Gradient treetree boosting boosting
apply gradient boosting to CART (classification and regression trees)
initialize to predict a constant for t=1:T calculate the negative of the gradient fit a regression tree to and produce regions
readjust predictions per region
update return
f {0}
f (x) ={t} f (x) +{t−1} w I(x ∈∑k=1
Kk R )k
r = − L(f ,y)∂f∂ {t−1}
X, rN × D N
R , … ,R 1 K
w =k arg min L(y , f (x ) +w∑x ∈R
(n)k
(n) {t−1} (n) w )k
f (x){T}
α
10 . 2
Gradient Gradient treetree boosting boosting
apply gradient boosting to CART (classification and regression trees)
initialize to predict a constant for t=1:T calculate the negative of the gradient fit a regression tree to and produce regions
readjust predictions per region
update return
f {0}
f (x) ={t} f (x) +{t−1} w I(x ∈∑k=1
Kk R )k
r = − L(f ,y)∂f∂ {t−1}
X, rN × D N
R , … ,R 1 K
w =k arg min L(y , f (x ) +w∑x ∈R
(n)k
(n) {t−1} (n) w )k
f (x){T}
α
decide T using a validation set (early stopping)
10 . 2
Gradient Gradient treetree boosting boosting
apply gradient boosting to CART (classification and regression trees)
initialize to predict a constant for t=1:T calculate the negative of the gradient fit a regression tree to and produce regions
readjust predictions per region
update return
f {0}
f (x) ={t} f (x) +{t−1} w I(x ∈∑k=1
Kk R )k
r = − L(f ,y)∂f∂ {t−1}
X, rN × D N
R , … ,R 1 K
w =k arg min L(y , f (x ) +w∑x ∈R
(n)k
(n) {t−1} (n) w )k
f (x){T}
α
shallow trees of K = 4-8 leaf usuallywork well as weak learners
decide T using a validation set (early stopping)
10 . 2
Gradient Gradient treetree boosting boosting
apply gradient boosting to CART (classification and regression trees)
initialize to predict a constant for t=1:T calculate the negative of the gradient fit a regression tree to and produce regions
readjust predictions per region
update return
f {0}
f (x) ={t} f (x) +{t−1} w I(x ∈∑k=1
Kk R )k
r = − L(f ,y)∂f∂ {t−1}
X, rN × D N
R , … ,R 1 K
w =k arg min L(y , f (x ) +w∑x ∈R
(n)k
(n) {t−1} (n) w )k
f (x){T}
this is effectively the line-search
α
shallow trees of K = 4-8 leaf usuallywork well as weak learners
decide T using a validation set (early stopping)
10 . 2
Gradient Gradient treetree boosting boosting
apply gradient boosting to CART (classification and regression trees)
initialize to predict a constant for t=1:T calculate the negative of the gradient fit a regression tree to and produce regions
readjust predictions per region
update return
f {0}
f (x) ={t} f (x) +{t−1} w I(x ∈∑k=1
Kk R )k
r = − L(f ,y)∂f∂ {t−1}
X, rN × D N
R , … ,R 1 K
w =k arg min L(y , f (x ) +w∑x ∈R
(n)k
(n) {t−1} (n) w )k
f (x){T}
this is effectively the line-search
αusing a small learning rate here improves test error (shrinkage)
shallow trees of K = 4-8 leaf usuallywork well as weak learners
decide T using a validation set (early stopping)
10 . 2
Gradient Gradient treetree boosting boosting
apply gradient boosting to CART (classification and regression trees)
initialize to predict a constant for t=1:T calculate the negative of the gradient fit a regression tree to and produce regions
readjust predictions per region
update return
f {0}
f (x) ={t} f (x) +{t−1} w I(x ∈∑k=1
Kk R )k
r = − L(f ,y)∂f∂ {t−1}
X, rN × D N
R , … ,R 1 K
w =k arg min L(y , f (x ) +w∑x ∈R
(n)k
(n) {t−1} (n) w )k
f (x){T}
this is effectively the line-search
stochastic gradient boosting
combines bootstrap and boostinguse a subsample at each iteration abovesimilar to stochastic gradient descent
αusing a small learning rate here improves test error (shrinkage)
shallow trees of K = 4-8 leaf usuallywork well as weak learners
decide T using a validation set (early stopping)
10 . 2
Gradient tree boostingGradient tree boosting
example
features are samples from standard Gaussianx , … ,x 1(n)
10(n)
y =(n) I( x >∑d d(n) 9.34)label
N=2000 training examples
recall the synthetic example:
10 . 3
Gradient tree boostingGradient tree boosting
example
features are samples from standard Gaussianx , … ,x 1(n)
10(n)
y =(n) I( x >∑d d(n) 9.34)label
N=2000 training examples
recall the synthetic example:
Gradient tree boosting (using log-loss) works better than Adaboost
10 . 3
Gradient tree boostingGradient tree boosting
example
features are samples from standard Gaussianx , … ,x 1(n)
10(n)
y =(n) I( x >∑d d(n) 9.34)label
N=2000 training examples
recall the synthetic example:
since sum of features are used in prediction using stumps work best
Gradient tree boosting (using log-loss) works better than Adaboost
10 . 3
Gradient tree boostingGradient tree boosting
example
features are samples from standard Gaussianx , … ,x 1(n)
10(n)
y =(n) I( x >∑d d(n) 9.34)label
N=2000 training examples
recall the synthetic example:
α = .2
10 . 4
Gradient tree boostingGradient tree boosting
example
features are samples from standard Gaussianx , … ,x 1(n)
10(n)
y =(n) I( x >∑d d(n) 9.34)label
N=2000 training examples
recall the synthetic example:
deviance = cross entropy = log-loss
(K=2) stump
α = .2
10 . 4
Gradient tree boostingGradient tree boosting
example
features are samples from standard Gaussianx , … ,x 1(n)
10(n)
y =(n) I( x >∑d d(n) 9.34)label
N=2000 training examples
recall the synthetic example:
deviance = cross entropy = log-loss
(K=2) stump K=6
α = .2
10 . 4
Gradient tree boostingGradient tree boosting
example
features are samples from standard Gaussianx , … ,x 1(n)
10(n)
y =(n) I( x >∑d d(n) 9.34)label
N=2000 training examples
recall the synthetic example:
deviance = cross entropy = log-loss
(K=2) stump K=6
in both cases using shrinkage helpsα = .2while test loss may increase, test misclassification error does not
10 . 4
Gradient tree boostingGradient tree boosting
example
features are samples from standard Gaussianx , … ,x 1(n)
10(n)
y =(n) I( x >∑d d(n) 9.34)label
N=2000 training examples
recall the synthetic example:
α = 1
α = 1
α = .1α = .1
stochastic withbatch size 50%
stochastic withbatch size 50%
α = .1 α = .1and stochasticand stochastic
both shrinkage and subsampling can helpmore hyper-parameters to tune
10 . 5
Winter 2020 | Applied Machine Learning (COMP551)
Gradient tree boostingGradient tree boosting
example
see the interactive demo: https://arogozhnikov.github.io/2016/07/05/gradient_boosting_playground.html
10 . 6
SummarySummarytwo ensemble methods
11
SummarySummarytwo ensemble methods
bagging & random forests (reduce variance)
produce models with minimal correlationuse their average prediction
11
SummarySummarytwo ensemble methods
bagging & random forests (reduce variance)
produce models with minimal correlationuse their average prediction
boosting (reduces the bias of the weak learner)
models are added in stepsa single cost function is minimizedfor exponential loss: interpret as re-weighting the instance (AdaBoost)gradient boosting: fit the weak learner to the negative of the gradientinterpretation as L1 regularization for "weak learner"-selectionalso related to max-margin classification (for large number of steps T)
11
SummarySummarytwo ensemble methods
bagging & random forests (reduce variance)
produce models with minimal correlationuse their average prediction
boosting (reduces the bias of the weak learner)
models are added in stepsa single cost function is minimizedfor exponential loss: interpret as re-weighting the instance (AdaBoost)gradient boosting: fit the weak learner to the negative of the gradientinterpretation as L1 regularization for "weak learner"-selectionalso related to max-margin classification (for large number of steps T)
random forests and (gradient) boosting generally perform very well
11
Gradient boostingGradient boosting
− L(f ,y)∂f∂ {t−1}
=f f ={T} f −{0} w L(f ,y)∑t=1
T {t}∂f∂ {t−1}Gradient for some loss functions
one-hot coding for C-class classification
setting loss function
regression ∣∣y −2
1 f ∣∣ 22 y − f
regressionregression ∣∣y − f ∣∣ 1 sign(y − f)
multiclass
classification multi-class cross-entropy Y −PN × C
predicted class probabilities
N × C
binary
classification
12
exp(−yf)exponential loss
−y exp(−yf)
P =c,: softmax(f )[c]