Online Learning: Wrap Up Boosting: Interpretations, Extensions

25
Online Learning: Wrap Up & Boosting: Interpretations, Extensions, Theory Arindam Banerjee . – p.

Transcript of Online Learning: Wrap Up Boosting: Interpretations, Extensions

Page 1: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Online Learning: Wrap Up

&

Boosting: Interpretations, Extensions, Theory

Arindam Banerjee

. – p.1

Page 2: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Online: “Things we liked”

Adaptive (Natural)

Computationally efficient

Good guarantees [2]

Gradient descent based

No assumptions [2]

. – p.2

Page 3: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Online: “Things we didn’t like”

Assumes some good expert or oracle [6]

Not “robust” (noisy data)

Tracking of outliers

Intuition about convergence rates

Unsupervised versions

Stock market assumptions

Worst-case focus (conservative)

Learning rate (how to choose?)

. – p.3

Page 4: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Online: “General Questions”

Relations with Bayesian inference (e.g., Dirichlet)

Convergence proofs are good, but what about intuitivealgorithms

Is there a Batch to Online continuum

. – p.4

Page 5: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Some Key Ideas

Hypothesis Spaces and Oracles

. – p.5

Page 6: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Some Key Ideas

Hypothesis Spaces and Oracles

Approximation error and Estimation error

. – p.5

Page 7: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Some Key Ideas

Hypothesis Spaces and Oracles

Approximation error and Estimation error

Complexity of Hypothesis Spaces (VC dimensions)

. – p.5

Page 8: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Some Key Ideas

Hypothesis Spaces and Oracles

Approximation error and Estimation error

Complexity of Hypothesis Spaces (VC dimensions)

Regularization and “Being Conservative”

. – p.5

Page 9: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Some Key Ideas

Hypothesis Spaces and Oracles

Approximation error and Estimation error

Complexity of Hypothesis Spaces (VC dimensions)

Regularization and “Being Conservative”

Sufficient conditions for “Learning”

. – p.5

Page 10: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Some Key Ideas

Hypothesis Spaces and Oracles

Approximation error and Estimation error

Complexity of Hypothesis Spaces (VC dimensions)

Regularization and “Being Conservative”

Sufficient conditions for “Learning”

Empirical Estimation Theory and Falsifiability

. – p.5

Page 11: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Some Key Ideas

Hypothesis Spaces and Oracles

Approximation error and Estimation error

Complexity of Hypothesis Spaces (VC dimensions)

Regularization and “Being Conservative”

Sufficient conditions for “Learning”

Empirical Estimation Theory and Falsifiability

Lets get back to real life ... to Boosting ...

. – p.5

Page 12: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Weak Learning

A weak learner predicts “slightly better” than random

The PAC setting (basics)Let X be a instance space, c : X 7→ {0, 1} be a targetconcept, H be a hypothesis space (h : X 7→ {0, 1})Let Q be a fixed (but unknown) distribution on X

An algorithm, after training on (xi, c(xi)), [i]m1 , selects h ∈ H

such that

Px∼Q[h(x) 6= c(x)] ≤1

2− γ

The algorithm is called a γ-weak learner

We assume the existence of such a learner

. – p.6

Page 13: Online Learning: Wrap Up Boosting: Interpretations, Extensions

The Boosting Model

Boosting converts a weak learner to a strong learner

Boosting proceeds in roundsBooster constructs Dt on X, the train setWeak learner produces a hypothesis ht ∈ H so thatPx∼Dt

[ht(x) 6= c(x)] ≤ 12 − γt

After T rounds, the weak hypotheses ht, [t]T1 are combined

into a final hypothesis hfinal

We need proceduresfor obtaining Dt at each stepfor combining the weak hypotheses

. – p.7

Page 14: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Adaboost

Input: Training set (x1, y1), . . . , (xm, ym)

Algorithm: Initialize D1(i) = 1/m

For t = 1, . . . , T

Train a weak learner using distribution Dt

Get weak hypothesis ht with error εt = Prx∼Dt[ht(x) 6= y]

Choose αt = 12 ln

(

1−εt

εt

)

Update

Dt+1(i) =Dt(i) exp(−αtyiht(xi))

Zt

where Zt is the normalization factor

Output: h(x) = sign

[

T∑

t=1

αtht(x)

]

. – p.8

Page 15: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Training Error, Margin Error

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5

3

Margin, yf(x)

Tra

inin

g lo

ss

Loss functions for boosting

0−1 lossAdaboost lossLogitboost loss

The 0-1 training set loss with convex upper bounds:exponential loss and logistic loss

. – p.9

Page 16: Online Learning: Wrap Up Boosting: Interpretations, Extensions

More on the Training Error

The training error of the final classifier is bounded

1

m|{h(xi) 6= yi}| ≤

1

m

m∑

i=1

exp(−yih(xi)) =

T∏

t=1

Zt

Training error can be reduced most rapidly by choosing αt thatminimizes

Zt =n

i=1

Dt(i) exp(−αtyiht(xi))

Adaboost chooses the optimal αt

Margin is (sort of) maximized

Other boosting algorithms minimize other upper bounds

. – p.10

Page 17: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Boosting as Entropy Projection

For a general α we have

Dαt+1(i) =

Dt(i) exp(−αyiht(xi))

Zt(α)

where Zt(α) =∑m

i=1 exp(−αyiht(xi)). Then

minED[yh(x)]=0

KL(D, Dt) = maxα

(− lnZt(α)) .

Further

argminED[yh(x)]=0

KL(D, Dt) = Dαt

t+1 ,

which is the update Adaboost uses.

But what if labels are noisy?

. – p.11

Page 18: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Additive Models

An additive model

FT (x) =

T∑

t=1

wtft(x) .

Fit the t-th component to minimize the “residual” error

Error is measured by C(yf(x)), an upper bound on [y 6= f(x)]

For Adaboost,C(yf(x)) = exp(−yf(x))

For Logitboost,

C(yf(x)) = log(1 + exp(−yf(x)))

. – p.12

Page 19: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Upper Bounds on Training Errors

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5

3

Margin, yf(x)

Tra

inin

g lo

ss

Loss functions for boosting

0−1 lossAdaboost lossLogitboost loss

The 0-1 training set loss with convex upper bounds:exponential loss and logistic loss

. – p.13

Page 20: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Adaboost as Gradient Descent

Margin Cost function C(yh(x)) = exp(−yh(x))

Current classifier Ht−1(x), next weak learner ht(x)

Next classifier Ht(x) = Ht−1(x) + αtht(x)

The cost function to be minimized

m∑

i=1

C(yi[Ht−1(xi) + αtht(xi)])

The optimum solution

αt =1

2ln

(∑

i:ht(xi)=yiDt(i)

i:ht(xi) 6=yiDt(i)

)

=1

2ln

(

1 − εt

εt

)

. – p.14

Page 21: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Gradient Descent in Function Space

Choose any appropriate margin cost function C

Goal is to minimize margin cost on the training set

Current classifier Ht−1(x), next weak learner ht(x)

Next classifier Ht(x) = Ht−1(x) + αtht(x)

Choose αt to minimize

C(Ht(x) = C(Ht−1(x) + αtht(x)) .

“Anyboost” class of algorithms

. – p.15

Page 22: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Margin Bounds: Why Boosting works

Test set error decreases even after train set error is 0

No simple “bias-variance” explanation

Bounds on the error rate of the boosted classifier depends on

the number of training examples

the margin achieved on the train set

the complexity of the weak learners

It does not depend on the number of classifiers combined

. – p.16

Page 23: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Margin Bound for Boosting

H: Class of binary classifiers of VC-dimension dH

co(H) =

{

h : h(x) =∑

i

αihi(x), αi ≥ 0,∑

i

αi = 1

}

With probability at least (1 − δ) over the draw of train set S ofsize m, ∀h ∈ co(H), θ > 0 we have

Pr[yh(x) ≤ 0] ≤ PrS [yh(x) ≤ θ] + O

(

1

θ

dHm

)

+ O

log 1δ

m

. – p.17

Page 24: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Maximizing the Margin

Do boosting algorithms maximize the margin?

Minimize upper bound on training error

Often get large margin in the process

Do not explicitly maximize the margin

Explicit margin maximization

Classical Boost: Minimize bound on PrS [yh(x) ≤ 0]

Margin Boost: Minimize (bound on) PrS [yh(x) ≤ θ]

. – p.18

Page 25: Online Learning: Wrap Up Boosting: Interpretations, Extensions

Boosting: Wrap Up (for now)

Strong learner from convex combinations of weak learners

Choice of loss function is critical

Aggresive loss functions for clean data

Robust loss functions for noisy data

Gradient descent in function space of additive models

Generalization by implicit margin maximization

Explicit maximum margin boosting algorithms possible

Is bound minimization a good idea?

. – p.19