An Introduction to boosting

49
An Introduction to Boosting Yoav Freund Banter Inc.

description

 

Transcript of An Introduction to boosting

Page 1: An Introduction to boosting

An Introduction to Boosting

Yoav Freund

Banter Inc.

Page 2: An Introduction to boosting

Plan of talk

• Generative vs. non-generative modeling

• Boosting

• Alternating decision trees

• Boosting and over-fitting

• Applications

Page 3: An Introduction to boosting

Toy Example

• Computer receives telephone call

• Measures Pitch of voice

• Decides gender of caller

HumanVoice

Male

Female

Page 4: An Introduction to boosting

Generative modeling

Voice Pitch

Pro

babi

lity

mean1

var1

mean2

var2

Page 5: An Introduction to boosting

Discriminative approach

Voice Pitch

No.

of

mis

take

s

Page 6: An Introduction to boosting

Ill-behaved data

Voice Pitch

Pro

babi

lity

mean1 mean2

No.

of

mis

take

s

Page 7: An Introduction to boosting

Traditional Statistics vs. Machine Learning

DataEstimated world state

PredictionsActionsStatistics

Decision Theory

Machine Learning

Page 8: An Introduction to boosting

Model Generative Discriminative

Goal Probability estimates

Classification rule

Performance measure

Likelihood Misclassification rate

Mismatch problems

Outliers Misclassifications

Comparison of methodologies

Page 9: An Introduction to boosting
Page 10: An Introduction to boosting

A weak learner

weak learner

A weak rule

hWeighted

training set

(x1,y1,w1),(x2,y2,w2) … (xn,yn,wn)

instances

x1,x2,x3,…,xnh

labels

y1,y2,y3,…,yn

The weak requirement:

Feature vectorB

inary labelN

on-negative weights

sum to 1

Page 11: An Introduction to boosting

The boosting process

weak learner h1

(x1,y1,1/n), … (xn,yn,1/n)

weak learner h2(x1,y1,w1), … (xn,yn,wn)

h3(x1,y1,w1), … (xn,yn,wn) h4

(x1,y1,w1), … (xn,yn,wn) h5(x1,y1,w1), … (xn,yn,wn) h6

(x1,y1,w1), … (xn,yn,wn) h7(x1,y1,w1), … (xn,yn,wn) h8

(x1,y1,w1), … (xn,yn,wn) h9(x1,y1,w1), … (xn,yn,wn) hT

(x1,y1,w1), … (xn,yn,wn)

Final rule: Sign[ ]h1 h2 hT

Page 12: An Introduction to boosting

Adaboost

• Binary labels y = -1,+1

• margin(x,y) = y [ttht(x)]

• P(x,y) = (1/Z) exp (-margin(x,y))

• Given ht, we choose t to minimize

(x,y) exp (-margin(x,y))

Page 13: An Introduction to boosting

Main property of adaboost

• If advantages of weak rules over random guessing are: T then in-sample error of final rule is at most

(w.r.t. the initial weights)

Page 14: An Introduction to boosting

A demo

• www.cs.huji.ac.il/~yoavf/adabooost

Page 15: An Introduction to boosting

Adaboost as gradient descent

• Discriminator class: a linear discriminator in the space of “weak hypotheses”

• Original goal: find hyper plane with smallest number of mistakes – Known to be an NP-hard problem (no algorithm that

runs in time polynomial in d, where d is the dimension of the space)

• Computational method: Use exponential loss as a surrogate, perform gradient descent.

Page 16: An Introduction to boosting

Margins view}1,1{;, +−∈∈ yRwx n )( xwsign •Prediction =

+

- + +++

+

+

-

--

--

--

w

Correc

t

Mist

akes

Project

Margin

Cumulative # examples

Mistakes Correct

Margin = xw

xwy

⋅• )(

Page 17: An Introduction to boosting

Adaboost et al.

Loss

Correct

MarginMistakes

Brownboost

LogitboostAdaboost = )( xwye •−

0-1 loss

Page 18: An Introduction to boosting

One coordinate at a time

• Adaboost performs gradient descent on exponential loss

• Adds one coordinate (“weak learner”) at each iteration.

• Weak learning in binary classification = slightly better than random guessing.

• Weak learning in regression – unclear.

• Uses example-weights to communicate the gradient direction to the weak learner

• Solves a computational problem

Page 19: An Introduction to boosting

What is a good weak learner?

• The set of weak rules (features) should be flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label.

• Small enough to allow exhaustive search for the minimal weighted training error.

• Small enough to avoid over-fitting.

• Should be able to calculate predicted label very efficiently.

• Rules can be “specialists” – predict only on a small subset of the input space and abstain from predicting on the rest (output 0).

Page 20: An Introduction to boosting

Alternating Trees

Joint work with Llew Mason

Page 21: An Introduction to boosting

Decision Trees

X>3

Y>5-1

+1-1

no

yes

yesno

X

Y

3

5

+1

-1

-1

Page 22: An Introduction to boosting

-0.2

Decision tree as a sum

X

Y-0.2

Y>5

+0.2-0.3

yesno

X>3

-0.1

no

yes

+0.1

+0.1-0.1

+0.2

-0.3

+1

-1

-1sign

Page 23: An Introduction to boosting

An alternating decision tree

X

Y

+0.1-0.1

+0.2

-0.3

sign

-0.2

Y>5

+0.2-0.3yesno

X>3

-0.1

no

yes

+0.1

+0.7

Y<1

0.0

no

yes

+0.7

+1

-1

-1

+1

Page 24: An Introduction to boosting

Example: Medical Diagnostics

•Cleve dataset from UC Irvine database.

•Heart disease diagnostics (+1=healthy,-1=sick)

•13 features from tests (real valued and discrete).

•303 instances.

Page 25: An Introduction to boosting

Adtree for Cleveland heart-disease diagnostics problem

Page 26: An Introduction to boosting

Cross-validated accuracyLearning algorithm

Number of splits

Average test error

Test error variance

ADtree 6 17.0% 0.6%

C5.0 27 27.2% 0.5%

C5.0 + boosting

446 20.2% 0.5%

Boost Stumps

16 16.5% 0.8%

Page 27: An Introduction to boosting
Page 28: An Introduction to boosting

Curious phenomenonBoosting decision trees

Using <10,000 training examples we fit >2,000,000 parameters

Page 29: An Introduction to boosting

Explanation using margins

Margin

0-1 loss

Page 30: An Introduction to boosting

Explanation using margins

Margin

0-1 loss

No examples with small margins!!

Θ

Page 31: An Introduction to boosting

Experimental Evidence

Page 32: An Introduction to boosting

TheoremFor any convex combination and any threshold

Probability of mistake

Fraction of training example with small margin

Size of training sample

VC dimension of weak rules

No dependence on numberNo dependence on number of weak rules of weak rules that are combined!!!that are combined!!!

Schapire, Freund, Bartlett & LeeAnnals of stat. 98

Page 33: An Introduction to boosting

Suggested optimization problem

MarginΘ

Θm

d

Page 34: An Introduction to boosting

Idea of Proof

Page 35: An Introduction to boosting
Page 36: An Introduction to boosting

Applications of Boosting

• Academic research

• Applied research

• Commercial deployment

Page 37: An Introduction to boosting

Academic research

Database Other BoostingError reduction

Cleveland 27.2 (DT) 16.5 39%

Promoters 22.0 (DT) 11.8 46%

Letter 13.8 (DT) 3.5 74%

Reuters 4 5.8, 6.0, 9.8 2.95 ~60%

Reuters 8 11.3, 12.1, 13.4 7.4 ~40%

% test error rates

Page 38: An Introduction to boosting

Applied research

• “AT&T, How may I help you?”• Classify voice requests• Voice -> text -> category• Fourteen categories

Area code, AT&T service, billing credit, calling card, collect, competitor, dial assistance, directory, how to dial, person to person, rate, third party, time charge ,time

Schapire, Singer, Gorin 98

Page 39: An Introduction to boosting

• Yes I’d like to place a collect call long distance please

• Operator I need to make a call but I need to bill it to my office

• Yes I’d like to place a call on my master card please

• I just called a number in Sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off my bill

Examples

collect

third party

billing credit

calling card

Page 40: An Introduction to boosting

Calling card

Collect

call

Third party

Weak Rule

Category

Word occursWord does not occur

Weak rules generated by “boostexter”

Page 41: An Introduction to boosting

Results

• 7844 training examples– hand transcribed

• 1000 test examples– hand / machine transcribed

• Accuracy with 20% rejected– Machine transcribed: 75%– Hand transcribed: 90%

Page 42: An Introduction to boosting

Commercial deployment

• Distinguish business/residence customers

• Using statistics from call-detail records

• Alternating decision trees– Similar to boosting decision trees, more flexible– Combines very simple rules– Can over-fit, cross validation used to stop

Freund, Mason, Rogers, Pregibon, Cortes 2000

Page 43: An Introduction to boosting

Massive datasets

• 260M calls / day

• 230M telephone numbers

• Label unknown for ~30%

• Hancock: software for computing statistical signatures.

• 100K randomly selected training examples,

• ~10K is enough

• Training takes about 2 hours.

• Generated classifier has to be both accurate and efficient

Page 44: An Introduction to boosting

Alternating tree for “buizocity”

Page 45: An Introduction to boosting

Alternating Tree (Detail)

Page 46: An Introduction to boosting

Precision/recall graphs

Score

Acc

urac

y

Page 47: An Introduction to boosting

Business impact

• Increased coverage from 44% to 56%

• Accuracy ~94%

• Saved AT&T 15M$ in the year 2000 in operations costs and missed opportunities.

Page 48: An Introduction to boosting

Summary

• Boosting is a computational method for learning accurate classifiers

• Resistance to over-fit explained by margins• Underlying explanation –

large “neighborhoods” of good classifiers• Boosting has been applied successfully to a

variety of classification problems

Page 49: An Introduction to boosting

Come talk with me!

[email protected]

• http://www.cs.huji.ac.il/~yoavf