Boosting (Part II) and Self-Organizing Maps

34
Boosting (Part II) and Self- Organizing Maps By Marc Sobel

description

Boosting (Part II) and Self-Organizing Maps. By Marc Sobel. Generalized Boosting. - PowerPoint PPT Presentation

Transcript of Boosting (Part II) and Self-Organizing Maps

Page 1: Boosting (Part II) and Self-Organizing Maps

Boosting (Part II) and Self-Organizing MapsBy Marc Sobel

Page 2: Boosting (Part II) and Self-Organizing Maps

Generalized Boosting

Consider the problem of analyzing surveys. A large variety of people are surveyed to determine how likely they are to vote for conviction on juries. It is advantageous to design surveys which link their gender, political affiliations, etc.. to conviction. It is also advantageous to ordinally divide conviction into 5 categories which correspond to how strongly people feel about conviction.

Page 3: Boosting (Part II) and Self-Organizing Maps

Generalized Boosting example For the response variable, we have 5

separate values; the higher the response the greater the tendency to convict. We would like to predict how likely participants are to go for conviction based on their sex, participation in sports, etc… Logistic discrimination does not work in this example because it does not capture the complicated relationships between the predictor and response variables.

Page 4: Boosting (Part II) and Self-Organizing Maps

Generalized boosting for the conviction problem

We assign a score h(x,y)=ηφ{|y-ycorrect|/σ} which increases in proportion to how close it is to the correct response.

We put weights on all possible responses (xi,y) for y=yi and also y≠yi. We update not only the former, but also the latter weights in a two stage procedure. First, we update weights for each case. Second, we update weights within each single case. We update weights via,

Page 5: Boosting (Part II) and Self-Organizing Maps

Generalized boosting (part 1)

We update weights via:

, ,

11 ( , ) ( , )

( 1) ( ) 2, ,

11 ( , ) ( ) ( , ) ;

2 1i

i i i

tt i t i i i t i t

ti y y

h x y h x yt t

ti y i y

w l x y q y h x y

w w

,, i

,

, ,

( )( ) ; (y y ); (second stage)

( ) (first stage)i

i ti t

i t

i t i ty y

w yq y

w

w q y

Page 6: Boosting (Part II) and Self-Organizing Maps

Generalized boosting (explained)

The error incorporates all possible mistaken possibilities rather than just a single one. The algorithm differs from 2-valued boosting in that it updates a matrix rather than a vector of weights. This algorithm works much better than the comparable one which gives weight 1 to mistakes and weight 0 to correct responses.

Page 7: Boosting (Part II) and Self-Organizing Maps

Why does generalized boosting work?

The pseudo-loss of WL on training data (xi,yi), defined by

Defines the error made by the weak learner in case i.

The goal is to minimize the (weighted) average of the pseudo-losses.

,ploss=(1/ 2) 1 ( , ) ( , )i i i y iy

WL x y q WL x y

Page 8: Boosting (Part II) and Self-Organizing Maps

Another way to do Generalized Boosting ( GBII)

Do 5 adaboosts A[1],…,A[5]. Adaboost ‘A[i]’ distinguishes whether the correct response is ‘i’ or not. At the final stage we choose between the weighted averages by maximizing the scores:

[ ]

1[ ]

1

1log ( )

[ ]( )

1log

[ ]

Ti

t il i T

t i

WL tl

r x

l

Page 9: Boosting (Part II) and Self-Organizing Maps

Generalized Boosting

We choose the label which give the highest weighted average.

[ ]( ) arg max ( ) : l=1,...,5i l iWL x r x

Page 10: Boosting (Part II) and Self-Organizing Maps

Percentage predicted correctly among the training data for GB II

Page 11: Boosting (Part II) and Self-Organizing Maps

Histogram showing the probability of being incorrectly chosen for each item

Page 12: Boosting (Part II) and Self-Organizing Maps

Outliers

Note in the former slide, there are 2000 (out of 2800) items which are almost never incorrectly predicted. But there are also 400 outliers which are almost always predicted incorrectly. We would like to downweight these outliers.

Page 13: Boosting (Part II) and Self-Organizing Maps

Boosting in the presence of outliers: Hedge algorithms

Devise a loss vector at each time t. And update the weights via:

1( ,..., )t tnl ltl

1 tilt t

i i tw w

Page 14: Boosting (Part II) and Self-Organizing Maps

Stochastic Gradient Boosting

Assume training data; (Xi,Yi); (i=1,…,n) with responses taking e.g., real values. We want to estimate the relationship between the X’s and Y’s.

Assume a model of the form,

( , )i i i i iY g X

Page 15: Boosting (Part II) and Self-Organizing Maps

Stochastic Gradient Boosting (continued)

We use a two stage procedure: Define

First, given β1,…,βl+1 and Θ1,…,Θl we minimize

l+1

n 2* * *1 l 1 1

i=1argmin Res ( , ) ( | )l i i l i i lX Y g X

* * * *i

1Res , Y

l

l i i j ij

X Y X

Page 16: Boosting (Part II) and Self-Organizing Maps

Stochastic Gradient Boosting

We fit the beta’s via (using the bootstrap)

l+1

n 2* * *l 1 11

i=1

* * *1l 1

1l+1

* 211

1

argmin Res ( , ) ( | )

Res , ( | )

( | )

i i l i i ll

nli i l i

in

ll ii

X Y g X

X Y g X

g X

Page 17: Boosting (Part II) and Self-Organizing Maps

Stochastic Gradient Boosting

Note that the new weights (i.e., the beta’s) are proportional to the residual error. Bootstrapping estimates for the new parameters has the effect of making them robust.

Page 18: Boosting (Part II) and Self-Organizing Maps

Self Organizing Maps Create Maps of large data sets using an ordered set of map vectors: We view each map vector as controlling a

neighborhood of the data Ni (with ni members) (i=1,…,k) (i.e., those points closest to it).

From a statistical standpoint we can view the map vectors as parameters: The idea is that they will provide a simple ordered mapping of the data.

1( ,..., )nX XX

1( ,..., )kM MM

Page 19: Boosting (Part II) and Self-Organizing Maps

Kohonen in a typical pose

Page 20: Boosting (Part II) and Self-Organizing Maps

Example from politics

Self-Organizing Map showing US Congress voting patterns visualized in Synapse. The boxes show clustering and distances. Red means a yes vote while blue means a no vote in the component planes.

Page 21: Boosting (Part II) and Self-Organizing Maps

Bayesian View BAYESIAN VIEW: Assume that for each map vector

Mi there are ni copies distributed according to a gaussian prior with mean a

and variance τ2 . These vectors are the means of each member of the neighborhood. The posterior distribution of is:

, : s=1,...,ni s i

, : s=1,...,ni s i

2 2,

2 2 2 2

1| ,

1 1 1 1

i

i s

X a

X

N

Page 22: Boosting (Part II) and Self-Organizing Maps

Bayesian View of SOM Now, estimate the prior mean ‘a’ by the ‘old’ map

vector Mi[old]. Let and use the posterior estimate

For the ‘new’ Map vector. We then get the equation:

2

2 2τ

α =σ + τ

, | [ ]i s iX M newE =

[ ] [ ] [ ]i i i iM new M old X M old

Page 23: Boosting (Part II) and Self-Organizing Maps

Bayesian Extension: Possible Project

The Bayes update should really incorporate and random factor associated with the posterior standard deviation (e.g.,

2

[ ]

[ ] [ ] ,

i

i i i

M new

M old X M old

N

Page 24: Boosting (Part II) and Self-Organizing Maps

Gradient Descent Viewpoint

We can also view the SOM algorithm as part of a Newton-Raphson algorithm . Here we view the likelihood as based on independent gaussians with mean map vectors.

The mean vectors in a given neighborhood are all the same. The Newton Raphson updating algorithm operates sequentially on parts of the likelihood.

Page 25: Boosting (Part II) and Self-Organizing Maps

Gradient Descent

We have

[ ] [ ] [ ]i i i sM new M old M old X

Page 26: Boosting (Part II) and Self-Organizing Maps

SOM applied to the Conviction Data

Page 27: Boosting (Part II) and Self-Organizing Maps

Correlation chart showing the relationship between conviction and the first four questions

1.0000 0.9860 0.9843 0.9853 0.8018 0.9860 1.0000 0.9839 0.9841 0.7708 0.9843 0.9839 1.0000 0.9927 0.7883 0.9853 0.9841 0.9927 1.0000 0.7782 0.8018 0.7708 0.7883 0.7782 1.0000

Page 28: Boosting (Part II) and Self-Organizing Maps

Proof of log ratio result

Recall that we had the risk function

Which we would like to minimize in βl+1. First divide up the sum into two parts; the first is

where φl+1 correctly predicts y; the second where it does not:

1 1: 1 11

( ) exp ( ) ( )n

l i l i i l l ii

Risk y f x y x

1

1

1 1: 1( ) 1

1: 1( ) 1

( ) exp ( )

+ exp ( )

i l i

i l i

l i l i ly x

i l i ly x

Risk y f x

y f x

Page 29: Boosting (Part II) and Self-Organizing Maps

conclusion

We have that:

1

1

1 1 1:( ) 1

1 1:( ) 1

( ) exp exp ( )

+ exp exp ( )

i l i

i l i

l l i l iy x

l i l iy x

Risk y f x

y f x

Page 30: Boosting (Part II) and Self-Organizing Maps

Bounding the error made by boosting

Theorem: Put

We have that

Proof: We have, for the weights associated with incorrect classification that:

t( ( )); ( ( ));tw w i t iP y WL x P y WL x

14 (1 )

Tt t

i

1 , 1

1 ( ) r,

i,t

( ( ))

= ( 1 (1 ) )

< w 1 1 1 ( )

t

i i

w i i i t

WL X Yi t t

t i i

P Y WL X w

w r

WL X Y

Page 31: Boosting (Part II) and Self-Organizing Maps

Boosting error bound (continued)

We have that:

Putting this together over all the iterations:

i,t

,

w 1 1 1 ( )

1 1 1

t t i i

i t t t

WL X Y

w

i,T+11

w 1 1 1T

t ti

Page 32: Boosting (Part II) and Self-Organizing Maps

Lower bounding the error

We can lower bound the weights by: The final hypothesis makes a mistake on

the predicting yi if (see Bayes)

The final weight on an instance

1( ) 2

1 1

t i iT Th x y

t tt i

1 ( ), 1 ,0

1

i t iT y h x

i T i tt

w w

Page 33: Boosting (Part II) and Self-Organizing Maps

Lower Bound on weights

Putting together the last slide:

1

T+1

, 1 , 11 ( )

(1/ 2)

i,0h ( ) 1

(1/ 2)

1

w

=

T i i

i i

Ni T i T

i h x y

Tt

x y t

Tt

t

w w

Page 34: Boosting (Part II) and Self-Organizing Maps

Conclusion of Proof:

Putting the former result together with

We get the conclusion.

i,T+11

w 1 1 1T

t ti