Boosting (Part II) and Self-Organizing Maps

Boosting (Part II) and Self-Organizing MapsBy Marc Sobel

Generalized Boosting

Consider the problem of analyzing surveys. A large variety of people are surveyed to determine how likely they are to vote for conviction on juries. It is advantageous to design surveys which link their gender, political affiliations, etc.. to conviction. It is also advantageous to ordinally divide conviction into 5 categories which correspond to how strongly people feel about conviction.

Generalized Boosting example For the response variable, we have 5

separate values; the higher the response the greater the tendency to convict. We would like to predict how likely participants are to go for conviction based on their sex, participation in sports, etc… Logistic discrimination does not work in this example because it does not capture the complicated relationships between the predictor and response variables.

Generalized boosting for the conviction problem

We assign a score h(x,y)=ηφ{|y-ycorrect|/σ} which increases in proportion to how close it is to the correct response.

We put weights on all possible responses (xi,y) for y=yi and also y≠yi. We update not only the former, but also the latter weights in a two stage procedure. First, we update weights for each case. Second, we update weights within each single case. We update weights via,

Generalized boosting (part 1)

We update weights via:

, ,

11 ( , ) ( , )

( 1) ( ) 2, ,

11 ( , ) ( ) ( , ) ;

2 1i

i i i

tt i t i i i t i t

ti y y

h x y h x yt t

ti y i y

w l x y q y h x y

w w

,, i

,

, ,

( )( ) ; (y y ); (second stage)

( ) (first stage)i

i ti t

i t

i t i ty y

w yq y

w

w q y

Generalized boosting (explained)

The error incorporates all possible mistaken possibilities rather than just a single one. The algorithm differs from 2-valued boosting in that it updates a matrix rather than a vector of weights. This algorithm works much better than the comparable one which gives weight 1 to mistakes and weight 0 to correct responses.

Why does generalized boosting work?

The pseudo-loss of WL on training data (xi,yi), defined by

Defines the error made by the weak learner in case i.

The goal is to minimize the (weighted) average of the pseudo-losses.

,ploss=(1/ 2) 1 ( , ) ( , )i i i y iy

WL x y q WL x y

Another way to do Generalized Boosting ( GBII)

Do 5 adaboosts A[1],…,A[5]. Adaboost ‘A[i]’ distinguishes whether the correct response is ‘i’ or not. At the final stage we choose between the weighted averages by maximizing the scores:

[ ]

1[ ]

1

1log ( )

[ ]( )

1log

[ ]

Ti

t il i T

t i

WL tl

r x

l

Generalized Boosting

We choose the label which give the highest weighted average.

[ ]( ) arg max ( ) : l=1,...,5i l iWL x r x

Percentage predicted correctly among the training data for GB II

Histogram showing the probability of being incorrectly chosen for each item

Outliers

Note in the former slide, there are 2000 (out of 2800) items which are almost never incorrectly predicted. But there are also 400 outliers which are almost always predicted incorrectly. We would like to downweight these outliers.

Boosting in the presence of outliers: Hedge algorithms

Devise a loss vector at each time t. And update the weights via:

1( ,..., )t tnl ltl

1 tilt t

i i tw w

Stochastic Gradient Boosting

Assume training data; (Xi,Yi); (i=1,…,n) with responses taking e.g., real values. We want to estimate the relationship between the X’s and Y’s.

Assume a model of the form,

( , )i i i i iY g X

Stochastic Gradient Boosting (continued)

We use a two stage procedure: Define

First, given β1,…,βl+1 and Θ1,…,Θl we minimize

l+1

n 2* * *1 l 1 1

i=1argmin Res ( , ) ( | )l i i l i i lX Y g X

* * * *i

1Res , Y

l

l i i j ij

X Y X


We fit the beta’s via (using the bootstrap)

l+1

n 2* * *l 1 11

i=1

* * *1l 1

1l+1

* 211

1

argmin Res ( , ) ( | )

Res , ( | )

( | )

i i l i i ll

nli i l i

in

ll ii

X Y g X

X Y g X

g X


Note that the new weights (i.e., the beta’s) are proportional to the residual error. Bootstrapping estimates for the new parameters has the effect of making them robust.

Self Organizing Maps Create Maps of large data sets using an ordered set of map vectors: We view each map vector as controlling a

neighborhood of the data Ni (with ni members) (i=1,…,k) (i.e., those points closest to it).

From a statistical standpoint we can view the map vectors as parameters: The idea is that they will provide a simple ordered mapping of the data.

1( ,..., )nX XX

1( ,..., )kM MM

Kohonen in a typical pose

Example from politics

Self-Organizing Map showing US Congress voting patterns visualized in Synapse. The boxes show clustering and distances. Red means a yes vote while blue means a no vote in the component planes.

Bayesian View BAYESIAN VIEW: Assume that for each map vector

Mi there are ni copies distributed according to a gaussian prior with mean a

and variance τ2 . These vectors are the means of each member of the neighborhood. The posterior distribution of is:

, : s=1,...,ni s i

, : s=1,...,ni s i

2 2,

2 2 2 2

1| ,

1 1 1 1

i

i s

X a

X

N

Bayesian View of SOM Now, estimate the prior mean ‘a’ by the ‘old’ map

vector Mi[old]. Let and use the posterior estimate

For the ‘new’ Map vector. We then get the equation:

2

2 2τ

α =σ + τ

, | [ ]i s iX M newE =

[ ] [ ] [ ]i i i iM new M old X M old

Bayesian Extension: Possible Project

The Bayes update should really incorporate and random factor associated with the posterior standard deviation (e.g.,

2

[ ]

[ ] [ ] ,

i

i i i

M new

M old X M old

N

Gradient Descent Viewpoint

We can also view the SOM algorithm as part of a Newton-Raphson algorithm . Here we view the likelihood as based on independent gaussians with mean map vectors.

The mean vectors in a given neighborhood are all the same. The Newton Raphson updating algorithm operates sequentially on parts of the likelihood.

Gradient Descent

We have

[ ] [ ] [ ]i i i sM new M old M old X

SOM applied to the Conviction Data

Correlation chart showing the relationship between conviction and the first four questions

1.0000 0.9860 0.9843 0.9853 0.8018 0.9860 1.0000 0.9839 0.9841 0.7708 0.9843 0.9839 1.0000 0.9927 0.7883 0.9853 0.9841 0.9927 1.0000 0.7782 0.8018 0.7708 0.7883 0.7782 1.0000

Proof of log ratio result

Recall that we had the risk function

Which we would like to minimize in βl+1. First divide up the sum into two parts; the first is

where φl+1 correctly predicts y; the second where it does not:

1 1: 1 11

( ) exp ( ) ( )n

l i l i i l l ii

Risk y f x y x

1

1

1 1: 1( ) 1

1: 1( ) 1

( ) exp ( )

+ exp ( )

i l i

i l i

l i l i ly x

i l i ly x

Risk y f x

y f x

conclusion

We have that:

1

1

1 1 1:( ) 1

1 1:( ) 1

( ) exp exp ( )

+ exp exp ( )

i l i

i l i

l l i l iy x

l i l iy x

Risk y f x

y f x

Bounding the error made by boosting

Theorem: Put

We have that

Proof: We have, for the weights associated with incorrect classification that:

t( ( )); ( ( ));tw w i t iP y WL x P y WL x

14 (1 )

Tt t

i

1 , 1

1 ( ) r,

i,t

( ( ))

= ( 1 (1 ) )

< w 1 1 1 ( )

t

i i

w i i i t

WL X Yi t t

t i i

P Y WL X w

w r

WL X Y

Boosting error bound (continued)

We have that:

Putting this together over all the iterations:

i,t

,

w 1 1 1 ( )

1 1 1

t t i i

i t t t

WL X Y

w

i,T+11

w 1 1 1T

t ti

Lower bounding the error

We can lower bound the weights by: The final hypothesis makes a mistake on

the predicting yi if (see Bayes)

The final weight on an instance

1( ) 2

1 1

t i iT Th x y

t tt i

1 ( ), 1 ,0

1

i t iT y h x

i T i tt

w w

Lower Bound on weights

Putting together the last slide:

1

T+1

, 1 , 11 ( )

(1/ 2)

i,0h ( ) 1

(1/ 2)

1

w

=

T i i

i i

Ni T i T

i h x y

Tt

x y t

Tt

t

w w

Conclusion of Proof:

Putting the former result together with

We get the conclusion.

i,T+11

w 1 1 1T

t ti

Boosting (Part II) and Self-Organizing Maps

Documents

Transcript of Boosting (Part II) and Self-Organizing Maps