(Useful) Information...

13
(Useful) Information Geometry Shane Gu and Nilesh Tripuraneni [email protected],[email protected] 02/04/2015

Transcript of (Useful) Information...

Page 1: (Useful) Information Geometrycbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/information_geometry2.pdf(Useful) Information Geometry Shane Gu and Nilesh Tripuraneni sg717@cam.ac.uk,nt357@cam.ac.uk

(Useful) Information GeometryShane Gu and Nilesh [email protected],[email protected]

02/04/2015

Page 2: (Useful) Information Geometrycbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/information_geometry2.pdf(Useful) Information Geometry Shane Gu and Nilesh Tripuraneni sg717@cam.ac.uk,nt357@cam.ac.uk

AdaBoost

I Input: a pool of weak learners/rule ∈ H, training data(xi , yi ) ∈ X × {−1, 1}, and initial sample weight distribution D0.

I Weak Learners : Find a weak rule ht ∈ H that gives the smallestweighted error εt under Dt .

I Booster : Adjust sample weights

Dt+1(i) = 1Zt

Dt(i) exp(−αtyiht(xi )) (1)

where αt = 12 ln 1−εt

εtand Zt =

∑i Dt(i) exp(−αtyiht(xi ))

I Repeat until convergence or satisfactionI Output: Strong Learner/Rule F (x) = sgn(

∑t αtht(x))

2 of 13

Page 3: (Useful) Information Geometrycbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/information_geometry2.pdf(Useful) Information Geometry Shane Gu and Nilesh Tripuraneni sg717@cam.ac.uk,nt357@cam.ac.uk

AdaBoost

I At the end of each round, the weight of example is (exp)proportional to its loss:

Dt+1(i) = Πtt′

exp(−yiαt′ht′(xi ))Zt′

∝ exp(−yiFt(xi ))

I Total loss is proportional to Zt :∑i

exp(−yiFt(xi )) =∑

iexp(−yi (Ft−1(xi ) + αtht(xi )))

∝∑

iDt(i) exp(−yiαtht(xi )) ≡ Zt

3 of 13

Page 4: (Useful) Information Geometrycbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/information_geometry2.pdf(Useful) Information Geometry Shane Gu and Nilesh Tripuraneni sg717@cam.ac.uk,nt357@cam.ac.uk

Sequential Error Minimization

Find αt and ht(xi ) to minimize Zt(αt , ht):

Zt (αt , ht ) =∑

i

Dt (i) exp(−αtyi ht (xi )) =∑

i :yi =ht (xi )

Dt (i) exp(−αt ) +∑

i :yt 6=ht (xi )

Dt (i) exp(αt )

= (exp(αt )− exp(−αt ))(N∑

i=1

Dt (i)I(yi = ht (xi )) + exp(−αt )N∑

i=1

Dt (i)

Choose ht (xi ) to minimize weighted error.

Zt (αt , ht (xi )) =∑

i

Dt (i) exp(−αtyi ht (xi )) =∑i :yt =ht (xi )

Dt (i) exp(−αt ) +∑

i :yt 6=ht (xi )

Dt (i) exp(αt )

= exp(−αt )(1− εt ) + exp(αt )εt

Choose αt such that dZtdαt

= 0 =⇒ αt = 12 ln( 1−εt

εt).

4 of 13

Page 5: (Useful) Information Geometrycbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/information_geometry2.pdf(Useful) Information Geometry Shane Gu and Nilesh Tripuraneni sg717@cam.ac.uk,nt357@cam.ac.uk

Orthogonality of D

So, αt chosen such that dZtdαt

= 0, and choose ht(xi ) to minimizeI(yi = ht(xi )).Then, the booster constructs a new distribution Dt+1, such that thecorrelation with ht is zero:∑

iDt+1(i)yiht(xi ) = 1

Zt

∑i

Dt(i) exp(−αtyiht(xi ))yiht(xi )

= − 1Zα

dZtdαt

= 0.

5 of 13

Page 6: (Useful) Information Geometrycbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/information_geometry2.pdf(Useful) Information Geometry Shane Gu and Nilesh Tripuraneni sg717@cam.ac.uk,nt357@cam.ac.uk

Alternative View of AdaBoost

I Weak Learners : Given Dt , find ht ∈ H minimizing weighted error

minht

∑i

Dt(i)I(ht(xi ) = yi )

or equivalently, minimizing weighted error.I Booster : Given ht , compute Dt+1 such that∑

iDt+1(i)yiht(xi ) = 0

i.e. is the booster pursuing a distribution D such that∑i

D(i)yihj(xi ) = 0

for every hj ∈ H ?Set of Constraints Linear in D

6 of 13

Page 7: (Useful) Information Geometrycbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/information_geometry2.pdf(Useful) Information Geometry Shane Gu and Nilesh Tripuraneni sg717@cam.ac.uk,nt357@cam.ac.uk

Optimization Problem for AdaBoost?

Solve:

minD

KL(D||U)

such that ∑i

D(i)yihj(xi ) = 0, ∀j

D(i) ≥ 0, ∀i∑i

D(i) = 1

Let us assume the feasible set P defined by constraints is non-empty.

7 of 13

Page 8: (Useful) Information Geometrycbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/information_geometry2.pdf(Useful) Information Geometry Shane Gu and Nilesh Tripuraneni sg717@cam.ac.uk,nt357@cam.ac.uk

Iterative ProjectionsI Initialize D1 = UI Choose ht ∈ H corresponding to one constraint (Weak Learner)I Find Dt+1 = argminD:

∑i D(i)yi ht (xi )=0KL(D||Dt) (Booster)

I IterateGreedy Selection of Constraints: Choose ht so that KL(Dt+1||Dt) ismaximized.

Each round of Iterative Projection is equivalent to one round ofAdaboost.

8 of 13

Page 9: (Useful) Information Geometrycbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/information_geometry2.pdf(Useful) Information Geometry Shane Gu and Nilesh Tripuraneni sg717@cam.ac.uk,nt357@cam.ac.uk

Equivalence ProofI Booster Using Lagrange multipliers/duality:

maxα,µminDL(α, µ,D) = KL(D||Dt) + α∑

iD(i)yiht(xi )

+µ(∑

iD(i)− 1)

0 = ∂L∂D(i) = ln D(i)

Dt(i) + 1 + αyiht(xi ) + µ

D∗(i) = Dt(i) exp(−αyiht(xi )− 1− µ) = 1Z (α)Dt(i) exp(−αyiht(xi ))

L(α) = − ln Z (α)

Choose α to minimize Zt , so D, α,Z ≡ Dt+1, αt ,Zt for same ht .I Weak Learner Find ht to maximize:

KL(Dt+1||Dt) =∑

iDt+1(i)(−αtyiht(xi )− ln Zt) = − ln Zt

Equiv to choosing ht to minimize Zt

9 of 13

Page 10: (Useful) Information Geometrycbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/information_geometry2.pdf(Useful) Information Geometry Shane Gu and Nilesh Tripuraneni sg717@cam.ac.uk,nt357@cam.ac.uk

Convergence of AdaBoost

I Recall that P is the feasible set of constraints, and define Q as the set ofD ∝ exp(−

∑Ni=1 λj yi hj (xi )). If d ∈ P ∩Q then by Pythagorean theorem (as before)

d uniquely solves minp∈P KL(p||U).I Dt computed by iterative projection converges to unique point d ∈ P ∩Q

By Pythagorean Theorem:

KL(D∗,Dt+1) = KL(D∗,Dt )− KL(Dt+1,Dt )

we are always getting closer!

I i.e. the loss ≥ 0 and non-increasing, so drop in loss must convergeto 0.

I Moreover if drop in loss = 0, then D ∈ PI Construction of d implies D∗ ∈ Q

10 of 13

Page 11: (Useful) Information Geometrycbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/information_geometry2.pdf(Useful) Information Geometry Shane Gu and Nilesh Tripuraneni sg717@cam.ac.uk,nt357@cam.ac.uk

Duality

Minimizing the exponential loss E(exp(−yF (x))) is the convexdual of solving the KL-projection problem subject to linearconstraints.

11 of 13

Page 12: (Useful) Information Geometrycbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/information_geometry2.pdf(Useful) Information Geometry Shane Gu and Nilesh Tripuraneni sg717@cam.ac.uk,nt357@cam.ac.uk

Afterthoughts and Bregman Divergences

Why and When does this work?For convex function F , the induced Bregman divergence is:

BF (p||q) = F (p)− F (q)−∇F (q)(p − q)

Bregman Divergences are in 1-to-1 correspondence with exponentialfamilies (i.e. contours of equal density define the Bregman distance)Theorem: For a large family of Bregman divergences, there exists aunique d∗ satisfying

I d∗ ∈ P ∩QI d∗ = argminp∈PBF (p||q0)I Pythagorean Theorem

12 of 13

Page 13: (Useful) Information Geometrycbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/information_geometry2.pdf(Useful) Information Geometry Shane Gu and Nilesh Tripuraneni sg717@cam.ac.uk,nt357@cam.ac.uk

CitationsR.E. Schapire, Y. Freund, Boosting: Foundations and AlgorithmsM. Collins, R. E. Schapire, and Y. Singer, ”Logistic regression,adaboost and bregman distances,” Machine Learning, vol. 48, no. 1-3,pp. 253-285, 2002.

13 of 13