Lossless or Quantized Boosting with Integer Arithmetic ...

Lossless or Quantized Boosting with Integer Arithmetic— Supplementary Material —

Richard NockData61, The Australian National University & The University of Sydney

[email protected]

Robert C. WilliamsonThe Australian National University & Data61

[email protected]

Abstract

This is the Supplementary Material to Paper ”Lossless or Quantized Boosting with IntegerArithmetic”, appearing in the proceedings of ICML 2019. Notation “main file” indicatesreference to the paper.

1

1 Table of contentsSupplementary material on proofs Pg 3Proof of Theorem 5 Pg 3↪→ Comments on properness vs the Q-loss Pg 3↪→ Detailed proof Pg 4Proof of Lemma 6 Pg 6Proof of Theorem 7 Pg 7Proof of Theorem 8 Pg 9Proof of Theorem 10 Pg 10

Supplementary material on experiments Pg 16Implementation Pg 16Domain summary Table Pg 16UCI fertility Pg 18UCI haberman Pg 19UCI transfusion Pg 20UCI banknote Pg 21UCI breastwisc Pg 22UCI ionosphere Pg 23UCI sonar Pg 24UCI yeast Pg 25UCI winered Pg 26UCI cardiotocography Pg 27UCI creditcardsmall Pg 28UCI abalone Pg 29UCI qsar Pg 30UCI winewhite Pg 31UCI page Pg 32UCI mice Pg 33UCI hill+noise Pg 34UCI hill+nonoise Pg 35UCI firmteacher Pg 36UCI magic Pg 37UCI eeg Pg 38UCI skin Pg 39UCI musk Pg 40UCI hardware Pg 41UCI twitter Pg 42Summary of Results Pg 43

2

2 Proof of Theorem 5

2.1 Comments on properness vs the Q-lossWe explain here why we have left open the unit interval for the definition of (2) and why parameterε in the definition of the partial losses of the Q-loss is important for its properness, even whenthe actual value of ε has absolutely no influence on RATBOOST nor the decision tree inductionalgorithm using LQ. A large class of partial losses is defined in Buja et al. (2005, Theorem 1)1,from which the following,

`1(u).

=

∫ 1−ε

u

(1− c)w(dc), (1)

`−1(u).

=

∫ u

ε

cw(dc) (2)

defines partial losses of a proper loss, where w is a positive measure require to be finite on anyinterval (ε, 1− ε) with2 0 < ε ≤ 1/2. The definition of proper losses in Reid & Williamson (2010,Theorem 6) implicitly assumes that the integrals are proper so the limits of (1), (2) exist for ε→ 0.

In our case, it is not hard to reconstruct the partial losses of Definition 4 from (1), (2) providedwe pick

w(dc).

=% · dc

err(c)2, (3)

which indeed meets the requirements of Buja et al. (2005, Theorem 1) (see (9) below). So, theQ-loss implicitly constrains the domain of the pointwise Bayes risk to be (ε, 1 − ε) for it to fitto (1), (2). While this brings the benefit to prevent infinite values for the pointwise Bayes risk(lim0 L

Q(u) = lim1 LQ(u) = −∞), this also does not represent a restriction for learning:

• this restricts in theory the image of HT in RATBOOST to [ψ(ε),−ψ(ε)] using the canonicallink, that is:

ImHT ⊆ % ·(

1

ε− 2

)· [−1, 1] , (4)

but all components of HT have finite values in RATBOOST (including the images of weakhypotheses, wlog), so we can just consider that ε is implicitly fixed as small as possible for(4) to hold (again, learning HT in RATBOOST does not depend on ε);

• this restricts in theory the proportion p of examples of class ±1 at each leaf of a decision treeto be in (ε, 1 − ε) for the tree to be learned with LQ, but this happens not to be restrictive,for three reasons. First, all classical top-down induction algorithms use losses whose Bayesrisk zeroes in 0, 1, so we can train those trees by discarding pure leaves in the computationof L (Section 7). Second, discarding pure leaves from the computation of the loss doesnot endanger the weak learning assumption. Third, in practice DTs are pruned for goodgeneralization: classical statistical methods will in general end up with trees with pure leavesremoved Kearns & Mansour (1998).

1And an even larger class is defined in Schervish (1989, Theorem 4.2).2Buja et al. (2005, Theorem 1) is slightly more general as the integrals bounds depending on ε are replaced by

variables in (ε, 1− ε).

3

2.2 Detailed proofWe use Shuford, Jr et al. (1966, Theorem 1), Reid & Williamson (2010, Theorem 1) to show that theQ-loss is proper. For this to hold, we just need to show that−u`Q1

′(u) = (1−u)`Q−1

′(u), ∀u ∈ (0, 1),

where ’ denotes derivative. We then check that whenever u ≤ 1/2, we have `Q1′(u) = % · (−1/u2 +

1/u) and `Q−1′(u) = % · (1/u), so that

−u`Q1′(u) = % ·

(1

u− 1

)= % ·

(1− uu

);

(1− u)`Q−1′(u) = % ·

(1− uu

), (5)

so the Q-loss is proper. To show that it is strictly proper is just a matter of completing three steps:(i) computing the pointwise Bayes risk LQ, (ii) computing its weight function wQ(u) and showingthat it is strictly positive for any u ∈ [0, 1] Reid & Williamson (2010, Theorem 6). To achieve step(i), we remark that because `Q is proper Reid & Williamson (2010),

1

%· LQ(u)

= LQ(u, u)

= u · `Q1 (u) + (1− u) · `Q−1(u)

=

{−u log ε− 2u+ 1 + u log u− (1− u) log ε+ (1− u) log u if u ≤ 1/2

−u log ε+ u log(1− u)− (1− u) log ε− 2(1− u) + 1 + (1− u) log(1− u) otherwise(6)

= − log ε+

{−2u+ 1 + log u if u ≤ 1/2

−2(1− u) + 1 + log(1− u) otherwise (7)

= − log ε+ log err(u) + 1− 2err(u)

= log

(err(u)

ε

)+ 1− 2err(u), (8)

and we retrieve (11). We then easily check that its weight function equals Buja et al. (2005)

wQ(u).

= −LQ′′(u)

= −% ·({

1u− 2 if u ≤ 1/2

− 11−u + 2 otherwise

)′= % ·

{ 1u2

if u ≤ 1/21

(1−u)2 otherwise

=%

err(u)2, (9)

which is indeed > 0 for any u ∈ [0, 1], and shows that the Q-loss is strictly proper. We also remarkthat LQ is twice differentiable. The computation of the inverse link is then, from (5) (we recall that

4

K = 0),

ψQ−1

(z).

=(−LQ′

)−1(z)

=

(% ·{

2− 1u

if u ≤ 1/2−2 + 1

1−u otherwise

)−1(10)

=

1

2− z%

if z ≤ 01+ z

%

2+ z%

otherwise

=%+ H(−z)

2%+ |z|, (11)

as claimed (link immediate from (10)). The convex surrogate for the Q-loss is obtained from (7),and we first search for (−L)?:

(−LQ)?(z).

= supz′∈dom(LQ)

{zz′ + LQ(z′)}

= supu∈[0,1]

{zu+ % ·

(log

(err(u)

ε

)+ 1− 2err(u)

)}= % · (1− log ε) + max

{sup

u∈[0,1/2]{(z − 2%)u+ % · log u} ,−2%+ sup

u∈(1/2,1]{(z + 2%)u+ % · log(1− u)}

}

= % · (1− log ε) + max

{% log %+ % · z−2%

2%−z − % · log(2%− z) for u = % · 12%−z ∈ [0, 1/2]

% log %− 2%+ (z+%)(z+2%)z+2%

− % · log(2%+ z) for u = z+%z+2%

∈ (1/2, 1]

= % log %− % · log ε+ max

{−% · log(2%− z) for u = % · 1

2%−z ∈ [0, 1/2]

z − % · log(2%+ z) for u = z+%z+2%

∈ (1/2, 1]

= −% log

(ε

%

)+ max

{−% · log(2%− z) for z ≤ 0z − % · log(2%+ z) for z > 0

= −% log

(ε

%

)+

{−% · log(2%− z) for z ≤ 0z − % · log(2%+ z) for z > 0

= −% · log

(2ε+

ε|z|%

)+ H(−z), (12)

and we get

FQ(z).

= (−LQ)?(−z) (13)

= −% · log

(2ε+

ε|z|%

)+ H(z), (14)

as claimed. This derivation also allows us to prove that the Q-loss is proper canonical using Nock &Nielsen (2008, Lemma 1). That the Q-loss is symmetric is just a consequence of its definition Reid& Williamson (2010). This ends the proof of Theorem 5.

5

3 Proof of Lemma 6Denote for short

v.

= z + % ·(

1− 2u

err(u)

). (15)

It is not hard to check that indeed

z � u =%+H(v)

2%+ |v|.

= g(v), (16)

as well as g(−v) = 1 − g(v). So, we focus on the second equality. Denote for short u .= nu/du,

z.

= % · nz/dz. We remark that the definition of z makes % simplify:

z � u =

1 +H

(nz

dz+

1−2·nudu

nudu∧ du−nu

du

)2 +

∣∣∣∣nz

dz+

1−2·nudu

nudu∧ du−nu

du

∣∣∣∣=

1 +H(nz

dz+ du−2nu

nu∧(du−nu)

)2 +

∣∣∣nz

dz+ du−2nu

nu∧(du−nu)

∣∣∣ (17)

Case 1: v ≥ 0 and nu ≤ du − nu. We have

z � u =1

2 + nz

dz+ du−2nu

nu

=1

nz

dz+ du

nu

=nudz

nunz + dudz. (18)

Case 2: v ≥ 0 and nu > du − nu. We have

z � u =1

2 + nz

dz+ du−2nu

du−nu

=1

3 + nz

dz− nu

du−nu

=(du − nu)dz

(du − nu)(3dz + nz)− nudz

=(du − nu)dz

(du − nu)nz + dudz + 2(du − 2nu)dz. (19)

Folding both cases 1 and 2, we get

z � u =(nu ∧ (du − nu))dz

(nu ∧ (du − nu))nz + dudz − 2H(du − 2nu)dz. (20)

Note that this holds when v ≥ 0, equivalent to

nzdz

+du − 2nu

nu ∧ (du − nu)> 0, (21)

6

that is, assuming wlog dz > 0,

(nu ∧ (du − nu))nz > −(du − 2nu)dz. (22)

So, let us denote a .= (nu ∧ (du − nu))dz, b

.= (nu ∧ (du − nu))nz, c

.= dudz, d

.= 2(du − 2nu)dz.

We get that if b+ (d/2) ≥ 0, then

z � u =a

b+ c− H(d), (23)

and if b+ (d/2) < 0, then we remark that −b− (d/2) > 0, so

z � u = 1− a

−b+ c− H(−d)=−b− a+ c− H(−d)

−b+ c− H(−d), (24)

as claimed.

4 Proof of Theorem 7The proof revolves on two simple facts about FQ: (i) since FQ is convex and differentiable, wehave FQ(y)− FQ(x)− (y − x)FQ′(x) ≥ 0 (the left hand side is just the Bregman divergence withgenerator FQ). Also, (ii) FQ being twice differentiable, Taylor Theorem says that for any x, y wecan expand the derivative as FQ′(y) = FQ′(x) + (y − x)FQ′′(z) for some z ∈ [x, y]. Using (i) and(ii) in this order, we get that fo for any i ∈ {1, 2, ...,m}, there exists αi ∈ [0, 1] and

βi = yiHt(xi) + αiδtyiht(xi) ∈ [yiHt(xi), yiHt+1(xi)] (25)

such that:

Ei∼D[FQ(yiHt(xi))

]− Ei∼D

[FQ(yiHt+1(xi))

]≥ Ei∼D

[(yiHt(xi)− yiHt+1(xi))F

Q′(yiHt+1(xi))]

(26)

= Ei∼D[(yiHt(xi)− yiHt+1(xi))F

Q′(yiHt(xi))]

︸︷︷︸.=X

−Ei∼D[(yiHt(xi)− yiHt+1(xi))

2FQ′′(βi)]

︸︷︷︸.=Y

.(27)

Because FQ is convex, Y ≥ 0. We want to show that not just X ≥ 0 but in fact the differenceX − Y is sufficiently large for the bound of the Theorem to hold. We first remark

X.

= Ei∼D[(yiHt(xi)− yiHt+1(xi))F

Q′(yiHt(xi))]

= −δtEi∼D[yiht(xi) · −ψQ

−1(−yiHt(xi))

]= δtEi∼D [wtiyiht(xi)]

= δt ·∑

iwtiyiht(xi)

m= a · η2t . (28)

7

We also have FQ′′(z) = %/(2%+ |z|)2, so

Y.

= Ei∼D[(yiHt(xi)− yiHt+1(xi))

2FQ′′(βi)]

= % · Ei∼D[

(yiHt(xi)− yiHt+1(xi))2

(2 + |βi|)2

]= %δ2t · Ei∼D

[h2t (xi)

(2%+ |βi|)2

]. (29)

Now we get because of assumption (M):

Ei∼D[

h2t (xi)

(2%+ |βi|)2

]≤ 1

4%2· Ei∼D

[h2t (xi)

]≤ M2

4%2. (30)

So,

Y ≤ δ2tM2

4%

=a2 · η2tM2

4%. (31)

We finally get

Ei∼D[FQ(yiHt(xi))

]− Ei∼D

[FQ(yiHt+1(xi))

]≥ X − Y

≥(

1− aM2

4%

)· a︸︷︷︸

.=Z(a)

·η2t . (32)

Suppose now that we fix any π ∈ [0, 1] and then choose any

a ∈ 2%

M2· [1− π, 1 + π] . (33)

It is not hard to check that Z(a) satisfies

Z(a) ≥ (1− π2) · %

M2· η2t , (34)

so we get

Ei∼D[FQ(yiHt(xi))

]− Ei∼D

[FQ(yiHt+1(xi))

]≥ (1− π2)%η2t

M2,∀t, (35)

and so the final classifier HT satisfies

Ei∼D[FQ(yiHT (xi))

]≤ FQ(0)− (1− π2)% ·

∑Tt=1 η

2t

M2. (36)

Remark that this holds regardless of the sequence {ηt}t. If we want to guarantee that Ei∼D[FQ(yiHT (xi))

]≤

FQ(z∗) for some z∗ ≥ 0, then it suffices to iterate untilT∑t=1

η2t ≥FQ(0)− FQ(z∗)

(1− π2)%·M2, (37)

and we get the statement of the Theorem.

8

5 Proof of Theorem 8The proof uses the same basic steps as the proof of Theorem 7. Denote for short

w̃ti.

= wti + κti, (38)

where wti.

= −ψQ−1 (−yiHt(xi)) is the non-quantized weights and κti is the quantization shift inweights. Note that we do not have access to wti. We indicate with a tilda quantities that depend onw̃.

This time, we have for X the expression:

X = −δ̃tEi∼D[yiht(xi) · −ψQ

−1(−yiHt(xi))

]= δ̃tEi∼D [wtiyiht(xi)]

= δ̃t ·(∑

i w̃tiyiht(xi)

m−∑

i κtiyiht(xi)

m

)= a · η̃2t − a · η̃t ·

∑i κtiyiht(xi)

m, (39)

while the expression of Y does not change (yet including ”tilda” parameters affected by thequantization of weights). Denote for short

∆t.

=

∑i κtiyiht(xi)

m. (40)

We get in lieu of (32),

Ei∼D[FQ(yiHt(xi))

]− Ei∼D

[FQ(yiHt+1(xi))

]≥ X − Y

≥(

1− ∆t

η̃t− aM2

4%

)· aη̃2t

=

(4%

M2· η̃t −∆t

η̃t− a)· a︸︷︷︸

.=Z(a)

·M2η̃2t

4%.(41)

Choose

a ∈ 2%

M2·[η̃t −∆t

η̃t− π, η̃t −∆t

η̃t+ π

], (42)

for any 0 ≤ π ≤ |η̃t −∆t|/η̃t. It follows

Z(a) ≥

((η̃t −∆t

η̃t

)2

− π2

)· %

M2· η̃2t . (43)

Suppose that the quantisation shift satisfies |η̃t −∆t| ≥ ζ · |η̃t| (which holds if |∆t| ≤ (1− ζ) · |η̃t|)for some ζ > 0. We obtain that for any 0 ≤ π < ζ ,

Z(a) ≥(ζ2 − π2

)· %

M2· η̃2t > 0, (44)

9

which leads to the statement of the Theorem after posing κt.

= |∆t|.Remark: assumption (Q) is in fact stronger than what would really be needed to get the Theorem.Under some conditions, we could indeed accept |∆t| > (1− ζ) · |η̃t|, but in the derivations above,the shift in weights due to quantisation would result in a disguised way to strenghten weak learning.Clearly, such an assumption where quantisation compensates for the weakness of the weak classifiersis unfit in a boosting setting.

6 Proof of Theorem 10We assume basic knowledge of the proofs of Kearns & Mansour (1996). We shall briefly presentthe proof scheme as well as the notations, that we keep identical to Kearns & Mansour (1996) forreadability.

The basic of the proof is to show that each time a leaf is replaced by a split under the weaklearning assumption, there is a sufficient decrease of L(H). Denote H+ tree H in which a leaf λ hasbeen replaced by a split indexed with some g : R→ {0, 1} satisfying the weak learning assumption.The decrease in L(.), ∆

.= L(H)− L(H+), is lowerbounded as a function of γ and then used to

lowerbound the number of iterations (each of which is the replacement of a leaf by a binary subtree)to get to a given value of L(.)

It turns out that ∆ can be abstracted by a better quantity to analyze, ∆.

= ω(λ) ·∆LQ(q, τ, δ)with

∆LQ(q, τ, δ).

= LQ(q)− (1− τ)LQ(q − τδ)− τLQ(q + (1− τ)δ) (45)

with q .= q(λ) and δ = γq(1− q)/(τ(1− τ)) with τ denoting the relative proportion of examples

for which g = +1 in leaf λ, following Kearns & Mansour (1996). The following Lemma is the keyto the proof of Theorem 10.

Lemma 1 Suppose the weak hypothesis assumption is satisfied for the current split, for someconstant γ > 0. For any q, τ ∈ [0, 1], using δ = γq(1− q)/(τ(1− τ)) yields:

∆LQ(q, τ, δ) ≥ γ2

2. (46)

Proof Our proof follows the proof of Kearns & Mansour (1996).

Lemma 2 Suppose τ ≤ 1/2, q > 1/2 or τ ≥ 1/2, q < 1/2. If γ ≤ 1/25, ∆LQ(q, τ, δ) is minimizedby some τ ∈ [0.4, 0.6].

Proof To prove the Lemma we use the trick of Kearns & Mansour (1996, Lemma 4), which consistsof studying function

U(q,X).

= LQ(q −X) +XLQ′(q −X)

=

{log (q −X) + X

q−X + 1− 2q if q −X ≤ 12

log (1− q +X)− X1−q+X − 1 + 2q if q −X > 1

2

(47)

10

and show

U(q, τδ) ≤ U(q,−(1− τ)δ),∀τ ≤ 0.4, (48)U(q, τδ) ≥ U(q,−(1− τ)δ),∀τ ≥ 0.6, (49)

Case 1: τ ≤ 0.4 (and therefore q < 1/2). We have two subcases to show (48).

Case 1.1: q + (1− τ)δ < 1/2. In this case, q −X < 1/2 for both instantiations of X in (48). Wethen have

U(q, τδ) = log

(1− γ(1− q)

1− τ

)+

γ(1−q)1−τ

1− γ(1−q)1−τ

+ 1− 2q + log q (50)

= log

(τ − 1 + γ(1− q)

τ − 1

)− γ(1− q)τ − 1 + γ(1− q)

+ 1− 2q + log q (51)

U(q,−(1− τ)δ) = log

(1 +

γ(1− q)τ

)−

γ(1−q)τ

1 + γ(1−q)τ

+ 1− 2q + log q (52)

= log

(τ + γ(1− q)

τ

)− γ(1− q)τ + γ(1− q)

+ 1− 2q + log q, (53)

so (48) is equivalent to showing

log

(τ − 1 + γ(1− q)

τ − 1

)− γ(1− q)τ − 1 + γ(1− q)

≤ log

(τ + γ(1− q)

τ

)− γ(1− q)τ + γ(1− q)

,(54)

which after reorganising and simplification amounts to showing

log

(1− γ(1− q)

(τ + γ(1− q))(1− τ)

)≤ − γ(1− q)

(τ + γ(1− q))(1− τ − γ(1− q)). (55)

We remark that for the log to be defined in (51), we must have τ < 1−γ(1− q), which implies thatthe RHS of (55) is negative. To show (55), we use the fact that log(1−X) ≤ −X −X2/2 whenX ≥ 0, so fixing X .

= γ(1− q)/((τ + γ(1− q))(1− τ)) we obtain

log

(1− γ(1− q)

(τ + γ(1− q))(1− τ)

)≤ − γ(1− q)

τ + γ(1− q)·(

1

1− τ+

γ(1− q)2(τ + γ(1− q))(1− τ)2

).(56)

To show (55), we can then show

1

1− τ − γ(1− q)≤ 1

1− τ+

γ(1− q)2(τ + γ(1− q))(1− τ)2

, (57)

which, after simplification, is equivalent to

1− τ − γ(1− q)2(τ + γ(1− q))(1− τ)

≥ 1, (58)

or equivalently 3τ − 2τ 2 + 3γ(1 − q) − 2τγ(1 − q) ≤ 1. Since τ ≤ 2/5, 3τ − 2τ 2 ≤ 22/25.If we pick γ ≤ 1/25, then 3γ(1 − q) − 2τγ(1 − q) ≤ 3γ(1 − q) ≤ 3γ = 3/25, so that

11

3τ − 2τ 2 + 3γ(1− q)− 2τγ(1− q) ≤ 1, as claimed (end of Case 1.1).

Case 1.2: q + (1− τ)δ > 1/2. In this case,

U(q,−(1− τ)δ) = log(

1− γq

τ

)+

γqτ

1− γqτ

+ 1− 2(1− q) + log(1− q) (59)

= log

(τ − γq

τ

)+

γq

τ − γq+ 2q − 1 + log(1− q). (60)

We also remark that 1 − 2q + log q ≤ 2q − 1 + log(1 − q) for q < 1/2, so to prove (48), it issufficient to show

log

(τ − 1 + γ(1− q)

τ − 1

)− γ(1− q)τ − 1 + γ(1− q)

≤ log

(τ − γq

τ

)+

γq

τ − γq, (61)

which reduces after simplification to showing that

log

(1 +

γ(q − τ)

(τ − γq)(1− τ)

)≤ γ(q − τ)

(τ − γq)(1− τ − γ(1− q)). (62)

Because q+ (1− τ)δ > 1/2, if τ ≥ 10γq(1− q), then q > 0.4 and therefore q > τ . If, on the otherhand τ ≤ 10γq(1− q), then if γ ≤ 1/10, it follows also τ ≤ q. To summarize, q + (1− τ)δ > 1/2and γ ≤ 1/10 imply q ≥ τ .

Using the fact that log(1 +X) ≤ X and γ(1− q) ≥ 0, we easily obtain the proof of (62) viathe chain of inequalities

log

(1 +

γ(q − τ)

(τ − γq)(1− τ)

)≤ γ(q − τ)

(τ − γq)(1− τ)≤ γ(q − τ)

(τ − γq)(1− τ − γ(1− q)). (63)

This ends up the proof for Case 1.

Case 2: τ ≥ 0.6 (and therefore q > 1/2). We have two cases again, this time to show (49).

Case 2.1: q − τδ > 1/2. In this case, q −X > 1/2 for both instantiations of X in (49). We thenhave

U(q, τδ) = log

(1 +

γq

1− τ

)− γq

1− τ + γq− 1 + 2q + log(1− q) (64)

U(q,−(1− τ)δ) = log(

1− γq

τ

)+

γq

τ − γq− 1 + 2q + log(1− q), (65)

To show (49), it is thus sufficient to show that

log

(1 +

γq

1− τ

)− γq

1− τ + γq≥ log

(1− γq

τ

)+

γq

τ − γq, (66)

or equivalently, after reordering and simplifying,

log

(1− γq

τ(1− τ + γq)

)≤ − γq

(τ − γq)(1− τ + γq), (67)

12

which is (55) with the substitution τ 7→ 1 − τ and q 7→ 1 − q. Since then 1 − τ ≤ 0.4, we candirectly apply the proof of (55), which ends the proof of Case 2.1.

Case 2.2: q − τδ < 1/2. In this case,

U(q, τδ) = log

(1− γ(1− q)

1− τ

)+

γ(1− q)1− τ − γ(1− q)

+ 1− 2q + log q, (68)

while we still have

U(q,−(1− τ)δ) = log(

1− γq

τ

)+

γq

τ − γq− 1 + 2q + log(1− q), (69)

and so we want to show

log(

1− γq

τ

)+

γq

τ − γq− 1 + 2q + log(1− q)

≤ log

(1− γ(1− q)

1− τ

)+

γ(1− q)1− τ − γ(1− q)

+ 1− 2q + log q, (70)

We also remark that −1 + 2q + log(1 − q) ≤ 1 − 2q + log q for q > 1/2, so to prove (70), it issufficient to show

log(

1− γq

τ

)+

γq

τ − γq≤ log

(1− γ(1− q)

1− τ

)+

γ(1− q)1− τ − γ(1− q)

, (71)

which reduces after simplification to showing that

log

(1 +

γ(τ − q)(1− τ − γ(1− q))τ

)≤ γ(τ − q)

(τ − γq)(1− τ − γ(1− q)), (72)

wich turns out to be (62) with the substitution τ 7→ 1− τ and q 7→ 1− q. Since then 1− τ ≤ 0.4,we can directly apply the proof of (62), which ends the proof of Case 2.2, and the proof of Lemma2 as well. (end of the proof of Lemma 2)

Following Kearns & Mansour (1996), we define

FLQ(q, τ, δ).

= −τ(1− τ)δ2

2LQ′′(q)− τ(1− τ)(1− 2τ)δ3

6LQ

(3)(q). (73)

We now state and prove the equivalent of (Kearns & Mansour, 1996, Lemma 3).

Lemma 3 For any q, τ, δ ∈ [0, 1],

∆LQ(q, τ, δ) ≥ FLQ(q, τ, δ). (74)

Proof We have

LQ(k)

(q) = % ·

{(−1)k−1(k−1)!

qk− 2 · Jk = 1K if q < 1/2

− (k−1)!(1−q)k + 2 · Jk = 1K if q > 1/2

, (75)

13

and we check that only the first and second order derivatives are defined in q = 1/2. Since LQ issymmetric around 1/2, ∆LQ satisfies

∆LQ

(1

2− q, 1− τ, δ

)= LQ

(1

2− q)− τLQ

(1

2− q − (1− τ)δ

)− (1− τ)LQ

(1

2− q + τδ

)= LQ (q)− τLQ

(1

2− (q + (1− τ)δ)

)− (1− τ)LQ

(1

2− (q − τδ)

)= LQ (q)− τLQ (q + (1− τ)δ)− (1− τ)LQ (q − τδ) = ∆LQ(q, τ, δ),(76)

so we study ∆LQ for q > 1/2 without loss of generality. In this case, all derivatives LQ at orderk ≥ 4 are all negative, which from (Kearns & Mansour, 1996, Lemma 3) guarantees that

∆LQ(q, τ, δ) ≥ FLQ(q, τ, δ), (77)

as claimed. (end of the proof of Lemma 3)

We now lowerbound FLQ(q, τ, δ), which, from Lemma 3, will also provide a lowerbound for thedecrease in ∆LQ(q, τ, δ) and in fact will show Lemma 1. From now on, let us fix δ = γq(1 −q)/(τ(1− τ)), if we denote V (τ, q)

.= (1− 2τ) (q − Jq < 1/2K), then

FLQ(q, τ, δ) = max{q, 1− q}2γ2 ·(

1

2τ(1− τ)+

γ

3τ 2(1− τ)2· V (τ, q)

). (78)

We immediately obtain

Lemma 4 Let δ = γq(1− q)/(τ(1− τ)). Then for any τ, q such that V (τ, q) ≥ 0,

FLQ(q, τ, δ) ≥ γ2

2. (79)

Proof For any τ, q such that V (τ, q) ≥ 0, we have

FLQ(q, τ, δ) ≥ max{q, 1− q}2γ2 · 1

2τ(1− τ)≥ 1

4· γ2 · 2 =

γ2

2, (80)

as claimed (end of the proof of Lemma 4).

Lemma 4 means that when τ ≤ 1/2, q < 1/2 or τ ≥ 1/2, q > 1/2, the drop ∆LQ(q, τ, δ) isguaranteed to be ”big”. If this does not happen, we make use of Lemma 2. In this case, if we pickwlog τ ≤ 1/2, q > 1/2 and get:

FLQ(q, τ, δ) = max{q, 1− q}2γ2 ·(

1

2τ(1− τ)− γ(1− 2τ)(1− q)

3τ 2(1− τ)2

)≥ γ2

2·(

2− γ(1− 2 · 0.4)

3 · 0.42(1− 0.4)2

)= γ2 ·

(1− 625γ

216

)≥ γ2 ·

(1− 25

216

)≥ γ2

2,

which therefore implies that FLQ(q, τ, δ) ≥ γ2/2 in all cases. We just have to use Lemma 3 to finishthe proof of Lemma 1 (end of the proof of Lemma 1).

14

We can now finish the proof of Theorem 10. Suppose the current tree H has t leaves. There must bea leaf with ω(λ) ≥ 1/t, so

∆.

= LQ(H)− LQ(H+)

= ω(λ)∆LQ(q, τ, δ) ≥ γ2

2t

≥ γ2

2t· L

Q(H)

LQ(H0), (81)

where the last inequality follows from the concavity of LQ, letting H0 the single-root node tree forwhich LQ(H0) = LQ(q(S)), and more generally Ht a tree with t+ 1 leaves (thus we have made titerations of the boosting procedure). It therefore comes the recurrence relationship

LQ(Ht+1) ≤(

1− γ2

2LQ(q(S)) · t

)· LQ(Ht), (82)

and we get (see (Kearns & Mansour, 1996, proof of Theorem 10))

LQ(Ht) ≤ exp

(− γ2 log t

4LQ(q(S))

)· LQ(q(S)), (83)

to obtain LQ(Ht) ≤ ρ · LQ(q(S)) for ρ ∈ (0, 1], it therefore suffices that

t ≥(

1

ρ

) 4·LQ(q(S))

γ2

. (84)

We finally remark that LQ(q(S)) ≤ % · log 1/(2ε) and conclude that (84) holds when

t ≥(

1

ρ

) 4%

γ2 log 12ε, (85)

as claimed.Remark: we can compare at this stage our guarantees to those of Kearns & Mansour (1996). Theknowledge of their proofs immediately sheds light on the fact that our lowerbound on ∆LQ(q, τ, δ)in Lemma 10 does not depend on q whereas all of theirs do (Kearns & Mansour, 1996, Lemmata 5,6, 7), and in fact vanish as q → 0, 1. A closer look at the weak learning assumption shows that itin fact precludes this extreme regime for q as it enforces q ∈ [τδ, 1− (1− τ)δ] when δ ≤ 1; as aconsequence their bounds can also be reformulated to exclude q and their convergence rate for theirbest splitting criterion is within the same order as ours.

15

7 Experiments in extenso

7.1 ImplementationWe give here a few details on the implementation. The Java implementation of the algorithms,available separately, implements the version of Nock & Nielsen (2006); Schapire & Singer (1999)respectively for ADABOOSTR and AdaBoost.

The implementation of RATBOOSTE uses methods from class Math that allow to throw anArithmeticExceptionwhen a long overflow happens – in which case we catch the exceptionand redo the corresponding method after quantization. To make the code faster, we have alsoincluded the possibility to trigger quantization when the longs encoding length exceeds a user-fixed threshold.

The implementation of RATBOOSTAb uses a regular k-means with Forgy initialization. Ifyou want to optimize this with your best hard clustering algorithm, you just have to rewrite a fewmethods from class KMeans R in file Misc.java. Note that the implementation also allows touse stochastic weight assignation with adaptive quantization (a combination of RATBOOSTAb andRATBOOSTQb), but it is not reported (see README).

Domain summary TableTable 1 details the UCI domains we have used Blake et al. (1998). We now detail the per-domaintraining curves when there is no stopping criterion (other than to boost for 10 000 iterations). Inthe results reported in Tables 1 (main file) and 2 (this), we keep the classifier which minimizesthe empirical risk among all iterations, which amounts to a cutoff point for boosting around theminimal values of each curve (because of the statistical uncertainty, we are not guarantee that thismay be minimal on testing). Results of ADABOOSTR are omitted to not clutter the plots but theyare included in the full Table 2.

16

Domain m dFertility 100 9

Haberman 306 3Transfusion 748 4Banknote 1 372 4

Breast wisc 699 9Ionosphere 351 33

Sonar 208 60Yeast 1 484 7

Wine-red 1 599 11Cardiotocography (*) 2 126 9CreditCardSmall (**) 1 000 23

Abalone 4 177 8Qsar 1 055 41

Wine-white 4 898 11Page 5 473 10Mice 1 080 77

Hill+noise 1 212 100Hill+nonoise 1 212 100Firmteacher 10 800 16

Magic 19 020 10EEG 14 980 14Skin 245 057 3Musk 6 598 166

Hardware 28 179 95Twitter (***) 583 250 77

Table 1: UCI domains considered in our experiments (m = total number of examples, d = numberof features), ordered in increasing m× d. (*) we used features 13-21 as descriptors; (**) we usedthe first 1 000 examples of the UCI domain; (***) due to the size of the domain, only AdaBoostand ADABOOSTRwere run for T = 5000 iterations, the other algorithms were rum for a smallerT ′ = 1000 iterations.

17

UCI fertility

Figure 1: UCI domain fertility. Results comparing AdaBoost (blue), RATBOOST (green) andRATBOOSTE (purple). Note: there is no other stopping criterion apart from running for T = 10000iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 2: UCI domain fertility. Results comparing AdaBoost (blue), RATBOOST (green) andthe quantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb (red),for various values of the quantization index bit-size b. Note: there is no other stopping criterionapart from running for T = 10000 iterations.

18

UCI haberman

Figure 3: UCI domain haberman. Results comparing AdaBoost (blue), RATBOOST (green) andRATBOOSTE (purple). Note: there is no other stopping criterion apart from running for T = 10000iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 4: UCI domain haberman. Results comparing AdaBoost (blue), RATBOOST (green) andthe quantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb (red),for various values of the quantization index bit-size b. Note: there is no other stopping criterionapart from running for T = 10000 iterations.

19

UCI transfusion

Figure 5: UCI domain transfusion. Results comparing AdaBoost (blue), RATBOOST (green)and RATBOOSTE (purple). Note: there is no other stopping criterion apart from running forT = 10000 iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 6: UCI domain transfusion. Results comparing AdaBoost (blue), RATBOOST (green)and the quantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb(red), for various values of the quantization index bit-size b. Note: there is no other stoppingcriterion apart from running for T = 10000 iterations.

20

UCI banknote

Figure 7: UCI domain banknote. Results comparing AdaBoost (blue), RATBOOST (green) andRATBOOSTE (purple). Note: there is no other stopping criterion apart from running for T = 10000iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 8: UCI domain banknote. Results comparing AdaBoost (blue), RATBOOST (green) andthe quantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb (red),for various values of the quantization index bit-size b. Note: there is no other stopping criterionapart from running for T = 10000 iterations.

21

UCI breastwisc

Figure 9: UCI domain breastwisc. Results comparing AdaBoost (blue), RATBOOST (green)and RATBOOSTE (purple). Note: there is no other stopping criterion apart from running forT = 10000 iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 10: UCI domain breastwisc. Results comparing AdaBoost (blue), RATBOOST (green)and the quantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb(red), for various values of the quantization index bit-size b. Note: there is no other stoppingcriterion apart from running for T = 10000 iterations.

22

UCI ionosphere

Figure 11: UCI domain ionosphere. Results comparing AdaBoost (blue), RATBOOST (green)and RATBOOSTE (purple). Note: there is no other stopping criterion apart from running forT = 10000 iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 12: UCI domain ionosphere. Results comparing AdaBoost (blue), RATBOOST (green)and the quantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb(red), for various values of the quantization index bit-size b. Note: there is no other stoppingcriterion apart from running for T = 10000 iterations.

23

UCI sonar

Figure 13: UCI domain sonar. Results comparing AdaBoost (blue), RATBOOST (green) andRATBOOSTE (purple). Note: there is no other stopping criterion apart from running for T = 10000iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 14: UCI domain sonar. Results comparing AdaBoost (blue), RATBOOST (green) and thequantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb (red), forvarious values of the quantization index bit-size b. Note: there is no other stopping criterion apartfrom running for T = 10000 iterations.

24

UCI yeast

Figure 15: UCI domain yeast. Results comparing AdaBoost (blue), RATBOOST (green) andRATBOOSTE (purple). Note: there is no other stopping criterion apart from running for T = 10000iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 16: UCI domain yeast. Results comparing AdaBoost (blue), RATBOOST (green) and thequantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb (red), forvarious values of the quantization index bit-size b. Note: there is no other stopping criterion apartfrom running for T = 10000 iterations.

25

UCI winered

Figure 17: UCI domain winered. Results comparing AdaBoost (blue), RATBOOST (green) andRATBOOSTE (purple). Note: there is no other stopping criterion apart from running for T = 10000iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 18: UCI domain winered. Results comparing AdaBoost (blue), RATBOOST (green) andthe quantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb (red),for various values of the quantization index bit-size b. Note: there is no other stopping criterionapart from running for T = 10000 iterations.

26

UCI cardiotocography

Figure 19: UCI domain cardiotocography. Results comparing AdaBoost (blue), RAT-BOOST (green) and RATBOOSTE (purple). Note: there is no other stopping criterion apart fromrunning for T = 10000 iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 20: UCI domain cardiotocography. Results comparing AdaBoost (blue), RAT-BOOST (green) and the quantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) /RATBOOSTSb (red), for various values of the quantization index bit-size b. Note: there is no otherstopping criterion apart from running for T = 10000 iterations.

27

UCI CreditCardSmall

Figure 21: UCI domain creditcardsmall. Results comparing AdaBoost (blue), RAT-BOOST (green) and RATBOOSTE (purple). Note: there is no other stopping criterion apart fromrunning for T = 10000 iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 22: UCI domain creditcardsmall. Results comparing AdaBoost (blue), RAT-BOOST (green) and the quantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) /RATBOOSTSb (red), for various values of the quantization index bit-size b. Note: there is no otherstopping criterion apart from running for T = 10000 iterations.

28

UCI abalone

Figure 23: UCI domain abalone. Results comparing AdaBoost (blue), RATBOOST (green) andRATBOOSTE (purple). Note: there is no other stopping criterion apart from running for T = 10000iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 24: UCI domain abalone. Results comparing AdaBoost (blue), RATBOOST (green) andthe quantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb (red),for various values of the quantization index bit-size b. Note: there is no other stopping criterionapart from running for T = 10000 iterations.

29

UCI qsar

Figure 25: UCI domain qsar. Results comparing AdaBoost (blue), RATBOOST (green) andRATBOOSTE (purple). Note: there is no other stopping criterion apart from running for T = 10000iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 26: UCI domain qsar. Results comparing AdaBoost (blue), RATBOOST (green) and thequantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb (red), forvarious values of the quantization index bit-size b. Note: there is no other stopping criterion apartfrom running for T = 10000 iterations.

30

UCI winewhite

Figure 27: UCI domain winewhite. Results comparing AdaBoost (blue), RATBOOST (green) andRATBOOSTE (purple). Note: there is no other stopping criterion apart from running for T = 10000iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 28: UCI domain winewhite. Results comparing AdaBoost (blue), RATBOOST (green) andthe quantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb (red),for various values of the quantization index bit-size b. Note: there is no other stopping criterionapart from running for T = 10000 iterations.

31

UCI page

Figure 29: UCI domain page. Results comparing AdaBoost (blue), RATBOOST (green) andRATBOOSTE (purple). Note: there is no other stopping criterion apart from running for T = 10000iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 30: UCI domain page. Results comparing AdaBoost (blue), RATBOOST (green) and thequantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb (red), forvarious values of the quantization index bit-size b. Note: there is no other stopping criterion apartfrom running for T = 10000 iterations.

32

UCI mice

Figure 31: UCI domain mice. Results comparing AdaBoost (blue), RATBOOST (green) andRATBOOSTE (purple). Note: there is no other stopping criterion apart from running for T = 10000iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 32: UCI domain mice. Results comparing AdaBoost (blue), RATBOOST (green) and thequantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb (red), forvarious values of the quantization index bit-size b. Note: there is no other stopping criterion apartfrom running for T = 10000 iterations.

33

UCI hill+noise

Figure 33: UCI domain hill+noise. Results comparing AdaBoost (blue), RATBOOST (green)and RATBOOSTE (purple). Note: there is no other stopping criterion apart from running forT = 10000 iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 34: UCI domain hill+noise. Results comparing AdaBoost (blue), RATBOOST (green)and the quantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb(red), for various values of the quantization index bit-size b. Note: there is no other stoppingcriterion apart from running for T = 10000 iterations.

34

UCI hill+nonoise

Figure 35: UCI domain hill+nonoise. Results comparing AdaBoost (blue), RATBOOST (green)and RATBOOSTE (purple). Note: there is no other stopping criterion apart from running forT = 10000 iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 36: UCI domain hill+nonoise. Results comparing AdaBoost (blue), RATBOOST (green)and the quantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb(red), for various values of the quantization index bit-size b. Note: there is no other stoppingcriterion apart from running for T = 10000 iterations.

35

UCI firmteacher

Figure 37: UCI domain firmteacher. Results comparing AdaBoost (blue), RATBOOST (green)and RATBOOSTE (purple). Note: there is no other stopping criterion apart from running forT = 10000 iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 38: UCI domain firmteacher. Results comparing AdaBoost (blue), RATBOOST (green)and the quantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb(red), for various values of the quantization index bit-size b. Note: there is no other stoppingcriterion apart from running for T = 10000 iterations.

36

UCI magic

Figure 39: UCI domain magic. Results comparing AdaBoost (blue), RATBOOST (green) andRATBOOSTE (purple). Note: there is no other stopping criterion apart from running for T = 10000iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 40: UCI domain magic. Results comparing AdaBoost (blue), RATBOOST (green) and thequantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb (red), forvarious values of the quantization index bit-size b. Note: there is no other stopping criterion apartfrom running for T = 10000 iterations.

37

UCI eeg

Figure 41: UCI domain eeg. Results comparing AdaBoost (blue), RATBOOST (green) andRATBOOSTE (purple). Note: there is no other stopping criterion apart from running for T = 10000iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 42: UCI domain eeg. Results comparing AdaBoost (blue), RATBOOST (green) and thequantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb (red), forvarious values of the quantization index bit-size b. Note: there is no other stopping criterion apartfrom running for T = 10000 iterations.

38

UCI skin

Figure 43: UCI domain skin. Results comparing AdaBoost (blue), RATBOOST (green) andRATBOOSTE (purple). Note: there is no other stopping criterion apart from running for T = 10000iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 44: UCI domain skin. Results comparing AdaBoost (blue), RATBOOST (green) and thequantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb (red), forvarious values of the quantization index bit-size b. Note: there is no other stopping criterion apartfrom running for T = 10000 iterations.

39

UCI musk

Figure 45: UCI domain musk. Results comparing AdaBoost (blue), RATBOOST (green) andRATBOOSTE (purple). Note: there is no other stopping criterion apart from running for T = 10000iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 46: UCI domain musk. Results comparing AdaBoost (blue), RATBOOST (green) and thequantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb (red), forvarious values of the quantization index bit-size b. Note: there is no other stopping criterion apartfrom running for T = 10000 iterations.

40

UCI hardware

Figure 47: UCI domain hardware. Results comparing AdaBoost (blue), RATBOOST (green) andRATBOOSTE (purple). Note: there is no other stopping criterion apart from running for T = 10000iterations.

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 48: UCI domain hardware. Results comparing AdaBoost (blue), RATBOOST (green) andthe quantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb (red),for various values of the quantization index bit-size b. Note: there is no other stopping criterionapart from running for T = 10000 iterations.

41

UCI twitter

Figure 49: UCI domain twitter. Results comparing AdaBoost (blue), RATBOOST (green) andRATBOOSTE (purple). Note: there is no other stopping criterion apart from running for T = 5000iterations (AdaBoost) and T ′ = 1000 iterations (RATBOOST, RATBOOSTE).

b = 2 b = 3 b = 4 b = 5 b = 6

Figure 50: UCI domain twitter. Results comparing AdaBoost (blue), RATBOOST (green) andthe quantized versions RATBOOSTAb (black) / RATBOOSTQb (thin orange) / RATBOOSTSb (red),for various values of the quantization index bit-size b. Note: there is no other stopping criterionapart from running for T = 5000 iterations (AdaBoost) and T ′ = 1000 iterations (RATBOOST,RATBOOSTAb, RATBOOSTQb, RATBOOSTSb).

42

Summary of Results

ReferencesBlake, C. L., Keogh, E., and Merz, C. UCI repository of machine learning databases, 1998.

http://www.ics.uci.edu/∼mlearn/MLRepository.html.

Buja, A., Stuetzle, W., and Shen, Y. Loss functions for binary class probability estimation andclassification: structure and applications, 2005. Technical Report, University of Pennsylvania.

Kearns, M. and Mansour, Y. On the boosting ability of top-down decision tree learning algorithms.In Proc. of the 28th ACM STOC, pp. 459–468, 1996.

Kearns, M. J. and Mansour, Y. A Fast, Bottom-up Decision Tree Pruning algorithm with Near-Optimal generalization. In Proc. of the 15th International Conference on Machine Learning, pp.269–277, 1998.

Nock, R. and Nielsen, F. A Real Generalization of discrete AdaBoost. In Proc. of the 17th EuropeanConference on Artificial Intelligence, pp. 509–515, 2006.

Nock, R. and Nielsen, F. On the efficient minimization of classification-calibrated surrogates. InNIPS*21, pp. 1201–1208, 2008.

Reid, M.-D. and Williamson, R.-C. Composite binary losses. JMLR, 11:2387–2422, 2010.

Schapire, R. E. and Singer, Y. Improved boosting algorithms using confidence-rated predictions.MLJ, 37:297–336, 1999.

Schervish, M.-J. A general method for comparing probability assessors. Ann. of Stat., 17(4):1856–1879, 1989.

Shuford, Jr, E.-H., Albert, A., and Massengil, H.-E. Admissible probability measurement procedures.Psychometrika, 31:125–145, 1966.

43

AdaBoost

ADABOOSTR

RATBOOST

RATBOOSTE

RA

TB

OO

ST

Qb,b

=R

AT

BO

OS

TS b

,b=

RA

TB

OO

STA

b,b

=

23

45

62

34

56

23

45

6F

38.0

0±10

.33

37.0

0±9.

4940

.00±

9.43

40.0

0±11

.55

47.0

0±14

.94

47.0

0±14

.94

39.0

0±15

.24

42.0

0±11

.35

42.0

0±11

.35

38.0

0±7.

8946

.00±

18.3

839

.00±

7.38

47.0

0±14

.18

52.0

0±16

.19

41.0

0±7.

3846

.00±

12.6

543

.00±

9.49

47.0

0±14

.94

39.0

0±8.

76H

25.5

3±8.

7925

.53±

8.79

25.8

5±8.

3226

.81±

9.71

25.8

4±9.

8325

.84±

9.83

25.8

4±9.

8326

.48±

9.07

26.8

1±9.

7125

.84±

9.34

26.5

1±8.

8626

.49±

8.96

25.5

2±9.

6526

.17±

10.0

125

.52±

9.65

25.5

2±8.

9029

.80±

11.7

828

.18±

11.3

625

.84±

8.31

T38

.78±

6.86

39.0

5±6.

6838

.92±

7.15

34.9

1±7.

2539

.18±

7.02

39.1

8±7.

2238

.53±

9.35

38.5

2±7.

2040

.66±

7.98

30.2

4±8.

2734

.65±

8.25

35.3

1±6.

6237

.58±

5.44

38.2

4±6.

9233

.59±

8.06

30.9

1±8.

4432

.37±

6.14

32.7

8±5.

5735

.71±

5.39

B2.

70±

1.62

2.63±

1.55

2.99±

1.70

4.89±

1.89

15.4

6±2.

5915

.46±

2.59

15.4

6±2.

5913

.64±

2.96

13.9

3±2.

977.

73±

3.12

4.45±

2.32

2.77±

1.78

2.99±

2.08

3.43±

1.33

12.5

4±3.

2211

.01±

4.25

8.53±

2.85

5.32±

2.07

3.57±

2.16

BW

2.86±

2.78

2.86±

2.78

3.29±

2.52

3.14±

2.59

10.0

2±3.

834.

01±

3.07

4.72±

2.94

4.15±

2.56

3.58±

2.72

2.86±

2.13

3.14±

2.68

3.00±

2.73

2.86±

2.43

3.00±

2.65

4.29±

3.09

4.29±

2.85

3.29±

2.34

3.29±

1.91

3.00±

2.37

I11

.39±

4.01

11.1

1±3.

9111

.68±

3.92

12.5

4±5.

2625

.37±

5.82

15.1

0±4.

2813

.67±

4.82

14.8

2±5.

8513

.69±

4.25

12.8

3±3.

6413

.97±

4.58

12.8

2±4.

8913

.97±

4.38

13.4

0±5.

4114

.53±

4.94

11.4

0±2.

7211

.96±

2.21

11.4

1±4.

4913

.40±

3.60

S20

.67±

7.12

20.6

7±7.

1221

.64±

6.47

25.4

8±9.

8830

.69±

12.3

025

.88±

15.0

225

.50±

11.5

027

.40±

11.2

327

.38±

9.72

24.0

2±8.

7128

.40±

8.74

28.8

6±9.

1128

.38±

9.25

26.0

7±9.

2923

.10±

9.62

22.6

2±9.

8826

.90±

7.79

28.8

3±9.

3724

.05±

11.7

6Y

48.1

8±4.

4348

.18±

4.43

48.5

9±4.

5934

.04±

7.09

48.5

2±4.

0048

.52±

4.00

48.7

9±3.

4749

.06±

4.15

48.4

5±4.

6049

.33±

3.83

47.1

7±5.

1846

.23±

4.47

47.1

1±4.

4746

.77±

3.67

48.7

9±2.

6449

.53±

3.18

47.7

1±4.

0248

.72±

3.69

49.3

3±3.

97W

R26

.14±

3.02

26.1

4±3.

1525

.45±

3.70

26.2

7±3.

1830

.77±

3.48

27.5

2±3.

1327

.46±

3.66

26.8

3±4.

3027

.02±

4.23

27.0

8±3.

3626

.71±

4.02

26.2

0±3.

9426

.20±

4.04

26.8

9±4.

0827

.64±

3.45

27.1

4±4.

0227

.14±

2.67

26.2

1±3.

5125

.96±

3.39

Ca

41.6

3±4.

6241

.58±

4.55

39.2

3±4.

4637

.91±

2.88

45.8

6±2.

0445

.86±

2.04

42.4

3±3.

0642

.80±

3.09

42.4

7±2.

8042

.06±

4.33

36.0

8±2.

3838

.10±

1.93

40.5

9±3.

1642

.00±

2.27

38.0

5±3.

8937

.77±

4.40

38.6

2±4.

9836

.60±

4.20

37.1

1±3.

54C

CS

40.0

0±4.

6239

.90±

4.70

40.9

0±3.

3139

.90±

4.56

57.9

0±4.

1257

.90±

4.12

57.6

0±4.

2253

.40±

7.82

42.1

0±5.

9539

.90±

3.87

35.3

0±5.

3637

.00±

3.40

36.5

0±2.

8836

.40±

4.40

42.6

0±4.

5839

.60±

5.27

43.3

0±3.

6541

.40±

4.90

40.9

0±4.

33A

b21

.64±

1.81

21.6

2±1.

8621

.35±

1.60

22.1

0±1.

4124

.18±

1.55

24.1

8±1.

5524

.18±

1.55

24.3

7±1.

4724

.40±

1.42

22.8

6±1.

5421

.81±

1.17

22.0

7±1.

4821

.83±

1.54

21.5

2±1.

4624

.28±

1.40

23.0

1±1.

9822

.82±

1.02

22.6

0±1.

6121

.74±

1.45

Q22

.47±

6.54

22.3

7±6.

5019

.81±

5.14

20.4

8±5.

5531

.47±

4.82

23.9

9±4.

8724

.65±

4.45

22.5

6±4.

9522

.65±

4.44

22.4

7±5.

4921

.33±

4.53

20.4

7±4.

8220

.28±

2.64

22.3

7±5.

7224

.55±

4.60

22.0

0±4.

4223

.22±

4.56

22.4

7±6.

0521

.24±

5.74

WW

30.3

6±2.

1830

.32±

2.09

29.7

7±1.

9529

.64±

2.03

35.8

7±2.

0531

.69±

1.82

31.7

3±1.

7631

.56±

2.04

31.6

9±1.

6831

.46±

2.12

31.9

3±1.

9531

.63±

1.89

31.2

8±1.

8931

.58±

2.23

31.3

0±2.

3029

.87±

2.44

29.4

4±2.

1330

.38±

1.64

29.7

5±1.

23P

19.2

6±1.

9119

.24±

1.84

6.01±

1.18

7.80±

1.45

35.6

1±1.

9328

.69±

2.38

22.3

3±1.

8721

.98±

1.47

21.1

4±1.

7711

.04±

2.20

23.5

5±2.

3823

.61±

2.31

19.9

9±1.

7514

.80±

2.04

22.3

5±1.

6810

.12±

2.47

8.04±

1.91

7.23±

1.19

7.22±

1.07

Mi

4.07±

2.15

3.89±

2.04

4.44±

2.30

7.41±

3.55

26.1

1±4.

3223

.15±

6.19

13.8

9±4.

0511

.02±

3.74

10.0

9±3.

7711

.94±

3.61

11.4

8±2.

239.

54±

2.73

8.80±

2.32

8.33±

2.90

13.7

0±2.

7211

.02±

2.67

8.70±

3.27

7.87±

2.66

7.31±

2.97

H+n

41.9

1±5.

9641

.91±

5.96

35.1

5±5.

3239

.93±

5.56

49.2

5±4.

8549

.25±

4.85

49.3

4±4.

8749

.34±

4.87

44.0

6±6.

7945

.72±

6.26

42.7

4±6.

7540

.26±

4.07

39.2

0±6.

4040

.18±

5.64

42.9

9±5.

5440

.84±

6.88

36.8

0±7.

1635

.23±

8.04

28.9

6±9.

20H

+nn

41.9

9±5.

4541

.99±

5.45

32.9

1±5.

0737

.95±

4.98

48.7

6±4.

7848

.76±

4.78

48.7

6±4.

7848

.76±

4.78

41.9

9±8.

4042

.58±

6.41

46.2

8±5.

5332

.93±

8.25

35.9

8±7.

0837

.47±

5.10

41.6

6±4.

2637

.87±

5.86

20.4

6±8.

1428

.22±

12.4

019

.63±

8.76

Ft12

.23±

0.93

12.3

9±0.

9012

.33±

0.85

13.5

6±1.

0733

.78±

1.62

33.7

8±1.

6219

.81±

1.48

15.3

1±1.

1413

.74±

0.75

17.1

3±1.

1312

.83±

0.81

13.0

4±0.

7413

.37±

1.11

12.5

6±0.

8221

.12±

1.42

15.4

6±0.

9313

.54±

0.80

12.4

5±0.

6712

.57±

0.90

Ma

21.0

0±1.

0021

.01±

0.93

20.9

1±0.

9720

.94±

0.98

26.4

1±0.

9721

.41±

0.88

21.4

5±0.

9321

.46±

0.92

21.1

1±0.

8820

.95±

0.93

20.9

3±1.

0420

.94±

1.08

21.0

1±1.

0620

.94±

0.96

21.4

5±0.

8821

.38±

0.91

21.0

3±0.

9121

.08±

0.94

21.0

1±0.

96E

45.5

5±1.

4845

.55±

1.49

43.4

8±1.

3642

.92±

0.81

47.2

6±1.

4347

.12±

1.39

46.4

6±1.

8845

.23±

1.93

44.0

7±1.

4545

.63±

1.71

44.6

4±1.

6344

.28±

1.36

44.7

5±1.

0644

.51±

1.80

44.8

3±1.

5044

.06±

1.82

43.4

0±1.

0142

.60±

1.19

42.2

5±0.

72Sk

9.62±

0.22

10.1

8±0.

2910

.74±

0.21

9.65±

0.23

33.9

7±0.

2933

.97±

0.29

18.7

9±0.

3110

.68±

0.24

9.87±

0.23

9.74±

0.24

9.61±

0.23

9.62±

0.23

9.61±

0.23

9.61±

0.23

9.62±

0.23

9.61±

0.23

7.99±

1.46

7.19±

0.75

6.77±

0.34

Mu

23.3

6±1.

1923

.28±

1.24

19.4

8±1.

1222

.26±

1.19

46.0

7±5.

1739

.18±

6.37

29.4

6±2.

1228

.22±

1.87

26.2

0±1.

4632

.92±

2.97

32.1

1±2.

7232

.80±

2.15

30.9

6±1.

9628

.13±

2.28

28.5

4±2.

0327

.25±

2.59

25.4

0±2.

7723

.79±

1.66

24.7

2±1.

04H

a1.

94±

0.23

3.11±

0.31

3.11±

0.33

1.76±

0.25

9.73±

0.44

2.28±

0.21

2.28±

0.21

2.29±

0.21

2.28±

0.20

1.80±

0.32

6.28±

0.28

4.85±

0.21

3.85±

0.30

2.72±

0.19

2.29±

0.23

2.26±

0.23

2.14±

0.34

1.66±

0.19

1.66±

0.17

Tw7.

45±

0.08

7.45±

0.08

4.42±

0.11

4.72±

0.10

6.63±

0.07

6.55±

0.08

6.55±

0.08

6.25±

0.09

5.65±

0.14

4.34±

0.11

5.80±

0.09

5.21±

0.15

5.07±

0.12

5.07±

0.11

5.21±

0.19

5.13±

0.23

5.10±

0.12

4.97±

0.12

4.90±

0.15

Tabl

e2:

Com

plet

ere

sults

forT

able

1(i

nm

ain

file)

.Dom

ains

orde

red

follo

win

gTa

ble

1(i

nth

isSM

).E

ach

resu

ltis

the

aver

age

+st

ddev

ofth

ecl

assi

fiers

reta

ined

atea

chC

Vfo

ld.T

hecl

assi

fierr

etai

ned

atea

chfo

ldis

the

one

min

imiz

ing

the

empi

rica

lris

kam

ong

theT,T′

boos

ting

itera

tions

.

44

Lossless or Quantized Boosting with Integer Arithmetic ...

Documents

Transcript of Lossless or Quantized Boosting with Integer Arithmetic ...