Discriminative Learning of Relaxed Hierarchy for Large...

8
Discriminative Learning of Relaxed Hierarchy for Large-scale Visual Recognition — Supplementary Material Tianshi Gao Dept. of Electrical Engineering Stanford University [email protected] Daphne Koller Dept. of Computer Science Stanford University [email protected] 1. Introduction In this supplementary material, we first discuss how to efficiently solve the optimization problem (2) in Section 3.2 of the main paper. Then, we present two theorems for the generalization error of our model and discuss how our method is related to minimizing the error bound in Sec- tion 3. In Section 4, we prove the main theorems. Finally, we show the learned hierarchies for both the Caltech-256 and the SUN datasets in Section 5. 2. Optimization In this section, we present an efficient algorithm to solve the optimization problem (2) in Section 3.2 of the main paper. We first ignore the constraint K k=1 1{μ k > 0}≥ 1 and K k=1 1{μ k < 0}≥ 1, and discuss how to solve (1) efficiently. min {μ k } C K k=1 n k ( 1{μ k = +1} ( 1 n k i:yi=k ξ + i - A C ) + 1{μ k = -1} ( 1 n k i:yi=k ξ - i - A C ) ) s.t. μ k ∈ {-1, 0, +1}, k ∈Y - B K k=1 μ k B (1) Let f k (μ k )= Cn k ( 1 n k i:yi=k ξ - i - ρ) if μ k = -1 0 if μ k =0 Cn k ( 1 n k i:yi=k ξ + i - ρ) if μ k =1 (2) where ρ = A C . Then the objective function in (1) can be written as k f k (μ k ). Let μ k = argmin μ k f k (μ k ). Then k f k (μ k ) is the lowest possible value the objective func- tion can take regardless of the constraint. To facilitate later discussion, we define S + = {k|μ k =1}, S 0 = {k|μ k =0} and S - = {k|μ k = -1}. Furthermore, to simplify the dis- cussion we assume B is a positive integer, i.e., B 1. For B =0, the principles of the following discussions remain the same with some special considerations of corner cases. Let B = k μ k . According to different relations between B and B, there are three cases. First, if -B B B, then the optimal solution of (1) is μ k = μ k , k. Second, if B >B, we first characterize the relations between the optimal solution {μ k } K k=1 and {μ k } K k=1 , and then provide an efficient algorithm to find such an optimal solution. Note that for the third case, i.e., B < -B, we can use an analogous procedure to find the optimal solution similar to the case of B >B by symmetry. We first prove the following theorem characterizing the optimal solution of (1). Theorem 2.1. If B = k μ k >B, then there exists an optimal solution {μ k } K k=1 such that B = B or B - 1, where B = k μ k (note that B - 1 > -B, since we assume B 1). Proof. Given an optimal solution {μ k } K k=1 , suppose B B - 2. We have k μ k - k μ k = k (μ k - μ k )= kS - (μ k -μ k )+ kS 0 (μ k -μ k )+ kS + (μ k -μ k )= B - B B - B +2. Since kS - (μ k - μ k )= kS - (-1 - μ k ) 0 (μ k ∈ {-1, 0, 1}≥-1), we have kS 0 (μ k - μ k )+ kS + (μ k - μ k ) B - B +2 > 2. This implies that, there exists at least one ˜ k S o S + such that μ ˜ k - μ * ˜ k =1 or 2. Now we construct a solution by changing μ * ˜ k to μ ˜ k , and keep other μ * k ’s unchanged. Then for the new solution, ˜ B = μ ˜ k + k= ˜ k μ k = B +μ ˜ k -μ * ˜ k B +2 B. This means that the new solution is a feasible solution. In terms of the objective value, since f ˜ k (μ ˜ k ) f ˜ k (μ * ˜ k ) (by definition), the objective value of the new solution is no worse than that of {μ k } K k=1 . If f ˜ k (μ ˜ k ) is

Transcript of Discriminative Learning of Relaxed Hierarchy for Large...

Page 1: Discriminative Learning of Relaxed Hierarchy for Large ...ai.stanford.edu/~tianshig/papers/learningRelaxedHierarchy-ICCV11-supp.pdf · Discriminative Learning of Relaxed Hierarchy

Discriminative Learning of Relaxed Hierarchy for Large-scale VisualRecognition — Supplementary Material

Tianshi GaoDept. of Electrical Engineering

Stanford [email protected]

Daphne KollerDept. of Computer Science

Stanford [email protected]

1. Introduction

In this supplementary material, we first discuss how toefficiently solve the optimization problem (2) in Section3.2 of the main paper. Then, we present two theorems forthe generalization error of our model and discuss how ourmethod is related to minimizing the error bound in Sec-tion 3. In Section 4, we prove the main theorems. Finally,we show the learned hierarchies for both the Caltech-256and the SUN datasets in Section 5.

2. Optimization

In this section, we present an efficient algorithm to solvethe optimization problem (2) in Section 3.2 of the mainpaper. We first ignore the constraint

∑Kk=1 1{µk > 0} ≥

1 and∑K

k=1 1{µk < 0} ≥ 1, and discuss how to solve (1)efficiently.

min{µk}

C

K∑k=1

nk

(1{µk = +1} (

1nk

∑i:yi=k

ξ+i −

A

C)

+ 1{µk = −1} (1nk

∑i:yi=k

ξ−i −A

C))

s.t. µk ∈ {−1, 0,+1}, ∀k ∈ Y

−B ≤K∑

k=1

µk ≤ B

(1)

Let

fk(µk) =

Cnk( 1

nk

∑i:yi=k ξ−i − ρ) if µk = −1

0 if µk = 0Cnk( 1

nk

∑i:yi=k ξ+

i − ρ) if µk = 1(2)

where ρ = AC . Then the objective function in (1) can be

written as∑

k fk(µk). Let µ′k = argminµkfk(µk). Then∑

k fk(µ′k) is the lowest possible value the objective func-tion can take regardless of the constraint. To facilitate later

discussion, we define S′+ = {k|µ′k = 1}, S′0 = {k|µ′k = 0}and S′− = {k|µ′k = −1}. Furthermore, to simplify the dis-cussion we assume B is a positive integer, i.e., B ≥ 1. ForB = 0, the principles of the following discussions remainthe same with some special considerations of corner cases.Let B′ =

∑k µ′k. According to different relations between

B′ and B, there are three cases.First, if −B ≤ B′ ≤ B, then the optimal solution of (1)

is µ?k = µ′k, ∀k.

Second, if B′ > B, we first characterize the relationsbetween the optimal solution {µ?

k}Kk=1 and {µ′k}Kk=1, andthen provide an efficient algorithm to find such an optimalsolution. Note that for the third case, i.e., B′ < −B, wecan use an analogous procedure to find the optimal solutionsimilar to the case of B′ > B by symmetry.

We first prove the following theorem characterizing theoptimal solution of (1).

Theorem 2.1. If B′ =∑

k µ′k > B, then there exists anoptimal solution {µ?

k}Kk=1 such that B? = B or B − 1,where B? =

∑k µ?

k (note that B − 1 > −B, since weassume B ≥ 1).

Proof. Given an optimal solution {µ?k}Kk=1, suppose B? ≤

B − 2. We have∑

k µ′k −∑

k µ?k =

∑k(µ′k − µ?

k) =∑k∈S′−

(µ′k−µ?k)+

∑k∈S′0

(µ′k−µ?k)+

∑k∈S′+

(µ′k−µ?k) =

B′ − B? ≥ B′ − B + 2. Since∑

k∈S′−(µ′k − µ?

k) =∑k∈S′−

(−1−µ?k) ≤ 0 (∵ µ?

k ∈ {−1, 0, 1} ≥ −1), we have∑k∈S′0

(µ′k − µ?k) +

∑k∈S′+

(µ′k − µ?k) ≥ B′ −B + 2 > 2.

This implies that, there exists at least one k̃ ∈ S′o∪S′+ suchthat µ′

k̃− µ∗

k̃= 1 or 2. Now we construct a solution by

changing µ∗k̃

to µ′k̃, and keep other µ∗k’s unchanged. Then

for the new solution, B̃ = µ′k̃+

∑k 6=k̃ µ?

k = B?+µ′k̃−µ∗

k̃≤

B? + 2 ≤ B. This means that the new solution is afeasible solution. In terms of the objective value, sincefk̃(µ′

k̃) ≤ fk̃(µ∗

k̃) (by definition), the objective value of the

new solution is no worse than that of {µ?k}Kk=1. If fk̃(µ′

k̃) is

Page 2: Discriminative Learning of Relaxed Hierarchy for Large ...ai.stanford.edu/~tianshig/papers/learningRelaxedHierarchy-ICCV11-supp.pdf · Discriminative Learning of Relaxed Hierarchy

strictly less than fk̃(µ∗k̃), then we find another feasible so-

lution with better objective value, which is contradictory tothe optimality of {µ?

k}Kk=1. Thus, B? > B−2, i.e., B? = B

or B−1. Otherwise if fk̃(µ′k̃) = fk̃(µ∗

k̃), if B̃ > B−2, then

we find an optimal solution satisfying the statement in thetheorem, otherwise if B̃ ≤ B− 2, we can recursively applythe same argument above based on the newly constructedoptimal solution, which will lead to contradiction or even-tually find an optimal solution satisfying the condition inthe theorem.

We present another theorem which shows another prop-erty of the optimal solution:

Theorem 2.2. If B′ =∑

k µ′k > B, among the optimal so-lutions satisfying the condition in Theorem 2.1, there existsan optimal solution {µ?

k}Kk=1 such that ∀k ∈ S′−, µ?k = µ′k.

Proof. Suppose there exists an optimal solution (satisfyingthe condition in Theorem 2.1) with some k̃ ∈ S′− such thatµ?

k̃6= µ′

k̃= −1. Similar to the analysis in the proof of The-

orem 2.1, we have∑

k µ′k −∑

k µ?k =

∑k(µ′k − µ?

k) =∑k∈S′−

(µ′k − µ?k) +

∑k∈S′0

(µ′k − µ?k) +

∑k∈S′+

(µ′k −µ?

k) = B′ − B? ≥ B′ − B ≥ 1. Thus,∑

k∈S′0(µ′k −

µ?k) +

∑k∈S′+

(µ′k − µ?k) ≥ 1 −

∑k∈S′−

(µ′k − µ?k) ≥ 2

(∵ µ′k̃− µ?

k̃≤ −1). This implies that, there exists at least

two j ∈ S′o ∪ S′+ such that µ′j − µ?j = 1 and/or at least

one i ∈ S′+ such that µ′i − µ?i = 2. Now we construct a

new solution by first changing µ?k̃

to µ′k̃

= −1. If µ?k̃

was1, then we also change two µ?

j to µ′j = µ?j + 1, or change

one µ?i to µ′i = µ?

i + 2. If µ?k̃

was 0, we change one µ?j to

µ?j + 1 or change one µ?

i to µ′i = µ?i + 2 (only when there

is no single j such that µ′j − µ?j = 1). The new solution

still has Bnew = B? or B? +1 (only when there is no singlej such that µ′j − µ?

j = 1, which implies B? = B − 1 asshown in the algorithm). Therefore, the new solution is stillfeasible. However, since we changed µ?

k̃to µ′

k̃and some µ?

j

to µ′j or µ?i to µ′i, the objective value becomes no worse. If

the new solution has lower objective value, then by contra-diction, we conclude that ∀k ∈ S′−, µ?

k = µ′k. If the newsolution has the same objective value, we can use the sameargument above based on this new solution, which will leadto contradiction (thus proving the theorem) or construct anoptimal solution such that ∀k ∈ S′−, µ?

k = µ′k, which isconsistent with the statement in the theorem.

Similar to Theorem 2.2, the following statement is alsotrue.

Theorem 2.3. If B′ =∑

k µ′k > B, among the opti-mal solutions satisfying both the conditions in Theorem 2.1and Theorem 2.2, there exists an optimal solution {µ?

k}Kk=1

such that ∀k ∈ S′0, µ?k = µ′k = 0 or µ?

k = −1. In otherwords, there is no k ∈ S′0 such that µ?

k = 1.

Proof. Follow the same principle of the proof for Theo-rem 2.2. Basically, if there is some k̃ ∈ S′0 such thatµ?

k̃= 1 > µ′

k̃= 0, then we can always construct another

solution by changing µ?k̃

to µ′k̃

= 0 and increase some otherµ?

j to µ′j where j ∈ S′o ∪S′+. This new solution will be fea-sible and has an objective value that is no worse than that ofthe previous optimal solution. If the new solution has lowerobjective value, then we have a contradiction, thus provingthe theorem. Otherwise, we can repeat such construction,which will lead to contradiction or eventually construct an-other optimal solution satisfying the condition in the theo-rem.

Based on the above theorems, we can transform the orig-inal problem (1) to:

min{µk}

K∑k=1

fk(µk)− fk(µ′k)

s.t. ∀k ∈ S′−, µk = µ′k = −1∀k ∈ S′0, , µk ∈ {−1, 0}∀k ∈ S′+, , µk ∈ {−1, 0, 1}∑k∈S′0

(µ′k − µk) +∑

k∈S′+

(µ′k − µk)

= B′ −B or B′ −B + 1

(3)

Since fk(µk) ≥ fk(µ′k), for any k such that µk 6= µ′k,it will incur a non-negative increase in the objective. Thus,the optimal solution {µ?

k} should have as few differencesfrom {µ′k} as possible. However, the last constraint in (3)requires that the sum of the difference between {µ′k} and{µk} is B′−B or B′−B+1. Then the problem (3) is equiv-alent to finding a subset of k’s from S′0∪S′+ and change thevalues of those µk from µ′k to some other values such thatthe sum of the difference is B′−B or B′−B +1 while thecost of these changes is the minimum (corresponding to theminimum objective value). To achieve this requirement, fork ∈ S′0, if we change µk from µ′k = 0 to −1, it will con-tribute µ′k − µk = 1. However, it incurs a delta increase inthe objective ∆k = fk(−1)− fk(0). For k ∈ S′+, there aretwo cases. The first is to change µk from µ′k = 1 to 0, whichwill contribute µ′k−µk = 1 to the difference while incurringa delta increase in the objective ∆k,0 = fk(0)− fk(1). Thesecond case is to change µk from µ′k = 1 to −1, which willcontribute µ′k − µk = 2 to the difference while incurring adelta increase in the objective ∆k,− = fk(−1) − fk(1).We also denote ∆k = fk(−1) − fk(0) for k ∈ S′+.Note that ∆k,− = ∆k,0 + ∆k. The interpretation of ∆k,∆k,0 and ∆k,−

2 is that they are the cost per unit increase

Page 3: Discriminative Learning of Relaxed Hierarchy for Large ...ai.stanford.edu/~tianshig/papers/learningRelaxedHierarchy-ICCV11-supp.pdf · Discriminative Learning of Relaxed Hierarchy

in∑

k∈S′0(µ′k − µk) +

∑k∈S′+

(µ′k − µk) for different k’s.Therefore, we can sort these costs and select those changeswith the smallest costs until the constraint is satisfied. Weformulate this idea in Algorithm 1. Note that the compu-tations in Algorithm 1 consists of some linear scan compo-nents, i.e., computing and selecting ∆’s, and a sorting com-ponent. Therefore, the overall complexity is O(K log K).

Now we discuss the case where we add another constraint∑Kk=1 1{µk > 0} ≥ 1 and

∑Kk=1 1{µk < 0} ≥ 1 to (1).

If the optimal solution {µ?k}Kk=1 of (1) already satisfies the

new constraint, then there is no modification needed. Oth-erwise, we discuss three cases corresponding to differentrelations between B′ and B.

First, for −B ≤ B′ ≤ B (µ?k = µ′k), if there is no µ?

k

equals to −1, then we select j such that the delta increaseby changing µ?

j to −1 is the smallest, and set µ?j to −1. It’s

easy to show that this new solution always satisfies the con-straint −B ≤

∑Kk=1 µ?

k ≤ B. We do the analogous step ifthere is no µ?

k equals to 1. Since the delta changes (with re-spect to the best possible value

∑k fk(µ′k)) in the objective

function is the minimum, the new solution is optimal.Second, for B′ > B, we focus on the case of B > 1,

since there will be at least one µ?k = 1 (since B? =∑

k µ?k = B or B−1 > 0). In addition, as long as S′− is not

empty, then there is at least one µ?k = −1, where k ∈ S′−.

If S′− is empty and no µ?k from the solution of (1) equals to

−1, we can find the smallest ∆j from S∆ when Algorithm 1terminates, and the largest ∆i,0 from S̄∆, and set µ?

j = −1and µ?

i = 1.Third, for B′ < −B, we can use an analogous procedure

to find the optimal solution similar to the case of B′ > Bby symmetry.

3. Analysis of Generalization Error

In this section we analyze the generalization error of ourmodel based on the techniques developed for perceptron de-cision tree [2]. Given a model represented by G, for eachnode j ∈ G, define the margin at node j as

γj = min{xi:µj,yi

=±1,(xi,yi)∈D}|wT

j xi + bj | (4)

where D is a given training set, {µj,k} is the coloring atnode j and (wj , bj) specifies the decision hyperplane.

Theorem 3.1. Suppose m random samples are correctlyclassified using G on K classes containing NG decisionnodes with margins {γj ,∀j ∈ G}, then the generalizationerror with probability greater than 1− δ is less than

130R2

m

(D′ log(4em) log(4m) + NG log(2m)− log(

δ

2))

(5)

where D′ =∑NG

j1γ2

j, and R is the radius of a ball con-

taining the distribution’s support.

We first discuss the implications of this theorem and thenpresent the proof in Section 4. Theorem 3.1 reveals two im-portant factors to reduce generalization error bound: thefirst is to enlarge the margin γj and the second is to con-trol the complexity of the model by reducing the number ofdecision nodes NG. There is a trade-off between these twoquantities. Consider two extreme cases: on one hand, toreduce NG, we want to have a balanced binary tree whereeach node encourages all classes to participate. However,in this case, the margins are likely to be small since a largesubset of classes has to be separated from another subset.On the other hand, to increase the margins, each node canconsider only two classes, which is the case of DAGSVM.However, in this case, the number of nodes is quadratic inthe number of classes K, while the first case is only lin-ear in K. Therefore, a good balance has to be establishedto enjoy a good generalization error. Our method charac-terized by the optimization problem (1) in the main paperis seeking such a good trade-off. The objective functionencourages classes to participate to control the model com-plexity and ignores a subset of classes with small margins tomaximize the resulting margins. A particular trade-off be-tween these two parts is controlled by the parameter ρ in ourmodel. Compared to previous methods, for DAGSVM, ourmethod takes advantage of the hierarchical structure in thelabel space to reduce the model complexity, and for other hi-erarchical methods, our method achieves larger margins byignoring a subset of classes at each node and the joint max-margin optimization of the partition and the binary classi-fier.

Instead of considering the overall error for all classes, wecan also bound the generalization error for a specific classk in terms of the nodes in which class k is active. We definethe probability of error for class k as

εk(G) = P (x,y)∼D{(y = k but fG(x) 6= k)or (y 6= k but fG(x) = k)}

(6)

where fG : X → Y is the decision function induced by Gand D is some distribution from which (x, y)’s are drawn.

Theorem 3.2. Suppose in a random m-sample, we are ableto correctly distinguish class k from the other classes usingG containing NG decision nodes with margins {γj ,∀j ∈G}. In addition, suppose the number of nodes in whichclass k is active is NG(k). Then with probability greaterthan 1− δ,

εk(G) ≤ 130R2

m

(D′

k log(4em) log(4m)

+ NG(k) log(2m)− log(δ

2)) (7)

Page 4: Discriminative Learning of Relaxed Hierarchy for Large ...ai.stanford.edu/~tianshig/papers/learningRelaxedHierarchy-ICCV11-supp.pdf · Discriminative Learning of Relaxed Hierarchy

where D′k =

∑j∈k-nodes

1γ2

j(k-nodes are the set of nodes

in which class k is active), and R is the radius of a ballcontaining the distribution’s support.

Please refer to the next section for the proof. This theo-rem implies that the error for a class k will be small if it hasa short decision path with large margins along the path, andvice versa.

4. Proof of Theorems

The proof of Theorem 3.1 and Theorem 3.2 closely re-sembles the proof of the Theorem 1 in [3] which is basedon the technique developed for studying perceptron deci-sion trees [2]. We present the key steps for proving the the-orem and more details on this style of proof can be foundin [2].

Definition 1. Let F be a set of real valued functions. Wesay that a set of points X is γ-shattered by F relative tor = (rx)x∈X if there are real numbers rx indexed by x ∈ Xsuch that for all binary vectors b indexed by X , there is afunction fb ∈ F satisfying fb(x) ≥ rx + γ, if bx = 1, andfb(x) ≤ rx − γ otherwise. The fat shattering dimensionfatF of the setF is a function from the positive real numbersto the integers which maps a value γ to the size of the largestγ-shattered set, if this is finite, or infinity otherwise.

We consider the class of linear function: Flin = {x →〈w, x〉+ θ : ‖w‖ = 1}.

Theorem 4.1 (Bartlett and Shawe-Taylor [1]). Let Flin berestricted to points in a ball of n dimensions of radius Rabout the origin. Then

fatFlin(γ) ≤ min{R

2

γ2, n + 1} (8)

We now give a lemma which bounds the probability oferror over a sequence of 2m samples where the first m sam-ples x̄1 have zero error, and the second m samples x̄2 haveerror greater than an appropriate ε. We denote the model asG and the number of decision nodes as NG.

Lemma 1. Let G be a model on K classes with NG de-cision nodes with margins {γ1, γ2, . . . , γNG

}. Let ki =fatFlin

(γi

8 ), where fat is continuous from the right. Thenthe following bound holds, P 2m{x̄1x̄2 : ∃ a graph G:∀x ∈ x̄1 and ∀i, G correctly separates classes S−i and S+

i

at node i, then a fraction of points misclassified in x̄2 >ε(m,K, δ)} < δ, where ε(m,K, δ) = 1

m (D log(8m) +log( 2NG

δ )), D =∑NG

i=1 ki log( 4emki

) and S−i and S+i are

the negative and positive classes at node i.

Proof. The proof follows the same arguments used by theproof of Lemma 3.7 in [2] or the proof of Lemma 4 in [3].

Lemma 1 applies to a particular G with specified mar-gins {γi}. In practice, we observe the margins after learningthe model. To obtain a bound that can be better applied inpractice, we want to bound the probabilities uniformly overall possible margins that can arise. This gives rise to Theo-rem 3.1. The proof follows the same arguments by the proofof Theorem 1 in [3] based on Lemma 1.

Now we proceed to discuss the generalization errorbound for the probability of error for a single class k. Letthe Sk-nodes be the set of nodes in G where class k is active.We denote the probability of error for Sk-nodes as ε′k(G), i.e.,ε′k(G) is the probability that any of i ∈ Sk-nodes makes mis-take. We first present a lemma for the generalization errorbound for ε′k(G).

Lemma 2. Suppose in a random m-sample, there is no mis-take made by any node i ∈ Sk-nodes. In addition, G hasmargins {γj ,∀j ∈ G}. Denote NG(k) = |Sk-nodes|, i.e., thenumber of nodes in G where class k is active. Then withprobability greater than 1− δ,

ε′k(G) ≤ 130R2

m

(D′

k log(4em) log(4m)

+ NG(k) log(2m)− log(δ

2)) (9)

where D′k =

∑j∈Sk-nodes

1γ2

j, and R is the radius of a ball

containing the distribution’s support.

Proof. The proof follows exactly Lemma 1 and Theo-rem 3.1.

Now we prove the last theorem by exploiting the relationbetween ε′k(G) and εk(G):

Proof of Theorem 3.2. We first prove that ε′k(G) ≥ εk(G).Consider a node i ∈ Sk-nodes, whose positive and negativeclasses are S+

i and S−i respectively. We study how a bi-nary error in i relates to the error for class k. Without lossof generality, we assume k ∈ S+

i . Suppose there is a mis-take made on instance x from S+

i /{k}, i.e., x is classifiedas negative by the binary classifier. This implies that x willnot be classified as class k which is a correct classificationgiven the definition of the error for class k. Furthermore,if there is a mistake made on instance x from S−i , i.e., x isclassified as positive in node i, then it is still possible thatlower levels will prune away class k for x, leading to correctclassification if we are only considering the error for classk. Finally, if there is a binary mistake on instance x fromclass k, then it also counts as an error in terms of the errorfor class k. In summary, if there are some binary classifi-cation errors in some nodes in Sk-nodes, these “errors” maynot hold true if we consider the error only for class k. Onthe other hand, if there is an error for class k either because

Page 5: Discriminative Learning of Relaxed Hierarchy for Large ...ai.stanford.edu/~tianshig/papers/learningRelaxedHierarchy-ICCV11-supp.pdf · Discriminative Learning of Relaxed Hierarchy

we classified instances from class k not as k or because weclassified instances from non-k classes as k, there must beat least one binary error in some i ∈ Sk-nodes (since no er-ror in all binary classifications implies no error for class k).Therefore, ε′k(G) ≥ εk(G). With this relation and basedon Lemma 2, the statement of Theorem 3.2 is true.

5. Learned Hierarchy

We present three learned hierarchies using different rep-resentations on two datasets. Figure 1 shows the first fivelevels of the learned hierarchy using dense SIFT based fea-ture with extended Gaussian kernel based on χ2 distance onthe Caltech-256 dataset. Figure 2 shows the first five lev-els of the learned hierarchy using GIST with RBF kernel onthe SUN dataset. Figure 3 shows the first five levels of thelearned hierarchy using spatial HOG with histogram inter-section kernel on the SUN dataset. Note that the hierarchieslearned on the same dataset with different representationsare different (see Figure 2 and Figure 3).

References

[1] P. Bartlett and J. Shawe-Taylor. Generalization performanceof support vector machines and other pattern classifiers. Ad-vances in Kernel Methods - Support Vector Learning, 1998.4

[2] P. Bennett, N. Cristianini, J. Shawe-taylor, and W. D. En-larging the margins in perceptron decision trees. In MachineLearning, 2000. 3, 4

[3] J. C. Platt, N. Cristianini, and J. Shawe-taylor. Large margindags for multiclass classification. In NIPS, 2000. 4

Algorithm 1 An efficient algorithm for solving (3)1: Initialize ∀k, µk = µ′k. Let d = 0, and let S∆ and S̄∆

be two empty sets.2: For each k ∈ S′0, compute ∆k = fk(−1)− fk(0), and

let S∆ ← S∆ ∪ {∆k}.3: For each k ∈ S′+, compute ∆k = fk(−1) − fk(0),

∆k,0 = fk(0) − fk(1) and ∆k,− = ∆k,0 + ∆k . If∆k,0 ≤ ∆k, then let S∆ ← S∆ ∪ {∆k,0,∆k}. Other-wise, let S∆ ← S∆ ∪ {∆k,−

2 ,∆k,0}.4: repeat5: Pop out the first (minimum) element ∆min from S∆.

S∆ ← S∆\{∆min}6: if ∆min = ∆k then7: Let µk = −1 and d← d + 1.8: end if9: if ∆min = ∆k,0 then

10: Let µk = 0 and d← d + 1.11: end if12: if ∆min = ∆k,−

2 then13: if d ≤ B′ −B − 2 then14: Let µk = −1 and d← d + 2.15: S∆ ← S∆\{∆k,0}16: else // d = B′ −B − 117: Find the next smallest ∆j or ∆j,0 from S∆ and

let it be ∆j,min.18: Find the largest ∆i or ∆i,0 from S̄∆, and let it

be ∆i,max (if there does not exist such i, thenlet ∆i,max = 0).

19: For ∆l,− ∈ S̄∆, find the l with the largest∆l,− − ∆l,0, denote the largest difference as∆l,max (if S̄∆ does not contain ∆l,−, then let∆l,max = 0).

20: Let ∆one-step-min be the minimum of{∆j,min,∆k,− −∆i,max,∆k,− −∆l,max}.

21: if ∆one-step-min = ∆j,min then22: if ∆j,min = ∆j then23: Let µj = −124: else25: Let µj = 026: end if27: d← d + 128: else if ∆one-step-min = ∆k,− −∆i,max then29: Let µk = −1.30: if i exists then31: µi = µ′i32: d← d + 133: else34: d← d + 235: end if36: else // ∆one-step-min = ∆k,− −∆l,max

37: Let µl = 0 and µk = −138: d← d + 139: end if40: end if41: end if42: S̄∆ ← S̄∆ ∪ {∆min}43: until d = B′ −B or d = B′ −B + 1Return: {µk}Kk=1

Page 6: Discriminative Learning of Relaxed Hierarchy for Large ...ai.stanford.edu/~tianshig/papers/learningRelaxedHierarchy-ICCV11-supp.pdf · Discriminative Learning of Relaxed Hierarchy

Node 2, 0 , 0

Node 1

ak47

Negative: Positive:

tweezer

zebra

camelIgnore :

vcr fern

Negative: Positive: Ignore :

microscope

Playing-card

ostrich

elephant-101

Node 3

Boom-boxNegative: Positive:

vcrtoaster waterfall

bearIgnore :

breadmaker

toadVideo-projector

Computer-monitor gorilla

Negative: Positive:canoe

kayak

Ignore

American-flagNegative: Positive:

Paper-shredder

camel

ostrich

Ignore

chopsticksNegative: Positive:

Tuning-fork

fern

mushroom

Ignore

frisbeeNegative: Positive:

waterfallHead-phones

fern

toad

cactusIgnore

, 0 , 0Node 4 Node 5 , 0 , 0

Node 6 Node 7

ak47Negative: Positive:

Car-tire

Roulette-wheel

Ignore

Node 8

vcr

grapes

snake

backpackNegative: Positive:

bulldozer

ostrich

Ignore

Node 11

American-flagNegative: Positive:

camel

ostrich

Ignore

Node 12

French-hornNegative: Positive:

Yo-yo

bear

gorilla

Ignore

Node 15

tweezer vcr

American-flag

Negative: Positive:Baseball-bat

rainbow

Ignore

Node 17Negative: Positive: Ignore

Node 22

backpackNegative: Positive:

bulldozer

kayak

Ignore

Node 25

Welding-mask

Touring-bikeNegative: Positive:

comet

galaxy

Ignore

Node 30

Mountain-bike

, 0

, 0

, 0 , 0

, 0

, 0

Node9

Node10

, 0

Node13

, 0

Node14

, 0 , 0

, 0 , 0

Node16

Node23

Node24

Node31

, 0, 0, 0 , 0

Hierarchy Learned using dense SIFT with extended Gaussian Kernel based Chi2 distance

Welding-mask

backpack birdbath

tombstone

Welding-mask

fernbear

toadraccoon

skunk elk

mushroomiris

porcupine

gorilla

chimp grape

water fallkangaroosword

ak47 computer-mouse diamond-ring eyeglasses flashlight

lightbulb spoon syringe

Frying-pan

screwdrivertweezer

microwave

calculatorWashing-machine

photocopier

harmonica

refrigeratorak47

tweezer

flashlight

saturn

Figure 1. The first five levels of the learned hierarchy using dense SIFT based feature with extended Gaussian kernel based on χ2 kernelon the Caltech-256 dataset.

Page 7: Discriminative Learning of Relaxed Hierarchy for Large ...ai.stanford.edu/~tianshig/papers/learningRelaxedHierarchy-ICCV11-supp.pdf · Discriminative Learning of Relaxed Hierarchy

Node 2, 0 , 0

Node 1

bamboo forestNegative: Positive:

vineyardforest road

lake/natural

booth/indoortree farm

bakery/shopIgnore :

bar arrival_gate

Negative: Positive:

tree houserainforest

cottage garden forest/broadleaf

closing storerope bridge

forest/needleleafbow window/outdoor

swamp

botanical gardenbamboo forest forest path

baseball fieldmusic store

athletic field/outdoor

vegetable garden

Ignore :

living room

church/outdoorbathroom hotel_room hangar/outdoor nuclear power plant/outdoor

beach butte coast

supermarket

lighthouse

filed/wild

desert/vegetation

dinner/indoor

fairway

candy store casino/indoor

Node 3

barnNegative: Positive:

wind farmhayfield

airplane cabin

classroomcoffee shop

auditoriumIgnore :

butte

galleybeach

field/wild bakery/shop

forest pathNegative: Positive:

orchard

art studio

dorm room

Ignore

barn forest broadleafNegative: Positive:

forest/needleleaf

control room

cubicle/office

Ignore

tree farmNegative: Positive:

hill

berth

game room

Ignore

campusNegative: Positive:

cabin/outdoorvillage

skatepark

ball pit

corn fieldIgnore

, 0 , 0Node 4 Node 5 , 0 , 0

Node 6 Node 7

tree farmNegative: Positive:

corridor

game room

Ignore

Node 8

village

forest path

herb garden

wet bar

corridor

airplane_cabinNegative: Positive:

stable

barndoor

Ignore

Node 11

harborNegative: Positive:

medina

waterfall/fan

Ignore

Node 12

valleyNegative: Positive:

waterfall/fan

driveway

basement

Ignore

Node 15

orchard beach

villageNegative: Positive:

elevator/door

elevator/interior

Ignore

Node 17

landfill

Negative: Positive: Ignore

Node 22

campusNegative: Positive:

waiting room

locker room

Ignore

Node 25

house

airplane cabinNegative: Positive:

library/indoor

jail/indoor

Ignore

Node 30

veterinarians office

, 0

, 0

, 0 , 0

, 0

, 0

Node9

Node10

, 0

Node13

, 0

Node14

, 0 , 0

, 0 , 0

Node16

Node23

Node24

Node31

, 0, 0, 0 , 0

Hierarchy Learned using GIST with RBF Kernel

hospital

courthouse hospital room

staircase

veterinarians_office

Figure 2. The first five levels of the learned hierarchy using GIST with RBF kernel on the SUN dataset.

Page 8: Discriminative Learning of Relaxed Hierarchy for Large ...ai.stanford.edu/~tianshig/papers/learningRelaxedHierarchy-ICCV11-supp.pdf · Discriminative Learning of Relaxed Hierarchy

Node 2, 0 , 0

Node 1

botanical_gardenNegative: Positive:

vineyard

beach

mosque/indoorcoast

synagogue/indoorIgnore :

barn living room

Negative: Positive:

forest_pathcottage_garden

swamp

garbage_dump

landfill

art_studio

art_schoolIgnore :

alley

church/outdoorauditorium biological_lab dorm_room game_room

suchi_bar bathroom bedroom

market/indoor

lobby

nursery

cottage_garden factory/indoor

tennis_court/indoor

bazaar/indoor

Node 3

aqueductNegative: Positive:

mountain_snowychurch/outdoor

airplane_cabin

diner/indoorcoffee_shop

officeIgnore :

ski_resort

lighthouse

tree_house

forest_pathNegative: Positive:

baseball_fieldIgnore

tree_house

Forest_broadleaf

Negative: Positive:

Basketball_court/outdoor

barndoor

elevator/door

Ignore

dockNegative: Positive:

runway

classroom

cubicle/office

Ignore

airplane_cabinNegative: Positive:

car_interirorvan_interior

doorway/outdoor

phone_booth

limousine_interiorIgnore

, 0 , 0Node 4 Node 5 , 0 , 0

Node 6 Node 7

Negative: Positive:construction_site

Industrial_area

Ignore

Node 8

stadium/baseball

florist_shop/indoor

home_office

living_room

cliffNegative: Positive:

rainforest

factory/indoor

shopping_mall

Ignore

Node 11

track/outdoorNegative: Positive:

batters_box

golf_course

Ignore

Node 12

pharmacyNegative: Positive:

drugstore

cathedral/outdoor

kennel/outdoor

Ignore

Node 15

raceway

cemeteryNegative: Positive:

crosswalk

train railway

Ignore

Node 17

picnic area

hayfied

Negative: Positive: Ignore

Node 22

abbeyNegative: Positive:

elevator/interior

pulpit

Ignore

Node 25

church_outdoor

icebergNegative: Positive:

dining_room

dorm_room

Ignore

Node 30

general_store

, 0

, 0

, 0 , 0

, 0

, 0

Node9

Node10

, 0

Node13

, 0

Node14

, 0 , 0

, 0 , 0

Node16

Node23

Node24

Node31

, 0, 0, 0 , 0

sandbarvalley

beachyard

vegetable_garden

field/wild

art_studio

art school

cottage_garden

baseball_field

stadium/football

botanical_garden

Forest/needleleaft volcano

pulpit

synagogue/indoor

Ice_floe

Hierarchy Learned using HOG with Hist. Intersection Kernel

swamp

bamboo forest iceberg

ice floe

Figure 3. The first five levels of the learned hierarchy using spatial HOG with histogram intersection kernel on the SUN dataset.