Extreme Classificationsaketha/research/AkshatMTP2018.pdf · Place : MLAV\A.,6 ~ Prof. Sunita...

$: Extreme Classificationsaketha/research/AkshatMTP2018.pdf · Place : MLAV\A.,6 ~ Prof. Sunita Sarawagi Dept o f C SE , ITT 13ornhay S u perviso r De p t. o f CSE , IIT Bo mbay Internal$
Extreme Classification

Submitted in partial fulfillment of the requirements

of the degree of

Master of Technology

by

Akshat Jaiswal

(Roll no. 163050069)

Guided By:

Prof. Saketha Nath

Prof. Sunita Sarawagi

Department of Computer Science & Engineering

Indian Institute of Technology Bombay

2018

lleport Approval

T his pnij1•ct report 1·11lillt•cl " Extrnu1e Classification " , s ul.J1111I ll'tl liy A b ltn t .lri.iswol (Roll

Nn. l(i;lQ5(l(H>!l) , is npprowd fur the nwmd nf tl1•gr<·c uf fl. luster of lh:h11o logy in Cc,mputcr

Sri\'lll'L' & F.nginl'ning.

Lft Dept. of CSS, !IT Hydcrnl.iad

Supervisor

9-¾ LJoJ,_ ~ Ajit Rajwade

Dept. of CSE, IIT Born I.Jay

lnternfll Examiner

Prof.

Prof. Ganesh Ramakrislman

Dept . of CSE, IIT Bombay

C hairperson

Date: '3P. June 2018

Place: MLAV\A.,6 ~

Prof. Sunita Sarawagi

Dept of CSE, ITT 13ornhay

Supervisor

Dept . of CSE, IIT Bombay

Internal Examiner

Declaration of A11thorship

l d\'dan• th .1t this wril"t.L'll suhrnissin11 rt'Jm•st•11ts Ill)' i(kas i11 111v 1rn·11 \\·orris nnd wlll'n·'

ot hers · idr::ts or words lw.n' lil't'll i11ch1dt>d, I haw ad<'quat.01:v ri t.ed nnd 1vkrcm·rd t.ht·

original snmce:; . 1 abo dcdarc l hat I haw :..1dhN0d t.o nil prinl'ipks of antclc•mic ho11est.y

Jml integrity and hnw JlL)\ 111i:m'prr:-:e11tcd or fa bricnt.ccl or falsified an~r id~•n/d nta/fac-

1 / :-:1J lllTl' i11 Ill\' :-:ub111is:-:ion . I understand that am· violation nf the n.boV<' will l.w rans<' ' . . fpr di:-:riplinnry ar1 in11 liy tlw lnstit11t e nud c:rn alsn eYnk<• 1w11nl action fro m t.lw soun-1•s

\\'l11d1 hove t bus llOt been pro1wrl\' cited or from whom proper 1wrn1ission has not l.H•e11

t n k1•11 wlw11 lll'e d1•d

0,1 t1' . ~ -- .hJlll' :2lH~

11

Sip; 11a tmr: _ .. . ~ .. .... ... .. . .

Akshat .Jaiswal

L liJllSOOLi0

Abstract

Over the past few years extreme classification has become an important research problem

and has gained focus of many researchers all over the world due to its wide range of ap-

plications in ranking, recommendation and tagging. Extreme classification is process of

assigning a relevant set of items/labels to a data point/observation, from a very large set

of labels. In real world there are many tasks that require ranking of huge set of items,

documents or labels and return only few of them. For example collaborative filtering, im-

age/video annotation etc. are some of the problem which have very large label space and

can be readily casted into an extreme classification problem. The large size of label space

makes existing approaches intractable in terms of scalability, memory footprint, predic-

tion time etc. This report presents some of the techniques used by researchers in the past

to address this problem. Most of the existing approaches makes use of the assumption

that there exist a latent structure between the labels which they try to utilize in their

techniques. However recent developments have shown that state of the art results can be

achieved without making any such assumption and treating each label as an independent

entity. In this work we propose various formulations that tries to combine the strength

of these recent independent learners with the latent structural information present in the

label space.

iii

Contents

Report Approval i

Declaration of Authorship ii

Abstract iii

List of Tables vi

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.5 Road-map of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Literature Survey 5

2.1 Binary Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Label Partitioning for Sublinear Ranking . . . . . . . . . . . . . . . . . . . 6

2.3 Low Rank Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . 9

2.4 FastXML: A Fast, Accurate and Stable Tree-classifier for eXtreme Multi-

label Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 Sparse Local Embeddings for Extreme Classification . . . . . . . . . . . . . 13

2.6 PDSparse: A Primal and Dual Sparse Approach to Extreme Multiclass

and Multilabel Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 16

iv

Contents CONTENTS

3 Analysis of Existing Approaches 19

3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Scalability and Real World Usage . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Proposed Formulations 23

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Extreme Classification as Structured Prediction . . . . . . . . . . . . . . . 25

4.3 General Formulations for Structured Objects . . . . . . . . . . . . . . . . . 30

4.4 Large Margin Learning with Label Embeddings . . . . . . . . . . . . . . . 34

5 Implementation Results and Analysis 37

5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Conclusion and Future Work 42

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Acknowledgements 47

v

List of Tables

4.1 Different Variants of Embedding Formulation . . . . . . . . . . . . . . . . 36

5.1 Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Comparison of PDSparse with Structured Learning Formulation 4.2 . . . . 38

5.3 Comparison of different variants of Structured learning formulation, Sec-

tion 4.2, with PDSparse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.4 PDSparse Vs Version 6: Formulation 4.9 without edge weights . . . . . . . 39

5.5 PDSparse Vs Version 7: Margin Rescaled formulation with recursive con-

straints without edge parameters . . . . . . . . . . . . . . . . . . . . . . . 40

5.6 PDSparse Vs Version 8: Negative slack formulation with recursive con-

straints and edges triggered only when both labels are present. . . . . . . . 40

5.7 PDSparse Vs Version 9: Negative Slack only formulation with margin

rescaling, recursive constraints and GloVe Embeddings Formulation 4.15 . 41

5.8 PDSparse Vs Version 10: Negative Slack only formulation with margin

rescaling, recursive constraints and GloVe Embeddings Formulation 4.16 . 41

vi

Chapter 1

Introduction

1.1 Background

In machine learning and statistics, classification or supervised learning is the problem of

assigning a new data point (observation) to previously known set of categories (classes),

on the basis of some prior information available in form of training data containing data

points whose categories are known. Classification is further broadly divided into three

types; binary, multiclass and multi-label, on the basis of number of classes an observation

can belong and total number of possible classes. In binary classification there are only two

classes and an observation can belong to any one of them at a point of time. For example

consider the problem of predicting whether a patient is suffering from a particular diseases

(yes or no) given his medical history, which we refer to as input features for that patient.

In multiclass classification total number of classes are more than two unlike binary

classification however an observation can only belong to any one of them. Consider the

problem of identifying shapes of object, the output classes can be circle, triangle, rectangle

etc. Here the number of classes are finite and more than two however an object can only

belong to any one of them as an object can never have more than one shape unless the

two shapes are generalization of one of them like square and rectangle.

Lastly multi-label classification is same as multiclass classification but the restriction

of an observation belonging to more than one class is removed. In multi-label learning

an observation can belong to more than one class. Consider our previous example of

Chapter 1. Introduction 2

identifying shapes of objects, lets say now we want to predict color of object instead of its

shape. This is a multi-label learning problem as an object can have more than one color.

Considering above definitions of classification problem we define Extreme Classification

as the problem of learning a classifier that can annotate a data point with the most relevant

subset of labels from an extremely large label set [8]. Extreme Classification is essentially

a multi-label learning problem with an extremely large label space (number of classes),

say order of 104 or something in millions.

1.2 Motivation

Extreme multi-label learning is an important research problem as it has many applications

in tagging, recommendation and ranking. Consider the problem of automatically identi-

fying relevant tags/labels for an article or document submitted on Wikipedia. Currently

millions of wiki tags are available which make the above task of selecting a small set of

labels substantially difficult , hence falling under category of extreme multi-label learning.

Similarly, one may wish to learn an extreme classifier that can recommend movies or songs

to a user out of millions of movies and songs available on-line, given his/her past history

of likings . Finally in information retrieval one may wish to rank large set of documents

given a user query e.g. search results by Google or any search engine. In general, any

problem of recommendation, ranking or tagging can be formulated as multi-label learning

problem with each item to be ranked/recommended as a separate label, which when pre-

sented with input features of an observation predicts most relevant set of labels. Similarly

each tag that is to be assigned can be treated as separated label.

Due to such motivating real-life applications, extreme classifications being focus area

of many researchers and organizations all over world. The main problem and the difference

between traditional multi-label learning and extreme classification is scalability and large

memory footprint due to extremely large label spaces. The key challenge being the design

of scalable algorithms that offer real-time predictions and have a small memory footprint.

2


1.3 Problem Statement

Let the dimensionality of input feature vectors be D, total number of labels be K and total

number of training instances be N. Given set of point (xi, yi), i = 1, 2...N where xi ∈ RD

is the input and yi ∈ 0, 1K is set of relevant labels (subset of set of possible labels K).

For any j ∈ [K] yji = 1 indicate presence of that label and yji = 0 indicate label is either

irrelevant or missing.The task is to predict top k relevant labels for a given query point

x∗, when the order of D and K is very large.

1.4 Proposed Solution

In order to address the extreme classification problem we propose various formulations

based on recent development [15]. We classify our formulations into following categories.

1. Extreme classification posed as structured learning problem : Each configuration of

labels is treated as single structured output/object. The objective is to predict an

object closest to the ground truth configuration.

2. General Formulations for Structured Objects: In this part we present loss augmented

formulations along with different forms of constraints for optimization problem.

3. Large Margin Learning in Embedding Space: We combine the idea of representing

high dimensional label space into a lower dimensional embedding and max-margin

learning.

1.5 Road-map of the Report

The organization of the rest of the report is as follows:

• Chapter 2 corresponds to the literature survey where we present various existing

approaches used to address the extreme classification problem. Since extreme clas-

sification is similar to traditional classification problem many different techniques

exists in the field of ranking, recommendation and classification systems which can

3


be used in similar settings. However we only present techniques which cater to the

need of large label space learning.

• Chapter 3 presents comparative study of the different approaches discussed in

the previous section based on various criteria: including training and prediction

time,accuracy etc. and analyze strength and weaknesses of their models.

• Chapter 4 present the various formulations we proposed in order to improve upon

the existing techniques.

• Chapter 5 provides the results of the experiments we conducted on some of the

extreme classification datasets. It compares different formulations with state of the

art technique and analyzes strength and limitations of each model.

• Chapter 6 concludes the report by covering the key aspects of the proposed ideas.

It also mentions about the future work that can be done in the field of extreme

multi-label learning that can further improve the system.

4

Chapter 2

Literature Survey

In this section we present some of the techniques used by researchers in the past to address

the problems with extreme classification. Most of these methods are essentially tree based

or embedding based. We will define what these are in further sections.

2.1 Binary Relevance

In this approach we build a 1vsAll classifier, which we refer to as binary relevance, for

all labels to predict most relevant labels for any query point.Essentially these methods

treat predicting each label as a separate binary classification task. They typically rank

possibilities by scoring each label in turn. The Approach is as follows:

• Training Phase: Learn a label scorer f(x, k)∀k ∈ [K] which when presented with

a data point x and a label k returns a real valued score.

• Prediction Phase: Given a new query point x*, compute these scores zk =

f(x, Lk)∀k ∈ [K]. The top k most relevant labels,L for input x* are predicted

by sorting these scores and extracting the top k ones.

The model that are used to predict these scores could be anything from linear SVMs, kernel

SVMs, neural networks, decision trees etc. These methods have prediction in linear with

respect to the label size as every time it needs to compute scores for each label. Therefore

these methods does not scale when label space,K is very large which is the typical case of

extreme classification.

Chapter 2. Literature Survey 6

2.2 Label Partitioning for Sublinear Ranking

As discussed, above methods become impractical as the number of labels increases to

millions because these algorithms are linear in label space. This paper [14] provide a

wrapper method that can make these existing methods to extreme multilabel learning

tractable while maintaining the accuracy. The paper provide a two step algorithm that

uses the existing label scorers and makes the process of prediction sublinear by reducing

the label space to a subset of original space for each query point. The algorithm works in

following manner

• Input Partitioning- Input space is partitioned to create different clusters (partitions)

• Label Assignment- A subset of labels is assigned to each partition.

At prediction time, the following algorithm is followed to predict(rankings) set of

relevant labels for a new query point x:

1. Given a query point x, corresponding set of partitions p are identified by the input

partitioner p= g(x)

2. label set Lp, is identified corresponding to the set the partitions p

3. scores are calculated for each label y ∈ Lp with the help of label scorer and ranked

to give the final output

The cost of ranking at prediction time is considerably reduced as number of labels Lp is

less than the total number of labels. Given the brief of how algorithm work following is

the exact algorithm provided by the authors of paper for sub-linear Ranking.

Input Partitioning :

It is assumed that user has already trained a label scorer f(x, Lk) using the previous

binary relevance approach.There are two guidelines that need to be taken care of in order

to achieve our objective.

• Examples that share highly relevant labels should be mapped to same partition

6


• Examples for which label scorer performs well should be prioritized while learning

the partitioner.

Based on this they propose the following approaches for input partitioner that works by

closest assigned partition as defined by the partition centroids ci, i=1 ... P.

Weighted Hierarchical Partitioner Following is a straight forward optimization prob-

lem for prioritizing examples for which label scorer is performing well.

N∑i=1

P∑j=1

l(f(xi), yi)||xi − cj||2

where f(x) = (f(x, L1), f(x, L2)..f(x, LK)) is the hypothesis function that returns a vec-

tor containing the relevant labels corresponding to input xi and l(f(xi), yi) is accuracy

measurement of interest (e.g. precision at K). In practice this can be implemented as a

weighted version of hierarchical k-means.

Weighted Embedded Partitioner The above method only incorporates prioritization

of examples but does not fulfill the first guideline that examples sharing highly ranked

relevant labels are mapped together. One way of encoding this constraint is to optimize

the above weighted partitioner in learnt ”embedding ”space.

N∑i=1

P∑j=1

l(f(xi), yi)||Mxi − cj||2

Label Assignment :

This section presents the algorithm to assign labels to the partitions created using the

input partitioner. We define α = 0, 1K , where αi determines if a label Li should be

assigned to the partition (αi = 1) or not (αi = 0).We also define the following

• Rt,i as the rank of label i for example t:

Rt,i = 1 +∑j =i

δ(f(xt, Lj) > f(xt, Li))

• Rt,yt is the rank of true label for example t.

7


We start with the base case where we want to optimize precision at 1 and each example

has only one relevant label. To achieve out objective two conditions must hold : (1) true

label must be assigned to the same partition and (2) true label must be ranked highest

among all other labels in same subset. These constraints are incorporated in following

objective function:

maxα

∑t

αyt(1− maxRt,i<Rt,yt

αi)

subject to

0 ≤ αi ≤ 1

and to restrict the number of labels to be assigned to partition, αi values are ranked and

top C are taken for the partition.

The above formulation is generalized for precision at k > 1 by replacing inner max

with a function that counts the number of violations above relevant label.

maxα

∑t

αyt(1− ϕ(∑

Rt,i<Rt,yt

αi))

subject to

0 ≤ αi ≤ 1

where ϕ(r) = 0ifr < k and 1 otherwise to optimize at k.

In order to extend our formulation for more than one labels instead of considering

only one label in our optimization problem we tend to maximize the the internal term for

maximum number of labels in the partition which is captured in the following:

maxα

∑t

1|yt|

∑y∈yt

αy

w(Rt,y)(1− ϕ(

∑Rt,i<Rt,yt

αi))

subject to

0 ≤ αi ≤ 1

where ϕ(r) is replaced with a sigmoid:ϕ(r) = 11+e(k−r) to approximate the value and relax

the optimization problem and w(Rt,y) = (R)t,yλ, λ ≥ 0 is a weighing factor governed by the

8


rank of that label for that example in order to incorporate the constraint of prioritizing

the examples for which label scorer is performing well.

The paper provides a innovative algorithm for reducing the prediction time in real

world scenarios by partitioning the label space.However experimental results shows that

results are significantly affected by choice of input partitioner and therefore remains the

area of interest for further development.

2.3 Low Rank Empirical Risk Minimization

The above method of input partitioning is a novel approach however it still requires longer

training times(even more than binary relevance) and also its performance critically depend

upon choice of partitioning and label assignment which may or may not be optimal in

many cases. Also their method cannot handle missing labels which are common in most

real-world situations. For example consider the task of assigning wiki tags to an article.

The training data which contains tags for articles are manually assigned by individuals

and it is impossible for an individual to assign all the relevant tags out of millions of wiki

tags. Therefore it is more likely that training data may not contain some labels which are

actually relevant for articles, we call them missing labels.

The paper[16] addresses the above issues related to multi-label classification prob-

lem in real applications. Extreme classification problem is mostly addressed by either

tree based methods or embeddings based methods.This paper uses the latter approach.

Embedding based methods basically project high dimensional label space into low dimen-

sion(embedded space), then learns regressors over the embedded space and at the time of

prediction revert back these predicted embeddings into original label space. They tend

to formulate into learning a low rank linear model Z ∈ RD×Ks.t. ypred = ZTx which can

be casted into a standard empirical risk minimization (ERM) framework to allow use of

various loss functions and regularizers.ERM framework is an abstraction that is based on

the principle that learning algorithms should choose a hypothesis which minimizes the

empirical risk.

The motivation for this framework comes from the fact that although label space is

very large but there exist significant label correlation among labels which allows them

9


to be modeled by low rank constraint. The algorithm works as follows. The hypothesis

function, parametrized by Z, is defined as f(x;Z) = ZTx, where Z ∈ RD×K . Also we

define loss function as l(y, f(x;Z)) which is assumed to be decomposable over labels i.e.

l(y, f(x;Z)) =K∑j=1

l(yj, f j(x;Z)).

Using the above functions the optimization problem can be written as

(Z) = argminZJΩ(Z) =

∑(i,j)∈Ω

l(Yij, fj(xi;Z)) + λ.r(Z) s.t rank(Z) ≤ k

where r(Z) : RD×K −→ R is a regularizer and Ω ⊆ [N ]]× [K] represent the index set

of ”known” labels. Here Standard setting is assumed that Yij = 1or0 for present or absent

and Yij =? for missing. They show that above formulation can be solved using alternating

minimization technique and even have a closed form solution in case of L2 loss.

Algorithm :

It is assumed that Z is a low rank matrix therefore we can have decomposition of Z in

the form Z = WHT where W ∈ RD×k and H ∈ RK×k. Further it is also assumed that

regularizer can be decomposed also in form r(Z) = r1(W ) + r2(H). Therefore above

formulation can be written in form of W and H as:

JΩ(W,H) =∑

(i,j)∈Ωl(Yij, x

Ti Whj) +

λ2(||W ||2F + ||H||2F )

where hTj is the j-th row of H. Here trace norm regularization is used for Z which is

equivalent to sum of frobenius norms of W and H. Now if we make one of W or H con-

stant above formulation becomes a convex function, which permits the use of alternating

minimization technique that is guaranteed to converge to a stationary point when both

minHJΩ(W(t−1), H) and minWJΩ(W,H

t) are uniquely defined.

Once W is fixed, H can be independently updated as follow:

hj ← arg minh∈Rk

∑i:(i,j)∈Ω

l(Yij, fj(xi;Z)) +

λ2.||h||22

which is similar to solving a regression problem over k variable.Now when H is fixed

W can be updated as follows :

If W ∗ = argminW JΩ(W,H)

and we denote w∗ = vec(W ∗), then w∗ = argminw∈Rdk g(w),

g(w) =∑

(i,j)∈Ωl(Yij, w

T xij) +λ2.||w||22

where xij = hj⊗xi , ⊗ denotes the outer product. Taking Squared loss as an example

10


above is equivalent to a regularized least squares problem with dk variables, whose closed

form solution becomes infeasible when d is large. Therefore iterative methods such as

conjugate gradient (CG), are used for this purpose, which often require computation

of gradient and multiplication of a vector with hessian matrix. They provide efficient

methods for computing these two which offers a speedup of O(d) over direct computation

where d is average number of non zero features for an instance.

The paper presents a novel approach of formulating multi-label learning problem with

missing labels into a standard ERM framework with the rank constraints and regularizers

to increase flexibility and efficiency. It also provides algorithms based on alternating

minimization technique to efficiently solve such non convex formulations. However the

method works only for decomposable loss functions and therefore requires further work

to incorporate other non decomposable functions.

2.4 FastXML: A Fast, Accurate and Stable Tree-classifier

for eXtreme Multi-label Learning

Extreme classification as explained before can be addressed using either tree based meth-

ods or embeddings based methods. The paper [11] presents an algorithm to build a tree

based classifier, referred as FastXML, which is more accurate and faster to train than all

the previously discussed techniques.

Tree based methods are often seen to beat the 1-vs-All baseline systems in terms of

prediction accuracy at a fractional cost of the prediction time. The algorithm presented in

[14] can take time even longer than 1-vs-all methods due to additional cost of partitioning

and label assignment. FastXML is like any other tree based algorithm where the objective

is to learn a tree like structure/hierarchy over the data in order to reduce the number of

items(labels) in the leaf nodes, that is assigned to each instance while training/testing.

Training :

FastXML learns hierarchy over feature space instead of label space as opposed to multi-

class setting. The key idea here is that a ranking based partitioning function used to split

11


a node. Similar to any tree learning algorithm, FastXML performs recursive partitioning

of parent’s feature space into its children. Existing approaches uses local measures like

GINI index, Entropy etc. which depend solely on predictions at node being partitioned to

decide the split at node. However in order to increase the overall performance of algorithm,

node partitioning should be done using global measures which require partitioning to be

learnt jointly over all the nodes. Unfortunately optimizing such a global measure can be

very expensive and there its the main reason existing approaches optimizes locally. This

allows the hierarchy to be learnt node by node starting from the root and going down to

the leaves and is more efficient than learning all the nodes jointly. FastXML learns the

hierarchy by directly optimizing a ranking loss function. In particular, it optimizes the

normalized Discounted Cumulative Gain(nDCG) [13].

Let idesc1 , ..., idescK be the permutation indices that sort a real-valued vector y ∈ RK in

descending order are defined such that if j > k then yidescjyidesck

. The rankk(y) operator,

which returns the indices of the k largest elements of y ranked in descending order (with

ties broken randomly), can then be defined as rankk(y) = [idesc1 , ..., idesck ]T

Let π(1, K)denote the set of all permutations of1,2,...,K. The Discounted Cumulative

Gain (DCG) at k of a ranking r ∈ π(1, K) given a ground truth label vector y with binary

levels of relevance is LDCG@k(r, y) =k∑

l=1

yrllog(1+l)

DCG is sensitive to both the ranking and the relevance of predictions unlike precision

and other measures. The normalized DCG, is defined by

LDCG@k(r, y) = Ik(y)k∑

l=1

yrllog(1+l)

, Ik(y) = 1min(k,1T y)∑

l=1

1log(1+l)

The value of nDCG lies between 0 and 1 allowing it to be used to compare rankings

across label vectors with different numbers of positive labels.

FastXML partitions the current node’s feature space by learning a linear separator w

such that

min ||w||1 +∑i

Cδ(δi)log(1 + e−δiwT xi)− Cr

∑i

12(1 + δi)LnDCG@L(r

+, yi)− Cr

∑i

12(1−

δi)LnDCG@L(r−, yi)

w.r.t. w ∈ RD, δ = −1,+1K , r+, r− ∈ π(1, L)

where i indexes all the training points present at the node being partitioned, δi ∈

−1,+1 indicates whether point i was assigned to the negative or positive partition and

12


r+ and r− represent the predicted label rankings for the positive and negative partition

respectively. Cδ and Cr are user defined parameters which determine the relative impor-

tance of the three terms. FastXML uses alternating minimization technique to learn this

separator.

Prediction:

FastXML instead of using a single tree that partitions the nodes using above problem,

learns an ensemble of trees for accurate and stable predictions. Therefore given a novel

point x ∈ RD, FastXML’s top ranked k predictions are given by

r(x) = rankk(1T

T∑t=1

P leaft (x))

where T is the number of trees in the FastXML ensemble and P leaft (x) ∝

∑i∈Sleaf

t (x)

yi

and Sleaf(x)t are label distribution and set of points respectively in leaf node of x in the

tree t.

FastXML learnt an ensemble of trees with prediction costs that were logarithmic in

the number of labels. The key technical contribution in FastXML was a novel node

partitioning formulation which optimized an nDCG based ranking loss over all the labels.

Such a loss was found to be more suitable for extreme multi-label learning than the Gini

index optimized by MLRF[1] or the clustering error optimized by LPSR[14]. nDCG is

known to be a hard loss to optimize using gradient descent based techniques. FastXML

therefore developed an efficient alternating minimization algorithm for its optimization.

2.5 Sparse Local Embeddings for Extreme Classifica-

tion

Embeddings based methods are popular choice for addressing large scale multi-label clas-

sification. They address the problem of large number of labels by projecting label vectors

into low dimensional linear subspace. It is based on the assumption that there exist sig-

nificant label correlations among the set of labels which allows them to be modeled by

a row rank matrix. Despite the computational benefits provided by embeddings based

methods , they still suffer scalability issues and are unable to deliver higher accuracies on

13


most of the real world scenarios. The reason behind is that the underlying assumption

of low rank does not hold because of presence of ”tail” labels in most scenarios. There

exist a large number of labels which occur in very few training instances. The paper[3]

provides an algorithm SLEEC that address all the issues discussed.

Embeddings based approaches: These approaches works in the similar way discussed

above. They transform original high dimensional label vectors in to low dimensional

linear subspace (embeddings). The model is then trained to predict these embeddings

instead of original labels which are then later transformed back to original label space.

Mathematically given training points (xi, yi), i = 1...n, we transform each label vector as

zi = Uyi then learn these embeddings as function of xi as zi = V xi. At last the labels for a

given point is predicted by post processing y = U∗V x where U∗ is the decompression ma-

trix.These approaches are slow at training and prediction time even for small embedding

dimensions.

SLEEC Approach: It is based on the assumption that low rank linear subspace modeling

of label space Y is violated in most real world situations but Y can be accurately predicted

by using a low dimensional non linear manifold. SLEEC is essentially an embeddings based

approach but differs from traditional approaches in following manner. Firstly instead of

global projection, it learns embeddings zi only on set of points which are nearest neighbors

to each other. Secondly, at prediction time instead of using a decompression technique it

uses kNN classifier in embeddings space. kNN helps in tackling the issue of tail labels as

it outperforms discriminative methods when trainings instances are less, the case of ”tail”

labels.

Clustering based speedup: kNN classifier as known are lazy learners therefore are slow

at prediction time. In order to reduce the prediction time SLEEC first clusters the data

over input feature space and then learn separate embeddings for each clusters. SLEEC

tackles the issue of instability in clustering owing to curse of dimensionality by learning

an ensemble of learners over different sets of clusters.

SLEEC algorithm works by first learning an embedding zi = V xi for query point

xi then using kNN classifier over embedding space, S = V x1, V2, ...VN to find the most

relevant labels for that point. The next section presents the algorithms to learn these

embeddings and use of clustering to make algorithm scalable over large dataset.

14


Learning Sparse Embeddings:

SLECC is based on the assumption that Y cannot be modeled using low rank however,

Y can still be accurately modeled using a low-dimensional non-linear manifold. That

is, instead of preserving distances (or inner products) of a given label vector to all the

training points, they attempt to preserve the distance to only a few nearest neighbors. The

optimization problem is to find a D-dimensional embedding matrix Z = [z1, ...zN ] ∈ RD×N

which minimizes

minZ∈RD×n

||PΩ(YTY )− PΩ(Z

TZ)||2F + λ||Z||1

where Ω denotes the set of neighbors to preserve i.e. (i, j) ∈ Ω iff j ∈ Ni where Ni

denotes the nearest neighbors of i. These are the set of points which have largest inner

product with yi. And (PΩ(YTY ))ij =< yi, yj > if (i, j) ∈ Ω otherwise 0. The next thing

required is to learn these embeddings from input features i.e. Z = V X where V ∈ RD×D.

Now combining this with above formulation and adding L2 regularization for V, the final

optimization problem looks like

minZ∈RL×n

||PΩ(YTY )− PΩ(X

TV TV X)||2F + λ||V X||1 + µ||V ||2F

SLEEC uses singular value projection(SVP)for learning embeddings and alternating direc-

tion method of multipliers(ADMM) for learning regressors for these embeddings. SVP is a

simple projected gradient descent method where the projection is onto the set of low-rank

matrices[9]. The alternating direction method of multipliers (ADMM) is an algorithm

that solves convex optimization problems by breaking them into smaller pieces, each of

which are then easier to handle[4].

SLEEC is a novel technique that combines clustering with embeddings to overcome the

limitations of traditional Embeddings-based approaches. It has better scaling properties

than all other embedding methods and achieves higher accuracies even in presence of ”tail”

labels. SLEEC is state of art algorithm that outperforms leading tree based method,

FASTXML and all other embedding based approaches in terms of accuracy.

15


2.6 PDSparse: A Primal and Dual Sparse Approach

to Extreme Multiclass and Multilabel Classifica-

tion

In the Extreme Classification setting, we discussed the two popular approaches; tree

based and embeddings based, this paper[15] presents an algorithm that utilizes the spar-

sity present in primal and dual formulation of problem using a margin maximizing loss

function. They propose a Fully-Corrective Block-Coordinate Frank-Wolfe (FC-BCFW)

algorithm that exploits both primal and dual sparsity to achieve a complexity sublinear

to the number of primal and dual variables.

In there work, instead of making structural assumption on the relation between label,

they assume that for each instance, there are only few correct labels and the feature space

is rich enough for one to distinguish between labels. Note this assumption is much weaker

than other structural assumption. Under this assumption, It was shown that a simple

margin-maximizing loss yields extremely sparse dual solution in the setting of extreme

classification, and furthermore, the loss, when combined with l1 penalty, gives sparse

solution both in the primal and in the dual for any l1 parameter λ > 0 .

They proposed a Fully-Corrective Block Coordinate Frank-Wolfe algorithm to solve

the primal-dual sparse problem given by margin-maximizing loss with l1 − l2 penalties.

Let D be the problem dimension, N be the number of samples, and K be the number

of classes. In case DK >> N , the proposed algorithm has complexity sublinear to the

number of variables by exploiting sparsity in the primal to search active variables in the

dual. In case DK ≤ N , they proposed a stochastic approximation method to further

speed up the search-step in the Frank-Wolfe algorithm.

Problem Formulation :

Let P (y) = k ∈ [K]|yk = 1 denotes the positive(relevant) label indexes, while N(y) =

k ∈ [K]|yk = 0 denotes the negative(irrelevant) label indexes. The only assumption

made is nnz(y) is very small compared to K and not growing linearly with K. Let W =

16


(wk)Kk=1 be a D ×K matrix that represents the parameters of the model.

Loss with Dual Sparsity : In their work, they used the following separation ranking

loss that penalizes the prediction on an instance x by highest response from set of negative

(irrelevant) labels minus lowest response from set of positive labels.

L(z, y) = maxkn∈N(y)

maxkp∈P (y)

(1 + zkn − zkp)+

where zk =< x,wk > denote the response of kth label for a training instance x. The

loss is zero if all the positive labels have higher responses than negative labels. The

motivation behind the loss function is that there are only few labels with high responses

hence accuracy can be boosted by learning how to distinguish between those few confusing

labels.

Primal and dual sparse formulation They formulated extreme classification problem

as minimizing following problem

λK∑k=1

||wk||1 +N∑i=1

L(W Txi, yi)

The optimal solution of this formulation satisfies λρ∗k + XTαk = 0∀k ∈ [K] for some

subgradient ρ∗k ∈ ∂|wk||1 and α∗i ∈ ∂zL(zi, yi) with zi = W ∗Txi. The subgradients α∗

i

satisfies α∗ik = 0 for some k∗ = k only if k∗ is a confusing label that satisfies

k∗ = argmaxk =k⟨wk, xi⟩

which means nnz(αi)≪ K and nnz(A)≪ NK. They further showed that optimal primal

and dual solution (W*,A*) of the above ERM problem with separation loss defined above

satisfies nnz(W ∗) ≤ nnz(A∗) for any λ > 0 if the design matrix X is drawn from a

continuous probability distribution.

The paper presented a novel algorithm that does not uses any structural assumptions,

which has training time, prediction time and model size growing sublinearly with label

space while keeping a competitive accuracy. Extreme classification problem handled either

by tree based or embedding based is basically performing some grouping over the labels

to reduce the overall dimensionality. However in their work they introduced a totally

different approach that does not perform any such grouping, instead utilizes sparsity to

achieve similar performance than state of art techniques and actually outperforms all of

17


them in terms of training and prediction time. This is achieved only by exploiting the

sparsity present in both primal and dual formulations of extreme classification problem

with a margin maximizing loss function.

18

Chapter 3

Analysis of Existing Approaches

In this section we present comparative study of above discussed approaches on based

of different criteria including training and prediction time,accuracy etc. and analyses

strength and weaknesses of these models.

3.1 Assumptions

In this part we present some of the critical assumptions made by authors in their work in

order for algorithms to work successfully.

1. LPSR: It is assumed that user has pre trained label scorers for each label.

2. LEML: The underlying assumption was label space can be modelled using a low

rank linear model as there exist significant label correlation.

3. FastXML: The only assumption made was label vector y is log K sparse, i.e. there

are only few positive labels for each observation .

4. SLEEC: It assumed that the low rank constraint is easily violated in real life situa-

tions due to presence of ”tail label”, however it possible to model them using a low

dimensional non linear manifold.

5. PDSparse: The only assumption made is although label space is very large but there

exist only few positive labels for each observation.

Chapter 3. Analysis of Existing Approaches 20

3.2 Scalability and Real World Usage

:

In real world scenarios, it is required that prediction time should be atleast sublinear

in terms of labels and algorithms should be able to scale efficiently to large label space.

Based on this below we classify whether the algorithm is suitable for real world usage.

1. LPSR: Prediction time is sublinear with respect to label space which makes is suit-

able for real-life applications however training time is even longer than Binary Rel-

evance approach.

2. LEML: Provides efficient methods for gradient and hessian matrix computation that

makes it efficient enough to handle large scale problem. Prediction time is still linear

in label space.

3. FastXML: It uses an ensemble of trees which allows it to scale over large datasets.

The prediction time also is logarithmic in label space, making it suitable for most

of the real world problems.

4. SLEEC: For handling large scale datasets, it uses clustering and learns model (em-

beddings) over individual clusters. To guard against instability in clustering due to

curse of dimensionality it uses an ensemble of model trained over different set of

clusters.

5. PDSparse: Due to sparse problem formulation, it outperforms all of the above

methods in terms of training and prediction time. The algorithm is well suited

for real world usage.

3.3 Advantages

In this part we present some of the advantages offered by these algorithms over others.

They can be advantageous either from research point of view or from real world usage.

20


1. LPSR: Generic method independent of choice of classifier or loss function used to

train label scorers. Prediction time is sublinear making it suitable for real world

scenarios.

2. LEML: Problem formulation in standard ERM framework allows its to use different

loss functions and regularizers without changing the algorithm. It is also designed

to handle missing labels present in training set.

3. FastXML: Logarithmic time predictions and no structural assumptions about the

data.

4. SLEEC: Achieves highest prediction accuracies across all methods and has better

scaling properties than all other embedding based methods.

5. PDSparse: Lowest training and prediction time than all of these methods with

comparable accuracies and no structural assumptions about data.

3.4 Limitations

Lastly we present some of the limitations of these algorithms.

1. LPSR: The performance of algorithm significantly depend upon choice of partition-

ing algorithm which may or may not result in optimal label assignment therefore

remains an area of scope for further research.

2. LEML: Prediction time is linear in label space and underlying assumption of low

rank linear modelling of label space is violated in many real-life scenarios.

3. FastXML: Prediction accuracy is low for tail labels which constitute a significant

part of total label space.

4. SLEEC: High training and prediction time compared to tree based and sparsity

based approaches.Algorithms performance significantly depend upon hyper param-

eters which are sometime difficult to tune. ”Tail labels” still remains an issue of

concern.

21


5. PDSparse: Highly accurate in predicting the top label for a query point however

accuracy in predicting top k results decreases significantly as k increases. Also ”tail

labels” still remains an issue of concern.

22

Chapter 4

Proposed Formulations

In this section we present some of the formulations based on the recent PDSparse[15]

technique, which tries to capture and utilize the hidden structural information present

in the labels. All of these formulations uses the same framework used in [15] with some

modifications.

4.1 Overview

In section 2 we have discussed the various approaches used in the past along with their

advantages and limitations in section 3. We see that there are broadly two classes of

approaches: structural and OneVsAll. Structural approaches are basically embedding

based(low rank) or tree based approaches which makes certain assumptions about the

label space. These approaches provide good accuracies when the assumptions hold but

we see that most of these assumptions gets violated in real life scenarios. OneVsAll

techniques like PDSparse[15] , Dismec[2], etc. as the name suggest does not make any

such and treats every label independently. They have been able to outperform all other

structural techniques which motivates our work. In this work we tried to combine the idea

of label correlations (hidden structure between labels) being present in real-life scenarios

with the OneVsAll approaches, mainly PDSparse. All our formulations are based on the

following large margin structured learning formulation[12]

Chapter 4. Proposed Formulations 24

minw

1

2||w||22 + λ|w|+ C

m∑i=1

ξi

s.t. ⟨w, δψi(xi, y)⟩ ≥ 1− ξi ∀i ∈ [m] (4.1)

ξ ⪰ 0

where δψi(xi, y) = ψ(xi, yi)− ψ(xi, y)

with the general form of our hypothesis function given by f(x;w) = argmaxy⟨w,ψ(x, y)⟩.

Basically, In our work we present different formulation that differs from each other in

terms of the feature maps ψ(x, y) and the way constraints are formulated. As mentioned

previously, all these formulations uses the following framework from [15],[12] for training.Algorithm 1: Cutting Plane Algorithm For Training

1 Initialize active set Ai for all i in 1...m

2 while iter ≤ max_iter do

3 for i in 1...m do

4 Find the most violating constraint: maxy⟨w,−δψi(xi, y)⟩ and add it to active

set Ai

5 Minimize the objective function w.r.t to new set of constraints Ai

6 end

7 end

The above algorithm mainly consist of two subroutines/subprograms, namely finding

the max violater and a constrained optimizer. In next few sections we present different

features maps and learning algorithms for each case. We will be using the following

notation in all the upcoming sections.

• xi ∈ RD, input feature vectors,

• yi ∈ 0, 1K , corresponds to correct ground truth configuration.

• yi ∈ 0, 1K , subset of ground truth labels.

• yi ∈ 0, 1K , a possible combination of labels , which is not a proper subset of

ground truth labels, yi

24


4.2 Extreme Classification as Structured Prediction

The core idea here is to pose extreme classification as structured learning problem where

the objective is to predict a structured object closest to the ground truth available. In

other words we extend the labels space from K to its powerset and tries to learn a classifier

that can directly learn to predict a set from powerset rather than predicting individual

labels. We propose the following constraints with the same objective function defined in

eq. 4.1 to achieve the same.

K∑j=1

yij.wTj xi +

∑yip=yiq

wpq −maxy =yi

K∑j=1

yj.wTj xi +

∑yp=yq

wpq ≥ 1− ξi

wpq ≥ 0, ∀p, q ∈ 1, ..., K (4.2)

In words, we are trying to say that the total score of the ground truth configuration yi

should be greater than any other possible combinations of labels, y. The score of any

configuration is composed of two parts: node potentials/score and edge potentials/score.

The node potential, wTk x is similar to score we learn in a OneVsAll classifier where weights,

wk ∈ RD are separate for each label and are/can be learned independently.

The motive of adding the edge potential, Epq is to capture the relationship between

cooccuring labels, p and q. However, In this formulation we add edge potential to the

overall score of configuration in either case when both labels are present or both are absent

with an additional constraint that says the edge weights learned should be non negative.

The reason for such conditions on edge potentials is to make the problem of finding the

most violating constraint tractable (solvable in polynomial time).

The problem of finding the most violating constraint/label is as follows:

argmaxy

∆(yi, y) + K∑j=1

yj.wTj xi +

∑yp=yq

wpq (4.3)

Note here we are using a loss ∆(y, y) instead of constant margin in previous case. This

problem is equivalent to fthe following Pott’s Model Energy Minimization Problem[5]

when any decomposable loss function is used.

25


E(f) = minf

∑p∈P

D(fp) +∑p,q∈N

upqT (fp = fq) (4.4)

On comparing eq 4.3 and eq 4.4 it can be easily seen that

D(fp) ∼ wTj xiT (yj = 1) + ∆j(yi, y) and upq ∼ wpq

where T(.) is an indicator function with T(True)=1 , zero otherwise and∆j(yi, y being the

similarity (negative loss) associated with jth label difference between two configuations.

Now Pott’s Model energy minimization problem can be solved using graph cuts, to be

specific using multi-way cut[5]. In our case it is even simpler because we have only binary

values for assignment, therefore multiway cut reduces to s-t cut problem for which known

polynomial time algorithms are available. The algorithm for graph construction is same

as described in [5] for Potts Model Energy Minimization.

The next part is to solve the minimization problem with the new set of constraints.

For this purpose we use exactly the same fully corrective block coordinate frank-wolfe

algorithm used in PDSparse. The dual optimization problem for our formulation is as

follows:

minw

1

2||w(α)||22 +

m∑i=1

∑y

αiy

s.t. αi ∈ Ci, ∀i ∈ [m]

w(α) = prox(m∑i=1

∑y

αiyΨ(xi, y), λ) (4.5)

where Ci = α|∑y∈Pi

αiy = −∑y∈Ni

αiy ∈ [0, C] , αiy ≥ 0, αiy ≤ 0,

w = [w1, w2, ..., wk, w00, w01..wKK ]T with wi ∈ RD, wpq ∈ R

Ψ(x, y) = [ψ(x, y), ϕ(x, y)]T ∈ RKD+K2

ϕ(x, y) = ¬[y0 ⊕ y0, y0 ⊕ y1, ..., yK ⊕ yK ]T ∈ RK2

ψ(x, y) = [y0x, y1x, ..., yKx]T ∈ RKD

26


Here Pi, Ni represents the sets of valid and invalid label configurations respectively for an

instance i. In our case their is only one valid configuration of labels(ground truth) and

any other configuration of labels falls under invalid partition.

Note that its not the standard dual we obtain by using lagrangian method. We trans-

formed standard dual into a form similar to PDSparse dual formulation using some linear

transformations in order to use their algorithms. In case of loss augmentation, the linear

part in above dual objective gets replaced withm∑i=1

∑y

∆(yi, y)αiy and is only possible be-

cause we only had one positive configuration. We will discuss in the next section how to

incorporate losses in cases where |Pi| > 1. Also because of the non negativity constraint

for edge weights in our formulation the prox function takes different forms for node weights

and edge weights. The prox function for node weights is given by

prox(x, λ) =

x− λ, x ≥ λ

x+ λ, x ≤ −λ

0, otherwise

(4.6)

Similarly, for edge weights we get the following form of proximity function.

prox(x, λ) =

x− λ, x ≥ λ

0, otherwise

(4.7)

It can be seen that the above dual formulation is very similar to PDSparse formulation,

with exactly same simplex constraints which allowed us to use same algorithms used in

their work. We apply the same strategy of restricting the updates of variables to an active

set of label configurations, Ai for each sample i. For each sample i, α values are updated

by solving the following optimization problem using the same projection algorithm used

in PDSparse.

27


minαAi

∈Ci

||(−αNiAi)− b||2 + ||αPi

Ai− c||

s.t. αPiAi, αNi

Aiare partitions of α w.r.t Pi, Ni

by = (⟨w,Ψ(xi, y) + ∆(yi, y))/Qi − αtAi(y),∀y ∈ Ni

cyi = αtAi(yi)− (⟨w,Ψ(xi, yi))/Qi,∀yi ∈ Pi (4.8)

Here 1Qi

represents the step size. We used different techniques for step size computation

which includes constant step size, exponential decaying and max eigen value of Hessian

matrix. We summarize overall training algorithm as followsAlgorithm 2: Fully-Corrective BCFW

1 Initialize active set A0 = yi , α = 0

2 while t ≤ max_iter do

3 Draw a sample index i ∈ [m] uniformly at Random

4 Find the most violating configuration y via eq. 4.3

5 At+ 1

2i ← At

i ∪ y

6 Update α values viq eq. 4.8

7 At+1i ← A

t+ 12

i /y|αiy=0

8 Maintain w(α)

9 end

We found that this formulation performed well on small datasets but failed to do

so when applied on medium/large datasets. There were basically two major problems:

graph cut algorithm used to find violating label is very expensive (polynomial time in label

space) and presence of only few positive labels(|yi| << K) for an instance. We found that

the constraints are very hard to satisfy as we are comparing scores of few labels with

comparatively very large number of negative(absent) labels. We tried to overcome these

limitations by modifying the formulation and algorithms used. Below are the details of

some of the variants we tried to achieve the objective.

Variant 1 : Same formulation with approximate solver instead of graph cut.

Variant 2 : No special constraint on edge weights, i.e. they are allowed to be negative.

28


Variant 3 : Edge score between any two labels are added(triggered) to total score of

configuration only when both of them are present but with non negativity constraint.

Variant 4 : No edge potentials, i.e. total score of configuration is sum of node potentials

of labels present in it.

We observed that except variant 4, in all other variants the problem of finding the most

violating label configuration becomes intractable. Therefore we used approximation algo-

rithm for finding the violating label. All those algorithms are based on following graph

relaxation algorithm. They do not provide any guarantee or bounds for closeness with the

exact solution but works very fast as compared to graph cut and provide decent results.Algorithm 3: Graph Relaxation Algorithm

Data:

Z= Node potentials (score of individual labels)

E= Edge potentials

Result: y - a set of labels

1 Initialize y = yi, flip_count= |y|, max_score= score(y)

2 while iter ≤ max_iter and flip_count != 0 do

3 flip_count=0

4 for i in 1...K do

5 compute new_score by flipping the state of label i in y, using Z and E

6 if new_score ≥ max_score then

7 max_score=new_score

8 increment flip_count

9 flip the label i in y

10 end

11 end

12 end

13 return y

As mentioned before for variant 4, the problem of finding the most violating label

configuration is tractable since there are no edge potentials all that is required is to take

the labels with non negative node potentials.

29


4.3 General Formulations for Structured Objects

In previous section we proposed different feature maps for learning the structure between

labels. In this section we propose different ways of forming constraints using same or

different feature maps.

We observed that because of very small number of positive labels for an instance the

constraints in section 4.2 are hard to satisfy. We propose the following formulation that

tries to relax those constraints by limiting the size of label configurations being compared.

miny|y|=k

K∑j=1

yj.wTj xi +

∑yp=yq=1

wpq − maxy|y|=k

K∑j=1

yj.wTj xi +

∑yp=yq=1

wpq ≥ 1− ξi (4.9)

In words, The constraints says that the score of any combination of size k of ground

truth labels (positives) should greater than the score of any combination of size k which

contains at least one negative label.

Lets say the number of active labels for example x be r.

1. r>k, score of all the Crk combinations of active labels should be greater than any

other combination of same size which contains a negative label.

2. r=k, same as above

3. r<k, then we set k=r for that example and above constraint is applied

Special Case: PDSparse is a special case of this formulation when edge parameters

are zero and we optimize for k=1.

The idea behind the above formulation is that it will boost the score of a group of labels

instead of individual labels and therefore will result in better accuracies for top k predic-

tions. It utilizes the same training algorithm we discussed in previous section. Although

it faces the same problem as those variants that finding max violator is intractable when

edge weights are used, therefore we used approximate solver based on algorithm 3.

30


Loss Augmentation:

We observed that we cannot use the same projection algorithm used in PDSparse to update

the dual parameters when we augment a loss function (margin rescaling) to formulation

4.9. It is because we found that the standard margin/slack rescaled formulation cannot

be converted into an equivalent dual program with same constraints as in eq. 4.5.

Standard Lagrangian dual for margin rescaled formulation is as follows

minw

1

2||w(α)||22 −

m∑i=1

∑y∈Pi,y∈Ni

∆(y, y)αyyi

s.t. αi ⪰ 0, αTi 1 = C ∀i ∈ [m]


∑y∈Pi,y∈Ni

αyyi (Ψ(xi, y)−Ψ(xi, y)), λ) (4.10)

Note that constraints on dual parameters, α values, are different from eq. 4.5 and are

much simpler. We used different algorithms to perform the dual update which includes

projected gradient descent (projection to probability simplex[13]), frank-wolfe or condi-

tional gradient method and variants of frank-wolfe algorithm [6]. The overall training

algorithm is summarized below.Algorithm 4: Fully-Corrective BCFW for Margin Rescaled Formulations

1 Initialize α = 0

2 Initialize positive set Pi = y|set of all possible positive combinations of size k∀i

3 Initialize negative set Ni = , set containing invalid/negative combinations

4 while t ≤ max_iter do

5 Draw a sample index i ∈ [m] uniformly at Random

6 Find the most violating configuration y = argmaxy ∈Pi∪Ni,|y|=k

⟨w,ψ(xi, y)⟩

7 N t+1i ← N t

i ∪ y

8 Update α values by minimizing eq. 4.10 with updated Ni

9 Maintain w(α)

10 end

As one can see step 6 in above algorithm requires to find a violating configuration of size

k which is currently not present in active sets Pi and Ni. We developed following algorithm

31


for this purpose. However, this algorithm works only when total score is decomposable

over individual label scores. For edge cases we used the approximate algorithms based on

algorithm 3.

Algorithm 5: Growing Sets AlgorithmData:Ap = Y 1, ..., Y i : Positive Active Set where Y i = 1, ..., KkAn = Y 1, ..., Y j : Negative Active Set where Y j = 1, ..., Kk

Y = 1, ..., Kx : Ground truth labels for current exampleZ ∈ RK : scores of each individual label where zi = ⟨wi, x⟩k= size of required label configurationResult: Y ∗ = argmax

Y ∈Ap∪An,|Y |=k

∑j∈Y

Zj

1 begin2 L←− min(|Y |+ k ∗ |An|+ 1, K)3 Γ←− Top L labels according to scores Z (Using Heap of size L)4 Let level0, level1, ..., levelk be k+1 lists containing an empty set (ϕ)5 for each Label li in Γ do6 for j in k,...,1 do7 for each set Y in levelj−1 do8 Y = Y + li9 if j==k and Y ∈ Ap ∪ An then

10 return y

11 else12 levelj ←− levelj + Y

Recursive Constraints :

We found out that although we relaxed the constraints by limiting size of label configu-

rations but they were less powerful as compared to the original PDSparse constraint, i.e.

over individual labels. Therefore, we decided to incorporate those constraints also together

with our grouping constraint leading to following recursive constraints formulation.

minw,ξ≥0

1

2||w||22 + λ|w|+

m∑i=1

k∑p=1

Cpξpi

s.t ⟨w,ψ(xi, yi)− ψ(xi, yi)⟩ ≥ ∆(y, y)− ξpi , ∀i ∈ [m], |yi| = |yi| = p ∈ 1, 2, ..., k

(4.11)

32


Corresponding dual program:

minw

1

2||w(α)||22 −

m∑i=1

k∑p=1

∑y∈P p

i ,y∈Npi

∆(y, y)αyyip

s.t. αip ⪰ 0, αTip1 = C ∀i ∈ [m], p ∈ [k]


k∑p=1

∑y∈P p

i ,y∈Npi

αyyip (Ψ(xi, y)−Ψ(xi, y)), λ) (4.12)

Note that constraint here are of same form as in eq. 4.10. Therefore, we used the same

algorithms we used to solve previous formulation 4.10. Also we did small modifications in

growing sets algorithm (algorithm 5) for simultaneously finding the maximum violating

label configurations of different sizes.

Negative Slack Only :

In above mentioned formulations we found that number of constraints being added in each

iteration are more that the original PDSparse formulation and therefore taking longer

time. The reason being for each negative configuration added to active set Ni, |Pi| con-

straints are added to problem. Also it is not possible to remove constraints at the end

of each iteration like we did in algorithm 2. In order to solve these we propose following

formulation that separates out constraints for positive and negative configurations.

minw,ξ≥0

1

2||w||22 + λ|w|+ C

m∑i=1

ξi

s.t ⟨w,ψ(xi, yi)⟩ ≥ 1,

⟨w,ψ(xi, yi)⟩ ≤ δ(yi, yi) + ξi, ∀i ∈ [m],

where δ(yi, yi) =|yi ∩ yi||yi|

(4.13)

Here for δ(y, yi) we used hamming loss but in general can be any loss function. Basically,

we want that all the positive labels/configurations should have a positive score and only

few confusing/negative configurations should be allowed to have a positive score by al-

lowing a slack for them.

33


Corresponding Dual Program:

minw

1

2||w(α)||22 −

m∑i=1

∑y∈Ni

δ(y, yi)αyi −

∑y∈Pi

αyi

s.t. αi ⪰ 0,∑y∈Ni

αyi = C ∀i ∈ [m]


∑y∈Pi∪Ni

αyiΨ(xi, y), λ) (4.14)

We used same algorithms, projected gradient descent, frank-wolfe etc. for updating the

dual parameters corresponding to negative configurations and for positive ones the con-

straint is much simpler and we just clip the dual parameters, α values, after gradient

update. This formulation is very much similar to traditional binary SVM formulation

with slight modifications in margin and slack. In our work we combined this formulation

with many different formulations we discussed including recursive constraints, limiting

size of label configurations, etc.

4.4 Large Margin Learning with Label Embeddings

The core of these formulations remains the same, i.e. combine the structural information

with the max-margin learner (OneVsAll). We already presented formulations in section

4.2 that tries to capture this information by introducing edges between labels but it had

its limitations. In this work instead of using edges we used label embeddings as a way of

capturing interaction between co-occurring labels. It is different from embedding based

approaches like [3],[16] etc. in a way that instead of trying to learn regressors(compressors)

and decompressors, it only uses embeddings to learn a better margin between labels in

original space. We propose the following formulation based on above discussion.

Formulation 1 : We use the same l1 − l2 regularized objective from eq. 4.1 with the

modification in the score computation/feature map of any configuration. Previously,

the total score was simply the sum/average of individual node potentials but in

addition to that we have score of label embeddings, V.

34


Mathematically, constraints have the following form :

minyi

K∑j=1

yij(wTj xi + λe⟨E, Vj⟩) −max

yi

K∑j=1

yij(wTj xi + λe⟨E, Vj⟩) ≥ 1− ξi (4.15)

where λe is a hyperparameter controlling the importance given to node and em-

bedding scores, Vj ∈ RK is the embedding corresponding to jth label and E ∈ RK

corresponds to shared weights, similar to edge weights in section 4.2.

Formulation 2 : In the above version we are not using feature vector, xi in embedding

score computation. In this version we are simultaneously trying to maximize the

score of a positive configuration and learning to predict the embeddings from xi.

Mathematically,

miny

K∑j=1

yj(wTj xi+λe⟨Exi, Vj⟩)−max

y

K∑j=1

yj(wTj xi+λe⟨Exi, Vj⟩) ≥ 1−ξi (4.16)

where Vj ∈ RK is the embedding corresponding to jth label, same as before, and

E ∈ RK×D is the transformation matrix/regressor that transform features into em-

bedding space.

Note in both of these formulations the objective is to learn weights w = [w1, w2, wK , vec(E)]

not the embeddings V. In other words, we are not simultaneously learning to predict em-

beddings from feature space and transforming labels to embedding space. Instead we used

unsupervised technique, glove embeddings [10] specifically, to learn V from label space.

The reason for using glove embeddings is it only requires co-occurrence statistics from

corpus which is very cheap to compute given only few labels are active/present for each

instance. However, one can use any supervised or unsupervised technique to transform

labels into smaller dimensional embeddings and use any of the above formulations.

Since here also the problem of finding the max violated constraint required a pair of con-

figurations, min. from positives and max. from negatives, we adopted the same strategy

of already putting all possible valid/positive configurations in the active set and searching

for only negative configurations. We used the same growing sets algorithm, discussed in

the previous section , to find the max violating configuration from negatives as the total

35


score of any configuration is decomposable over individual labels unlike edge weights in

formulation 4.2 and its variants. We also tried following variants of above formulations

based on the formulations discussed in section 4.3.

Table 4.1: Different Variants of Embedding Formulation

Variant \Features Margin Rescaling Recursive Constraints Negative Slack OnlyVariant 1 Yes Yes NoVariant 2 Yes No YesVariant 3 Yes Yes Yes

Summary

In this chapter we presented various formulations based on the existing PDSparse ap-

proach. In particular, how one can use existing PDSparse framework by defining their

own feature maps and algorithms to find the most violating constraints. In next section

we present empirical results of these formulations when applied on some of the extreme

classification datasets.

36

Chapter 5

Implementation Results and Analysis

In this chapter we provide empirical analysis of the various approaches we discussed in

previous chapter. It presents the results obtained by various different formulations and

their comparison with the state of the art system. We further present analysis based on

the results obtained explaining possible reasons behind the observed behavior.

5.1 Datasets

We experimented on various datasets from different domains. Some of these are very

small in terms of label and feature space, hence they do not fall under extreme classifi-

cation category. However, we used those datasets to test our hypothesis before applying

to medium/large datasets. Table 5.1 summarizes the details of each dataset. Although,

Table 5.1: Dataset Details

Dataset Type #instances Dimensionality Avg. LabelsPer PointTrain Dev Test Features Labels

yeast multilabel 1500 400 517 103 14 5scene multilabel 1211 598 598 294 6 1.07emotions multilabel 391 100 102 72 6 1.8sector multiclass 7793 865 961 55197 105 1aloi.bin multiclass 90000 10000 8000 636911 1000 1bookmarks multilabel 58863 ‘17396 11597 2150 208 2bibtex multilabel 4954 1465 976 1836 159 2.4Eur-Lex multilabel 15643 1738 1933 5000 3956 5rcv1_regions multilabel 20835 2314 5000 47237 225 1.2

Chapter 5. Implementation Results and Analysis 38

all our datasets does not comply with the extreme label space we referred earlier in the

introduction but we found them providing enough evidence for drawing conclusions. An-

other reason for not using large dataset is the time complexity. Those datasets typically

take hours or days for training using existing PDSparse technique and most of our formu-

lations take more time PDSparse. Some of them even took hours to train on these smaller

datasets for which PDSparse takes minutes.

5.2 Experimental Results

In this section we present results of some of the experiments we conducted on above

mentioned datasets. Initially we worked on smaller datasets to test our hypothesis. Table

5.2 presents the results of our base structured learning formulation on very small datasets.

Table 5.2: Comparison of PDSparse with Structured Learning Formulation 4.2

Dataset Train Time AccuracyPDSparse Our Method PDSparse Our Method

Emotions 0.1 0.97 58.82 63.72Scene 0.7688 2.52 68.62 59.89Yeast 0.43 6.55 64.99 77.17

It was observed that it was outperforming PDSparse which motivated our work. However

the problem with this formulation was it takes very long to train when applied on medium

datasets. Therefore we tried following variants to reduce the time complexity of algorithm.

Version1 No edge parameters with exact Solver for violating constraint

Version2 Edge triggered only when both labels are active, with approximate Solver

Version3 Approximate solver with negative edge weights permitted

Version4 Brute Force solver for smaller datasets (non negativity constraint)

Version5 Brute Force solver for smaller datasets (negative edge weights allowed)

38


Table 5.3: Comparison of different variants of Structured learning formulation, Section4.2, with PDSparse

Dataset AccuracyPDSparse (Original) Version1 Version2 Version 3 Version 4 Version 5

Emotions 58.52 63.72 65.68 63.72 64.7 62.74Scene(new) 73.91 71.97 72.47 70.4 70.97 69.39Yeast 64.99 76.01 77.57 76.01

Bibtex 63.5 55.94 53.27 51.22Eur-Lex 74.9 60.47 61.4Rcv1_regions 96.48 95.08 95.6 92.86

Sector 95.94 94.7digits 96.91 95.37

We found that it does not perform well on medium datasets. The reason being the only

few positive labels for instance. We found that constraints are very hard to satisfy are

they are trying to compare score of few positive labels with entire label space. We tried

to relax the constraints by limiting the size of label configurations as discussed in section

4.3. We summarizes the results obtained in Table 5.4.

Table 5.4: PDSparse Vs Version 6: Formulation 4.9 without edge weights

Dataset Precision@1 Precision@2 Precision@3 Algorithm Trained@bibtex 63.11 48.05 38.76 PDSparse

63.93 48.05 38.52 Version 6 k=161.02 45.69 37.77 k=257.07 45.39 36.4 k=3

rcv1_regions 96.48 57.75 40.68 PDSparse96.82 57.86 40.78 Version 6 k=196.74 57.84 40.68 k=296.59 57.85 40.78 k=3

EurLex 75.99 67.58 61.4 PDSparse76.3 66.45 59.83 Version 6 k=166.2 61.45 56.02 k=261.97 56.31 51.59 k=3

sector 95.62 PDSparse95.42 Version 6 k=1

aloi.bin 96.51 PDSparse96.45 Version 6 k=1

It was observed that accuracies dropped as we increased the size of label configurations

which totally contradicted the idea behind grouping of labels together. We found that

there are two possible reasons behind the observations. Firstly, there was no loss function

39


(hard margin) i.e. the optimizer penalizes two invalid configurations same irrespective of

their closeness to valid configurations. Secondly, we observed that smaller size (refers to

size of labels, k=1,2,..) constraints are more powerful than larger ones unless there is

some additional information present in larger ones. Basically, if there are no edge param-

eters then satisfying smaller size constraints will automatically satisfy larger ones. The

idea was that in cases where smaller size constraints are hard to satisfy edge parameters

will help in satisfying larger ones and therefore may increase the overall performance of

algorithm. Table 5.5 and Table 5.6 presents the results we obtained by incorporating

recursive constraints and augmenting a loss function.

Table 5.5: PDSparse Vs Version 7: Margin Rescaled formulation with recursive constraintswithout edge parameters

Dataset Precison@1 Precison@2 Precison@3 Algorithm Trained@Bibtex 63.21 48.1 39.13 PDSparse

61.57 48.25 39.17 Version 7 k=162.09 47.49 38.31 k=2

rcv1_regions 96.6 57.67 PDSparse96.54 57.89 Version 7 k=196.92 57.66 k=2

Eur-Lex 75.63 67.58 60.78 PDSparse78.06 68.05 61.44 Version 7 k=171.702 65.46 60.28 k=2

sector 95.73 PDSparse95.42 Version 7 k=1

aloi.bin 96.33 PDSparse95.58 Version 7 k=1

Table 5.6: PDSparse Vs Version 8: Negative slack formulation with recursive constraintsand edges triggered only when both labels are present.


63.62 47.69 38.96 Version 8 k=164.95 47.18 38.11 k=2



40


It was observed that by accompanying edges, recursive constraints and margin rescaling

performance improved for larger size constraints. However, we used brute force solvers to

find the maximum violator which takes a lot of time and is not practical. As mentioned

earlier, some additional information is required in larger size constraints which will help

them when smaller size constraints are violated so we thought of using embeddings in

place of edges. The idea was that it will be able to capture the label correlations and

will provide the additional information needed with a slight expense of time (very less

compared to edges). Table 5.7 presents the result of the above discussed idea.

Table 5.7: PDSparse Vs Version 9: Negative Slack only formulation with margin rescaling,recursive constraints and GloVe Embeddings Formulation 4.15


63.83 (64.03) 48.3 (48.51) 38.96 (39.54) Version 9 k=164.03 (63.93) 47.28 (46.10) 38.55 (38.21) k=2

rcv1_regions 96.6 57.67 PDSparse96.4 (96.26) 57.57 (57.53) Version 9 k=196.42 (96.7) 57.09 (57.34) k=2

Eur-Lex 75.63 67.58 60.78 PDSparse77.34 (79.04) 69.83 (70.58) 63.06 (63.32) Version 9 k=172.47 ( 74.75) 64.25 (67.41) 57.80 (59.90) k=2

Note: Values in brackets are accuracies obtained after applying Stochastic WeightAveraging (SWA)[7] while training.

Table 5.8: PDSparse Vs Version 10: Negative Slack only formulation with margin rescal-ing, recursive constraints and GloVe Embeddings Formulation 4.16


63.42 47.95 39.27 Version 10 k=163.52 46.82 37.36 k=2



Table 5.8 shows the results obtained after incorporating input features in embedding score.

It was observed that they do not make much difference as embeddings are learned with

using them. Although it increased the time complexity of algorithm a lot.

41

Chapter 6

Conclusion and Future Work

This being the final chapter of the report, summarizes the entire work done for improving

the performance of Extreme classification systems and gives an overview of the work that

can be done in future.

The organization of the chapter is as follows. First, section 6.1 points out some inter-

esting observations from the entire report and then section 6.2 gives some of the future

works that can be done to further improve the system.

6.1 Conclusion

This section describes some of the key takeaways from the entire report. The important

observations are as follows:

• Despite the recent advancements in the field of deep learning and its applications

in major fields of vision, speech etc. we found that traditional machine learning

techniques like max-margin learners (SVMs) outperform them in case of extreme

classification.

• Power/Performance of state of art OneVsAll learners can be improved by incorpo-

rating the latent information present in the label space.

• We presented various ways in which we can use the exiting PDSparse framework for

learning to predict structured objects by defining feature maps and algorithms to

find maximum violator.

Chapter 6. Conclusion and Future Work 43

6.2 Future Work

This section describes some of the future works that can be done in extreme classification

to further improve the system. Some of them are:

1. Low accuracy for ”Tails Labels” remains the area of further research as most of the

algorithms fails to perform well in those cases.

2. In real life scenarios, we often encounter missing data i.e not all the relevant labels

are marked for an instance, which questions the performance measures we use to

compare different techniques. Currently, most of the algorithms treats given data

as complete and therefore it may/may not be able to learn the true relationship

between input features and labels. Also the performance measures we use treats all

the absent labels as negatives which is not true in most cases. Hence we need better

measures and techniques which can incorporate the idea of missing or incomplete

information.

3. In our work we tried different ways of incorporating the structural information

present in the labels with state of the art OneVsAll learners to boost their perfor-

mance. They have shown improvements but it still requires a lot of work to combine

the two approaches which can give us better results.

43

Bibliography

[1] Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. 2013. Multi-

label learning with millions of labels: Recommending advertiser bid phrases for web

pages. In Proceedings of the 22nd international conference on World Wide Web. ACM,

pages 13–24.

[2] Rohit Babbar and Bernhard Schölkopf. 2017. Dismec: distributed sparse machines

for extreme multi-label classification. In Proceedings of the Tenth ACM International

Conference on Web Search and Data Mining. ACM, pages 721–729.

[3] K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain. 2015. Sparse local embeddings

for extreme multi-label classification. In Advances in Neural Information Processing

Systems.

[4] Stephen Boyd. 2011. Alternating direction method of multipliers. In Talk at NIPS

Workshop on Optimization and Machine Learning.

[5] Yuri Boykov, Olga Veksler, and Ramin Zabih. 2001. Fast approximate energy mini-

mization via graph cuts. IEEE Transactions on pattern analysis and machine intelli-

gence 23(11):1222–1239.

[6] Donald Goldfarb, Garud Iyengar, and Chaoxu Zhou. 2017. Linear convergence of

stochastic frank wolfe variants. arXiv preprint arXiv:1703.07269 .

[7] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry P. Vetrov, and An-

drew Gordon Wilson. 2018. Averaging weights leads to wider optima and better gener-

alization. CoRR abs/1803.05407.

44

Bibliography 45

[8] H. Jain, Y. Prabhu, and M. Varma. 2016. Extreme multi-label loss functions for

recommendation, tagging, ranking and other missing label applications. In Proceedings

of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining.

[9] Prateek Jain, Raghu Meka, and Inderjit S Dhillon. 2010. Guaranteed rank mini-

mization via singular value projection. In Advances in Neural Information Processing

Systems. pages 937–945.

[10] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global

vectors for word representation. In Proceedings of the 2014 conference on empirical

methods in natural language processing (EMNLP). pages 1532–1543.

[11] Y. Prabhu and M. Varma. 2014. Fastxml: A fast, accurate and stable tree-classifier

for extreme multi-label learning. In Proceedings of the ACM SIGKDD Conference on

Knowledge Discovery and Data Mining.

[12] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun.

2005. Large margin methods for structured and interdependent output variables. Jour-

nal of machine learning research 6(Sep):1453–1484.

[13] Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Tie-Yan Liu, and Wei Chen. 2013. A

theoretical analysis of ndcg type ranking measures. arXiv preprint arXiv:1304.6480 .

[14] Jason Weston, Ameesh Makadia, and Hector Yee. 2013. Label partitioning for sub-

linear ranking. In Sanjoy Dasgupta and David Mcallester, editors, Proceedings of the

30th International Conference on Machine Learning (ICML-13). JMLR Workshop and

Conference Proceedings, volume 28, pages 181–189.

[15] Ian En-Hsu Yen, Xiangru Huang, Pradeep Ravikumar, Kai Zhong, and Inderjit

Dhillon. 2016. Pd-sparse : A primal and dual sparse approach to extreme multiclass

and multilabel classification. In Maria Florina Balcan and Kilian Q. Weinberger, edi-

tors, Proceedings of The 33rd International Conference on Machine Learning. PMLR,

New York, New York, USA, volume 48 of Proceedings of Machine Learning Research,

pages 3069–3077.

45

Bibliography 46

[16] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit S. Dhillon. 2014. Large-

scale multi-label learning with missing labels. In International Conference on Machine

Learning (ICML). volume 32.

46

A ckrio1v ledge'm1ents

I would likP to t'X!Jr<:'~s my si11cC're gratit11clP to my advisors Prof. Saketh Nath and

Prof. Sunita Sarawagi for t lw co11ti11uo11s support of 111_v !\I.Tech study and r f-! lat.ed

resPa.rch_ for their patience, motivation , and immense knowledge_ Their guidance helped

me in all thP time of research and ,vriting of this thesis.I could not have imagined havrng

bet t,er ad visors and mentors for my \L Tech study_

Signature:

Da te: d1:) .J11l!P 2018

~~ -~~---- --·- ---- ·-- -

Akshat Jaiswal

163050069

Extreme Classificationsaketha/research/AkshatMTP2018.pdf · Place : MLAV\A.,6 ~ Prof. Sunita...

Documents

Transcript of Extreme Classificationsaketha/research/AkshatMTP2018.pdf · Place : MLAV\A.,6 ~ Prof. Sunita...