Extreme Classificationsaketha/research/AkshatMTP2018.pdf · Place : MLAV\A.,6 ~ Prof. Sunita...
Transcript of Extreme Classificationsaketha/research/AkshatMTP2018.pdf · Place : MLAV\A.,6 ~ Prof. Sunita...
Extreme Classification
Submitted in partial fulfillment of the requirements
of the degree of
Master of Technology
by
Akshat Jaiswal
(Roll no. 163050069)
Guided By:
Prof. Saketha Nath
Prof. Sunita Sarawagi
Department of Computer Science & Engineering
Indian Institute of Technology Bombay
2018
lleport Approval
T his pnij1•ct report 1·11lillt•cl " Extrnu1e Classification " , s ul.J1111I ll'tl liy A b ltn t .lri.iswol (Roll
Nn. l(i;lQ5(l(H>!l) , is npprowd fur the nwmd nf tl1•gr<·c uf fl. luster of lh:h11o logy in Cc,mputcr
Sri\'lll'L' & F.nginl'ning.
Lft Dept. of CSS, !IT Hydcrnl.iad
Supervisor
9-¾ LJoJ,_ ~ Ajit Rajwade
Dept. of CSE, IIT Born I.Jay
lnternfll Examiner
Prof.
Prof. Ganesh Ramakrislman
Dept . of CSE, IIT Bombay
C hairperson
Date: '3P. June 2018
Place: MLAV\A.,6 ~
Prof. Sunita Sarawagi
Dept of CSE, ITT 13ornhay
Supervisor
Dept . of CSE, IIT Bombay
Internal Examiner
Declaration of A11thorship
l d\'dan• th .1t this wril"t.L'll suhrnissin11 rt'Jm•st•11ts Ill)' i(kas i11 111v 1rn·11 \\·orris nnd wlll'n·'
ot hers · idr::ts or words lw.n' lil't'll i11ch1dt>d, I haw ad<'quat.01:v ri t.ed nnd 1vkrcm·rd t.ht·
original snmce:; . 1 abo dcdarc l hat I haw :..1dhN0d t.o nil prinl'ipks of antclc•mic ho11est.y
Jml integrity and hnw JlL)\ 111i:m'prr:-:e11tcd or fa bricnt.ccl or falsified an~r id~•n/d nta/fac-
1 / :-:1J lllTl' i11 Ill\' :-:ub111is:-:ion . I understand that am· violation nf the n.boV<' will l.w rans<' ' . . fpr di:-:riplinnry ar1 in11 liy tlw lnstit11t e nud c:rn alsn eYnk<• 1w11nl action fro m t.lw soun-1•s
\\'l11d1 hove t bus llOt been pro1wrl\' cited or from whom proper 1wrn1ission has not l.H•e11
t n k1•11 wlw11 lll'e d1•d
0,1 t1' . ~ -- .hJlll' :2lH~
11
Sip; 11a tmr: _ .. . ~ .. .... ... .. . .
Akshat .Jaiswal
L liJllSOOLi0
Abstract
Over the past few years extreme classification has become an important research problem
and has gained focus of many researchers all over the world due to its wide range of ap-
plications in ranking, recommendation and tagging. Extreme classification is process of
assigning a relevant set of items/labels to a data point/observation, from a very large set
of labels. In real world there are many tasks that require ranking of huge set of items,
documents or labels and return only few of them. For example collaborative filtering, im-
age/video annotation etc. are some of the problem which have very large label space and
can be readily casted into an extreme classification problem. The large size of label space
makes existing approaches intractable in terms of scalability, memory footprint, predic-
tion time etc. This report presents some of the techniques used by researchers in the past
to address this problem. Most of the existing approaches makes use of the assumption
that there exist a latent structure between the labels which they try to utilize in their
techniques. However recent developments have shown that state of the art results can be
achieved without making any such assumption and treating each label as an independent
entity. In this work we propose various formulations that tries to combine the strength
of these recent independent learners with the latent structural information present in the
label space.
iii
Contents
Report Approval i
Declaration of Authorship ii
Abstract iii
List of Tables vi
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Road-map of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Survey 5
2.1 Binary Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Label Partitioning for Sublinear Ranking . . . . . . . . . . . . . . . . . . . 6
2.3 Low Rank Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . 9
2.4 FastXML: A Fast, Accurate and Stable Tree-classifier for eXtreme Multi-
label Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Sparse Local Embeddings for Extreme Classification . . . . . . . . . . . . . 13
2.6 PDSparse: A Primal and Dual Sparse Approach to Extreme Multiclass
and Multilabel Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 16
iv
Contents CONTENTS
3 Analysis of Existing Approaches 19
3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Scalability and Real World Usage . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Proposed Formulations 23
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Extreme Classification as Structured Prediction . . . . . . . . . . . . . . . 25
4.3 General Formulations for Structured Objects . . . . . . . . . . . . . . . . . 30
4.4 Large Margin Learning with Label Embeddings . . . . . . . . . . . . . . . 34
5 Implementation Results and Analysis 37
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6 Conclusion and Future Work 42
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Acknowledgements 47
v
List of Tables
4.1 Different Variants of Embedding Formulation . . . . . . . . . . . . . . . . 36
5.1 Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Comparison of PDSparse with Structured Learning Formulation 4.2 . . . . 38
5.3 Comparison of different variants of Structured learning formulation, Sec-
tion 4.2, with PDSparse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 PDSparse Vs Version 6: Formulation 4.9 without edge weights . . . . . . . 39
5.5 PDSparse Vs Version 7: Margin Rescaled formulation with recursive con-
straints without edge parameters . . . . . . . . . . . . . . . . . . . . . . . 40
5.6 PDSparse Vs Version 8: Negative slack formulation with recursive con-
straints and edges triggered only when both labels are present. . . . . . . . 40
5.7 PDSparse Vs Version 9: Negative Slack only formulation with margin
rescaling, recursive constraints and GloVe Embeddings Formulation 4.15 . 41
5.8 PDSparse Vs Version 10: Negative Slack only formulation with margin
rescaling, recursive constraints and GloVe Embeddings Formulation 4.16 . 41
vi
Chapter 1
Introduction
1.1 Background
In machine learning and statistics, classification or supervised learning is the problem of
assigning a new data point (observation) to previously known set of categories (classes),
on the basis of some prior information available in form of training data containing data
points whose categories are known. Classification is further broadly divided into three
types; binary, multiclass and multi-label, on the basis of number of classes an observation
can belong and total number of possible classes. In binary classification there are only two
classes and an observation can belong to any one of them at a point of time. For example
consider the problem of predicting whether a patient is suffering from a particular diseases
(yes or no) given his medical history, which we refer to as input features for that patient.
In multiclass classification total number of classes are more than two unlike binary
classification however an observation can only belong to any one of them. Consider the
problem of identifying shapes of object, the output classes can be circle, triangle, rectangle
etc. Here the number of classes are finite and more than two however an object can only
belong to any one of them as an object can never have more than one shape unless the
two shapes are generalization of one of them like square and rectangle.
Lastly multi-label classification is same as multiclass classification but the restriction
of an observation belonging to more than one class is removed. In multi-label learning
an observation can belong to more than one class. Consider our previous example of
Chapter 1. Introduction 2
identifying shapes of objects, lets say now we want to predict color of object instead of its
shape. This is a multi-label learning problem as an object can have more than one color.
Considering above definitions of classification problem we define Extreme Classification
as the problem of learning a classifier that can annotate a data point with the most relevant
subset of labels from an extremely large label set [8]. Extreme Classification is essentially
a multi-label learning problem with an extremely large label space (number of classes),
say order of 104 or something in millions.
1.2 Motivation
Extreme multi-label learning is an important research problem as it has many applications
in tagging, recommendation and ranking. Consider the problem of automatically identi-
fying relevant tags/labels for an article or document submitted on Wikipedia. Currently
millions of wiki tags are available which make the above task of selecting a small set of
labels substantially difficult , hence falling under category of extreme multi-label learning.
Similarly, one may wish to learn an extreme classifier that can recommend movies or songs
to a user out of millions of movies and songs available on-line, given his/her past history
of likings . Finally in information retrieval one may wish to rank large set of documents
given a user query e.g. search results by Google or any search engine. In general, any
problem of recommendation, ranking or tagging can be formulated as multi-label learning
problem with each item to be ranked/recommended as a separate label, which when pre-
sented with input features of an observation predicts most relevant set of labels. Similarly
each tag that is to be assigned can be treated as separated label.
Due to such motivating real-life applications, extreme classifications being focus area
of many researchers and organizations all over world. The main problem and the difference
between traditional multi-label learning and extreme classification is scalability and large
memory footprint due to extremely large label spaces. The key challenge being the design
of scalable algorithms that offer real-time predictions and have a small memory footprint.
2
Chapter 1. Introduction 3
1.3 Problem Statement
Let the dimensionality of input feature vectors be D, total number of labels be K and total
number of training instances be N. Given set of point (xi, yi), i = 1, 2...N where xi ∈ RD
is the input and yi ∈ 0, 1K is set of relevant labels (subset of set of possible labels K).
For any j ∈ [K] yji = 1 indicate presence of that label and yji = 0 indicate label is either
irrelevant or missing.The task is to predict top k relevant labels for a given query point
x∗, when the order of D and K is very large.
1.4 Proposed Solution
In order to address the extreme classification problem we propose various formulations
based on recent development [15]. We classify our formulations into following categories.
1. Extreme classification posed as structured learning problem : Each configuration of
labels is treated as single structured output/object. The objective is to predict an
object closest to the ground truth configuration.
2. General Formulations for Structured Objects: In this part we present loss augmented
formulations along with different forms of constraints for optimization problem.
3. Large Margin Learning in Embedding Space: We combine the idea of representing
high dimensional label space into a lower dimensional embedding and max-margin
learning.
1.5 Road-map of the Report
The organization of the rest of the report is as follows:
• Chapter 2 corresponds to the literature survey where we present various existing
approaches used to address the extreme classification problem. Since extreme clas-
sification is similar to traditional classification problem many different techniques
exists in the field of ranking, recommendation and classification systems which can
3
Chapter 1. Introduction 4
be used in similar settings. However we only present techniques which cater to the
need of large label space learning.
• Chapter 3 presents comparative study of the different approaches discussed in
the previous section based on various criteria: including training and prediction
time,accuracy etc. and analyze strength and weaknesses of their models.
• Chapter 4 present the various formulations we proposed in order to improve upon
the existing techniques.
• Chapter 5 provides the results of the experiments we conducted on some of the
extreme classification datasets. It compares different formulations with state of the
art technique and analyzes strength and limitations of each model.
• Chapter 6 concludes the report by covering the key aspects of the proposed ideas.
It also mentions about the future work that can be done in the field of extreme
multi-label learning that can further improve the system.
4
Chapter 2
Literature Survey
In this section we present some of the techniques used by researchers in the past to address
the problems with extreme classification. Most of these methods are essentially tree based
or embedding based. We will define what these are in further sections.
2.1 Binary Relevance
In this approach we build a 1vsAll classifier, which we refer to as binary relevance, for
all labels to predict most relevant labels for any query point.Essentially these methods
treat predicting each label as a separate binary classification task. They typically rank
possibilities by scoring each label in turn. The Approach is as follows:
• Training Phase: Learn a label scorer f(x, k)∀k ∈ [K] which when presented with
a data point x and a label k returns a real valued score.
• Prediction Phase: Given a new query point x*, compute these scores zk =
f(x, Lk)∀k ∈ [K]. The top k most relevant labels,L for input x* are predicted
by sorting these scores and extracting the top k ones.
The model that are used to predict these scores could be anything from linear SVMs, kernel
SVMs, neural networks, decision trees etc. These methods have prediction in linear with
respect to the label size as every time it needs to compute scores for each label. Therefore
these methods does not scale when label space,K is very large which is the typical case of
extreme classification.
Chapter 2. Literature Survey 6
2.2 Label Partitioning for Sublinear Ranking
As discussed, above methods become impractical as the number of labels increases to
millions because these algorithms are linear in label space. This paper [14] provide a
wrapper method that can make these existing methods to extreme multilabel learning
tractable while maintaining the accuracy. The paper provide a two step algorithm that
uses the existing label scorers and makes the process of prediction sublinear by reducing
the label space to a subset of original space for each query point. The algorithm works in
following manner
• Input Partitioning- Input space is partitioned to create different clusters (partitions)
• Label Assignment- A subset of labels is assigned to each partition.
At prediction time, the following algorithm is followed to predict(rankings) set of
relevant labels for a new query point x:
1. Given a query point x, corresponding set of partitions p are identified by the input
partitioner p= g(x)
2. label set Lp, is identified corresponding to the set the partitions p
3. scores are calculated for each label y ∈ Lp with the help of label scorer and ranked
to give the final output
The cost of ranking at prediction time is considerably reduced as number of labels Lp is
less than the total number of labels. Given the brief of how algorithm work following is
the exact algorithm provided by the authors of paper for sub-linear Ranking.
Input Partitioning :
It is assumed that user has already trained a label scorer f(x, Lk) using the previous
binary relevance approach.There are two guidelines that need to be taken care of in order
to achieve our objective.
• Examples that share highly relevant labels should be mapped to same partition
6
Chapter 2. Literature Survey 7
• Examples for which label scorer performs well should be prioritized while learning
the partitioner.
Based on this they propose the following approaches for input partitioner that works by
closest assigned partition as defined by the partition centroids ci, i=1 ... P.
Weighted Hierarchical Partitioner Following is a straight forward optimization prob-
lem for prioritizing examples for which label scorer is performing well.
N∑i=1
P∑j=1
l(f(xi), yi)||xi − cj||2
where f(x) = (f(x, L1), f(x, L2)..f(x, LK)) is the hypothesis function that returns a vec-
tor containing the relevant labels corresponding to input xi and l(f(xi), yi) is accuracy
measurement of interest (e.g. precision at K). In practice this can be implemented as a
weighted version of hierarchical k-means.
Weighted Embedded Partitioner The above method only incorporates prioritization
of examples but does not fulfill the first guideline that examples sharing highly ranked
relevant labels are mapped together. One way of encoding this constraint is to optimize
the above weighted partitioner in learnt ”embedding ”space.
N∑i=1
P∑j=1
l(f(xi), yi)||Mxi − cj||2
Label Assignment :
This section presents the algorithm to assign labels to the partitions created using the
input partitioner. We define α = 0, 1K , where αi determines if a label Li should be
assigned to the partition (αi = 1) or not (αi = 0).We also define the following
• Rt,i as the rank of label i for example t:
Rt,i = 1 +∑j =i
δ(f(xt, Lj) > f(xt, Li))
• Rt,yt is the rank of true label for example t.
7
Chapter 2. Literature Survey 8
We start with the base case where we want to optimize precision at 1 and each example
has only one relevant label. To achieve out objective two conditions must hold : (1) true
label must be assigned to the same partition and (2) true label must be ranked highest
among all other labels in same subset. These constraints are incorporated in following
objective function:
maxα
∑t
αyt(1− maxRt,i<Rt,yt
αi)
subject to
0 ≤ αi ≤ 1
and to restrict the number of labels to be assigned to partition, αi values are ranked and
top C are taken for the partition.
The above formulation is generalized for precision at k > 1 by replacing inner max
with a function that counts the number of violations above relevant label.
maxα
∑t
αyt(1− ϕ(∑
Rt,i<Rt,yt
αi))
subject to
0 ≤ αi ≤ 1
where ϕ(r) = 0ifr < k and 1 otherwise to optimize at k.
In order to extend our formulation for more than one labels instead of considering
only one label in our optimization problem we tend to maximize the the internal term for
maximum number of labels in the partition which is captured in the following:
maxα
∑t
1|yt|
∑y∈yt
αy
w(Rt,y)(1− ϕ(
∑Rt,i<Rt,yt
αi))
subject to
0 ≤ αi ≤ 1
where ϕ(r) is replaced with a sigmoid:ϕ(r) = 11+e(k−r) to approximate the value and relax
the optimization problem and w(Rt,y) = (R)t,yλ, λ ≥ 0 is a weighing factor governed by the
8
Chapter 2. Literature Survey 9
rank of that label for that example in order to incorporate the constraint of prioritizing
the examples for which label scorer is performing well.
The paper provides a innovative algorithm for reducing the prediction time in real
world scenarios by partitioning the label space.However experimental results shows that
results are significantly affected by choice of input partitioner and therefore remains the
area of interest for further development.
2.3 Low Rank Empirical Risk Minimization
The above method of input partitioning is a novel approach however it still requires longer
training times(even more than binary relevance) and also its performance critically depend
upon choice of partitioning and label assignment which may or may not be optimal in
many cases. Also their method cannot handle missing labels which are common in most
real-world situations. For example consider the task of assigning wiki tags to an article.
The training data which contains tags for articles are manually assigned by individuals
and it is impossible for an individual to assign all the relevant tags out of millions of wiki
tags. Therefore it is more likely that training data may not contain some labels which are
actually relevant for articles, we call them missing labels.
The paper[16] addresses the above issues related to multi-label classification prob-
lem in real applications. Extreme classification problem is mostly addressed by either
tree based methods or embeddings based methods.This paper uses the latter approach.
Embedding based methods basically project high dimensional label space into low dimen-
sion(embedded space), then learns regressors over the embedded space and at the time of
prediction revert back these predicted embeddings into original label space. They tend
to formulate into learning a low rank linear model Z ∈ RD×Ks.t. ypred = ZTx which can
be casted into a standard empirical risk minimization (ERM) framework to allow use of
various loss functions and regularizers.ERM framework is an abstraction that is based on
the principle that learning algorithms should choose a hypothesis which minimizes the
empirical risk.
The motivation for this framework comes from the fact that although label space is
very large but there exist significant label correlation among labels which allows them
9
Chapter 2. Literature Survey 10
to be modeled by low rank constraint. The algorithm works as follows. The hypothesis
function, parametrized by Z, is defined as f(x;Z) = ZTx, where Z ∈ RD×K . Also we
define loss function as l(y, f(x;Z)) which is assumed to be decomposable over labels i.e.
l(y, f(x;Z)) =K∑j=1
l(yj, f j(x;Z)).
Using the above functions the optimization problem can be written as
(Z) = argminZJΩ(Z) =
∑(i,j)∈Ω
l(Yij, fj(xi;Z)) + λ.r(Z) s.t rank(Z) ≤ k
where r(Z) : RD×K −→ R is a regularizer and Ω ⊆ [N ]]× [K] represent the index set
of ”known” labels. Here Standard setting is assumed that Yij = 1or0 for present or absent
and Yij =? for missing. They show that above formulation can be solved using alternating
minimization technique and even have a closed form solution in case of L2 loss.
Algorithm :
It is assumed that Z is a low rank matrix therefore we can have decomposition of Z in
the form Z = WHT where W ∈ RD×k and H ∈ RK×k. Further it is also assumed that
regularizer can be decomposed also in form r(Z) = r1(W ) + r2(H). Therefore above
formulation can be written in form of W and H as:
JΩ(W,H) =∑
(i,j)∈Ωl(Yij, x
Ti Whj) +
λ2(||W ||2F + ||H||2F )
where hTj is the j-th row of H. Here trace norm regularization is used for Z which is
equivalent to sum of frobenius norms of W and H. Now if we make one of W or H con-
stant above formulation becomes a convex function, which permits the use of alternating
minimization technique that is guaranteed to converge to a stationary point when both
minHJΩ(W(t−1), H) and minWJΩ(W,H
t) are uniquely defined.
Once W is fixed, H can be independently updated as follow:
hj ← arg minh∈Rk
∑i:(i,j)∈Ω
l(Yij, fj(xi;Z)) +
λ2.||h||22
which is similar to solving a regression problem over k variable.Now when H is fixed
W can be updated as follows :
If W ∗ = argminW JΩ(W,H)
and we denote w∗ = vec(W ∗), then w∗ = argminw∈Rdk g(w),
g(w) =∑
(i,j)∈Ωl(Yij, w
T xij) +λ2.||w||22
where xij = hj⊗xi , ⊗ denotes the outer product. Taking Squared loss as an example
10
Chapter 2. Literature Survey 11
above is equivalent to a regularized least squares problem with dk variables, whose closed
form solution becomes infeasible when d is large. Therefore iterative methods such as
conjugate gradient (CG), are used for this purpose, which often require computation
of gradient and multiplication of a vector with hessian matrix. They provide efficient
methods for computing these two which offers a speedup of O(d) over direct computation
where d is average number of non zero features for an instance.
The paper presents a novel approach of formulating multi-label learning problem with
missing labels into a standard ERM framework with the rank constraints and regularizers
to increase flexibility and efficiency. It also provides algorithms based on alternating
minimization technique to efficiently solve such non convex formulations. However the
method works only for decomposable loss functions and therefore requires further work
to incorporate other non decomposable functions.
2.4 FastXML: A Fast, Accurate and Stable Tree-classifier
for eXtreme Multi-label Learning
Extreme classification as explained before can be addressed using either tree based meth-
ods or embeddings based methods. The paper [11] presents an algorithm to build a tree
based classifier, referred as FastXML, which is more accurate and faster to train than all
the previously discussed techniques.
Tree based methods are often seen to beat the 1-vs-All baseline systems in terms of
prediction accuracy at a fractional cost of the prediction time. The algorithm presented in
[14] can take time even longer than 1-vs-all methods due to additional cost of partitioning
and label assignment. FastXML is like any other tree based algorithm where the objective
is to learn a tree like structure/hierarchy over the data in order to reduce the number of
items(labels) in the leaf nodes, that is assigned to each instance while training/testing.
Training :
FastXML learns hierarchy over feature space instead of label space as opposed to multi-
class setting. The key idea here is that a ranking based partitioning function used to split
11
Chapter 2. Literature Survey 12
a node. Similar to any tree learning algorithm, FastXML performs recursive partitioning
of parent’s feature space into its children. Existing approaches uses local measures like
GINI index, Entropy etc. which depend solely on predictions at node being partitioned to
decide the split at node. However in order to increase the overall performance of algorithm,
node partitioning should be done using global measures which require partitioning to be
learnt jointly over all the nodes. Unfortunately optimizing such a global measure can be
very expensive and there its the main reason existing approaches optimizes locally. This
allows the hierarchy to be learnt node by node starting from the root and going down to
the leaves and is more efficient than learning all the nodes jointly. FastXML learns the
hierarchy by directly optimizing a ranking loss function. In particular, it optimizes the
normalized Discounted Cumulative Gain(nDCG) [13].
Let idesc1 , ..., idescK be the permutation indices that sort a real-valued vector y ∈ RK in
descending order are defined such that if j > k then yidescjyidesck
. The rankk(y) operator,
which returns the indices of the k largest elements of y ranked in descending order (with
ties broken randomly), can then be defined as rankk(y) = [idesc1 , ..., idesck ]T
Let π(1, K)denote the set of all permutations of1,2,...,K. The Discounted Cumulative
Gain (DCG) at k of a ranking r ∈ π(1, K) given a ground truth label vector y with binary
levels of relevance is LDCG@k(r, y) =k∑
l=1
yrllog(1+l)
DCG is sensitive to both the ranking and the relevance of predictions unlike precision
and other measures. The normalized DCG, is defined by
LDCG@k(r, y) = Ik(y)k∑
l=1
yrllog(1+l)
, Ik(y) = 1min(k,1T y)∑
l=1
1log(1+l)
The value of nDCG lies between 0 and 1 allowing it to be used to compare rankings
across label vectors with different numbers of positive labels.
FastXML partitions the current node’s feature space by learning a linear separator w
such that
min ||w||1 +∑i
Cδ(δi)log(1 + e−δiwT xi)− Cr
∑i
12(1 + δi)LnDCG@L(r
+, yi)− Cr
∑i
12(1−
δi)LnDCG@L(r−, yi)
w.r.t. w ∈ RD, δ = −1,+1K , r+, r− ∈ π(1, L)
where i indexes all the training points present at the node being partitioned, δi ∈
−1,+1 indicates whether point i was assigned to the negative or positive partition and
12
Chapter 2. Literature Survey 13
r+ and r− represent the predicted label rankings for the positive and negative partition
respectively. Cδ and Cr are user defined parameters which determine the relative impor-
tance of the three terms. FastXML uses alternating minimization technique to learn this
separator.
Prediction:
FastXML instead of using a single tree that partitions the nodes using above problem,
learns an ensemble of trees for accurate and stable predictions. Therefore given a novel
point x ∈ RD, FastXML’s top ranked k predictions are given by
r(x) = rankk(1T
T∑t=1
P leaft (x))
where T is the number of trees in the FastXML ensemble and P leaft (x) ∝
∑i∈Sleaf
t (x)
yi
and Sleaf(x)t are label distribution and set of points respectively in leaf node of x in the
tree t.
FastXML learnt an ensemble of trees with prediction costs that were logarithmic in
the number of labels. The key technical contribution in FastXML was a novel node
partitioning formulation which optimized an nDCG based ranking loss over all the labels.
Such a loss was found to be more suitable for extreme multi-label learning than the Gini
index optimized by MLRF[1] or the clustering error optimized by LPSR[14]. nDCG is
known to be a hard loss to optimize using gradient descent based techniques. FastXML
therefore developed an efficient alternating minimization algorithm for its optimization.
2.5 Sparse Local Embeddings for Extreme Classifica-
tion
Embeddings based methods are popular choice for addressing large scale multi-label clas-
sification. They address the problem of large number of labels by projecting label vectors
into low dimensional linear subspace. It is based on the assumption that there exist sig-
nificant label correlations among the set of labels which allows them to be modeled by
a row rank matrix. Despite the computational benefits provided by embeddings based
methods , they still suffer scalability issues and are unable to deliver higher accuracies on
13
Chapter 2. Literature Survey 14
most of the real world scenarios. The reason behind is that the underlying assumption
of low rank does not hold because of presence of ”tail” labels in most scenarios. There
exist a large number of labels which occur in very few training instances. The paper[3]
provides an algorithm SLEEC that address all the issues discussed.
Embeddings based approaches: These approaches works in the similar way discussed
above. They transform original high dimensional label vectors in to low dimensional
linear subspace (embeddings). The model is then trained to predict these embeddings
instead of original labels which are then later transformed back to original label space.
Mathematically given training points (xi, yi), i = 1...n, we transform each label vector as
zi = Uyi then learn these embeddings as function of xi as zi = V xi. At last the labels for a
given point is predicted by post processing y = U∗V x where U∗ is the decompression ma-
trix.These approaches are slow at training and prediction time even for small embedding
dimensions.
SLEEC Approach: It is based on the assumption that low rank linear subspace modeling
of label space Y is violated in most real world situations but Y can be accurately predicted
by using a low dimensional non linear manifold. SLEEC is essentially an embeddings based
approach but differs from traditional approaches in following manner. Firstly instead of
global projection, it learns embeddings zi only on set of points which are nearest neighbors
to each other. Secondly, at prediction time instead of using a decompression technique it
uses kNN classifier in embeddings space. kNN helps in tackling the issue of tail labels as
it outperforms discriminative methods when trainings instances are less, the case of ”tail”
labels.
Clustering based speedup: kNN classifier as known are lazy learners therefore are slow
at prediction time. In order to reduce the prediction time SLEEC first clusters the data
over input feature space and then learn separate embeddings for each clusters. SLEEC
tackles the issue of instability in clustering owing to curse of dimensionality by learning
an ensemble of learners over different sets of clusters.
SLEEC algorithm works by first learning an embedding zi = V xi for query point
xi then using kNN classifier over embedding space, S = V x1, V2, ...VN to find the most
relevant labels for that point. The next section presents the algorithms to learn these
embeddings and use of clustering to make algorithm scalable over large dataset.
14
Chapter 2. Literature Survey 15
Learning Sparse Embeddings:
SLECC is based on the assumption that Y cannot be modeled using low rank however,
Y can still be accurately modeled using a low-dimensional non-linear manifold. That
is, instead of preserving distances (or inner products) of a given label vector to all the
training points, they attempt to preserve the distance to only a few nearest neighbors. The
optimization problem is to find a D-dimensional embedding matrix Z = [z1, ...zN ] ∈ RD×N
which minimizes
minZ∈RD×n
||PΩ(YTY )− PΩ(Z
TZ)||2F + λ||Z||1
where Ω denotes the set of neighbors to preserve i.e. (i, j) ∈ Ω iff j ∈ Ni where Ni
denotes the nearest neighbors of i. These are the set of points which have largest inner
product with yi. And (PΩ(YTY ))ij =< yi, yj > if (i, j) ∈ Ω otherwise 0. The next thing
required is to learn these embeddings from input features i.e. Z = V X where V ∈ RD×D.
Now combining this with above formulation and adding L2 regularization for V, the final
optimization problem looks like
minZ∈RL×n
||PΩ(YTY )− PΩ(X
TV TV X)||2F + λ||V X||1 + µ||V ||2F
SLEEC uses singular value projection(SVP)for learning embeddings and alternating direc-
tion method of multipliers(ADMM) for learning regressors for these embeddings. SVP is a
simple projected gradient descent method where the projection is onto the set of low-rank
matrices[9]. The alternating direction method of multipliers (ADMM) is an algorithm
that solves convex optimization problems by breaking them into smaller pieces, each of
which are then easier to handle[4].
SLEEC is a novel technique that combines clustering with embeddings to overcome the
limitations of traditional Embeddings-based approaches. It has better scaling properties
than all other embedding methods and achieves higher accuracies even in presence of ”tail”
labels. SLEEC is state of art algorithm that outperforms leading tree based method,
FASTXML and all other embedding based approaches in terms of accuracy.
15
Chapter 2. Literature Survey 16
2.6 PDSparse: A Primal and Dual Sparse Approach
to Extreme Multiclass and Multilabel Classifica-
tion
In the Extreme Classification setting, we discussed the two popular approaches; tree
based and embeddings based, this paper[15] presents an algorithm that utilizes the spar-
sity present in primal and dual formulation of problem using a margin maximizing loss
function. They propose a Fully-Corrective Block-Coordinate Frank-Wolfe (FC-BCFW)
algorithm that exploits both primal and dual sparsity to achieve a complexity sublinear
to the number of primal and dual variables.
In there work, instead of making structural assumption on the relation between label,
they assume that for each instance, there are only few correct labels and the feature space
is rich enough for one to distinguish between labels. Note this assumption is much weaker
than other structural assumption. Under this assumption, It was shown that a simple
margin-maximizing loss yields extremely sparse dual solution in the setting of extreme
classification, and furthermore, the loss, when combined with l1 penalty, gives sparse
solution both in the primal and in the dual for any l1 parameter λ > 0 .
They proposed a Fully-Corrective Block Coordinate Frank-Wolfe algorithm to solve
the primal-dual sparse problem given by margin-maximizing loss with l1 − l2 penalties.
Let D be the problem dimension, N be the number of samples, and K be the number
of classes. In case DK >> N , the proposed algorithm has complexity sublinear to the
number of variables by exploiting sparsity in the primal to search active variables in the
dual. In case DK ≤ N , they proposed a stochastic approximation method to further
speed up the search-step in the Frank-Wolfe algorithm.
Problem Formulation :
Let P (y) = k ∈ [K]|yk = 1 denotes the positive(relevant) label indexes, while N(y) =
k ∈ [K]|yk = 0 denotes the negative(irrelevant) label indexes. The only assumption
made is nnz(y) is very small compared to K and not growing linearly with K. Let W =
16
Chapter 2. Literature Survey 17
(wk)Kk=1 be a D ×K matrix that represents the parameters of the model.
Loss with Dual Sparsity : In their work, they used the following separation ranking
loss that penalizes the prediction on an instance x by highest response from set of negative
(irrelevant) labels minus lowest response from set of positive labels.
L(z, y) = maxkn∈N(y)
maxkp∈P (y)
(1 + zkn − zkp)+
where zk =< x,wk > denote the response of kth label for a training instance x. The
loss is zero if all the positive labels have higher responses than negative labels. The
motivation behind the loss function is that there are only few labels with high responses
hence accuracy can be boosted by learning how to distinguish between those few confusing
labels.
Primal and dual sparse formulation They formulated extreme classification problem
as minimizing following problem
λK∑k=1
||wk||1 +N∑i=1
L(W Txi, yi)
The optimal solution of this formulation satisfies λρ∗k + XTαk = 0∀k ∈ [K] for some
subgradient ρ∗k ∈ ∂|wk||1 and α∗i ∈ ∂zL(zi, yi) with zi = W ∗Txi. The subgradients α∗
i
satisfies α∗ik = 0 for some k∗ = k only if k∗ is a confusing label that satisfies
k∗ = argmaxk =k⟨wk, xi⟩
which means nnz(αi)≪ K and nnz(A)≪ NK. They further showed that optimal primal
and dual solution (W*,A*) of the above ERM problem with separation loss defined above
satisfies nnz(W ∗) ≤ nnz(A∗) for any λ > 0 if the design matrix X is drawn from a
continuous probability distribution.
The paper presented a novel algorithm that does not uses any structural assumptions,
which has training time, prediction time and model size growing sublinearly with label
space while keeping a competitive accuracy. Extreme classification problem handled either
by tree based or embedding based is basically performing some grouping over the labels
to reduce the overall dimensionality. However in their work they introduced a totally
different approach that does not perform any such grouping, instead utilizes sparsity to
achieve similar performance than state of art techniques and actually outperforms all of
17
Chapter 2. Literature Survey 18
them in terms of training and prediction time. This is achieved only by exploiting the
sparsity present in both primal and dual formulations of extreme classification problem
with a margin maximizing loss function.
18
Chapter 3
Analysis of Existing Approaches
In this section we present comparative study of above discussed approaches on based
of different criteria including training and prediction time,accuracy etc. and analyses
strength and weaknesses of these models.
3.1 Assumptions
In this part we present some of the critical assumptions made by authors in their work in
order for algorithms to work successfully.
1. LPSR: It is assumed that user has pre trained label scorers for each label.
2. LEML: The underlying assumption was label space can be modelled using a low
rank linear model as there exist significant label correlation.
3. FastXML: The only assumption made was label vector y is log K sparse, i.e. there
are only few positive labels for each observation .
4. SLEEC: It assumed that the low rank constraint is easily violated in real life situa-
tions due to presence of ”tail label”, however it possible to model them using a low
dimensional non linear manifold.
5. PDSparse: The only assumption made is although label space is very large but there
exist only few positive labels for each observation.
Chapter 3. Analysis of Existing Approaches 20
3.2 Scalability and Real World Usage
:
In real world scenarios, it is required that prediction time should be atleast sublinear
in terms of labels and algorithms should be able to scale efficiently to large label space.
Based on this below we classify whether the algorithm is suitable for real world usage.
1. LPSR: Prediction time is sublinear with respect to label space which makes is suit-
able for real-life applications however training time is even longer than Binary Rel-
evance approach.
2. LEML: Provides efficient methods for gradient and hessian matrix computation that
makes it efficient enough to handle large scale problem. Prediction time is still linear
in label space.
3. FastXML: It uses an ensemble of trees which allows it to scale over large datasets.
The prediction time also is logarithmic in label space, making it suitable for most
of the real world problems.
4. SLEEC: For handling large scale datasets, it uses clustering and learns model (em-
beddings) over individual clusters. To guard against instability in clustering due to
curse of dimensionality it uses an ensemble of model trained over different set of
clusters.
5. PDSparse: Due to sparse problem formulation, it outperforms all of the above
methods in terms of training and prediction time. The algorithm is well suited
for real world usage.
3.3 Advantages
In this part we present some of the advantages offered by these algorithms over others.
They can be advantageous either from research point of view or from real world usage.
20
Chapter 3. Analysis of Existing Approaches 21
1. LPSR: Generic method independent of choice of classifier or loss function used to
train label scorers. Prediction time is sublinear making it suitable for real world
scenarios.
2. LEML: Problem formulation in standard ERM framework allows its to use different
loss functions and regularizers without changing the algorithm. It is also designed
to handle missing labels present in training set.
3. FastXML: Logarithmic time predictions and no structural assumptions about the
data.
4. SLEEC: Achieves highest prediction accuracies across all methods and has better
scaling properties than all other embedding based methods.
5. PDSparse: Lowest training and prediction time than all of these methods with
comparable accuracies and no structural assumptions about data.
3.4 Limitations
Lastly we present some of the limitations of these algorithms.
1. LPSR: The performance of algorithm significantly depend upon choice of partition-
ing algorithm which may or may not result in optimal label assignment therefore
remains an area of scope for further research.
2. LEML: Prediction time is linear in label space and underlying assumption of low
rank linear modelling of label space is violated in many real-life scenarios.
3. FastXML: Prediction accuracy is low for tail labels which constitute a significant
part of total label space.
4. SLEEC: High training and prediction time compared to tree based and sparsity
based approaches.Algorithms performance significantly depend upon hyper param-
eters which are sometime difficult to tune. ”Tail labels” still remains an issue of
concern.
21
Chapter 3. Analysis of Existing Approaches 22
5. PDSparse: Highly accurate in predicting the top label for a query point however
accuracy in predicting top k results decreases significantly as k increases. Also ”tail
labels” still remains an issue of concern.
22
Chapter 4
Proposed Formulations
In this section we present some of the formulations based on the recent PDSparse[15]
technique, which tries to capture and utilize the hidden structural information present
in the labels. All of these formulations uses the same framework used in [15] with some
modifications.
4.1 Overview
In section 2 we have discussed the various approaches used in the past along with their
advantages and limitations in section 3. We see that there are broadly two classes of
approaches: structural and OneVsAll. Structural approaches are basically embedding
based(low rank) or tree based approaches which makes certain assumptions about the
label space. These approaches provide good accuracies when the assumptions hold but
we see that most of these assumptions gets violated in real life scenarios. OneVsAll
techniques like PDSparse[15] , Dismec[2], etc. as the name suggest does not make any
such and treats every label independently. They have been able to outperform all other
structural techniques which motivates our work. In this work we tried to combine the idea
of label correlations (hidden structure between labels) being present in real-life scenarios
with the OneVsAll approaches, mainly PDSparse. All our formulations are based on the
following large margin structured learning formulation[12]
Chapter 4. Proposed Formulations 24
minw
1
2||w||22 + λ|w|+ C
m∑i=1
ξi
s.t. ⟨w, δψi(xi, y)⟩ ≥ 1− ξi ∀i ∈ [m] (4.1)
ξ ⪰ 0
where δψi(xi, y) = ψ(xi, yi)− ψ(xi, y)
with the general form of our hypothesis function given by f(x;w) = argmaxy⟨w,ψ(x, y)⟩.
Basically, In our work we present different formulation that differs from each other in
terms of the feature maps ψ(x, y) and the way constraints are formulated. As mentioned
previously, all these formulations uses the following framework from [15],[12] for training.Algorithm 1: Cutting Plane Algorithm For Training
1 Initialize active set Ai for all i in 1...m
2 while iter ≤ max_iter do
3 for i in 1...m do
4 Find the most violating constraint: maxy⟨w,−δψi(xi, y)⟩ and add it to active
set Ai
5 Minimize the objective function w.r.t to new set of constraints Ai
6 end
7 end
The above algorithm mainly consist of two subroutines/subprograms, namely finding
the max violater and a constrained optimizer. In next few sections we present different
features maps and learning algorithms for each case. We will be using the following
notation in all the upcoming sections.
• xi ∈ RD, input feature vectors,
• yi ∈ 0, 1K , corresponds to correct ground truth configuration.
• yi ∈ 0, 1K , subset of ground truth labels.
• yi ∈ 0, 1K , a possible combination of labels , which is not a proper subset of
ground truth labels, yi
24
Chapter 4. Proposed Formulations 25
4.2 Extreme Classification as Structured Prediction
The core idea here is to pose extreme classification as structured learning problem where
the objective is to predict a structured object closest to the ground truth available. In
other words we extend the labels space from K to its powerset and tries to learn a classifier
that can directly learn to predict a set from powerset rather than predicting individual
labels. We propose the following constraints with the same objective function defined in
eq. 4.1 to achieve the same.
K∑j=1
yij.wTj xi +
∑yip=yiq
wpq −maxy =yi
K∑j=1
yj.wTj xi +
∑yp=yq
wpq ≥ 1− ξi
wpq ≥ 0, ∀p, q ∈ 1, ..., K (4.2)
In words, we are trying to say that the total score of the ground truth configuration yi
should be greater than any other possible combinations of labels, y. The score of any
configuration is composed of two parts: node potentials/score and edge potentials/score.
The node potential, wTk x is similar to score we learn in a OneVsAll classifier where weights,
wk ∈ RD are separate for each label and are/can be learned independently.
The motive of adding the edge potential, Epq is to capture the relationship between
cooccuring labels, p and q. However, In this formulation we add edge potential to the
overall score of configuration in either case when both labels are present or both are absent
with an additional constraint that says the edge weights learned should be non negative.
The reason for such conditions on edge potentials is to make the problem of finding the
most violating constraint tractable (solvable in polynomial time).
The problem of finding the most violating constraint/label is as follows:
argmaxy
∆(yi, y) + K∑j=1
yj.wTj xi +
∑yp=yq
wpq (4.3)
Note here we are using a loss ∆(y, y) instead of constant margin in previous case. This
problem is equivalent to fthe following Pott’s Model Energy Minimization Problem[5]
when any decomposable loss function is used.
25
Chapter 4. Proposed Formulations 26
E(f) = minf
∑p∈P
D(fp) +∑p,q∈N
upqT (fp = fq) (4.4)
On comparing eq 4.3 and eq 4.4 it can be easily seen that
D(fp) ∼ wTj xiT (yj = 1) + ∆j(yi, y) and upq ∼ wpq
where T(.) is an indicator function with T(True)=1 , zero otherwise and∆j(yi, y being the
similarity (negative loss) associated with jth label difference between two configuations.
Now Pott’s Model energy minimization problem can be solved using graph cuts, to be
specific using multi-way cut[5]. In our case it is even simpler because we have only binary
values for assignment, therefore multiway cut reduces to s-t cut problem for which known
polynomial time algorithms are available. The algorithm for graph construction is same
as described in [5] for Potts Model Energy Minimization.
The next part is to solve the minimization problem with the new set of constraints.
For this purpose we use exactly the same fully corrective block coordinate frank-wolfe
algorithm used in PDSparse. The dual optimization problem for our formulation is as
follows:
minw
1
2||w(α)||22 +
m∑i=1
∑y
αiy
s.t. αi ∈ Ci, ∀i ∈ [m]
w(α) = prox(m∑i=1
∑y
αiyΨ(xi, y), λ) (4.5)
where Ci = α|∑y∈Pi
αiy = −∑y∈Ni
αiy ∈ [0, C] , αiy ≥ 0, αiy ≤ 0,
w = [w1, w2, ..., wk, w00, w01..wKK ]T with wi ∈ RD, wpq ∈ R
Ψ(x, y) = [ψ(x, y), ϕ(x, y)]T ∈ RKD+K2
ϕ(x, y) = ¬[y0 ⊕ y0, y0 ⊕ y1, ..., yK ⊕ yK ]T ∈ RK2
ψ(x, y) = [y0x, y1x, ..., yKx]T ∈ RKD
26
Chapter 4. Proposed Formulations 27
Here Pi, Ni represents the sets of valid and invalid label configurations respectively for an
instance i. In our case their is only one valid configuration of labels(ground truth) and
any other configuration of labels falls under invalid partition.
Note that its not the standard dual we obtain by using lagrangian method. We trans-
formed standard dual into a form similar to PDSparse dual formulation using some linear
transformations in order to use their algorithms. In case of loss augmentation, the linear
part in above dual objective gets replaced withm∑i=1
∑y
∆(yi, y)αiy and is only possible be-
cause we only had one positive configuration. We will discuss in the next section how to
incorporate losses in cases where |Pi| > 1. Also because of the non negativity constraint
for edge weights in our formulation the prox function takes different forms for node weights
and edge weights. The prox function for node weights is given by
prox(x, λ) =
x− λ, x ≥ λ
x+ λ, x ≤ −λ
0, otherwise
(4.6)
Similarly, for edge weights we get the following form of proximity function.
prox(x, λ) =
x− λ, x ≥ λ
0, otherwise
(4.7)
It can be seen that the above dual formulation is very similar to PDSparse formulation,
with exactly same simplex constraints which allowed us to use same algorithms used in
their work. We apply the same strategy of restricting the updates of variables to an active
set of label configurations, Ai for each sample i. For each sample i, α values are updated
by solving the following optimization problem using the same projection algorithm used
in PDSparse.
27
Chapter 4. Proposed Formulations 28
minαAi
∈Ci
||(−αNiAi)− b||2 + ||αPi
Ai− c||
s.t. αPiAi, αNi
Aiare partitions of α w.r.t Pi, Ni
by = (⟨w,Ψ(xi, y) + ∆(yi, y))/Qi − αtAi(y),∀y ∈ Ni
cyi = αtAi(yi)− (⟨w,Ψ(xi, yi))/Qi,∀yi ∈ Pi (4.8)
Here 1Qi
represents the step size. We used different techniques for step size computation
which includes constant step size, exponential decaying and max eigen value of Hessian
matrix. We summarize overall training algorithm as followsAlgorithm 2: Fully-Corrective BCFW
1 Initialize active set A0 = yi , α = 0
2 while t ≤ max_iter do
3 Draw a sample index i ∈ [m] uniformly at Random
4 Find the most violating configuration y via eq. 4.3
5 At+ 1
2i ← At
i ∪ y
6 Update α values viq eq. 4.8
7 At+1i ← A
t+ 12
i /y|αiy=0
8 Maintain w(α)
9 end
We found that this formulation performed well on small datasets but failed to do
so when applied on medium/large datasets. There were basically two major problems:
graph cut algorithm used to find violating label is very expensive (polynomial time in label
space) and presence of only few positive labels(|yi| << K) for an instance. We found that
the constraints are very hard to satisfy as we are comparing scores of few labels with
comparatively very large number of negative(absent) labels. We tried to overcome these
limitations by modifying the formulation and algorithms used. Below are the details of
some of the variants we tried to achieve the objective.
Variant 1 : Same formulation with approximate solver instead of graph cut.
Variant 2 : No special constraint on edge weights, i.e. they are allowed to be negative.
28
Chapter 4. Proposed Formulations 29
Variant 3 : Edge score between any two labels are added(triggered) to total score of
configuration only when both of them are present but with non negativity constraint.
Variant 4 : No edge potentials, i.e. total score of configuration is sum of node potentials
of labels present in it.
We observed that except variant 4, in all other variants the problem of finding the most
violating label configuration becomes intractable. Therefore we used approximation algo-
rithm for finding the violating label. All those algorithms are based on following graph
relaxation algorithm. They do not provide any guarantee or bounds for closeness with the
exact solution but works very fast as compared to graph cut and provide decent results.Algorithm 3: Graph Relaxation Algorithm
Data:
Z= Node potentials (score of individual labels)
E= Edge potentials
Result: y - a set of labels
1 Initialize y = yi, flip_count= |y|, max_score= score(y)
2 while iter ≤ max_iter and flip_count != 0 do
3 flip_count=0
4 for i in 1...K do
5 compute new_score by flipping the state of label i in y, using Z and E
6 if new_score ≥ max_score then
7 max_score=new_score
8 increment flip_count
9 flip the label i in y
10 end
11 end
12 end
13 return y
As mentioned before for variant 4, the problem of finding the most violating label
configuration is tractable since there are no edge potentials all that is required is to take
the labels with non negative node potentials.
29
Chapter 4. Proposed Formulations 30
4.3 General Formulations for Structured Objects
In previous section we proposed different feature maps for learning the structure between
labels. In this section we propose different ways of forming constraints using same or
different feature maps.
We observed that because of very small number of positive labels for an instance the
constraints in section 4.2 are hard to satisfy. We propose the following formulation that
tries to relax those constraints by limiting the size of label configurations being compared.
miny|y|=k
K∑j=1
yj.wTj xi +
∑yp=yq=1
wpq − maxy|y|=k
K∑j=1
yj.wTj xi +
∑yp=yq=1
wpq ≥ 1− ξi (4.9)
In words, The constraints says that the score of any combination of size k of ground
truth labels (positives) should greater than the score of any combination of size k which
contains at least one negative label.
Lets say the number of active labels for example x be r.
1. r>k, score of all the Crk combinations of active labels should be greater than any
other combination of same size which contains a negative label.
2. r=k, same as above
3. r<k, then we set k=r for that example and above constraint is applied
Special Case: PDSparse is a special case of this formulation when edge parameters
are zero and we optimize for k=1.
The idea behind the above formulation is that it will boost the score of a group of labels
instead of individual labels and therefore will result in better accuracies for top k predic-
tions. It utilizes the same training algorithm we discussed in previous section. Although
it faces the same problem as those variants that finding max violator is intractable when
edge weights are used, therefore we used approximate solver based on algorithm 3.
30
Chapter 4. Proposed Formulations 31
Loss Augmentation:
We observed that we cannot use the same projection algorithm used in PDSparse to update
the dual parameters when we augment a loss function (margin rescaling) to formulation
4.9. It is because we found that the standard margin/slack rescaled formulation cannot
be converted into an equivalent dual program with same constraints as in eq. 4.5.
Standard Lagrangian dual for margin rescaled formulation is as follows
minw
1
2||w(α)||22 −
m∑i=1
∑y∈Pi,y∈Ni
∆(y, y)αyyi
s.t. αi ⪰ 0, αTi 1 = C ∀i ∈ [m]
w(α) = prox(m∑i=1
∑y∈Pi,y∈Ni
αyyi (Ψ(xi, y)−Ψ(xi, y)), λ) (4.10)
Note that constraints on dual parameters, α values, are different from eq. 4.5 and are
much simpler. We used different algorithms to perform the dual update which includes
projected gradient descent (projection to probability simplex[13]), frank-wolfe or condi-
tional gradient method and variants of frank-wolfe algorithm [6]. The overall training
algorithm is summarized below.Algorithm 4: Fully-Corrective BCFW for Margin Rescaled Formulations
1 Initialize α = 0
2 Initialize positive set Pi = y|set of all possible positive combinations of size k∀i
3 Initialize negative set Ni = , set containing invalid/negative combinations
4 while t ≤ max_iter do
5 Draw a sample index i ∈ [m] uniformly at Random
6 Find the most violating configuration y = argmaxy ∈Pi∪Ni,|y|=k
⟨w,ψ(xi, y)⟩
7 N t+1i ← N t
i ∪ y
8 Update α values by minimizing eq. 4.10 with updated Ni
9 Maintain w(α)
10 end
As one can see step 6 in above algorithm requires to find a violating configuration of size
k which is currently not present in active sets Pi and Ni. We developed following algorithm
31
Chapter 4. Proposed Formulations 32
for this purpose. However, this algorithm works only when total score is decomposable
over individual label scores. For edge cases we used the approximate algorithms based on
algorithm 3.
Algorithm 5: Growing Sets AlgorithmData:Ap = Y 1, ..., Y i : Positive Active Set where Y i = 1, ..., KkAn = Y 1, ..., Y j : Negative Active Set where Y j = 1, ..., Kk
Y = 1, ..., Kx : Ground truth labels for current exampleZ ∈ RK : scores of each individual label where zi = ⟨wi, x⟩k= size of required label configurationResult: Y ∗ = argmax
Y ∈Ap∪An,|Y |=k
∑j∈Y
Zj
1 begin2 L←− min(|Y |+ k ∗ |An|+ 1, K)3 Γ←− Top L labels according to scores Z (Using Heap of size L)4 Let level0, level1, ..., levelk be k+1 lists containing an empty set (ϕ)5 for each Label li in Γ do6 for j in k,...,1 do7 for each set Y in levelj−1 do8 Y = Y + li9 if j==k and Y ∈ Ap ∪ An then
10 return y
11 else12 levelj ←− levelj + Y
Recursive Constraints :
We found out that although we relaxed the constraints by limiting size of label configu-
rations but they were less powerful as compared to the original PDSparse constraint, i.e.
over individual labels. Therefore, we decided to incorporate those constraints also together
with our grouping constraint leading to following recursive constraints formulation.
minw,ξ≥0
1
2||w||22 + λ|w|+
m∑i=1
k∑p=1
Cpξpi
s.t ⟨w,ψ(xi, yi)− ψ(xi, yi)⟩ ≥ ∆(y, y)− ξpi , ∀i ∈ [m], |yi| = |yi| = p ∈ 1, 2, ..., k
(4.11)
32
Chapter 4. Proposed Formulations 33
Corresponding dual program:
minw
1
2||w(α)||22 −
m∑i=1
k∑p=1
∑y∈P p
i ,y∈Npi
∆(y, y)αyyip
s.t. αip ⪰ 0, αTip1 = C ∀i ∈ [m], p ∈ [k]
w(α) = prox(m∑i=1
k∑p=1
∑y∈P p
i ,y∈Npi
αyyip (Ψ(xi, y)−Ψ(xi, y)), λ) (4.12)
Note that constraint here are of same form as in eq. 4.10. Therefore, we used the same
algorithms we used to solve previous formulation 4.10. Also we did small modifications in
growing sets algorithm (algorithm 5) for simultaneously finding the maximum violating
label configurations of different sizes.
Negative Slack Only :
In above mentioned formulations we found that number of constraints being added in each
iteration are more that the original PDSparse formulation and therefore taking longer
time. The reason being for each negative configuration added to active set Ni, |Pi| con-
straints are added to problem. Also it is not possible to remove constraints at the end
of each iteration like we did in algorithm 2. In order to solve these we propose following
formulation that separates out constraints for positive and negative configurations.
minw,ξ≥0
1
2||w||22 + λ|w|+ C
m∑i=1
ξi
s.t ⟨w,ψ(xi, yi)⟩ ≥ 1,
⟨w,ψ(xi, yi)⟩ ≤ δ(yi, yi) + ξi, ∀i ∈ [m],
where δ(yi, yi) =|yi ∩ yi||yi|
(4.13)
Here for δ(y, yi) we used hamming loss but in general can be any loss function. Basically,
we want that all the positive labels/configurations should have a positive score and only
few confusing/negative configurations should be allowed to have a positive score by al-
lowing a slack for them.
33
Chapter 4. Proposed Formulations 34
Corresponding Dual Program:
minw
1
2||w(α)||22 −
m∑i=1
∑y∈Ni
δ(y, yi)αyi −
∑y∈Pi
αyi
s.t. αi ⪰ 0,∑y∈Ni
αyi = C ∀i ∈ [m]
w(α) = prox(m∑i=1
∑y∈Pi∪Ni
αyiΨ(xi, y), λ) (4.14)
We used same algorithms, projected gradient descent, frank-wolfe etc. for updating the
dual parameters corresponding to negative configurations and for positive ones the con-
straint is much simpler and we just clip the dual parameters, α values, after gradient
update. This formulation is very much similar to traditional binary SVM formulation
with slight modifications in margin and slack. In our work we combined this formulation
with many different formulations we discussed including recursive constraints, limiting
size of label configurations, etc.
4.4 Large Margin Learning with Label Embeddings
The core of these formulations remains the same, i.e. combine the structural information
with the max-margin learner (OneVsAll). We already presented formulations in section
4.2 that tries to capture this information by introducing edges between labels but it had
its limitations. In this work instead of using edges we used label embeddings as a way of
capturing interaction between co-occurring labels. It is different from embedding based
approaches like [3],[16] etc. in a way that instead of trying to learn regressors(compressors)
and decompressors, it only uses embeddings to learn a better margin between labels in
original space. We propose the following formulation based on above discussion.
Formulation 1 : We use the same l1 − l2 regularized objective from eq. 4.1 with the
modification in the score computation/feature map of any configuration. Previously,
the total score was simply the sum/average of individual node potentials but in
addition to that we have score of label embeddings, V.
34
Chapter 4. Proposed Formulations 35
Mathematically, constraints have the following form :
minyi
K∑j=1
yij(wTj xi + λe⟨E, Vj⟩) −max
yi
K∑j=1
yij(wTj xi + λe⟨E, Vj⟩) ≥ 1− ξi (4.15)
where λe is a hyperparameter controlling the importance given to node and em-
bedding scores, Vj ∈ RK is the embedding corresponding to jth label and E ∈ RK
corresponds to shared weights, similar to edge weights in section 4.2.
Formulation 2 : In the above version we are not using feature vector, xi in embedding
score computation. In this version we are simultaneously trying to maximize the
score of a positive configuration and learning to predict the embeddings from xi.
Mathematically,
miny
K∑j=1
yj(wTj xi+λe⟨Exi, Vj⟩)−max
y
K∑j=1
yj(wTj xi+λe⟨Exi, Vj⟩) ≥ 1−ξi (4.16)
where Vj ∈ RK is the embedding corresponding to jth label, same as before, and
E ∈ RK×D is the transformation matrix/regressor that transform features into em-
bedding space.
Note in both of these formulations the objective is to learn weights w = [w1, w2, wK , vec(E)]
not the embeddings V. In other words, we are not simultaneously learning to predict em-
beddings from feature space and transforming labels to embedding space. Instead we used
unsupervised technique, glove embeddings [10] specifically, to learn V from label space.
The reason for using glove embeddings is it only requires co-occurrence statistics from
corpus which is very cheap to compute given only few labels are active/present for each
instance. However, one can use any supervised or unsupervised technique to transform
labels into smaller dimensional embeddings and use any of the above formulations.
Since here also the problem of finding the max violated constraint required a pair of con-
figurations, min. from positives and max. from negatives, we adopted the same strategy
of already putting all possible valid/positive configurations in the active set and searching
for only negative configurations. We used the same growing sets algorithm, discussed in
the previous section , to find the max violating configuration from negatives as the total
35
Chapter 4. Proposed Formulations 36
score of any configuration is decomposable over individual labels unlike edge weights in
formulation 4.2 and its variants. We also tried following variants of above formulations
based on the formulations discussed in section 4.3.
Table 4.1: Different Variants of Embedding Formulation
Variant \Features Margin Rescaling Recursive Constraints Negative Slack OnlyVariant 1 Yes Yes NoVariant 2 Yes No YesVariant 3 Yes Yes Yes
Summary
In this chapter we presented various formulations based on the existing PDSparse ap-
proach. In particular, how one can use existing PDSparse framework by defining their
own feature maps and algorithms to find the most violating constraints. In next section
we present empirical results of these formulations when applied on some of the extreme
classification datasets.
36
Chapter 5
Implementation Results and Analysis
In this chapter we provide empirical analysis of the various approaches we discussed in
previous chapter. It presents the results obtained by various different formulations and
their comparison with the state of the art system. We further present analysis based on
the results obtained explaining possible reasons behind the observed behavior.
5.1 Datasets
We experimented on various datasets from different domains. Some of these are very
small in terms of label and feature space, hence they do not fall under extreme classifi-
cation category. However, we used those datasets to test our hypothesis before applying
to medium/large datasets. Table 5.1 summarizes the details of each dataset. Although,
Table 5.1: Dataset Details
Dataset Type #instances Dimensionality Avg. LabelsPer PointTrain Dev Test Features Labels
yeast multilabel 1500 400 517 103 14 5scene multilabel 1211 598 598 294 6 1.07emotions multilabel 391 100 102 72 6 1.8sector multiclass 7793 865 961 55197 105 1aloi.bin multiclass 90000 10000 8000 636911 1000 1bookmarks multilabel 58863 ‘17396 11597 2150 208 2bibtex multilabel 4954 1465 976 1836 159 2.4Eur-Lex multilabel 15643 1738 1933 5000 3956 5rcv1_regions multilabel 20835 2314 5000 47237 225 1.2
Chapter 5. Implementation Results and Analysis 38
all our datasets does not comply with the extreme label space we referred earlier in the
introduction but we found them providing enough evidence for drawing conclusions. An-
other reason for not using large dataset is the time complexity. Those datasets typically
take hours or days for training using existing PDSparse technique and most of our formu-
lations take more time PDSparse. Some of them even took hours to train on these smaller
datasets for which PDSparse takes minutes.
5.2 Experimental Results
In this section we present results of some of the experiments we conducted on above
mentioned datasets. Initially we worked on smaller datasets to test our hypothesis. Table
5.2 presents the results of our base structured learning formulation on very small datasets.
Table 5.2: Comparison of PDSparse with Structured Learning Formulation 4.2
Dataset Train Time AccuracyPDSparse Our Method PDSparse Our Method
Emotions 0.1 0.97 58.82 63.72Scene 0.7688 2.52 68.62 59.89Yeast 0.43 6.55 64.99 77.17
It was observed that it was outperforming PDSparse which motivated our work. However
the problem with this formulation was it takes very long to train when applied on medium
datasets. Therefore we tried following variants to reduce the time complexity of algorithm.
Version1 No edge parameters with exact Solver for violating constraint
Version2 Edge triggered only when both labels are active, with approximate Solver
Version3 Approximate solver with negative edge weights permitted
Version4 Brute Force solver for smaller datasets (non negativity constraint)
Version5 Brute Force solver for smaller datasets (negative edge weights allowed)
38
Chapter 5. Implementation Results and Analysis 39
Table 5.3: Comparison of different variants of Structured learning formulation, Section4.2, with PDSparse
Dataset AccuracyPDSparse (Original) Version1 Version2 Version 3 Version 4 Version 5
Emotions 58.52 63.72 65.68 63.72 64.7 62.74Scene(new) 73.91 71.97 72.47 70.4 70.97 69.39Yeast 64.99 76.01 77.57 76.01
Bibtex 63.5 55.94 53.27 51.22Eur-Lex 74.9 60.47 61.4Rcv1_regions 96.48 95.08 95.6 92.86
Sector 95.94 94.7digits 96.91 95.37
We found that it does not perform well on medium datasets. The reason being the only
few positive labels for instance. We found that constraints are very hard to satisfy are
they are trying to compare score of few positive labels with entire label space. We tried
to relax the constraints by limiting the size of label configurations as discussed in section
4.3. We summarizes the results obtained in Table 5.4.
Table 5.4: PDSparse Vs Version 6: Formulation 4.9 without edge weights
Dataset Precision@1 Precision@2 Precision@3 Algorithm Trained@bibtex 63.11 48.05 38.76 PDSparse
63.93 48.05 38.52 Version 6 k=161.02 45.69 37.77 k=257.07 45.39 36.4 k=3
rcv1_regions 96.48 57.75 40.68 PDSparse96.82 57.86 40.78 Version 6 k=196.74 57.84 40.68 k=296.59 57.85 40.78 k=3
EurLex 75.99 67.58 61.4 PDSparse76.3 66.45 59.83 Version 6 k=166.2 61.45 56.02 k=261.97 56.31 51.59 k=3
sector 95.62 PDSparse95.42 Version 6 k=1
aloi.bin 96.51 PDSparse96.45 Version 6 k=1
It was observed that accuracies dropped as we increased the size of label configurations
which totally contradicted the idea behind grouping of labels together. We found that
there are two possible reasons behind the observations. Firstly, there was no loss function
39
Chapter 5. Implementation Results and Analysis 40
(hard margin) i.e. the optimizer penalizes two invalid configurations same irrespective of
their closeness to valid configurations. Secondly, we observed that smaller size (refers to
size of labels, k=1,2,..) constraints are more powerful than larger ones unless there is
some additional information present in larger ones. Basically, if there are no edge param-
eters then satisfying smaller size constraints will automatically satisfy larger ones. The
idea was that in cases where smaller size constraints are hard to satisfy edge parameters
will help in satisfying larger ones and therefore may increase the overall performance of
algorithm. Table 5.5 and Table 5.6 presents the results we obtained by incorporating
recursive constraints and augmenting a loss function.
Table 5.5: PDSparse Vs Version 7: Margin Rescaled formulation with recursive constraintswithout edge parameters
Dataset Precison@1 Precison@2 Precison@3 Algorithm Trained@Bibtex 63.21 48.1 39.13 PDSparse
61.57 48.25 39.17 Version 7 k=162.09 47.49 38.31 k=2
rcv1_regions 96.6 57.67 PDSparse96.54 57.89 Version 7 k=196.92 57.66 k=2
Eur-Lex 75.63 67.58 60.78 PDSparse78.06 68.05 61.44 Version 7 k=171.702 65.46 60.28 k=2
sector 95.73 PDSparse95.42 Version 7 k=1
aloi.bin 96.33 PDSparse95.58 Version 7 k=1
Table 5.6: PDSparse Vs Version 8: Negative slack formulation with recursive constraintsand edges triggered only when both labels are present.
Dataset Precison@1 Precison@2 Precison@3 Algorithm Trained@Bibtex 63.21 48.1 39.13 PDSparse
63.62 47.69 38.96 Version 8 k=164.95 47.18 38.11 k=2
rcv1_regions 96.6 57.67 PDSparse96.52 57.73 Version 8 k=196.8 57.17 k=2
Eur-Lex 75.63 67.58 60.78 PDSparse78.22 70.33 63.04 Version 8 k=178.06 70.46 62.66 k=2
40
Chapter 5. Implementation Results and Analysis 41
It was observed that by accompanying edges, recursive constraints and margin rescaling
performance improved for larger size constraints. However, we used brute force solvers to
find the maximum violator which takes a lot of time and is not practical. As mentioned
earlier, some additional information is required in larger size constraints which will help
them when smaller size constraints are violated so we thought of using embeddings in
place of edges. The idea was that it will be able to capture the label correlations and
will provide the additional information needed with a slight expense of time (very less
compared to edges). Table 5.7 presents the result of the above discussed idea.
Table 5.7: PDSparse Vs Version 9: Negative Slack only formulation with margin rescaling,recursive constraints and GloVe Embeddings Formulation 4.15
Dataset Precison@1 Precison@2 Precison@3 Algorithm Trained@Bibtex 63.21 48.1 39.13 PDSparse
63.83 (64.03) 48.3 (48.51) 38.96 (39.54) Version 9 k=164.03 (63.93) 47.28 (46.10) 38.55 (38.21) k=2
rcv1_regions 96.6 57.67 PDSparse96.4 (96.26) 57.57 (57.53) Version 9 k=196.42 (96.7) 57.09 (57.34) k=2
Eur-Lex 75.63 67.58 60.78 PDSparse77.34 (79.04) 69.83 (70.58) 63.06 (63.32) Version 9 k=172.47 ( 74.75) 64.25 (67.41) 57.80 (59.90) k=2
Note: Values in brackets are accuracies obtained after applying Stochastic WeightAveraging (SWA)[7] while training.
Table 5.8: PDSparse Vs Version 10: Negative Slack only formulation with margin rescal-ing, recursive constraints and GloVe Embeddings Formulation 4.16
Dataset Precison@1 Precison@2 Precison@3 Algorithm Trained@Bibtex 63.21 48.1 39.13 PDSparse
63.42 47.95 39.27 Version 10 k=163.52 46.82 37.36 k=2
rcv1_regions 96.6 57.67 PDSparse96.44 57.69 Version 10 k=196.5 57.31 k=2
Eur-Lex 75.63 67.58 60.78 PDSparse77.96 69.83 63.26 Version 10 k=172.27 63.47 56.80 k=2
Table 5.8 shows the results obtained after incorporating input features in embedding score.
It was observed that they do not make much difference as embeddings are learned with
using them. Although it increased the time complexity of algorithm a lot.
41
Chapter 6
Conclusion and Future Work
This being the final chapter of the report, summarizes the entire work done for improving
the performance of Extreme classification systems and gives an overview of the work that
can be done in future.
The organization of the chapter is as follows. First, section 6.1 points out some inter-
esting observations from the entire report and then section 6.2 gives some of the future
works that can be done to further improve the system.
6.1 Conclusion
This section describes some of the key takeaways from the entire report. The important
observations are as follows:
• Despite the recent advancements in the field of deep learning and its applications
in major fields of vision, speech etc. we found that traditional machine learning
techniques like max-margin learners (SVMs) outperform them in case of extreme
classification.
• Power/Performance of state of art OneVsAll learners can be improved by incorpo-
rating the latent information present in the label space.
• We presented various ways in which we can use the exiting PDSparse framework for
learning to predict structured objects by defining feature maps and algorithms to
find maximum violator.
Chapter 6. Conclusion and Future Work 43
6.2 Future Work
This section describes some of the future works that can be done in extreme classification
to further improve the system. Some of them are:
1. Low accuracy for ”Tails Labels” remains the area of further research as most of the
algorithms fails to perform well in those cases.
2. In real life scenarios, we often encounter missing data i.e not all the relevant labels
are marked for an instance, which questions the performance measures we use to
compare different techniques. Currently, most of the algorithms treats given data
as complete and therefore it may/may not be able to learn the true relationship
between input features and labels. Also the performance measures we use treats all
the absent labels as negatives which is not true in most cases. Hence we need better
measures and techniques which can incorporate the idea of missing or incomplete
information.
3. In our work we tried different ways of incorporating the structural information
present in the labels with state of the art OneVsAll learners to boost their perfor-
mance. They have shown improvements but it still requires a lot of work to combine
the two approaches which can give us better results.
43
Bibliography
[1] Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. 2013. Multi-
label learning with millions of labels: Recommending advertiser bid phrases for web
pages. In Proceedings of the 22nd international conference on World Wide Web. ACM,
pages 13–24.
[2] Rohit Babbar and Bernhard Schölkopf. 2017. Dismec: distributed sparse machines
for extreme multi-label classification. In Proceedings of the Tenth ACM International
Conference on Web Search and Data Mining. ACM, pages 721–729.
[3] K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain. 2015. Sparse local embeddings
for extreme multi-label classification. In Advances in Neural Information Processing
Systems.
[4] Stephen Boyd. 2011. Alternating direction method of multipliers. In Talk at NIPS
Workshop on Optimization and Machine Learning.
[5] Yuri Boykov, Olga Veksler, and Ramin Zabih. 2001. Fast approximate energy mini-
mization via graph cuts. IEEE Transactions on pattern analysis and machine intelli-
gence 23(11):1222–1239.
[6] Donald Goldfarb, Garud Iyengar, and Chaoxu Zhou. 2017. Linear convergence of
stochastic frank wolfe variants. arXiv preprint arXiv:1703.07269 .
[7] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry P. Vetrov, and An-
drew Gordon Wilson. 2018. Averaging weights leads to wider optima and better gener-
alization. CoRR abs/1803.05407.
44
Bibliography 45
[8] H. Jain, Y. Prabhu, and M. Varma. 2016. Extreme multi-label loss functions for
recommendation, tagging, ranking and other missing label applications. In Proceedings
of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
[9] Prateek Jain, Raghu Meka, and Inderjit S Dhillon. 2010. Guaranteed rank mini-
mization via singular value projection. In Advances in Neural Information Processing
Systems. pages 937–945.
[10] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global
vectors for word representation. In Proceedings of the 2014 conference on empirical
methods in natural language processing (EMNLP). pages 1532–1543.
[11] Y. Prabhu and M. Varma. 2014. Fastxml: A fast, accurate and stable tree-classifier
for extreme multi-label learning. In Proceedings of the ACM SIGKDD Conference on
Knowledge Discovery and Data Mining.
[12] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun.
2005. Large margin methods for structured and interdependent output variables. Jour-
nal of machine learning research 6(Sep):1453–1484.
[13] Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Tie-Yan Liu, and Wei Chen. 2013. A
theoretical analysis of ndcg type ranking measures. arXiv preprint arXiv:1304.6480 .
[14] Jason Weston, Ameesh Makadia, and Hector Yee. 2013. Label partitioning for sub-
linear ranking. In Sanjoy Dasgupta and David Mcallester, editors, Proceedings of the
30th International Conference on Machine Learning (ICML-13). JMLR Workshop and
Conference Proceedings, volume 28, pages 181–189.
[15] Ian En-Hsu Yen, Xiangru Huang, Pradeep Ravikumar, Kai Zhong, and Inderjit
Dhillon. 2016. Pd-sparse : A primal and dual sparse approach to extreme multiclass
and multilabel classification. In Maria Florina Balcan and Kilian Q. Weinberger, edi-
tors, Proceedings of The 33rd International Conference on Machine Learning. PMLR,
New York, New York, USA, volume 48 of Proceedings of Machine Learning Research,
pages 3069–3077.
45
Bibliography 46
[16] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit S. Dhillon. 2014. Large-
scale multi-label learning with missing labels. In International Conference on Machine
Learning (ICML). volume 32.
46
A ckrio1v ledge'm1ents
I would likP to t'X!Jr<:'~s my si11cC're gratit11clP to my advisors Prof. Saketh Nath and
Prof. Sunita Sarawagi for t lw co11ti11uo11s support of 111_v !\I.Tech study and r f-! lat.ed
resPa.rch_ for their patience, motivation , and immense knowledge_ Their guidance helped
me in all thP time of research and ,vriting of this thesis.I could not have imagined havrng
bet t,er ad visors and mentors for my \L Tech study_
Signature:
Da te: d1:) .J11l!P 2018
~~ -~~---- --·- ---- ·-- -
Akshat Jaiswal
163050069