Design and Analysis of Consistent Algorithms for ...

Design and Analysis of Consistent Algorithms

for Multiclass Learning Problems

A THESIS

SUBMITTED FOR THE DEGREE OF

Doctor of Philosophy

IN THE

Faculty of Engineering

BY

Harish Guruprasad Ramaswamy

Computer Science and Automation

Indian Institute of Science

Bangalore – 560 012 (INDIA)

June 2015

Dedicated to my parents and teachers

i

“Not much of a cheese shop really, is it?”

“Finest in the district, sir.”

“And what leads you to that conclusion?”

“Well, it’s so clean.”

“It’s certainly uncontaminated by cheese.”

- Monty Python’s Flying Circus.

Acknowledgements

I would like to express my sincere thanks to my advisor Prof. Shivani Agarwal. With

her systematic approach, she has helped me focus on the important aspects of research

life. Despite her busy schedule, she was always approachable and ready to give insightful

thoughts, which led to many interesting discussions, both technical and non-technical.

I express my profound gratitude towards Prof. Ambuj Tewari from the University of

Michigan, and Prof. Robert Williamson from the Australian National University, for

stimulating discussions and collaborations. I also thank Prof. Chiranjib Bhattacharyya

and Prof. P.S. Sastry for their guidance and support.

I thank my lab members Arun and Harikrishna for the many interesting discussions and

collaborations. Many thanks are also due to my other present and past lab members

Priyanka, Jay, Chandrahas, Rohit, Siddharth, Anirban, Saneem, Arpit and Aadirupa.

Mere words are not enough to express my gratitude and affection towards my many

friends at IISc who made my life here memorable and eventful. In particular, I would

like to thank Arun, Harikrishna, Raman, Achintya, Srinivasan, Chandru, Hariprasad,

Madhavan, Madhusudhan, Abhinav and Ramnath.

I would also like to thank the Indian Institute of Science and Tata Consultancy Services

for supporting me financially during my PhD. Special thanks to the Indo-US Virtual

institute of mathematical and statistical sciences (VIMSS) for funding a short visit to the

University of Michigan, which helped greatly in my research.

Finally, I would like to thank my parents for their constant love and support.

Note: Chapter 9 on consistent algorithms for complex multiclass evaluation metrics,

is joint work with Harikrishna Narasimhan. The description of this work in our thesis

focuses on our contributions; other aspects of the work will be described in greater detail

in Harikrishna Narasimhan’s thesis.

iii

Abstract

We consider the broad framework of supervised learning, where one gets examples of

objects together with some labels (such as tissue samples labeled as cancerous or non-

cancerous, or images of handwritten digits labeled with the correct digit in 0-9), and

the goal is to learn a prediction model which given a new object, makes an accurate

prediction. The notion of accuracy depends on the learning problem under study and is

measured by a performance measure of interest. A supervised learning algorithm is said

to be ’statistically consistent’ if it returns an ‘optimal’ prediction model with respect to

the desired performance measure in the limit of infinite data. Statistical consistency is a

fundamental notion in supervised machine learning, and therefore the design of consistent

algorithms for various learning problems is an important question. While this has been

well studied for simple binary classification problems and some other specific learning

problems, the question of consistent algorithms for general multiclass learning problems

remains open. We investigate several aspects of this question as detailed below.

First, we develop an understanding of consistency for multiclass performance measures

defined by a general loss matrix, for which convex surrogate risk minimization algorithms

are widely used. Consistency of such algorithms hinges on the notion of ’calibration’ of

the surrogate loss with respect to target loss matrix; we start by developing a general

understanding of this notion, and give both necessary conditions and sufficient conditions

for a surrogate loss to be calibrated with respect to a target loss matrix. We then define

a fundamental quantity associated with any loss matrix, which we term the ‘convex

calibration dimension’ of the loss matrix; this gives one measure of the intrinsic difficulty

of designing convex calibrated surrogates for a given loss matrix. We derive lower bounds

on the convex calibration dimension which leads to several new results on non-existence of

convex calibrated surrogates for various losses. For example, our results improve on recent

results on the non-existence of low dimensional convex calibrated surrogates for various

subset ranking losses like the pairwise disagreement (PD) and mean average precision

(MAP) losses. We also upper bound the convex calibration dimension of a loss matrix

by its rank, by constructing an explicit, generic, least squares type convex calibrated

surrogate, such that the dimension of the surrogate is at most the (linear algebraic)

rank of the loss matrix. This yields low-dimensional convex calibrated surrogates - and

therefore consistent learning algorithms - for a variety of structured prediction problems

for which the associated loss is of low rank, including for example the precision @ k

and expected rank utility (ERU) losses used in subset ranking problems. For settings

where achieving exact consistency is computationally difficult, as is the case with the

PD and MAP losses in subset ranking, we also show how to extend these surrogates to

give algorithms satisfying weaker notions of consistency, including both consistency over

restricted sets of probability distributions, and an approximate form of consistency over

the full probability space.

Second, we consider the practically important problem of hierarchical classification, where

the labels to be predicted are organized in a tree hierarchy. We design a new family of

convex calibrated surrogate losses for the associated tree-distance loss; these surrogates

are better than the generic least squares surrogate in terms of easier optimization and

representation of the solution, and some surrogates in the family also operate on a sig-

nificantly lower dimensional space than the rank of the tree-distance loss matrix. These

surrogates, which we term the ‘cascade’ family of surrogates, rely crucially on a new un-

derstanding we develop for the problem of multiclass classification with an abstain option,

for which we construct new convex calibrated surrogates that are of independent interest

by themselves. The resulting hierarchical classification algorithms outperform the current

state-of-the-art in terms of both accuracy and running time.

Finally, we go beyond loss-based multiclass performance measures, and consider multiclass

learning problems with more complex performance measures that are nonlinear functions

of the confusion matrix and that cannot be expressed using loss matrices; these include for

example the multiclass G-mean measure used in class imbalance settings and the micro

F1 measure used often in information retrieval applications. We take an optimization

viewpoint for such settings, and give a Frank-Wolfe type algorithm that is provably

consistent for any complex performance measure that is a convex function of the entries

of the confusion matrix (this includes the G-mean, but not the micro F1). The resulting

algorithms outperform the state-of-the-art SVMPerf algorithm in terms of both accuracy

and running time.

In conclusion, in this thesis, we have developed a deep understanding and fundamental

results in the theory of supervised multiclass learning. These insights have allowed us to

develop computationally efficient and statistically consistent algorithms for a variety of

multiclass learning problems of practical interest, in many cases significantly outperform-

ing the state-of-the-art algorithms for these problems.

List of Publications based on this Thesis

• Harish G. Ramaswamy and Shivani Agarwal. Classification calibration dimension

for general multiclass losses. In Advances in Neural Information Processing Systems,

2012.

• Harish G. Ramaswamy, Shivani Agarwal, and Ambuj Tewari. Convex calibrated

surrogates for low-rank loss matrices with applications to subset ranking losses. In

Advances in Neural Information Processing Systems, 2013.

• Harish G. Ramaswamy, Balaji S. Babu, Shivani Agarwal, and Robert C. Williamson.

On the consistency of output code based learning algorithms for multiclass learning

problems. In Proceedings of International Conference on Learning Theory, 2014.

• Harish G. Ramaswamy, Shivani Agarwal, and Ambuj Tewari. Convex calibrated

surrogates for hierarchical classification. In Proceedings of International Conference

on Machine Learning, 2015.

• Hariskrishna Narasimhan*, Harish G. Ramaswamy*, Aadirupa Saha, and Shivani

Agarwal. Consistent multiclass algorithms for complex performance measures. In

Proceedings of International Conference on Machine Learning, 2015.

• Harish G. Ramaswamy and Shivani Agarwal. Convex calibration dimension for

general multiclass losses. Accepted for publication pending minor revision, Journal

of Machine Learning Research, 2015

Contents

Abstract iv

Contents vii

General Notational Conventions xii

List of Symbols xiii

1 Introduction 1

1.1 Supervised Machine Learning and Consistency . . . . . . . . . . . . . . . 1

1.2 Past Work on Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1 Consistency and Calibration . . . . . . . . . . . . . . . . . . . . . 5

1.3.2 Application to Hierarchical Classification . . . . . . . . . . . . . . 8

1.3.3 Consistency for Complex Multiclass Evaluation Metrics . . . . . . 10

2 Background 12

2.1 Chapter Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Standard Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Multiclass Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Consistent Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5 Surrogate Minimizing Algorithms . . . . . . . . . . . . . . . . . . . . . . 19

2.6 Calibrated Surrogates and Excess Risk Bounds . . . . . . . . . . . . . . . 21

Part I: Consistency and Calibration 24

3 Conditions for Calibration 25


3.2 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Trigger Probabilities and Positive Normals . . . . . . . . . . . . . . . . . 31

3.3.1 Trigger Probabilities of a Loss Function . . . . . . . . . . . . . . . 32

3.3.2 Positive Normals of a Surrogate . . . . . . . . . . . . . . . . . . . 35

3.4 Conditions for Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4.1 Necessary Conditions for Calibration . . . . . . . . . . . . . . . . 44

3.4.2 Sufficient Condition for Calibration . . . . . . . . . . . . . . . . . 45

vii

Contents viii

4 Convex Calibration Dimension 49


4.2 Upper Bounds on CC Dimension . . . . . . . . . . . . . . . . . . . . . . 50

4.3 Lower Bounds on CC Dimension . . . . . . . . . . . . . . . . . . . . . . . 54

4.4 Tightness of Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5 Applications in Subset Ranking . . . . . . . . . . . . . . . . . . . . . . . 63

4.5.1 Precision @ q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.5.2 Normalized Discounted Cumulative Gain (NDCG) . . . . . . . . . 65

4.5.3 Pairwise Disagreement (PD) . . . . . . . . . . . . . . . . . . . . . 66

4.5.4 Mean Average Precision (MAP) . . . . . . . . . . . . . . . . . . . 68

5 Generic Rank Dimensional Calibrated Surrogates 74


5.2 Strongly Proper Composite Losses . . . . . . . . . . . . . . . . . . . . . . 75

5.3 Generic Rank-Dimensional Calibrated Surrogate . . . . . . . . . . . . . . 77

5.4 Generalized Tsybakov Conditions . . . . . . . . . . . . . . . . . . . . . . 82

5.5 Example Applications in Ranking and Multilabel Prediction . . . . . . . 86

5.5.1 Subset Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.5.2 Multilabel Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 89

6 Weak Notions of Consistency 92


6.2 Consistency Under Noise Conditions . . . . . . . . . . . . . . . . . . . . 93

6.2.1 Pairwise Disagreement . . . . . . . . . . . . . . . . . . . . . . . . 95

6.2.1.1 DAG Based Surrogate . . . . . . . . . . . . . . . . . . . 96

6.2.1.2 Score-Based Surrogates . . . . . . . . . . . . . . . . . . 98

6.2.2 Mean Average Precision . . . . . . . . . . . . . . . . . . . . . . . 101

6.3 Approximate Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Part II: Application to Hierarchical Classification 113

7 Multiclass Classification with an Abstain Option 114


7.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.3 Excess Risk Bounds for the CS Surrogate . . . . . . . . . . . . . . . . . . 118

7.4 Excess Risk Bounds for the OVA Surrogate . . . . . . . . . . . . . . . . . 122

7.5 The BEP Surrogate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.6 BEP Surrogate Optimization Algorithm . . . . . . . . . . . . . . . . . . 132

7.7 Extensions to Other Abstain Costs . . . . . . . . . . . . . . . . . . . . . 133

7.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.8.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.8.2 Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

8 Hierarchical Classification 138


Contents ix

8.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

8.3 Bayes Optimal Classifier for the Tree-Distance Loss . . . . . . . . . . . . 142

8.4 Cascade Surrogate for Hierarchical Classification . . . . . . . . . . . . . . 145

8.5 OVA-Cascade Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

8.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

8.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

8.6.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

8.6.3 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . 158

Part III: Complex Multiclass Evaluation Metrics 159

9 Consistent Algorithms for Complex Multiclass Penalties 160


9.2 Complex Multiclass Penalties . . . . . . . . . . . . . . . . . . . . . . . . 161

9.3 Consistency via Optimization . . . . . . . . . . . . . . . . . . . . . . . . 166

9.4 The BFW Algorithm for Convex Penalties . . . . . . . . . . . . . . . . . 168

10 Conclusions and Future Directions 176

10.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

10.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

10.2.1 Consistency and Calibration . . . . . . . . . . . . . . . . . . . . . 177

10.2.2 Application to Hierarchical Classification . . . . . . . . . . . . . . 178

10.2.3 Multiclass Complex Evaluation Metrics . . . . . . . . . . . . . . . 178

10.3 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

A Convexity 180

Bibliography 183

List of Figures

2.1 Various loss functions used in examples . . . . . . . . . . . . . . . . . . . 17

2.2 Excess risk bound. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 Trigger probability sets for various losses, with n = 3. . . . . . . . . . . . 34

3.2 The binary hinge loss and its positive normals . . . . . . . . . . . . . . . 35

3.3 The absolute difference surrogate and its positive normal sets. . . . . . . 39

3.4 The ε-insensitive absolute difference surrogate and its positive normal sets. 41

3.5 Positive normal sets for the Crammer-Singer surrogate. . . . . . . . . . . 43

3.6 Visual proof of Theorem 3.7. . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1 Illustration of feasible subspace dimension νQ(p) . . . . . . . . . . . . . . 55

5.1 Illustration for the `-Tsybakov noise condition. . . . . . . . . . . . . . . . 83

6.1 Dominant label noise condition . . . . . . . . . . . . . . . . . . . . . . . 94

7.1 Trigger probability sets for the abstain(α) loss. . . . . . . . . . . . . . . . 118

7.2 The partition of R2 induced by predBEP12

. . . . . . . . . . . . . . . . . . . 128

7.3 CS, OVA and BEP algorithms’ performance on synthetic data. . . . . . . 135

8.1 An example hierarchy in hierarchical classification. . . . . . . . . . . . . 139

8.2 Illustration of Bayes optimal prediction for tree-distance loss. . . . . . . . 142

9.1 Set of feasible confusion matrices. . . . . . . . . . . . . . . . . . . . . . . 167

x

List of Tables

5.1 Strongly proper composite losses. . . . . . . . . . . . . . . . . . . . . . . 76

7.1 Details of datasets used. . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7.2 Error percentages for CS, OVA and BEP at various abstain rates. . . . . 136

7.3 Time taken by CS, OVA and BEP algorithms. . . . . . . . . . . . . . . . 137

8.1 Dataset Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

8.2 Average tree-distance loss for various algorithms and datasets. . . . . . . 156

8.3 Training times for various algorithms. . . . . . . . . . . . . . . . . . . . . 156

9.1 Examples of complex multiclass evaluation metrics. . . . . . . . . . . . . 165

xi

Notational conventions xii

General Notational Conventions

Random variables are represented as upper case capitals like X, Y . Vectors are denoted

by lower case bold alphabets (both English and Greek) like `,ψ,v,u; component scalar

quantities of the vectors are denoted by the appropriate non-bold letter with the index as

a subscript, for example vi denotes the ith component of vector v. Matrices are denoted

by upper case bold alphabets like A,B,L; similar to the vector case, Ly,t denotes the

(y, t)th element of the matrix L. Sets are denoted by upper case English alphabets in

calligraphic font, like S, C.

1(predicate) denotes the indicator function of a predicate, i.e. it takes a value of 1 if the

predicate is true and is 0 other wise. Expectation of a random quantity is denoted by

E(.), the random variable over which the expectation is taken is given as a subscript if it

is not clear from the context. Probability of random event is denoted by P(.), the random

variable over which the probability is taken is given as a subscript if it is not clear from

the context.

For any pair of vectors u,v ∈ Rd the inner product u>v =∑d

i=1 uivi is denoted by 〈u,v〉.For a vector v, the 1-norm is given by ||v||1, the 2-norm is given by ||v||2 or simply

||v||, and the infinity norm by ||v||∞. For any matrix A (resp. vector u) its transpose is

denoted by A> (resp. u>).

For any pair of matrices A,B ∈ Rd×d the inner product Trace(A>B) =∑d

i=1

∑dj=1Ai,jBi,j

is denoted by 〈A,B〉. For a matrix A, the vectorized 1-norm is given by ||A||1, the vec-

torized infinity norm by ||A||∞. The operator norm, or maximum absolute eigen value,

of A is given as ||A||, and the nuclear norm, or the sum of absolute eigen values, of A is

given as ||A||∗.

The convergence of a sequence of random variables Vm to a value v in probability is

denoted as VmP−→ v.

List of Symbols xiii

List of Symbols

X Instance space

Y Label space

n |Y|[n] 1, 2, . . . , nY Prediction space

k |Y|[k] 1, 2, . . . , kD Distribution over X × YM Size of training sample

S Training sample, (x1, y1), . . . , (xM , yM)Πr Set of all bijections from [r] to [r]

µ Marginal distribution of D over X∆n n-dimensional probability simplex p ∈ Rn

+ :∑n

y=1 py = 1p(x) Conditional probability vector in ∆n induced by D conditioned on X = x

` and variants Loss function over Y × YL and variants Loss matrix in Rn×k

+

`t Vector in Rn for some t ∈ Y , denotes the tth column of loss matrix L

ψ and variants Surrogate loss Y × C→R+ for some C ⊆ Rd and d ∈ Z+

C Surrogate space of ψ

ψ(u) Vector in Rn for some u ∈ C, equal to [ψ(1,u), . . . , ψ(n,u)]>

(a)+ max(a, 0) for some a ∈ Rer`D[h] `-risk of a classifier h

erψD[f ] ψ-risk of a function f : X→Cpred Predictor mapping from C to Yreg`D[h] The `-excess-risk or `-regret of classifier h : X→YregψD[f ] The ψ-excess-risk or ψ-regret of function f : X→C

List of Symbols xiv

Rψ ψ(C) ⊆ Rn+

conv(R) Convex hull of set RSψ conv(Rψ)

Q`t Trigger probability set of loss ` for t ∈ YN ψ(z) Set of positive normals to ψ at z ∈ SψCCdim(`) Convex calibration dimension of loss `

νH(p) Feasible subspace dimension of H ⊆ Rd for some d ∈ N at point p ∈ Hnull(A) Null space of matrix A

aff(A) Affine hull of the set A ⊆ Rd for some d ∈ Ndim(A) Dimension of the vector subspace Aaffdim(A) Dimension of affine hull of columns (rows) of matrix A

nullity(A) dim(null(A))

1d All ones vector in Rd

Id Identity matrix in Rd×d

eda Vector in Rd with [eda]i = 1(a = i)

sm(v) mini vi — the smallest value in vector v

ssm(v) mini:vi>sm(v) vi — the second smallest value in vector v

u(i) The i-th largest element among the components of a vector u

Chapter 1

Introduction

1.1 Supervised Machine Learning and Consistency

Supervised machine learning is broadly concerned with learning input-output mappings

from empirical data. As a simple example to motivate the importance and significance

of supervised learning, consider the classification of images of handwritten alphabets into

one of the 26 alphabets. While it is not simple for human beings to tell a computer

what properties of the image make it correspond to the character of (say) ‘d’, it is easy

to provide many examples of each alphabet and label it. A supervised machine learning

algorithm uses such examples as training data and returns a ‘model’ whose performance

is measured via an appropriate evaluation metric based on the type of the problem.

A fundamental question in supervised machine learning is that of asymptotic optimality

or consistency. Informally, the question of consistency is given as –

Does a machine learning algorithm give the ‘best’ model in the limit of infinite data?

Consistency is a natural requirement to make of any machine learning algorithm, and

computationally efficient consistent algorithms are highly desirable. While there have

been many works on the study of consistent algorithms for various machine learning

problems like binary classification, multiclass classification and ranking, the current un-

derstanding is far from complete even for these problems, and is even more so for the

1

Chapter 1. Introduction 2

case of a generalized machine learning problem. In this thesis, we give the foundations

of a unified framework for studying consistency, thereby generalizing many known past

results for specific learning problems as well as developing several new results.

This thesis focuses on machine learning problems where the learned classifier is required

to output one class label from a finite set of class labels – this is a very general setting and

includes most standard machine learning problems like binary classification, multiclass

classification, multilabel classification, label ranking and subset ranking as special cases.

The ‘best’ classifier as mentioned in the question of consistency is determined for such

problems via an evaluation metric, which gives the defining characteristic to a machine

learning problem.

For most of this thesis we will consider the case where the evaluation metric is given by a

loss matrix. These are the most prevalent evaluation metrics in supervised learning, and

include many standard evaluation metrics used in the standard machine learning prob-

lems mentioned above. Some examples are the zero-one loss in multiclass classification,

Hamming loss in multilabel classification and NDCG loss in label ranking.

The space of machine learning algorithms is very vast, and characterizing and designing

general algorithms is a rather difficult task. However, a large majority of machine learn-

ing algorithms fall under a broad category of algorithms, known as surrogate minimizing

algorithms, in which the returned classifier is based on applying a predictor or decoder to

the solution of an optimization problem, whose objective is characterized by a surrogate

loss. Also, when the surrogate is convex, the resulting optimization problem becomes con-

vex and can be solved efficiently. For example, the binary SVM is a surrogate minimizing

algorithm which returns a classifier by applying the sign decoder/predictor to a minimizer

of the convex hinge surrogate loss. Surrogate minimizing algorithms are characterized by

the surrogate and the predictor; if such an algorithm is consistent for an evaluation metric

given by a certain loss matrix, then the surrogate is said to be calibrated with respect to

the loss matrix. The focus of most of this thesis will be on such surrogates and predictors.

In particular, we build a framework to study and design such surrogate-predictor pairs,

and apply the results to several specific loss matrices which demonstrate the utility of

such a framework.


Towards the end of this thesis, we consider more general evaluation metrics than those

based on loss matrices, such as the F-measure used in information retrieval and the

harmonic mean measure used in multiclass classification with class imbalance. We study

and design consistent algorithms for a large family of such evaluation metrics as well.

1.2 Past Work on Consistency

The earliest known works on consistency of supervised machine learning algorithms were

on the binary (number of classes n = 2) classification problem using the classical nearest

neighbour method. Cover and Hart [26] showed the approximate consistency of the 1-

nearest neighbour method in binary classification. Stone [96] showed the consistency of

the k-nearest neighbours method (k increasing with sample size) in binary classification.

More recently in the last decade, the topic of consistent surrogate minimizing algorithms

has been of great interest.

Initial work on consistency of surrogate minimizing algorithms focused largely on bi-

nary classification. For example, Steinwart [94] showed the consistency of support vector

machines with universal kernels for the problem of binary classification; Jiang [52] and

Lugosi and Vayatis [66] showed similar results for boosting methods. Bartlett et al. [7] and

Zhang [115] studied the calibration of margin-based surrogates for binary classification.

In particular, in their seminal work, Bartlett et al. [7] established that the property of

‘classification calibration’ of a surrogate loss is equivalent to its minimization yielding 0-1

consistency, and gave a simple necessary and sufficient condition for margin-based surro-

gates to be calibrated w.r.t. the binary 0-1 loss. More recently, Reid and Williamson [84]

analyzed the calibration of a general family of surrogates termed proper composite sur-

rogates for binary classification. Variants of standard 0-1 binary classification have also

been studied; for example, Bartlett and Wegkamp [6], Grandvalet et al. [47], Yuan and

Wegkamp [114] studied consistency for the problem of binary classification with a reject

option, and Scott [90] studied calibrated surrogates for cost-sensitive binary classification.

Over the years, there has been significant interest in extending the understanding of

consistency and calibrated surrogates to various multiclass (number of classes n > 2)


learning problems. Early work in this direction, pioneered by Zhang [116] and Tewari

and Bartlett [100], considered mainly the multiclass classification problem with the 0-1

loss. They generalized the framework of Bartlett et al. [7] to this setting and used these

results to study calibration w.r.t. 0-1 loss of various surrogates proposed for multiclass

classification, such as the surrogates of Weston and Watkins [109], Crammer and Singer

[27], and Lee et al. [64]. In particular, while the multiclass surrogate of Lee et al. [64]

was shown to calibrated w.r.t. multiclass 0-1 loss, it was shown that several other widely

used multiclass surrogates are in fact not calibrated w.r.t. multiclass 0-1 loss.

More recently, there has been much work on studying consistency and calibration for

various other learning problems that also involve finite label and prediction spaces. For

example, Gao and Zhou [43] studied consistency and calibration for multi-label predic-

tion with the Hamming loss. Another prominent class of learning problems for which

consistency and calibration have been studied recently is that of subset ranking, where

instances contain queries together with sets of documents, and the goal is to learn a pre-

diction model that given such an instance ranks the documents by relevance to the query.

Various subset ranking losses have been investigated in recent years. Cossock and Zhang

[25] studied subset ranking with the discounted cumulative gain (DCG) ranking loss, and

gave a simple surrogate calibrated w.r.t. this loss; Ravikumar et al. [83] further studied

subset ranking with the normalized DCG (NDCG) loss. Xia et al. [111] considered the 0-1

loss applied to permutations. Duchi et al. [34] focused on subset ranking with the pairwise

disagreement (PD) loss, and showed that several popular convex score-based surrogates

used for this problem are in fact not calibrated w.r.t. this loss; they also conjectured that

such surrogates may not exist. Calauzenes et al. [17] showed conclusively that there do

not exist any convex score-based surrogates that are calibrated w.r.t. the PD loss, or w.r.t.

the mean average precision (MAP) or expected reciprocal rank (ERR) losses. Also, in a

more general study of subset ranking losses, Buffoni et al. [11] introduced the notion of

‘standardization’ for subset ranking losses, and gave a way to construct convex calibrated

score-based surrogates for subset ranking losses that can be ‘standardized’; they showed

that while the DCG and NDCG losses can be standardized, the MAP and ERR losses

cannot be standardized.


In the related but different context of instance ranking there have been several papers

which effectively show that one can get consistent algorithms for instance ranking by

minimizing strictly proper composite surrogates [2, 22, 23, 60].

Steinwart [95] considered consistency and calibration in a very general setting. More

recently, Pires et al. [77] used Steinwart’s techniques to obtain surrogate regret bounds

for certain surrogates w.r.t. general multiclass losses.

There has also been increasing interest in designing consistent algorithms for more com-

plex evaluation metrics than the simple loss matrix based evaluation metrics. Ye et al.

[113] studied consistency for the binary F-measure. Menon et al. [69] analyzed the bal-

anced error rate evaluation metric in binary classification, and showed that simple plug-in

methods based on empirical balancing are consistent. Koyejo et al. [61] and Narasimhan

et al. [71] considered consistency for more general complex evaluation metrics in the bi-

nary setting, and showed that simple conditional probability estimation techniques along

with an appropriate threshold selection strategy yield consistent algorithms.

1.3 Main Contributions

In this thesis, we provide a framework for analyzing and designing consistent algorithms

for general multiclass learning problems. Our main contributions can be divided into

three parts and are outlined below.

1.3.1 Consistency and Calibration

Consistency of surrogate minimizing algorithms w.r.t. a loss matrix essentially reduces

to calibration of the surrogate w.r.t. the loss matrix. In the first part of the thesis, we

give several results on calibration for a general learning problem given by an arbitrary

loss matrix. This is in contrast to most past work which give results on calibration for a

particular learning problem/loss matrix. We also demonstrate the applicability of these

results by instantiating them to various specific loss matrices of practical interest. This

part of the thesis can be further divided into the following three sections.


Conditions for Calibration

The question

“When is a given surrogate calibrated w.r.t a given loss matrix?”

has been studied for specific loss matrices, like the 0-1 loss in binary and multiclass

classification [7, 100] and the pairwise disagreement and NDCG loss in ranking [34, 83].

We answer this question for a general loss matrix, by giving necessary conditions and

sufficient conditions for calibration [79].

We define a property of the loss matrix known as trigger probability sets which indicates

the optimal prediction to make for a given instance. Analagous to the trigger probabilities

of a loss matrix, one can define positive normals [100] of a surrogate. We give necessary

conditions and sufficient conditions for calibration of the surrogate w.r.t. the loss matrix

based on the trigger probabilities of the loss matrix and positive normals of the surrogate.

This is covered in Chapter 3 of the thesis.

Convex Calibration Dimension

A natural question to ask is, whether some learning problems are ‘easier’ than others, in

other words,

What is the difficulty of attaining consistency (using surrogate minimizing algorithms)

for the learning problem given by loss matrix `?

We give an answer to this question by defining a quantity called the convex calibration

dimension, and demonstrate its implications in some practical applications [79].

The surrogate minimizing algorithm for any surrogate calibrated w.r.t. a given loss matrix

` yields a consistent algorithm, but for the surrogate minimization to be done efficiently

we need the surrogate to be convex. Also, a very basic measure of complexity of the

surrogate minimizing algorithm is given by what is called the dimension of the surrogate.


In particular, optimizing a surrogate with dimension d requires computing d real valued

functions over the instance space X . Hence, the smallest d, such that there exists a

convex `-calibrated surrogate with surrogate dimension d, is a natural notion measuring

the intrinsic difficulty of designing convex `-calibrated surrogates. We call this the convex

calibration dimension of the loss matrix.

We give lower bounds for this object based on a geometric property of the trigger prob-

ability sets of the loss matrix, and an upper bound based on the linear algebraic rank

of the loss matrix. We apply these bounds to several label/subset ranking losses such

as normalized discounted cumulative gain (NDCG), mean average precision (MAP) and

pairwise disagreement (PD) and obtain a variety of interesting existence and impossibility

results.


Generic Rank-Dimensional Calibrated Surrogates

A natural question that arises from the study of convex calibration dimension is:

Can one construct an explicit convex `-calibrated surrogate and predictor meeting the

rank upper bound on the convex calibration dimension of `?

We show that we can indeed do so, and give an excess risk bound relating the rate at

which the classifier approaches the best classifier to the rate at which the surrogate is

being optimized. Under an appropriate setting, the surrogate given takes the form of

a least-squares style surrogate, with the predictor simply corresponding to a discrete

optimization problem [80, 81].

We apply this surrogate and predictor to several ranking and multilabel prediction losses

which have large label and prediction spaces, but a much smaller rank. In some cases this

yields efficient surrogates and predictors, but in some cases like the PD loss and MAP loss

in ranking it gives an efficient surrogate but a complicated predictor, thus precluding an

overall efficient algorithm. In such cases, a natural question to consider is the following:


Can the notion of consistency be relaxed in some way to make the resulting algorithm

computationally efficient?

We answer the above question in the affirmative by considering two weak notions of

consistency namely consistency under noise conditions and approximate consistency, and

show that in many cases including the PD and MAP losses, one can get efficient surrogates

and predictors, if the requirements of consistency are relaxed to one of these weak notions

of consistency.

This is covered in Chapters 5 and 6 of the thesis.

1.3.2 Application to Hierarchical Classification

In the second part of the thesis, we consider the application of the framework of calibration

to a particular family of loss matrices that arise in the learning problem of hierarchical

classification. As an intermediate step to doing so, we study the problem of multiclass

classification with an abstain option, which is also of some independent interest.

Multiclass Classification with an Abstain Option

In some practical applications like medical diagnosis, the learning problem is essentially

classification, but with the added constraint that predictions be made only if the predictor

is confident. We call this problem as multiclass classification with an abstain option. A

natural loss matrix for such a problem is the abstain loss, which is similar to the multiclass

0-1 loss, but has an additional option of abstaining from predicting any class, in which

case it incurs a fixed penalty. A natural question to ask here is the following:

Are there efficient convex calibrated surrogates for the problem of classification with an

abstain option where the performance is evaluated using the abstain loss?

We answer the above question affirmatively by constructing several convex calibrated

surrogates and predictors, leading to SVM-like training algorithms.


We show that some standard surrogates used in multiclass classification like the Crammer-

Singer surrogate [27] and one-vs-all hinge surrogate [86] are calibrated w.r.t. the abstain

loss using a modified version of the argmax predictor. We also give a novel convex cali-

brated surrogate operating in log2(n) dimensions for the n-class problem called the binary

encoded predictions surrogate. We demonstrate the efficacy of the resulting algorithms

on some benchmark multiclass datasets.


Calibrated Surrogates for Tree Distance Loss

Hierarchical classification is an important learning problem in which there is a pre-defined

hierarchy over the class labels, and has been the subject of many studies [5, 16, 46, 106].

A natural loss matrix in this case is simply based on the tree distance between the class

labels.

Despite the importance and popularity of hierarchical classification, the following question

has not been studied in past work.

Are there efficient convex calibrated surrogates for the problem of hierarchical

classification with the tree distance loss?

We answer this question positively [82], by constructing a family of efficient convex cali-

brated surrogates for the tree distance loss.

We show that the optimal classifier for the tree distance loss is the classifier which predicts

the deepest node whose sub-tree has a conditional probability greater than half. Based

on this observation, we show that consistency w.r.t. the tree distance loss in hierarchical

classification, can be achieved by reducing the problem to ‘depth of tree’ number of sub-

problems, in each of which one is required to solve a multiclass classification problem

with an abstain option.

Using the convex calibrated surrogates for the abstain loss constructed earlier as a black

box routine, we design new convex calibrated surrogates for the tree distance loss. One


such surrogate, whose surrogate minimization procedure simply requires to solve multiple

binary SVM problems, also gives superior empirical performance on several benchmark

hierarchical classification datasets.


1.3.3 Consistency for Complex Multiclass Evaluation Metrics

So far, we have considered learning problems with loss matrix based evaluation metrics

and consistent algorithms for such learning problems. In the third and final part of the

thesis, we consider learning problems with more complicated evaluation metrics like the

Fβ-measure in binary classification that cannot be expressed via a loss matrix.

The evaluation metrics we consider are based on a general penalty function operating on

the confusion matrix of a classifier. In particular, loss matrix based evaluation metrics

correspond to using a linear penalty function. For other penalty functions we get other

interesting evaluation metrics like the harmonic-mean measure, geometric mean measure,

and quadratic mean measure used in multiclass and binary problems with class imbal-

ance; the Fβ measures used in information retrieval; and the min-max measure used in

hypothesis testing [56, 58, 63, 65, 98, 104, 108]. The notion of consistency is very much

relevant for such evaluation metrics as well.

A natural question then is the following:

Can one construct efficient consistent algorithms for such complex evaluation metrics

given by an arbitrary penalty function?

While this question has been studied for the special case of binary classification [61, 71], it

remains unanswered for multiclass problems. We answer this question in the affirmative

for a large family of such complex multiclass evaluation metrics [72], by constructing

consistent algorithms.

We make the crucial observation that, finding the best classifier for such complex eval-

uation metrics (which is an infinite dimensional optimization problem) is equivalent to


optimizing the penalty function over the set of feasible confusion matrices (a finite di-

mensional optimization problem).

However, the set of feasible confusion matrices is a set for which membership and sep-

aration oracles are difficult to construct, but linear minimization oracles are easy to

construct. Hence, standard optimization methods such as projected gradient descent are

not possible, but the Frank-Wolfe algorithm is a viable option. We adapt the Frank-

Wolfe algorithm for this problem, and show that the resulting algorithm is consistent for

complex evaluation metrics for which the corresponding penalty function is convex.


Chapter 2

Background

This chapter provides the necessary background and preliminaries on which the thesis is

based.

2.1 Chapter Organization

We briefly describe the standard supervised learning setup and give examples of several

supervised learning tasks in Section 2.2. We deal with evaluation metrics used in multi-

class supervised learning, and give some example evaluation metrics appropriate for the

example supervised learning tasks, in Section 2.3. We introduce the crucial notion of

consistency in supervised machine learning algorithms in Section 2.4. We then describe

a popular class of supervised learning algorithms known as surrogate minimization algo-

rithms in Section 2.5, and briefly analyse what it means for such algorithms to have the

property of consistency in Section 2.6.

2.2 Standard Supervised Learning

This section describes the standard multiclass supervised learning setting under which

the thesis operates.

12

Chapter 2. Background 13

There is an instance space X and a finite set of labels Y called the label space, and a

distribution D over X×Y from which set of training samples S = (x1, y1), . . . , (xM , yM)are drawn in an i.i.d. manner. One wishes to use these training samples to learn a function

h from X to a finite prediction space Y . In many cases the prediction space Y is the

same as the label space Y , but there are many cases where they are different as well. Let

integers n and k be such that |Y| = n and |Y| = k. Some examples are given below.

Example 2.1 (Tumour detection). Consider the task of tumor detection in MRI images,

where we have X as the set of all MRI images, and Y contains two elements denoting the

absence or presence of the tumor, typically denoted by +1 and −1. Each data point (x, y)

in the training set is such that x is an MRI image, and y takes one of two possible values

indicating whether there is a tumor or not in image x. In this problem we have Y = Y,

and n = k = 2. The function to be learned simply predicts whether or not a tumor exists

in the given image. This type of learning problem is called a binary classification problem.

Example 2.2 (Document classification). Consider the task of classifying a newspaper

article into one of politics or sports or business. Here we have X as the set of all

documents, and Y as a three element set given by the three labels mentioned. Each data

point (x, y) in the training set is such that x is a document, and y is one of the three

labels indicating the class of document x. The prediction space Y is the same as the label

space Y, and n = k = 3. This type of learning problem is called a multiclass classification

problem.

Example 2.3 (Movie rating prediction). Consider the task of predicting a movie rating

for a user from her history of ratings. We have X as the set of all movies, and Y contains

the possible ratings that can be given to a movie. Let the rating system be a 5 star system

in which case Y contains five elements from 1 star to 5 stars. Each data point (x, y) in

the training set is such that x is a movie, and y is the star rating given to the movie by

the user. The prediction space Y is again the same as the label space Y, with n = k = 5.

Due to a natural ordering in the prediction and label spaces, this type of learning problem

is called an ordinal regression problem.

Example 2.4 (Medical diagnosis). Consider the problem of medical diagnosis where given

a collection of symptoms and test results (call it a case file) one has to diagnose the illness.

For simplicity assume the patient has only one of three possible conditions. Here we have


X as the set of all possible case files, and Y as the three element set representing the three

possible conditions. Each data point (x, y) in the training set is such that x is a case file

of a patient, and y is the true condition. In this case one might want a classifier that

gives one of the three diagnoses when it is confident and responds with a ‘don’t know’

when it is not confident. The right way to achieve this is to use a prediction space Y that

is different from the label space Y. The prediction space Y contains the three elements

in Y, and also a special symbol denoting an ‘abstain’ option, and hence n = 3, k = 4.

This type of learning problem is called a multiclass classification problem with an abstain

option.

Example 2.5 (Image tagging). Consider the problem of tagging images with one or more

tags from a fixed finite set, say, sky, road, tree, people, and water. Here we have

X as the set of all images, and Y as the set of all possible subsets of the 5 tags. Each

data point (x, y) in the training set is such that x is an image, and y is a 5 dimensional

vector in 0, 15 denoting the presence or absence of the appropriate tag. The prediction

space Y is the same as the label space Y and hence we have n = k = 25 = 32. This type

of problem is called a sequence prediction problem or a multi-label prediction problem.

Example 2.6 (Label ranking). Consider a problem where for a given document one has

to rank a fixed set of tags, say, politics, sport, business, science and culture,

according to relevance to the document, with each point in the training data containing

the set of relevant tags for each document. Here we have X as the set of all documents,

and Y as the set of all possible subsets of the 5 tags. As the problem requires us to rank

the tags, we have that the prediction space Y is the set of all permutations of the 5 tags.

Hence we have n = |Y| = 25 = 32 and k = |Y| = 5! = 120. This is another example

problem where the label space Y and the prediction space Y are distinct. This type of

problem is called a label ranking problem.

2.3 Multiclass Losses

The key aspect to the machine learning problem, i.e. to find a classifier h : X→Y , is

the performance measure used for evaluating the returned classifier h. This section gives

details on how the performance is evaluated in standard supervised learning problems.


The most prevalent way of evaluating the performance in the standard supervised learning

setting is via a loss function ` : Y × Y→R+. The interpretation for the loss function is

that `(y, t) gives the loss incurred by predicting t, when the truth is y. Given a classifier

h : X→Y and a loss function `, the `-risk of the classifier h is simply the expected loss

incurred on a new example (x, y) drawn from D:

er`D[h] = E(X,Y )∼D[`(Y, h(X))

].

Most of this thesis will focus on such evaluation metrics.1 The objective of a learning

algorithm is simply to use the training set S, to return a classifier h, with a small `-risk.

Given below are some loss functions and the problems in which they are commonly used.

Example applications of these problems can be found in Examples 2.1-2.6

Example 2.7 (Binary zero-one loss – Binary classification). Let Y = Y and |Y| = 2. The

problem of binary classification typically uses the simple binary 0-1 loss `0-1 : Y × Y→R+

defined as

`0-1(y, t) = 1(y 6= t) .

Example 2.8 (Multiclass zero-one loss – Multiclass classification). Let Y = Y and |Y| = n

with n > 2. The problem of multiclass classification typically uses a generalization of the

binary 0-1 loss `0-1 : Y × Y→R+ defined as

`0-1(y, t) = 1(y 6= t) .

Example 2.9 (Absolute difference loss – Ordinal regression). Let Y = Y = 1, 2, . . . , n.The problem of ordinal regression typically uses the absolute difference loss given by `abs :

Y × Y→R+ as

`abs(y, t) = |y − t| .

Example 2.10 (Abstain loss – Multiclass classification with an abstain option). Let |Y| =n and Y = Y ∪⊥. The special symbol ⊥, denotes the option of the classifier abstaining

from prediction. An appropriate evaluation metric here is the so called abstain loss `? :

1Chapter 9 considers a more general way of evaluating the performance of a classifier h, the detailsof which are given in the same chapter.


Y × Y→R+ defined as

`?(y, t) =

0 if y = t

α if t = ⊥

1 otherwise

,

where α ∈ [0, 1] simply gives the cost of abstaining.

Example 2.11 (Hamming Loss – Sequence prediction). Let Y = Y = 0, 1r, where

r ∈ Z+ is the number of elements in the sequence. The problem of sequence prediction

typically uses the simple Hamming loss, which simply adds the losses over all the elements

in the sequence. The hamming loss `Ham : Y × Y→R+ is given as

`Ham(y, t) =r∑i=1

1(yi 6= ti) .

Example 2.12 ( Precision@q loss – Label ranking). Let Y = 0, 1r and Y = Πr, where

r is the number of objects to be ranked and Πr is the set of all permutations over [r].

The problem of label ranking has many popular performance measures in practice. For

the sake of illustration, we consider the Precision@q loss. Let 1 ≤ q ≤ r be an integer.

The precision@q loss `P@q : Y × Y→R+ is given as

`P@q(y, σ) = 1− 1

q

q∑i=1

yσ−1(i),

where σ(i) denotes the position of object i under permutation σ ∈ Πr.

The example loss functions in Examples 7-12 are illustrated in Figure 2.1.

As can be seen in the examples above, different machine learning problems and their

corresponding loss functions use a variety of different finite label and prediction spaces

Y and Y . For simplicity, we shall use Y = [n] = 1, 2, . . . , n and Y = [k] = 1, 2, . . . , kin our results unless explicitly mentioned otherwise. This does not affect the generality

of these results, as any finite Y and Y can be identified with [n] and [k] respectively. We

will also often find it convenient to represent the loss function by a matrix L ∈ Rn×k+ ,

called the loss matrix with Ly,t = `(y, t). As ` and L both represent the same object, we

shall use the terms loss function and loss matrix interchangeably.


1 21 0 12 1 0

(a) `0-1

1 2 31 0 1 12 1 0 13 1 1 0

(b) `0-1

1 2 31 0 1 22 1 0 13 2 1 0

(c) `abs

1 2 3 ⊥1 0 1 1 1/22 1 0 1 1/23 1 1 0 1/2

(d) `(?)

00 01 10 1100 0 1 1 201 1 0 2 110 1 2 0 111 2 1 1 0

(e) `Ham

123 132 213 231 312 321000 1 1 1 1 1 1001 1 1 1 0 1 0010 1 1 0 1 0 1011 1 1 0 0 0 0100 0 0 1 1 1 1101 0 0 1 0 1 0110 0 0 0 1 0 1111 0 0 0 0 0 0

(f) `P@q

Figure 2.1: Loss functions corresponding to Examples 7-12 with rows representing theclass labels (first argument) and columns representing predictions (second argument).(a) Binary 0-1 loss. (b) 3-class 0-1 loss. (c) Absolute difference loss with n = 3. (d)Abstain loss with n = 3 and α = 1

2 . (e) Hamming loss with sequence length r = 2, andhence n = 4. (f) Precision@q loss with r = 3 and q = 1.

2.4 Consistent Algorithms

Given a loss function ` : Y × Y→R+, we seek a classifier with small `-risk. For any

distribution D, the smallest possible risk over all classifiers is called the Bayes `-risk er`,∗D .

er`,∗D = infh:X→Y

er`D[h] .

One can easily show that there always exists a classifier which achieves the Bayes `-risk

– such a classifier is called an `-Bayes classifier. Before we show show this, and construct

an `-Bayes classifier, we will define some useful quantities.

Let ∆n = p ∈ Rn+ :

∑ny=1 py = 1, be the set of probability distributions over [n].

Let µ be the marginal of D over X . For any x ∈ X , let p(x) ∈ ∆n denote the con-

ditional probability of Y given X = x. For each t ∈ Y , let `t ∈ Rn+ be such that

`t = [`(1, t), . . . , `(n, t)]>, i.e. `t ∈ Rn+ gives the tth column of the loss matrix L.


We have that

er`,∗D = infh:X→Y

er`D[h]

= infh:X→Y

E(X,Y )∼D[`(Y, h(X))

]= EX∼µ min

t∈YEY∼p(X)

[`(Y, h(X))

]= EX∼µ min

t∈Y

[〈p(X), `t〉

].

Thus, it immediately follows that any classifier h∗ such that h∗(x) ∈ argmint∈Y〈p(x), `t〉for all x ∈ X is an `-Bayes classifier.

An algorithm that takes a training sample S ∈ (X ×Y)M drawn i.i.d from D and returns

a classifier hM (which is a random variable depending on S) is said to be consistent w.r.t.

`, or simply `-consistent, if as M approaches ∞,

er`D[hM ]P−→ er`,∗D .

HereP−→ denotes convergence in probability.

Ideally one would like an algorithm to directly minimize the `-risk over the space of

classifiers, thus ensuring a consistent algorithm. There are two obstacles to doing so.

Firstly, the learning algorithm does not have access to the distribution D and has only

access to M samples drawn i.i.d. from D. However, this can be handled by viewing the

empirical distribution induced by S as the true distribution, and minimizing the `-risk

over an appropriate function class whose complexity increases with M – this is the well

known empirical risk minimization which we call the `-ERM algorithm. Note that directly

minimizing the `-risk over the space of all classifiers for the empirical distribution would

result in overfitting for any finite M . The second obstacle is computational in nature.

Due to the intrinsically discrete nature of (any subset of) the space of classifiers from

X to Y , minimizing the empirical `-risk is in general a computationally hard problem.

Hence we need to look beyond simple algorithms that minimize the `-risk directly.


2.5 Surrogate Minimizing Algorithms

A learning algorithm is formally a mapping from the set of training samples ∪∞m=1(X×Y)m

to the set of classifiers YX . A large majority of popular algorithms for multiclass learning

problems are from a special class of learning algorithms known as surrogate minimizing

algorithms, which are characterized simply by a ‘surrogate loss’. This section gives details

on such algorithms.

Let C ⊆ Rd for some integer d ∈ Z+. Let ψ : Y × C→R+ be the surrogate loss. We will

refer to d as the surrogate dimension of ψ and C as the surrogate space of ψ.

In a similar fashion to the `-risk of a classifier h : X→Y , the ψ-risk is defined for a

function f : X→C as

erψD[f ] = E(X,Y )∼D[ψ(Y, f(X))

].

The smallest possible ψ-risk is called the Bayes ψ-risk erψ,∗D .

erψ,∗D = inff :X→C

erψD[f ]

= inff :X→C

E(X,Y )∼D

[ψ(Y, f(X))

]= EX∼µ

[infu∈C〈p(X),ψ(u)〉

].

where ψ(u) = [ψ(1,u), . . . , ψ(n,u)]>. Viewing ψ as a function from C to Rn+, one can

construct two sets that are interesting and useful objects of study:

Rψ = ψ(C) ⊆ Rn+

Sψ = conv(Rψ) ⊆ Rn+ ,

where conv(R) denotes the convex hull of a set R.


Clearly, the Bayes ψ-risk can then also be written as

erψ,∗D = EX∼µ

[inf

z∈Rψ〈p(X), z〉

]= EX∼µ

[infz∈Sψ〈p(X), z〉

].

The objective of a surrogate minimizing algorithm is to find a function f : X→C, whose

ψ-risk is as small as possible. Once again we face two issues – access to the distribution

D only through the samples S, and computational difficulties. The first difficulty can be

overcome as before by using the empirical distribution, leading to the empirical surrogate

risk minimization called the ψ-ERM or simply the surrogate ERM-algorithm. The second

issue can be overcome by designing ψ to be convex. We give details of both below.

Given a training sample S = (x1, y1), . . . , (xM , yM), and class of functions FM ⊆ f :

X→C, the ψ-ERM algorithm simply returns f∗M given by

f∗M ∈ argminf∈FM1

M

M∑i=1

ψ(yi, f(xi)) .

One can show using standard uniform convergence type arguments that for an appropriate

sequence of function classes FM we have erψD[f∗M ]P−→ erψ,∗D , and such an algorithm is called

consistent w.r.t. ψ, or ψ-consistent.

Unlike the case of the `-ERM, the ψ-ERM is a continuous optimization problem, therefore

if ψ is convex in its second argument,2 with appropriate function classes Fm, the ψ-ERM

algorithm simply requires a convex optimization problem to be solved, which can be

done efficiently [10]. As an aside, we observe that the surrogate dimension d of ψ plays

a crucial component in deciding the computational difficulty of the corresponding ψ-

ERM. A surrogate minimizing algorithm using a surrogate with dimension d, requires

d functions from X to R to be learned, and hence both computational and memory

requirements increase with d.

The result of a ψ-ERM algorithm is a function f∗ from X to C. However, the learning

algorithm must return a function from X to Y . This is addressed by simply using a

2We will sometimes omit the term ‘in its second argument’ and simply say ψ is a convex surrogate.


predictor mapping pred : C→Y , and returning the classifier given by pred f∗. We give

two simple examples below.

Example 2.13 (Binary SVM for binary classification). Let Y = Y = +1,−1. The

SVM (support vector machine) algorithm is a surrogate minimizing algorithm with the

surrogate ψH : +1,−1 × R→R+ being the so called hinge loss:

ψH(1, u) = (1− u)+

ψH(−1, u) = (1 + u)+

where (a)+ = max(a, 0). As can be seen, the surrogate space of ψH is C = R, and the

‘surrogate dimension’ is d = 1. The surrogate-ERM in this case returns a function f ∗

from X to R, and the predictor pred of choice is the sign function, and thus the classifier

returned by the SVM algorithm is simply sign f ∗.

Example 2.14 (Crammer-Singer SVM for multiclass classification). Let Y = Y = [n] with

n > 2. The Crammer-Singer SVM [27] algorithm is a surrogate minimizing algorithm,

with the surrogate being a generalization of the hinge loss. The surrogate ψCS : Y ×Rn→R+ is given below.

ψCS(y,u) =n

maxi=1

(1 + ui − uy)+

As can be seen the surrogate space of ψCS is C = Rn, and the ‘surrogate dimension’ is

d = n. The surrogate-ERM in this case returns a function f∗ from X to Rn, and the

predictor pred of choice is the argmax function, and thus the classifier returned by the

algorithm is simply argmax f∗.

2.6 Calibrated Surrogates and Excess Risk Bounds

This section lays the groundwork for answering the following crucial question –

What surrogate minimizing algorithms are consistent w.r.t. a given loss function ` ?

The surrogate minimizing algorithm is characterized by the surrogate ψ and the predictor

pred and does not depend on `. Hence the surrogate ψ and predictor pred must somehow


capture the crucial qualities of the loss function `. In particular, ψ and pred must be

such that for any sequence of vector functions fM : X→C

limM→∞

erψD[fM ] = erψ,∗D implies limM→∞

er`D[pred fM ] = er`,∗D .

Such a pair (ψ, pred) is said to be calibrated w.r.t. ` or simply `-calibrated.

Sometimes it is more convenient to work with ψ and `-regrets or excess risks, than risks

directly. The `-regret, reg`D[h] of a classifier h : X→Y , and ψ-regret, regψD[f ] of a function

f : X→C are defined as

reg`D[h] = er`D[h]− infh′:X→Y

er`D[h′]

regψD[f ] = erψD[f ]− inff ′:X→C

erψD[f ′] .

Another quantity of interest is the conditional regret i.e. regret for a prediction on a

single instance given the conditional probability. The conditional `-regret, reg`p(t) and

conditional ψ-regret, regψp(u) for a conditional probability vector p ∈ ∆n, prediction

t ∈ Y and vector u ∈ C are defined as

reg`p(t) = 〈p, `t〉 − inft′∈Y〈p, `t′〉

regψp(u) = 〈p,ψ(u)〉 − infu′∈C〈p,ψ(u′)〉 .

It can be easily seen that

reg`D[h] = EX∼µ reg`p(X)(h(X))

regψD[f ] = EX∼µ regψp(X)(f(X)) .

One can show that the surrogate and predictor (ψ, pred) are `-calibrated if and only if

there exists a function ξ : R+→R+ such that ξ(0) = 0 and ξ is continuous at 0 and such

that for all f : X→Creg`D[pred f ] ≤ ξ(regψD[f ]) .


regψD[f ]

reg`D[predf ]ξ(regψD[f ])

Figure 2.2: Example illustrating the feasible `-regret and ψ-regret values for a surrogateand predictor (ψ,pred) satisfying an excess risk bound.

Such bounds are called excess risk bounds, and an illustration is given in Figure 2.2.

If a (ψ, pred) satisfies such an excess risk bound, it immediately gives a way to convert

a ψ-consistent algorithm to an `-consistent algorithm. As noted in Section 2.5, ψ-ERM

algorithms (implemented in suitable function class FM) are ψ-consistent, and if ψ is

convex are efficiently implementable. Thus, the major goal in most of this thesis will be

to construct convex calibrated surrogates for various loss matrices of interest; in fact we

will start by developing general tools that can be used to design such surrogates for any

loss matrix `.

Part I

Consistency and Calibration

Chapter 3

Conditions for Calibration

In this chapter we describe in detail the framework of calibration, and give general condi-

tions for a surrogate loss to be calibrated w.r.t. a target loss. These results significantly

generalize previous results, which have focused on specific classes of loss matrices.


We begin by defining a general notion of calibration applicable to an arbitrary multiclass

loss matrix in Section 3.2. We then define a crucial property of the loss matrix known as

trigger probability sets and a crucial property of the surrogate known as positive normals

in Section 3.3. We go on to give necessary conditions and sufficient conditions for a

surrogate to be calibrated w.r.t. a loss matrix, based on the trigger probabilities of the

loss matrix and positive normals of the surrogate in Section 3.4.

3.2 Calibration

In this section, we give a formal definition of calibration that generalizes the definitions

of Bartlett et al. [7], Tewari and Bartlett [100], Zhang [116].

25

Chapter 3. Conditions for Calibration 26

Definition 3.1 (`-calibration). Let ` : Y × Y→R+. Let ψ : Y × C→R+ and pred : C→Y.

(ψ, pred) is said to be `-calibrated if

∀p ∈ ∆n : infu∈C:pred(u)/∈argmint〈p,`t〉

〈p,ψ(u)〉 > infu∈C〈p,ψ(u)〉 .

Also, ψ is said to be `-calibrated, if there exists a pred : C→Y such that (ψ, pred) is

`-calibrated.

Another equivalent definition of calibration that is natural in some situations and gener-

alizes the definition in Tewari and Bartlett [100] is given in the Lemma below.

Lemma 3.1. Let ` : Y × Y→R+. Let ψ : Y × C→R+. Then ψ is `-calibrated iff there

exists pred′ : Sψ→Y such that

∀p ∈ ∆n : infz∈Sψ :pred′(z)/∈argmint〈p,`t〉

〈p, z〉 > infz∈Sψ〈p, z〉 .

Proof. We will show that ∃ pred : C→Y satisfying the condition in Definition 3.1 if and

only if ∃ pred′ : Sψ→Y satisfying the stated condition.

(‘if ’ direction) First, suppose ∃ pred′ : Sψ→Y such that



Define pred : C→Y as follows:

pred(u) = pred′(ψ(u)) ∀u ∈ C .

Then for all p ∈ ∆n, we have

infu∈C:pred(u)/∈argmint〈p,`t〉

〈p,ψ(u)〉 = infz∈Rψ :pred′(z)/∈argmint〈p,`t〉

〈p, z〉

≥ infz∈Sψ :pred′(z)/∈argmint〈p,`t〉

〈p, z〉

> infz∈Sψ〈p, z〉

= infu∈C〈p,ψ(u)〉 .

Thus ψ is `-calibrated.


(‘only if ’ direction) Conversely, suppose ψ is `-calibrated, so that ∃ pred : C→Y such

that

∀p ∈ ∆n : infu∈C:pred(u)/∈argmint〈p,`t〉


By Caratheodory’s theorem (e.g. see [8]), we have that every z ∈ Sψ can be expressed

as a convex combination of at most n + 1 points in Rψ, i.e. for every z ∈ Sψ, ∃α ∈∆n+1,u1, . . . ,un+1 ∈ C such that z =

∑n+1j=1 αjψ(uj); w.l.o.g., we can assume α1 ≥

1n+1

. For each z ∈ Sψ, arbitrarily fix a unique such convex combination, i.e. fix αz ∈∆n+1,u

z1, . . . ,u

zn+1 ∈ C with αz

1 ≥ 1n+1

such that

z =n+1∑j=1

αzjψ(uz

j ) .

Now, define pred′ : Sψ→Y as follows:

pred′(z) = pred(uz1) ∀z ∈ Sψ .

Then for any p ∈ ∆n, we have

infz∈Sψ :pred′(z)/∈argmint〈p,`t〉

〈p, z〉 = infz∈Sψ :pred(uz

1)/∈argmint〈p,`t〉

n+1∑j=1

αzj 〈p,ψ(uz

j )〉

≥ infα∈∆n+1,u1,...,un+1∈C:α1≥ 1

n+1,pred(u1)/∈argmint〈p,`t〉

n+1∑j=1

αj〈p,ψ(uj)〉

≥ infα∈∆n+1:α1≥ 1

n+1

n+1∑j=1

infuj∈C:pred(u1)/∈argmint〈p,`t〉

αj〈p,ψ(uj)〉

≥ infα1∈[ 1

n+1,1]

[α1 inf

u∈C:pred(u)/∈argmint〈p,`t〉〈p,ψ(u)〉

+(1− α1)n+1∑j=2

infu∈C〈p,ψ(u)〉

]> inf

u∈C〈p,ψ(u)〉

= infz∈Sψ〈p, z〉 .

Thus pred′ satisfies the stated condition.

Both the above definitions can be shown to be equivalent to the one mentioned in Section


2.6, which essentially states that if ψ is `-calibrated, then a ψ-consistent algorithm can

be converted to an `-consistent algorithm.

Theorem 3.2. Let ` : Y × Y→R+. Let ψ : Y × C→R+ and pred : C→Y. (ψ, pred) is

`-calibrated iff for all distributions D on X × [n] and all sequences of (random) vector

functions fm : X→C, we have that

erψD[fm]P−→ erψ,∗D implies er`D[pred fm]

P−→ er`,∗D .

The proof is similar to that for the multiclass 0-1 loss given by Tewari and Bartlett [100].

Before, we give the proof, we state two lemmas; the proof of the first can be found in

Tewari and Bartlett [100], and the second follows directly from Lemma 3.1.

Lemma 3.3. The map p 7→ infz∈Sψ〈p, z〉 is continuous over ∆n.

Lemma 3.4. Let ` : Y × Y→R+. A surrogate ψ : Y ×C→Rn is `-calibrated if and only if

there exists a function pred′ : Sψ→[k] such that the following holds: for all p ∈ ∆n and all

sequences zm in Sψ such that limm→∞〈p, zm〉 = infz∈Sψ〈p, z〉, we have 〈p, `pred′(zm)〉 =

mint∈Y〈p, `t〉 for all large enough m.

Proof. (Proof of Theorem 3.2)

(‘only if ’ direction)

Let (ψ, pred) be `-calibrated. Then by Lemma 3.1, ∃ pred′ : Sψ→Y such that



Further, for any u ∈ C we have pred(u) = pred′(ψ(u)).

Now, for each ε > 0, define

H(ε) = infp∈∆n,z∈Sψ :〈p,`pred′(z)〉−min

t∈Y 〈p,`t〉≥ε

〈p, z〉 − inf

z∈Sψ〈p, z〉

.


We claim that H(ε) > 0 ∀ε > 0. Assume for the sake of contradiction that ∃ε > 0 for

which H(ε) = 0. Then there must exist a sequence (pm, zm) in ∆n × Sψ such that

〈pm, `pred′(zm)〉 −mint∈Y〈pm, `t〉 ≥ ε ∀m (3.1)

and

〈pm, zm〉 − infz∈Sψ〈pm, z〉 → 0 . (3.2)

Since pm come from a compact set, we can choose a convergent subsequence (which we still

call pm), say with limit p. Then by Lemma 3.3, we have infz∈Sψ〈pm, z〉 −→ infz∈Sψ〈p, z〉,and therefore by Equation (3.2), we get

〈pm, zm〉 −→ infz∈Sψ〈p, z〉 .

Now we show that zm is a sequence such that 〈p, zm〉 −→ infz∈Sψ〈p, z〉. Without loss of

generality, we assume that the first a coordinates of p are non-zero and the rest are zero.

Hence the first a coordinates of zm are bounded for sufficiently large m, and we have

lim supm〈p, zm〉 = lim sup

m

a∑y=1

pm,yzm,y ≤ limm→∞

〈pm, zm〉 = infz∈Sψ〈p, z〉 .

By Lemma 3.4, we therefore have 〈p, `pred′(zm)〉 = mint∈[k]〈p, `t〉 for all large enough m,

which contradicts Equation (3.1) as pm converges to p. Thus we must have H(ε) > 0 ∀ε >0. From Zhang [116], there exists a concave and non-decreasing function ξ : R+→R+

continuous at 0 with ξ(0) = 0 and satisfying the following for all u ∈ C,p ∈ ∆n.

reg`p(pred′(ψ(u))) = reg`p(pred(u)) ≤ ξ(regψp(u)

).

By Jensen’s inequality, we have for all f : X→C and all distributions D over X × Y ,

reg`p [pred f ] ≤ ξ(regψp[f ]

).

And thus any sequence of random vector functions fm such that erψD[fm]P−→ erψ,∗D satisfies

er`D[pred fm]P−→ er`,∗D .


(‘if ’ direction)

Conversely, suppose ψ is not `-calibrated. Consider any pred : C→[k]. Then ∃p ∈ ∆n

such that

infu∈C:pred(u)/∈argmint〈p,`t〉

〈p,ψ(u)〉 = infu∈C〈p,ψ(u)〉 .

In particular, this means there exists a sequence of points um in C such that

pred(um) /∈ argmint〈p, `t〉 ∀m

and

〈p,ψ(um)〉 −→ infu∈C〈p,ψ(u)〉 .

Now consider a data distribution D = µ ×DY |X on X × [n], with µ being a point mass

at some x ∈ X and DY |X=x = p. Let fm : X→C be any sequence of functions satisfying

fm(x) = um ∀m. Then we have

erψD[fm] = 〈p,ψ(um)〉 ; erψ,∗D = infu∈C〈p,ψ(u)〉

and

er`D[pred fm] = 〈p, `pred(um)〉 ; er`,∗D = mint〈p, `t〉 .

This gives

erψD[fm] −→ erψ,∗D

but

er`D[pred fm] 6−→ er`,∗D .

This completes the proof.

We also have that calibration is equivalent to the existence of excess risk bounds. This is

formalized in Proposition 3.5. In some of our results, we simply show calibration via either

Definition 3.1 or Lemma 3.1. In some other cases, it is possible to derive explicit excess

risk bounds, and we do so whenever possible rather than simply showing calibration, as

it gives a better understanding of the relation between the surrogate and the true loss.

Proposition 3.5. Let ` : Y × Y→R+. Let ψ : Y × C→R+ and pred : C→Y.


a. Let ξ : R+→R+ be a concave non-decreasing function such that ξ(0) = 0 and for all

distributions D and all f : X→C we have

reg`D[pred f ] ≤ ξ(regψD[f ]

).

Then (ψ, pred) is `-calibrated.

b. Let (ψ, pred) be `-calibrated. Then there exists concave non-decreasing function

ξ : R+→R+ such that ξ(0) = 0 and for all distributions D and all f : X→C we have

reg`D[pred f ] ≤ ξ(regψD[f ]

).

Proof. Part a.

Fix p ∈ ∆n. By letting D be the distribution with marginal over X concentrated at a

single point x, and such that the conditional distribution p(x) = p, we have for all u ∈ C

reg`p(pred(u)) ≤ ξ(regψp(u)

).

As the above holds for all p ∈ ∆n, we have that (ψ, pred) is `-calibrated.

Part b.

This follows from the if direction of the proof of Theorem 3.2.

3.3 Trigger Probabilities and Positive Normals

Our goal is to study conditions under which a surrogate loss ψ : Y×C→R+ is `-calibrated

for a target loss function ` : Y×Y→R+. To this end, we will now define certain properties

of both multiclass loss functions ` and multiclass surrogates ψ that will be useful in

relating the two. Specifically, we will define trigger probability sets associated with a

multiclass loss function `, and positive normal sets associated with a multiclass surrogate


ψ; in Section 3.4 we will use these to obtain both necessary and sufficient conditions for

calibration.

3.3.1 Trigger Probabilities of a Loss Function

Definition 3.2 (Trigger probability sets). Let ` : Y×Y→R+. For each t ∈ Y, the trigger

probability set of ` at t is defined as

Q`t =

p ∈ ∆n : 〈p, (`t − `t′)〉 ≤ 0 ∀t′ ∈ Y

=

p ∈ ∆n : t ∈ argmint′∈Y〈p, `t′〉.

In words, the trigger probability set Q`t is the set of class probability vectors for which

predicting t is optimal in terms of minimizing `-risk. Such sets have also been studied

by Lambert and Shoham [62] and O’Brien et al. [73] in a different context. Lambert

and Shoham [62] show that these sets form what is called a power diagram, which is

a generalization of the Voronoi diagram. Trigger probability sets for the 0-1, absolute

difference, and abstain loss matrices (described in Examples 2.8, 2.9, 2.10 ) are calculated

in Examples 3.1, 3.2 and 3.3 and are illustrated in Figure 3.1.

Example 3.1 (Trigger probabilities for the multiclass zero-one loss). Consider the 3 class

multiclass zero-one loss with the loss matrix as in Figure 2.1b. We have

`0-11 =

0

1

1

; `0-12 =

1

0

1

; `0-13 =

1

1

0

.

Q`0-11 = p ∈ ∆3 : 〈p, `1〉 ≤ 〈p, `2〉, 〈p, `1〉 ≤ 〈p, `3〉

= p ∈ ∆3 : p2 + p3 ≤ p1 + p3, p2 + p3 ≤ p1 + p2

= p ∈ ∆3 : p1 ≥ max(p2, p3)

By symmetry,

Q`0-12 = p ∈ ∆3 : p2 ≥ max(p1, p3) and Q`0-13 = p ∈ ∆3 : p3 ≥ max(p1, p2)


See Figure 3.1a for an illustration of the trigger probabilities.

Example 3.2 (Trigger probabilities for the absolute difference loss). Consider the three

class absolute difference loss with the loss matrix as in Figure 2.1c. We have

àbs1 =

0

1

2

; àbs2 =

1

0

1

; àbs3 =

2

1

0

.

Qàbs1 = p ∈ ∆3 : 〈p, `1〉 ≤ 〈p, `2〉, 〈p, `1〉 ≤ 〈p, `3〉

= p ∈ ∆3 : p2 + 2p3 ≤ p1 + p3, p2 + 2p3 ≤ 2p1 + p2

= p ∈ ∆3 : p1 ≥ 12

By symmetry,

Qàbs3 = p ∈ ∆3 : p3 ≥ 12

Finally,

Qàbs2 = p ∈ ∆3 : 〈p, `2〉 ≤ 〈p, `1〉, 〈p, `2〉 ≤ 〈p, `3〉

= p ∈ ∆3 : p1 + p3 ≤ p2 + 2p3, p1 + p3 ≤ 2p1 + p2

= p ∈ ∆3 : p1 ≤ p2 + p3, p3 ≤ p1 + p2

See Figure 3.1b for an illustration of the trigger probabilities.

Example 3.3 (Trigger probabilities for the abstain loss). Consider the three class abstain

loss with the loss matrix as in Figure 2.1d. We have

`(?)1 =

0

1

1

; `(?)2 =

1

0

1

; `(?)3 =

1

1

0

; `(?)⊥ =

12

12

12

.


(1, 0, 0) (0, 0, 1)

(0, 1, 0)

Q`1

Q`2

Q`3

(12,12, 0)

(12, 0,12)

(0, 12,12)

(13,13,

13)

(a) Zero-one loss `0-1

(1, 0, 0) (0, 0, 1)

(0, 1, 0)

Q`1

Q`2

Q`3

(12,12, 0)

(12, 0,12)

(0, 12,12)

(b) Absolute diff. loss `abs

(1, 0, 0) (0, 0, 1)

(0, 1, 0)

Q`1

Q`2

Q`4

Q`3

(12,12, 0)

(12, 0,12)

(0, 12,12)

(c) Abstain loss `(?)

Figure 3.1: Trigger probability sets for various losses, with n = 3. See Examples 3.1,3.2 and 3.3 for details.

Q`(?)1 = p ∈ ∆3 : 〈p, `1〉 ≤ 〈p, `2〉, 〈p, `1〉 ≤ 〈p, `3〉, 〈p, `1〉 ≤ 〈p, `⊥〉

= p ∈ ∆3 : p2 + p3 ≤ p1 + p3, p2 + p3 ≤ p1 + p2, p2 + p3 ≤ 12(p1 + p2 + p3)

= p ∈ ∆3 : p2 ≤ p1, p3 ≤ p1, p2 + p3 ≤ 12

= p ∈ ∆3 : p1 ≥ 12

By symmetry,

Q`(?)2 = p ∈ ∆3 : p2 ≥ 12 and Q`(?)3 = p ∈ ∆3 : p3 ≥ 1

2

Finally,

Q`(?)⊥ = p ∈ ∆3 : 〈p, `⊥〉 ≤ 〈p, `1〉, 〈p, `⊥〉 ≤ 〈p, `2〉, 〈p, `⊥〉 ≤ 〈p, `3〉

= p ∈ ∆3 : 12(p1 + p2 + p3) ≤ min(p2 + p3, p1 + p3, p1 + p2)

= p ∈ ∆3 : 12≤ 1−max(p1, p2, p3)

= p ∈ ∆3 : max(p1, p2, p3) ≤ 12

See Figure 3.1c for an illustration of the trigger probabilities.


2

4

6

2

4

-5 -1 0 1 5

-5 -1 0 1 5

ψ(1, u) = (1− u)+

u

u

ψ(2, u) = (1 + u)+

(a)

z2 = (0, 2)

z4 = (2, 0)

Nψ(z1) = (1, 0)z1 = (0, 4)

z5 = (4, 0)

Nψ(z2) = conv(1, 0), (0.5, 0.5)

Nψ(z4) = conv(0.5, 0.5), (0, 1)z3 = (0.5, 1.5)

Nψ(z3) = (0.5, 0.5)

Nψ(z5) = (0, 1)

SψRψ

(0, 0)

ψ(2, .)

ψ(1, .)

(b)

Figure 3.2: The hinge loss and an illustration of its ‘image set’ Sψ along with theconstruction of positive normals at some points.

3.3.2 Positive Normals of a Surrogate

Definition 3.3 (Positive normal set at a point). Let ψ : Y × C→R+. For each point

z ∈ Sψ, the positive normal set of ψ at z is defined as1

N ψ(z) =

p ∈ ∆n : 〈p, (z− z′)〉 ≤ 0 ∀z′ ∈ Sψ

=

p ∈ ∆n : 〈p, z〉 = infz′∈Sψ〈p, z′〉

.

For any sequence of points zm in Sψ, the positive normal set of ψ at zm is defined

as2

N ψ(zm) =

p ∈ ∆n : limm→∞

〈p, zm〉 = infz′∈Sψ〈p, z′〉

.

In words, the positive normal set N ψ(z) at a point z = ψ(u) ∈ Rψ is the set of class

probability vectors for which predicting u is optimal in terms of minimizing ψ-risk. Such

sets were also studied by Tewari and Bartlett [100]. The extension to sequences of points

in Sψ is needed for technical reasons in some of our proofs. Note that for N ψ(zm) to

be well-defined, the sequence zm need not converge itself; however if the sequence zmdoes converge to some point z ∈ Sψ, then N ψ(zm) = N ψ(z).

A simple example for illustrating the positive normals is given below.

1For points z in the interior of Sψ, Nψ(z) is empty.2For sequences zm for which limm→∞〈p, zm〉 does not exist for any p, Nψ(z) is empty.


Example 3.4 (Positive normals of the binary hinge loss). Let Y = Y = 1, 2. Consider

the hinge loss ψ : Y × R→R+ defined as

ψ(y, u) = 1(y = 1) · (1− u)+ + 1(y = 2) · (1 + u)+ .

A graph of the hinge loss and an illustration of the construction of positive normals is

given in Figure 3.2. In particular, setting z2 = [0, 2]> and z4 = [2, 0]> we have that

N ψ(z2) = p ∈ ∆n : p1 ≥ p2

N ψ(z4) = p ∈ ∆n : p2 ≥ p1

The trigger probabilities of loss ` can be computed directly from the definition due to

finiteness of Y , however that is not the case for the positive normals. Below, we give a

method to compute the positive normals of certain types of surrogates at a given point.

Specifically, we give an explicit method for computing N ψ(z) for convex surrogate losses

ψ operating on a convex surrogate space C ⊆ Rd, at points z = ψ(u) ∈ Rψ for which

the subdifferential ∂ψy(u) for each y ∈ [n] can be described as the convex hull of a finite

number of points in Rd; this is particularly applicable for piecewise linear surrogates.

Lemma 3.6. Let C ⊆ Rd be a convex set and let ψ : Y ×C→R+ be convex. Let z = ψ(u)

for some u ∈ C such that ∀y ∈ [n], the subdifferential of ψy at u can be written as

∂ψy(u) = conv(wy

1, . . . ,wysy)

for some sy ∈ Z+ and wy1, . . . ,w

ysy ∈ Rd. Let s =

∑ny=1 sy, and let

A =[w1

1 . . .w1s1

w21 . . .w

2s2. . . . . .wn

1 . . .wnsn

]∈ Rd×s ; B = [by,j] ∈ Rn×s ,

where by,j is 1 if the j-th column of A came from wy1, . . . ,w

ysy and 0 otherwise. Then

N ψ(z) =

p ∈ ∆n : p = Bq for some q ∈ null(A) ∩∆s

,

where null(A) ⊆ Rs denotes the null space of the matrix A.


Proof. For all p ∈ Rn,

p ∈ N ψ(ψ(u)) ⇐⇒ p ∈ ∆n, 〈p,ψ(u)〉 ≤ 〈p, z′〉 ∀z′ ∈ Sψ⇐⇒ p ∈ ∆n, 〈p,ψ(u)〉 ≤ 〈p, z′〉 ∀z′ ∈ Rψ

⇐⇒ p ∈ ∆n, and the convex function φ(u′) = p>ψ(u′) =∑n

y=1 pyψy(u′)

achieves its minimum at u′ = u

⇐⇒ p ∈ ∆n, 0 ∈n∑y=1

py∂ψy(u)

⇐⇒ p ∈ ∆n, 0 =n∑y=1

py

sy∑j=1

vyjwyj for some vy ∈ ∆sy

⇐⇒ p ∈ ∆n, 0 =n∑y=1

sy∑j=1

qyjwyj for some qy = pyv

y, vy ∈ ∆sy

⇐⇒ p ∈ ∆n,Aq = 0 for some q = (p1v1, . . . , pnv

n)> ∈ ∆s, vy ∈ ∆sy

⇐⇒ p = Bq for some q ∈ null(A) ∩∆s .

In some of the steps above, we have used basic properties of convex functions that can

be found in Appendix A.

We give some examples for computation of the positive normals using Lemma 3.6 below.

Example 3.5 (Positive normal sets of ‘absolute difference’ surrogate). Let Y = [3], and

let C = R. Consider the ‘absolute difference’ surrogate ψabs : Y × R→R+ defined as

follows:

ψabs(y, u) = |u− y| ∀y ∈ [3], u ∈ R . (3.3)

Clearly, ψabs is convex (see Figure 3.3). Moreover, we have

Rψabs = ψabs(R) =

(|u− 1|, |u− 2|, |u− 3|)> : u ∈ R⊂ R3

+ .


Now let u1 = 1, u2 = 2, and u3 = 3, and let

z1 = ψabs(u1) = ψabs(1) = (0, 1, 2)> ∈ Rψabs

z2 = ψabs(u2) = ψabs(2) = (1, 0, 1)> ∈ Rψabs

z3 = ψabs(u3) = ψabs(3) = (2, 1, 0)> ∈ Rψabs .

Let us consider computing the positive normal sets of ψabs at the 3 points z1, z2, z3 above.

To see that z1 satisfies the conditions of Lemma 3.6, note that

∂ψabs1 (u1) = ∂ψabs

1 (1) = [−1, 1] = conv(+1,−1) ;


2 (1) = −1 = conv(−1) ;


3 (1) = −1 = conv(−1) .

Therefore, we can use Lemma 3.6 to compute N abs(z1). Here s = 4, and

A =[

+1 −1 −1 −1]

; B =

1 1 0 0

0 0 1 0

0 0 0 1

.

This gives

N ψabs

(z1) =p ∈ ∆3 : p = (q1 + q2, q3, q4) for some q ∈ ∆4, q1 − q2 − q3 − q4 = 0

=

p ∈ ∆3 : p = (q1 + q2, q3, q4) for some q ∈ ∆4, q1 = 1

2

=

p ∈ ∆3 : p1 ≥ 1

2

.

It is easy to see that z2 and z3 also satisfy the conditions of Lemma 3.6; similar compu-

tations then yield

N ψabs

(z2) =p ∈ ∆3 : p1 ≤ 1

2, p3 ≤ 1

2

N ψabs

(z3) =p ∈ ∆3 : p3 ≥ 1

2

.

The positive normal sets above are shown in Figure 3.3.


0 0.5 1 1.5 2 2.5 3 3.5 40

0.5

1

1.5

2

2.5

3

u

ψ

1ψ

2ψ3

(a)

(1, 0, 0) (0, 0, 1)

(0, 1, 0)

N ψ(z1)

N ψ(z2)

N ψ(z3)

(b)

Figure 3.3: (a) The absolute difference surrogate ψabs : Y × R→R+ (for n = 3), and(b) its positive normal sets at 3 points zi = ψabs(ui) ∈ R3

+ (i ∈ [3]) for u1 = 1, u2 =2, u3 = 3. See Example 3.5 for details.

Example 3.6 (Positive normal sets of ε-insensitive absolute difference surrogate). Let

Y = [3], and let C = R. Let ε ∈ (0, 0.5), and consider the ε-insensitive absolute difference

surrogate ψε : Y × R→R+ defined as follows:

ψε(y, u) =(|u− y| − ε

)+

∀y ∈ [3], u ∈ R . (3.4)

For ε = 0, we have ψε = ψabs. Clearly, ψε is a convex function (see Figure 3.4). Moreover,

we have

Rψε = ψε(R) =(

(|u− 1| − ε)+, (|u− 2| − ε)+, (|u− 3| − ε)+

)>: u ∈ R

⊂ R3

+ .

For concreteness, we will take ε = 0.25 below, but similar computations hold ∀ε ∈ (0, 0.5).

Let u1 = 1 + ε = 1.25, u2 = 2− ε = 1.75, u3 = 2 + ε = 2.25, and u4 = 3− ε = 2.75, and

let

z1 = ψ0.25(u1) = ψ0.25(1.25) = (0, 0.5, 1.5)> ∈ Rψ0.25

z2 = ψ0.25(u2) = ψ0.25(1.75) = (0.5, 0, 1)> ∈ Rψ0.25

z3 = ψ0.25(u3) = ψ0.25(2.25) = (1, 0, 0.5)> ∈ Rψ0.25

z4 = ψ0.25(u4) = ψ0.25(2.75) = (1.5, 0.5, 0)> ∈ Rψ0.25 .


Let us consider computing the positive normal sets of ψ0.25 at the 4 points zi (i ∈ [4])

above. To see that z1 satisfies the conditions of Lemma 3.6, note that

∂ψ0.251 (u1) = ∂ψ0.25

1 (1.25) = [0, 1] = conv(0, 1) ;

∂ψ0.252 (u1) = ∂ψ0.25

2 (1.25) = −1 = conv(−1) ;

∂ψ0.253 (u1) = ∂ψ0.25

3 (1.25) = −1 = conv(−1) .

Therefore, we can use Lemma 3.6 to compute N 0.25(z1). Here s = 4, and

A =[

0 1 −1 −1]

; B =

1 1 0 0

0 0 1 0

0 0 0 1

.

This gives

N ψ0.25

(z1) =p ∈ ∆3 : p = (q1 + q2, q3, q4) for some q ∈ ∆4, q2 − q3 − q4 = 0

=

p ∈ ∆3 : p = (q1 + q2, q3, q4) for some q ∈ ∆4, q1 + q2 ≥ q3 + q4

=

p ∈ ∆3 : p1 ≥ 1

2

.

Similarly, to see that z2 satisfies the conditions of Lemma 3.6, note that

∂ψ0.251 (u2) = ∂ψ0.25

1 (1.75) = 1 = conv(1) ;

∂ψ0.252 (u2) = ∂ψ0.25

2 (1.75) = [−1, 0] = conv(−1, 0) ;

∂ψ0.253 (u2) = ∂ψ0.25

3 (1.75) = −1 = conv(−1) .

Again, we can use Lemma 3.6 to compute N ψ0.25(z2); here s = 4, and

A =[

1 −1 0 −1]

; B =

1 0 0 0

0 1 1 0

0 0 0 1

.


0 0.5 1 1.5 2 2.5 3 3.5 40

0.5

1

1.5

2

2.5

3

u

ψ

1ψ

2ψ3

(a)

(1, 0, 0) (0, 0, 1)

(0, 1, 0)

N ψ(z1)

N ψ(z2) N ψ(z3)

N ψ(z4)

(b)

Figure 3.4: (a) The ε-insensitive absolute difference surrogate ψε : Y × R→R+ forε = 0.25 (and n = 3), and (b) its positive normal sets at 4 points zi = ψε(ui) ∈ R3

+

(i ∈ [4]) for u1 = 1.25, u2 = 1.75, u3 = 2.25, u4 = 2.75. See Example 3.6 for details.

This gives

N ψ0.25

(z2) =p ∈ ∆3 : p = (q1, q2 + q3, q4) for some q ∈ ∆4, q1 − q2 − q4 = 0

=

p ∈ ∆3 : p1 ≥ p3, p1 ≤ 1

2

.

Similar computations then yield

N ψ0.25

(z3) =p ∈ ∆3 : p1 ≤ p3, p3 ≤ 1

2

N ψ0.25

(z4) =p ∈ ∆3 : p3 ≥ 1

2

.

The positive normal sets above are shown in Figure 3.4.

Example 3.7 (Positive normals of the Crammer-Singer surrogate). Consider the Crammer-

Singer surrogate introduced in Example 2.14, for n = 3. For n = 3, the Crammer-Singer

surrogate ψCS : [3]× R3→R+ is given by

ψCS(1,u) = max(1 + u2 − u1, 1 + u3 − u1, 0)

ψCS(2,u) = max(1 + u1 − u2, 1 + u3 − u2, 0)

ψCS(3,u) = max(1 + u1 − u3, 1 + u2 − u3, 0) ∀u ∈ R3 .


Clearly, ψCS is convex. Let u1 = (1, 0, 0)>, u2 = (0, 1, 0)>, u3 = (0, 0, 1)>, u4 = (0, 0, 0)>,

and let

z1 = ψCS(u1) = (0, 2, 2)>

z2 = ψCS(u2) = (2, 0, 2)>

z3 = ψCS(u3) = (2, 2, 0)>

z4 = ψCS(u4) = (1, 1, 1)> .

We apply Lemma 3.6 to compute the positive normal sets of ψCS at the 4 points z1, z2, z3, z4

above. In particular, to see that z4 satisfies the conditions of Lemma 3.6, note that by

Danskin’s theorem [8], we have that

∂ψCS1 (u4) = conv

−1

+1

0

,−1

0

+1

;


+1

−1

0

,

0

−1

+1

;


+1

0

−1

,

0

+1

−1

.

We can therefore use Lemma 3.6 to compute N ψCS(z4). Here s = 6, and

A =

−1 −1 1 0 1 0

1 0 −1 −1 0 1

0 1 0 1 −1 −1

; B =

1 1 0 0 0 0

0 0 1 1 0 0

0 0 0 0 1 1

.


(1, 0, 0) (0, 0, 1)

(0, 1, 0)

N ψ(z1)

N ψ(z2)

N ψ(z4)

N ψ(z3)

(12 ,12 , 0)

(12 , 0,12)

(0, 12 ,12)

Figure 3.5: Positive normal sets for the Crammer-Singer surrogate ψCS for n = 3, at 4points zi = ψCS(ui) ∈ R3

+ (i ∈ [4]) for u1 = (1, 0, 0)>, u2 = (0, 1, 0)>, u3 = (0, 0, 1)>,and u4 = (0, 0, 0)>. Details can be found in Example 3.7.

By Lemma 3.6 (and some algebra), this gives

N ψCS

(z4) =p ∈ ∆3 : p = (q1 + q2, q3 + q4, q5 + q6) for some q ∈ ∆6,

q1 + q2 = q3 + q5, q3 + q4 = q1 + q6, q5 + q6 = q2 + q4

=

p ∈ ∆3 : p1 ≤ 1

2, p2 ≤ 1

2, p3 ≤ 1

2

.

It is easy to see that z1, z2, z3 also satisfy the conditions of Lemma 3.6; similar computa-

tions then yield

N ψCS

(z1) =p ∈ ∆3 : p1 ≥ 1

2

N ψCS

(z2) =p ∈ ∆3 : p2 ≥ 1

2

N ψCS

(z3) =p ∈ ∆3 : p3 ≥ 1

2

.

An illustration of the computed positive normals is given in Figure 3.5

3.4 Conditions for Calibration

In this section, we give both necessary conditions (Section 3.4.1) and sufficient conditions

(Section 3.4.2) for a surrogate ψ to be calibrated w.r.t. an arbitrary loss function `. Both

these conditions involve the trigger probability sets of the target loss ` and the positive

normal sets of the surrogate loss ψ.


(1, 0, 0) (0, 0, 1)

(0, 1, 0)

N ψ(z)

q1

q2

Q`1

Q`2

Q`3

Figure 3.6: Visual proof of Theorem 3.7.

3.4.1 Necessary Conditions for Calibration

We start by deriving necessary conditions for `-calibration of a surrogate loss ψ. Consider

what happens if for some point z ∈ Sψ, the positive normal set of ψ at z, N ψ(z), has a non-

empty intersection with the interiors of two trigger probability sets of `, sayQ`1 andQ`2 (see

Figure 3.6 for an illustration), which means ∃q1,q2 ∈ N ψ(z) with argmint∈Y〈q1, `t〉 = 1and argmint∈Y〈q2, `t〉 = 2. If ψ is `-calibrated, then by Lemma 3.1, we have ∃pred′ :

Sψ→Y such that

infz′∈Sψ :pred′(z′)6=1

〈q1, z′〉 = inf

z′∈Sψ :pred′(z′)/∈argmint〈q1,`t〉〈q1, z

′〉 > infz′∈Sψ〈q1, z

′〉 = 〈q1, z〉

infz′∈Sψ :pred′(z′)6=2

〈q2, z′〉 = inf

z′∈Sψ :pred′(z′)/∈argmint〈q2,`t〉〈q2, z

′〉 > infz′∈Sψ〈q2, z

′〉 = 〈q2, z〉 .

The first inequality above implies pred′(z) = 1; the second inequality implies pred′(z) = 2,

leading to a contradiction. This gives us the following necessary condition for `-calibration

of ψ, which requires the positive normal sets of ψ at all points z ∈ Sψ to be ‘well-behaved’

w.r.t. ` in the sense of being contained within individual trigger probability sets of ` and

generalizes the ‘admissibility’ condition used for 0-1 loss by Tewari and Bartlett [100]:

Theorem 3.7. Let ` : Y × Y→R+, and let ψ : Y × C→R+ be `-calibrated. Then for all

points z ∈ Sψ, there exists some t ∈ Y such that N ψ(z) ⊆ Q`t.

In fact, we have the following stronger necessary condition, which requires the positive

normal sets of ψ not only at all points z ∈ Sψ but also at all sequences zm in Sψ to be

contained within individual trigger probability sets of `.


Theorem 3.8. Let ` : Y × Y→R+, and let ψ : Y × C→R+ be `-calibrated. Then for all

sequences zm in Sψ, there exists some t ∈ Y such that N ψ(zm) ⊆ QLt .

Proof. Assume for the sake of contradiction that there is some sequence zm in Sψ for

which N ψ(zm) is not contained in Q`t for any t ∈ Y . Then ∀t ∈ Y , ∃qt ∈ N ψ(zm)such that qt /∈ Q`t, i.e. such that t /∈ argmint′〈qt, `t′〉. Now, since ψ is `-calibrated, by

Lemma 3.4, there exists a function pred′ : Sψ→Y such that for all p ∈ N ψ(zm), we

have pred′(zm) ∈ argmint′〈p, `t′〉 for all large enough m. In particular, for p = qt, we

get pred′(zm) ∈ argmint′〈qt, `t′〉 ultimately. Since this is true for each t ∈ Y , we get

pred′(zm) ∈ ∩t∈Y argmint′〈qt, `t′〉 ultimately. However by choice of qt, this intersection is

empty, thus yielding a contradiction. This completes the proof.

Note that Theorem 3.8 includes Theorem 3.7 as a special case, since N ψ(z) = N ψ(zm)for the constant sequence zm = z ∀m. We stated Theorem 3.7 separately above since it

has a simple, direct proof that helps build intuition.

Example 3.8 (Crammer-Singer surrogate is not calibrated for 0-1 loss). Looking at the

positive normal sets of the Crammer-Singer surrogate ψCS (for n = 3) shown in Figure

3.5 and the trigger probability sets of the 0-1 loss `0-1 shown in Figure 3.1a, we see

that N ψCS(z4) is not contained in any single trigger probability set of `0-1, and therefore

applying Theorem 3.7, it is immediately clear that ψCS is not `0-1-calibrated (this was also

established previously [100, 116]).

3.4.2 Sufficient Condition for Calibration

We now give a sufficient condition for `-calibration of a surrogate loss ψ that will be helpful

in showing calibration of various surrogates. In particular, we show that for a surrogate

loss ψ to be `-calibrated, it is sufficient for the above property of positive normal sets

of ψ being contained in trigger probability sets of ` to hold for only a finite number of

points in Sψ, as long as the corresponding positive normal sets jointly cover ∆n:

Theorem 3.9. Let ` : Y × Y→R+ and ψ : Y × C→Rn+. Suppose there exist r ∈ Z+

and z1, . . . , zr ∈ Sψ such that⋃rj=1N ψ(zj) = ∆n and for each j ∈ [r], ∃t ∈ Y such that

N ψ(zj) ⊆ Q`t. Then ψ is `-calibrated.


The proof uses the following technical lemma:

Lemma 3.10. Let ψ : Y × C→Rn+. Suppose there exist r ∈ N and z1, . . . , zr ∈ Rψ such

that⋃rj=1N ψ(zj) = ∆n. Then any element z ∈ Sψ can be written as z = z′+z′′ for some

z′ ∈ conv(z1, . . . , zr) and z′′ ∈ Rn+.

Proof. (Proof of Lemma 3.10)

Let S ′ = z′ + z′′ : z′ ∈ conv(z1, . . . , zr), z′′ ∈ Rn+, and suppose there exists a point

z ∈ Sψ which cannot be decomposed as claimed, i.e. such that z /∈ S ′. Then by the

Hahn-Banach theorem (e.g. see [42], corollary 3.10), there exists a hyperplane that strictly

separates z from S ′, i.e. ∃w ∈ Rn such that 〈w, z〉 < 〈w, a〉 ∀a ∈ S ′. It is easy to see

that w ∈ Rn+ (since a negative component in w would allow us to choose an element a

from S ′ with arbitrarily small 〈w, a〉).

Now consider the vector q = w/∑n

i=1 wi ∈ ∆n. Since⋃rj=1N ψ(zj) = ∆n, ∃j ∈ [r]

such that q ∈ N ψ(zj). By definition of positive normals, this gives 〈q, zj〉 ≤ 〈q, z〉, and

therefore 〈w, zj〉 ≤ 〈w, z〉. But this contradicts our construction of w (since zj ∈ S ′).Thus it must be the case that every z ∈ Sψ is also an element of S ′.


We will show `-calibration of ψ via Lemma 3.1. For each j ∈ [r], let

Tj =t ∈ Y : N ψ(zj) ⊆ Q`t

;

by assumption, Tj 6= ∅ ∀j ∈ [r]. By Lemma 3.10, for every z ∈ Sψ, ∃α ∈ ∆r,u ∈ Rn+ such

that z =∑r

j=1 αjzj + u . For each z ∈ Sψ, arbitrarily fix a unique αz ∈ ∆r and uz ∈ Rn+

satisfying the above, i.e. such that

z =r∑j=1

αzjzj + uz .

Now define pred′ : Sψ→Y as

pred′(z) = mint ∈ Y : ∃j ∈ [r] such that αz

j ≥ 1r

and t ∈ Tj.


We will show that ψ, along with pred′, satisfies the condition for `-calibration as in Lemma

3.1.

Fix any p ∈ ∆n. Let

Jp =j ∈ [r] : p ∈ N ψ(zj)

;

since ∆n =⋃rj=1N ψ(zj), we have Jp 6= ∅. Clearly,

∀j ∈ Jp : 〈p, zj〉 = infz∈Sψ〈p, z〉 (3.5)

∀j /∈ Jp : 〈p, zj〉 > infz∈Sψ〈p, z〉 (3.6)

Moreover, from definition of Tj, we have

∀j ∈ Jp : t ∈ Tj =⇒ p ∈ Q`t =⇒ t ∈ argmint′〈p, `t′〉 .

Thus we get

∀j ∈ Jp : Tj ⊆ argmint′〈p, `t′〉 .

Now, for any z ∈ Sψ for which pred′(z) /∈ argmint′〈p, `t′〉, we must have αzj ≥ 1

rfor

at least one j /∈ Jp (otherwise, we would have pred′(z) ∈ Tj for some j ∈ Jp, giving

pred′(z) ∈ arg mint′〈p, `t′〉, a contradiction). Thus we have

infz∈Sψ :pred′(z)/∈argmint′ 〈p,`t′ 〉

〈p, z〉 = infz∈Sψ :pred′(z)/∈argmint′ 〈p,`t′ 〉

r∑j=1

αzj 〈p, zj〉+ 〈p,uz〉

≥ infα∈∆r:αj≥ 1

rfor some j /∈Jp

r∑j=1

αj〈p, zj〉

≥ minj /∈Jp

infαj∈[ 1

r,1]αj〈p, zj〉+ (1− αj) inf

z∈Sψ〈p, z〉

> infz∈Sψ〈p, z〉 ,

where the last inequality follows from Equation (3.6). Since the above holds for all

p ∈ ∆n, by Lemma 3.1, we have that ψ is `-calibrated.

Example 3.9 (Crammer-Singer surrogate is calibrated for `(?) and `abs for n = 3). In-

specting the positive normal sets of the Crammer-Singer surrogate ψCS (for n = 3) in

Figure 3.5 and the trigger probability sets of the abstain loss `(?) in Figure 3.1c, we see


that N ψCS(zi) = Q`(?)i ∀i ∈ [3], and N ψCS

(z4) = Q`(?)⊥ . Therefore by Theorem 3.9, the

Crammer-Singer surrogate ψCS is `(?)-calibrated. Similarly, looking at the trigger prob-

ability sets of the ordinal regression loss matrix àbs in Figure 3.1b and again applying

Theorem 3.9, we see that the Crammer-Singer surrogate ψCS is also àbs-calibrated. Note

however, that ψCS for larger n remains calibrated w.r.t. the abstain loss, but not w.r.t.

the absolute difference loss (Details in Chapter 7).

Example 3.10 (Absolute difference surrogate is calibrated w.r.t. àbs). Inspecting the

positive normal sets of the absolute difference surrogate ψabs (for n = 3) in Figure 3.3 and

the trigger probability sets of the absolute loss àbs in Figure 3.1b we have that N ψabs(zi) =

Qàbsi ∀i ∈ [3]. Hence by Theorem 3.9, we see that ψabs is àbs-calibrated. This argument

can be easily generalized for larger n as well.

Example 3.11 (ε-insensitive absolute difference surrogate is calibrated w.r.t. àbs). Let

ε ∈ (0, 0.5). Inspecting the positive normal sets of the ε-insensitive surrogate ψε (for

n = 3) in Figure 3.4 and the trigger probability sets of the absolute loss àbs in Figure

3.1b we have that

N ψε(z1) = Qàbs1 ; N ψε(z4) = Qàbs3

N ψε(z2) ⊆ Qàbs2 ; N ψε(z3) ⊆ Qàbs2

Also clearly ∪4i=1N ψε(zi) = ∆n. Hence by Theorem 3.9, we see that ψε is àbs-calibrated.

This argument can be easily generalized for larger n as well.

Chapter 4

Convex Calibration Dimension

In this chapter we shall discuss a fundamental quantity associated with a loss matrix, that

we call the convex calibration dimension. To motivate this quantity consider the absolute

difference surrogate loss and absolute difference loss matrix as in Examples 3.5 and 3.10.

It follows that the absolute difference surrogate loss, which has a surrogate dimension

of 1, is calibrated with the absolute loss matrix for any finite n.1 This is a property of

the absolute difference loss matrix, and it is not clear if such ‘low-dimensional’ convex

calibrated surrogates exist for other loss matrices.

This immediately raises the question:

What is the smallest surrogate dimension of a convex `-calibrated surrogate ?

This is a key question as it captures all our requirements of surrogate ψ, namely, that it

be `-calibrated, convex and have a small surrogate dimension d. This question is captured

by our definition of the convex calibration dimension.

Definition 4.1 (Convex calibration dimension). Let ` : Y × Y→R+. Define the convex

calibration dimension (CC dimension) of ` as

CCdim(`) = mind ∈ Z+ : ∃ a convex set C ⊆ Rd and a convex surrogate ψ : Y × C→R+

that is `-calibrated.

1The examples discussed use n = 3, but the arguments can be generalized to any finite n easily.

49

Chapter 4. Convex calibration dimension 50

From the above discussion, CCdim(`abs) = 1 for all n.

In this chapter, we will be interested in developing an understanding of the CC dimension

for general loss matrices `.


We analyze the CC dimension and give upper bounds (Section 4.2) and lower bounds

(Section 4.3) on this quantity. We show that the derived upper and lower bounds match

for certain types of loss matrices in Section 4.4. We then apply these results to certain

losses used in subset ranking and derive bounds on their CC dimension in Section 4.5,

thereby giving both existence and impossibility results on convex calibrated surrogates

for such losses.

4.2 Upper Bounds on CC Dimension

We start with a simple result which establishes that the CC dimension of any multiclass

loss ` is finite, and in fact is strictly smaller than the number of class labels n.

Lemma 4.1. Let ` : Y ×Y→R+. Let C =u ∈ Rn−1

+ :∑n−1

j=1 uj ≤ 1

. Let ψ : Y ×C→R+

be such that for all y ∈ Y

ψ(y,u) =n−1∑j=1

(uj − 1(y = j)

)2.

Then ψ is `-calibrated. In particular, since ψ is convex, CCdim(`) ≤ n− 1.

Proof. For each u ∈ C, define pu =

(u

1−∑n−1j=1 uj

)∈ ∆n. Define pred : C→Y as

pred(u) = mint ∈ Y : pu ∈ Q`

t

.

We will show that (ψ, pred) is `-calibrated.


Fix p ∈ ∆n. It can be seen that

〈p,ψ(u)〉 =n−1∑j=1

(pj(uj − 1)2 + (1− pj)uj2

).

Minimizing the above over u yields the unique minimizer u∗ = (p1, . . . , pn−1)> ∈ C. Now,

for each t ∈ Y we have,

reg`p(t) = 〈p, `t〉 −mint′∈Y〈p, `t′〉 .

Clearly, reg`p(t) = 0⇐⇒ p ∈ Q`t. Note also that pu∗ = p, and therefore reg`p(pred(u∗)) =

0. Let ε > 0 be such that

ε = mint∈[k]:p/∈Q`t

reg`p(t) > 0 .

Then we have

infu∈C:pred(u)/∈argmint 〈p,`t〉

〈p,ψ(u)〉 = infu∈C:reg`p(pred(u))≥ε

〈p,ψ(u)〉

= infu∈C:reg`p(pred(u))≥reg`p(pred(u∗))+ε

〈p,ψ(u)〉 .

Now, we claim that the mapping u 7→ reg`p(pred(u)) is continuous at u = u∗. To see this,

suppose the sequence um converges to u∗. Then it is easy to see that pum converges to

pu∗ = p, and therefore for each t ∈ [k], 〈pum , `t〉 converges to 〈p, `t〉. Since by definition

of pred we have that for all m, pred(um) ∈ argmint〈pum , `t〉, this implies that for all large

enough m, pred(um) ∈ argmint p>`t. Thus for all large enough m, reg`p(pred(um)) = 0;

i.e. the sequence reg`p(pred(um)) converges to reg`p(pred(u∗)), yielding continuity at u∗.

In particular, this implies ∃δ > 0 such that

‖u− u∗‖ < δ =⇒ reg`p(pred(u))− reg`p(pred(u∗)) < ε .

This gives

infu∈C:reg`p(pred(u))≥reg`p(pred(u∗))+ε

〈p,ψ(u)〉 ≥ infu∈C:‖u−u∗‖≥δ

〈p,ψ(u)〉

> infu∈C〈p,ψ(u)〉 ,


where the last inequality holds since u∗ is the unique minimizer of 〈p,ψ(u)〉. The above

sequence of inequalities give us that

infu∈C:pred(u)/∈argmint 〈p,`t〉


Since this holds for all p ∈ ∆n, we have that ψ is `-calibrated.

It may appear surprising that the convex surrogate ψ in the above lemma, operating on a

surrogate space C ⊂ Rn−1, is `-calibrated for all multiclass losses ` on n classes. However

this makes intuitive sense, since in principle, for any multiclass problem, if one can esti-

mate the conditional probabilities of the n classes accurately (which requires estimating

n− 1 real-valued functions on X ), then one can predict a target label that minimizes the

expected loss according to these probabilities. Minimizing the above surrogate effectively

corresponds to such class probability estimation. Indeed, the above lemma can be shown

to hold for any surrogate that is a strictly proper composite multiclass loss [103].

In practice, when the number of class labels n is large (such as in a multilabel prediction

task, where n is exponential in the number of tags), the above result is not very helpful;

in such cases, it is of interest to develop algorithms operating on a surrogate prediction

space in a lower-dimensional space. Next we give a different upper bound on the CC

dimension that depends on the loss `, and for certain losses, can be significantly tighter

than the general bound above.

Theorem 4.2. Let ` : Y × Y→R+. Then

CCdim(`) ≤ affdim(L) ,

where affdim(L) denotes the dimension of the vector space parallel to the affine hull of

the column (or row) vectors of L.

Proof. Let affdim(L) = d. We will construct a convex `-calibrated surrogate loss ψ with

surrogate prediction space C ⊆ Rd.


Let V ⊆ Rn denote the (d-dimensional) subspace parallel to the affine hull of the column

vectors of L, and let r ∈ Rn be the corresponding translation vector, so that

V = aff(`1, . . . , `k) + r ,

where aff(A) denotes the affine hull of the set A. Let v1, . . . ,vd ∈ V be d linearly

independent vectors in V . Let ed1, . . . , edd denote the standard basis in Rd, and define a

linear function ψ : Rd→Rn by

ψ(edj ) = vj ∀j ∈ [d] .

Then for each v ∈ V , there exists a unique vector u ∈ Rd such that ψ(u) = v. In

particular, since `t + r ∈ V for all t ∈ [k], there exist unique vectors u1, . . . ,uk ∈ Rd

such that for each t ∈ [k], ψ(ut) = `t + r. Let C = conv(u1, . . . ,uk) ⊆ Rd, and define

ψ : C→Rn+ as

ψ(u) = ψ(u)− r ∀u ∈ C .

To see that ψ(u) ∈ Rn+ ∀u ∈ C, note that for any u ∈ C, ∃α ∈ ∆k such that u =∑k

t=1 αtut, which gives ψ(u) = ψ(u)−r =(∑k

t=1 αtψ(ut))−r =

(∑kt=1 αt(`t+r)

)−r =∑k

t=1 αt`t (and `t ∈ Rn+ ∀t ∈ [k]).

Let ψ : [n] × C→R+ be the surrogate associated with ψ. The surrogate ψ is clearly

convex. To show ψ is `-calibrated, we will use Theorem 3.9. Specifically, consider the k

points zt = ψ(ut) = `t ∈ Rψ for t ∈ [k]. By definition of ψ, we have Sψ = conv(ψ(C)) =

conv(`1, . . . , `k); from the definitions of positive normals and trigger probabilities, it

then follows that N ψ(zt) = N ψ(`t) = Q`t for all t ∈ [k]. Thus by Theorem 3.9, ψ is

`-calibrated.

Since affdim(L) is equal to either rank(L) or rank(L)− 1, this immediately gives us the

following corollary:

Corollary 4.3. Let ` : Y × Y→R+. Then CCdim(`) ≤ rank(L).


Example 4.1 (CC dimension of Hamming loss). Let r ∈ N. Let Y = Y = 0, 1r. The

Hamming loss `Ham as defined in Example 2.11 can be expressed as

`Ham(y, t) =r∑i=1

1(yi 6= ti)

=r∑i=1

yi + ti − 2yiti

=r∑i=1

ti +r∑i=1

yi(1− 2ti) .

Thus the loss matrix for the hamming loss can be expressed as a sum of r rank-1 ma-

trices, and a matrix that depends only on the column (or t in this case). Clearly, the

affine dimension of such a matrix is at most r. Hence we have from Theorem 4.2,

that CCdim(`Ham) ≤ r, which is significantly better than the bound of 2r − 1 got from

Lemma 4.1.

Theorem 4.2 guarantees the existence of a convex `-calibrated surrogate with dimension

affdim(L). Indeed, the proof of Theorem 4.2 constructs such a surrogate. However, this

constructed surrogate is undesirable, as the surrogate space C is an awkward subset of

Raffdim(L) and the predictor mapping pred is not given explicitly. We rectify both these

shortcomings in Chapter 5, and give many other example loss matrices used in ranking

and multi-label prediction where the size of the loss matrix is exponential in its rank.

4.3 Lower Bounds on CC Dimension

In this section we give a lower bound on the CC dimension of a loss ` and illustrate this

by using it to calculate the CC dimension of the 0-1 loss. We will need the following

definition:

Definition 4.2. The feasible subspace dimension of a convex set Q ⊆ Rn at a point

p ∈ Q, denoted by νQ(p), is defined as the dimension of the subspace FQ(p)∩ (−FQ(p)),

where FQ(p) is the cone of feasible directions of Q at p.2

2For a set Q ⊆ Rn and point p ∈ Q, the cone of feasible directions of Q at p is defined asFQ(p) = v ∈ Rn : ∃ε0 > 0 such that p + εv ∈ Q ∀ε ∈ (0, ε0).


Q

p1

p2

p3

(a) The convex set Q

FQ(p1) −FQ(p1)

(b) dim(FQ(p1) ∩ (−FQ(p1))) = 2

FQ(p2) −FQ(p2)

(c) dim(FQ(p2) ∩ (−FQ(p2))) = 1

FQ(p3)

−FQ(p3)

(d) dim(FQ(p3) ∩ (−FQ(p3))) = 0

Figure 4.1: Illustration of feasible subspace dimension νQ(p) at various points p1,p2,p3

for some 2-dimensional set Q. Clearly νQ(p1) = 2, νQ(p2) = 1, νQ(p3) = 0.

Essentially, the feasible subspace dimension of a convex set Q at a point p, is simply the

dimension of the smallest face of Q containing p. An illustration of the feasible subspace

dimension is given in Figure 4.1.

Both the proof of the lower bound and its applications make use of the following lemma,

which gives a method to calculate the feasible subspace dimension for certain convex sets

Q and points p ∈ Q:

Lemma 4.4. Let Q =q ∈ Rn : A1q ≤ b1,A2q ≤ b2,A3q = b3

. Let p ∈ Q be

such that A1p = b1, A2p < b2. Then νQ(p) = nullity

([A1

A3

]), where nullity(A) =

dim(null(A)) is the dimension of the null space of A.


Proof. We will show that FQ(p) ∩ (−FQ(p)) = null

([A1

A3

]), from which the lemma

follows.

First, let v ∈ null

([A1

A3

]). Then for ε > 0, we have

A1(p + εv) = A1p + εA1v = A1p + 0 = b1

A2(p + εv) < b2 for small enough ε, since A2p < b2

A3(p + εv) = A3p + εA3v = A3p + 0 = b3 .

Thus v ∈ FQ(p). Similarly, we can show −v ∈ FQ(p). Thus v ∈ FQ(p) ∩ (−FQ(p)),

giving null

([A1

A3

])⊆ FQ(p) ∩ (−FQ(p)).

Now let v ∈ FQ(p)∩(−FQ(p)). Then for small enough ε > 0, we have both A1(p+εv) ≤b1 and A1(p−εv) ≤ b1. Since A1p = b1, this gives A1v = 0. Similarly, for small enough

ε > 0, we have A3(p + εv) = b3; since A3p = b3, this gives A3v = 0. Thus

[A1

A3

]v = 0,

giving FQ(p) ∩ (−FQ(p)) ⊆ null

([A1

A3

]).

The following gives a lower bound on the CC dimension of a loss ` in terms of the feasible

subspace dimension of the trigger probability sets Q`t at points p ∈ Q`t:

Theorem 4.5. Let ` : Y × Y→R+. Let p ∈ ∆n and t ∈ argmint′〈p, `t′〉 (equivalently, let

p ∈ Q`t). Then

CCdim(`) ≥ ‖p‖0 − νQ`t(p)− 1 .

The proof will require the lemma below, which relates the feasible subspace dimensions of

different trigger probability sets at points in their intersection; we will also make critical

use of the notion of ε-subdifferentials of convex functions [8], the main properties of which

are given in Appendix A.


Lemma 4.6. Let ` : Y×Y→R+. Let p ∈ relint(∆n).3 Then for any t1, t2 ∈ arg mint′〈p, `t′〉(i.e. such that p ∈ Q`t1 ∩Q`t2),

νQ`t1(p) = νQ`t2

(p) .

Proof. (Proof of Lemma 4.6)

Let t1, t2 ∈ arg mint′〈p, `t′〉 (i.e. p ∈ Q`t1 ∩Q`t2). Now

Q`t1 =q ∈ Rn : −q ≤ 0,1>nq = 1, (`t1 − `t)>q ≤ 0 ∀t ∈ Y

,

where 1n is the n-dimensional all ones vector. Moreover, we have −p < 0, and p>(`t1 −`t) = 0 iff p ∈ Q`t. Let

t ∈ Y : p ∈ Q`t

=t1, . . . , tr

for some r ∈ [k]. Then by

Lemma 4.4, we have

νQ`t1= nullity(A1) ,

where A1 ∈ R(r+1)×n is a matrix containing r rows of the form (`t1 − `tj)>, j ∈ [r] and

the all ones row. Similarly, we get

νQ`t2= nullity(A2) ,

where A2 ∈ R(r+1)×n is a matrix containing r rows of the form (`t2−`tj)>, j ∈ [r] and the

all ones row. It can be seen that the subspaces spanned by the first r rows of A1 and A2

are both equal to the subspace parallel to the affine space containing `t1 , . . . , `tr . Thus

both A1 and A2 have the same row space and hence the same null space and nullity, and

therefore νQ`t1(p) = νQ`t2

(p).


Let d ∈ Z+ be such that there exists a convex set C ⊆ Rd and a convex surrogate loss

ψ : Y ×C→R+ such that ψ is `-calibrated. We will show that d ≥ ‖p‖0− νQ`t(p)− 1. We

consider two cases:

Case 1: p ∈ relint(∆n).

3relint(A) is the relative interior of the set A.


In this case ‖p‖0 = n. We will show that there exist H ⊆ ∆n and t0 ∈ Y satisfying

the following two conditions:

(a) νH(p) = n− d− 1 ; and

(b) H ⊆ Q`t0 .

This will give

νQ`t0(p) ≥ νH(p) = n− d− 1 .

Clearly, condition (b) above implies p ∈ Q`t0 . By Lemma 4.6, we then have that

νQ`t(p) = νQ`t0(p) ≥ n− d− 1 ,

thus proving the claim.

The procedure to construct H and t0 follows.

Let um be a sequence in C such that

〈p,ψ(um)〉 = infu∈C〈p,ψ(u)〉+ εm .

for some sequence εm ↓ 0. Denote the sequence ψ(um) in Sψ by zm. Note

that the sequence zm is bounded.

Hence,

0 ∈ ∂εm(〈p,ψ(um)〉) ⊆ ∂εm(p1ψ1(um)) + . . .+ ∂εm(pnψn(um)) .

Thus, for all y ∈ [n] there exists a wm,y ∈ ∂(εm/py)(ψy(um)) such that

∑y∈[n]

pywm,y = Amp = 0 ,

where

Am =[wm,1 . . .wm,n

]∈ Rd×n ..


Now, define the set Hm ⊆ ∆n as

Hm = q ∈ ∆n : Amq = 0 = q ∈ Rn : Amq = 0,1>nq = 1,q ≥ 0 .

Note that p ∈ Hm for all m.

Let qm ∈ Hm, then

0 =∑y∈[n]

qm,ywm,y

∈∑y∈[n]

qm,y∂(εm/py)(ψy(um))

=∑y∈[n]

∂(εmqm,y/py)(qm,yψy(um))

⊆∑y∈[n]

∂ε∗m(qm,yψy(um))

⊆ ∂nε∗m(〈qm,ψ(um)〉) ,

where ε∗m = εm maxy1py

.

Therefore for all m ∈ N, we have

〈qm, zm〉 = 〈qm,ψ(um)〉 ≤ infu∈Rd〈qm,ψ(u)〉+ nε∗m = inf

z∈Sψ〈qm, z〉+ nε∗m . (4.1)

As m approaches ∞, the above inequality becomes

limm→∞

〈qm, zm〉 ≤ limm→∞

infz∈Sψ〈qm, z〉 . (4.2)

Let the sequence qm have a limit q, we then have the following by virtue of zm

being a bounded sequence

limm→∞

〈qm, zm〉 = limm→∞

〈(qm − q), zm〉+ limm→∞

〈q, zm〉 = limm→∞

〈q, zm〉 . (4.3)

Also, from Lemma 3.3, the mapping p 7→ infz∈Sψ〈p, z〉 is continuous over its domain.

Thus,

limm→∞

infz∈Sψ〈qm, z〉 = inf

z∈Sψ〈q, z〉 . (4.4)


Putting Equations (4.2), (4.3) and (4.4) together, we get

limm→∞

〈q, zm〉 = infz∈Sψ〈q, z〉 . (4.5)

Thus we have that, any q ∈ ∆n which is a limit of any sequence of points qmwith qm ∈ Hm, is such that q ∈ N ψ(zm).

We will construct a set H all of whose elements can be expressed as a limit of some

sequence of points qm with qm ∈ Hm. It can be seen that such a set will satisfy

the condition (b) as stated in the beginning of the proof, i.e. H ⊆ Q`t0 , because

H ⊆ N ψ(zm) and by Theorem 3.8 we have that

N ψ(zm) ⊆ Q`t0 for some t0 ∈ Y .

If we can ensure that such a set H also satisfies condition (a), i.e. νH(p) = n−d−1,

we are done. The construction of H follows.

Let xm,1, . . . ,xm,n−d−1 be an orthonormal set of vectors in Rn, such that

Hm ⊇ (span(xm,1, . . . ,xm,n−d−1) + p) ∩∆n .

Such a sequence always exists by the construction of Hm. As xm,1, . . . ,xm,n−d−1take values in a bounded subset of Rn(n−d−1) for all m ∈ N, there is a subsequence

converging to some x1, . . . ,xn−d−1. We will restrict our attention to this sub-

sequence. It can be seen that x1, . . . ,xn−d−1 also forms n − d − 1 orthonormal

vectors. Let

H = (span(x1, . . . ,xn−d−1) + p) ∩∆n .

It can be seen that this set H ⊆ ∆n is such that every element q ∈ H can be

expressed as a limit of some sequence qm, with qm ∈ Hm. Also, as p ∈ relint(∆n),

it follows directly from the definition of the feasible subspace dimension ν, that

νH(p) = n− d− 1.

Case 2: p /∈ relint(∆n).


For each b ∈ 0, 1n \ 0, define

Pb =q ∈ ∆n : qy > 0⇐⇒ by = 1

.

Clearly, the set Pb : b ∈ 0, 1n \ 0 forms a partition of ∆n. Moreover, for

b = 1n (the all-ones vector), we have

P1n =q ∈ ∆n : qy > 0 ∀y ∈ [n]

= relint(∆n) .

Therefore we have p ∈ Pb for some b ∈ 0, 1n \ 0,1n, with ‖p‖0 = ‖b‖0. Now,

define ψb : C→R‖b‖0+ , Lb ∈ R‖b‖0×k+ , and pb ∈ ∆‖b‖0 as projections of ψ, L and p

onto the ‖b‖0 coordinates y : by = 1, so that ψb(u) contains the elements of ψ(u)

corresponding to coordinates y : by = 1, the columns `bt of Lb contain the elements of

the columns `t of L corresponding to the same coordinates y : by = 1, and similarly,

pb contains the strictly positive elements of p. Since ψ is `-calibrated, we have

that ψb is `b-calibrated. Moreover, by construction, we have pb ∈ relint(∆‖b‖0).

Therefore by Case 1 above, we have

d ≥ ‖b‖0 − νQ`bt (pb)− 1 .

The claim follows since νQ`bt(pb) ≤ νQ`t(p).

The above lower bound allows us to calculate precisely the CC dimension of the 0-1 loss:

Example 4.2 (CC dimension of 0-1 loss). Let Y = Y = [n], and consider the 0-1 loss

`0-1 : Y × Y→R+ as defined in Example 2.8. Let Q0-1t denote the trigger probability set

associated with t ∈ Y. Take p = ( 1n, . . . , 1

n)> ∈ ∆n. Then p ∈ Q0-1

t for all t ∈ Y (see

Figure 3.1a); in particular, we have p ∈ Q0-11 . Now Q0-1

1 can be written as

Q0-11 =

q ∈ ∆n : q1 ≥ qy ∀y ∈ 2, . . . , n

=

q ∈ Rn :

[−1n−1 In−1

]q ≤ 0,−q ≤ 0,1>nq = 1 ,


where and In−1 denotes the (n − 1) × (n − 1) identity matrix. Moreover, we have[−1n−1 In−1

]p = 0, −p < 0. Therefore, by Lemma 4.4, we have

νQ0-11

(p) = nullity

−1n−1 In−1

1>n

= nullity

−1 1 0 . . . 0

−1 0 1 . . . 0...

−1 0 0 . . . 1

1 1 1 . . . 1

= 0 .

Moreover, ‖p‖0 = n. Thus by Theorem 4.5, we have CCdim(`0-1) ≥ n − 1. Combined

with the upper bound of Lemma 4.1, this gives CCdim(`0-1) = n− 1.

4.4 Tightness of Bounds

The upper and lower bounds above are not necessarily tight in general. For example, for

the n-class absolute difference loss of Example 2.9 used in ordinal regression, we know that

CCdim(`abs) = 1; however the upper bound of Theorem 4.2 only gives CCdim(`abs) ≤n − 1. Similarly, for the n-class abstain loss of Example 2.10, it can be shown that

CCdim(`(?)) ≤ dlog2(n)e (in fact we conjecture CCdim(`(?)) = dlog2(n)e), whereas the

upper bound of Theorem 4.2 gives CCdim(`(?)) ≤ n, and the lower bound of Theorem 4.5

yields only CCdim(`(?)) ≥ 1. However, as we show below, for certain losses `, the bounds

of Theorems 4.2 and 4.5 are in fact tight (upto an additive constant of 1).

Theorem 4.7. Let ` : Y×Y→R+. If ∃p ∈ relint(∆n), c ∈ R+ such that 〈p, `t〉 = c ∀t ∈ Y,

then

CCdim(`) ≥ affdim(L)− 1 .

Proof. Since 〈p, `t〉 = c ∀t ∈ Y , we have p ∈ Q`t ∀t ∈ Y . In particular, we have p ∈ Q`1.

Now

Q`1 =

q ∈ Rn :

(`2 − `1)>

...

(`k − `1)>

q ≥ 0,−q ≤ 0,1>nq = 1

.


Moreover,

(`2 − `1)>

...

(`k − `1)>

p = 0 and −p < 0. Therefore, by Lemma 4.4, we have

νQ`1(p) = nullity

(`2 − `1)>

...

(`k − `1)>

1>n

= n− rank

(`2 − `1)>

...

(`k − `1)>

1>n

≤ n− affdim(L) .

Applying Theorem 4.5 at p, we immediately get


Thus for a certain family of losses, we have a lower bound on the CC dimension that

matches the upper bound in Theorem 4.2 up to an additive constant of 1.

A particularly useful application of Theorem 4.7 is to loss matrices L whose columns `t

can be obtained from one another by permuting entries:

Corollary 4.8. Let L ∈ Rn×k+ be such that all columns of L can be obtained from one

another by permuting entries, i.e. ∀t1, t2 ∈ Y , ∃σ ∈ Πn such that `y,t2 = `σ(y),t1 ∀y ∈ Y.

Then


Proof. Let p =(

1n, . . . , 1

n

)> ∈ relint(∆n). Let c = ‖`1‖1n

. Then under the given condition,

〈p, `t〉 = c ∀t ∈ Y . The result then follows from Theorem 4.7.

4.5 Applications in Subset Ranking

We now consider applications of the CC dimension framework in analyzing various subset

ranking problems, where each instance x ∈ X consists of a query together with a set of


r documents (for simplicity, r ∈ N here is fixed), and the goal is to learn a prediction

model which given such an instance predicts a ranking (permutation) of the r documents

[25]. We consider four popular losses used for subset ranking: the precision@q (P@q),

the normalized discounted cumulative gain (NDCG) loss, the pairwise disagreement (PD)

loss, and the mean average precision (MAP) loss.4 Each of these subset ranking losses

can be viewed as a specific type of multiclass loss acting on a certain label space Yand prediction space Y . In particular, for the P@q loss and MAP loss Y contains r-

dimensional binary relevance vectors given by 0, 1r; for the NDCG loss, the label space

Y contains r-dimensional multi-valued relevance vectors; for PD loss, Y contains directed

acyclic graphs on r nodes. In each case, the prediction space Y is the set of permutations

of r objects: Y = Πr. As a convention, if σ : [r]→[r] is in Πr, we mean it to represent the

ranking in which object i is ranked in position σ(i).

We study the convex calibration dimension of the above mentioned losses in this section.

Specifically, we show that the CC dimension of the NDCG and Precision@q losses are

upper bounded by r (Sections 4.5.1 and 4.5.2), and that both the PD and MAP losses

are lower and upper bounded by quadratic functions of r (Sections 4.5.3 and 4.5.4). Our

result on the CC dimension of the NDCG loss agrees with previous results in the literature

showing the existence of r-dimensional convex calibrated surrogates for NDCG [11, 83];

our results on the CC dimension of the PD and MAP losses strengthen previous results

of Calauzenes et al. [17], who showed non-existence of r-dimensional convex calibrated

surrogates (with a fixed argsort predictor5) for PD and MAP.

4.5.1 Precision @ q

Precision@q (See Example 2.12) is a standard evaluation metric used in ranking tasks

[68]. Here the label space Y = 0, 1r and the prediction space Y = Πr. The loss on

4Note that P@q, NDCG and MAP are generally expressed as gains, where a higher value correspondsto better performance; we can express them as non-negative losses by subtracting them from a suitableconstant.

5The argsort predictor argsort : Rr→Πr is such that σ = argsort(u) ‘sorts’ the vector u in descendingorder – i.e. uσ−1(1), uσ−1(2), . . . , uσ−1(r) is a decreasing sequence.


predicting a permutation σ ∈ Πr when the true label is y ∈ 0, 1r is given by

`P@q(y, σ) = 1− 1

q

q∑i=1

yσ−1(i) .

Clearly, affdim(LP@q) ≤ r, and therefore by Theorem 4.2, we have

CCdim(`P@q) ≤ r .

Thus, there exists r-dimensional convex `P@q-calibrated surrogates and we give such sur-

rogates along with predictors in Chapter 5.

4.5.2 Normalized Discounted Cumulative Gain (NDCG)

The NDCG loss is widely used in information retrieval applications [51]. Here Y is the

set of r-dimensional relevance vectors with (say) s relevance levels, Y = 0, 1, . . . , s−1r,and Y is the set of permutations of r objects, Y = Πr (thus, here n = |Y| = sr and

k = |Y| = r!). The loss on predicting a permutation σ ∈ Πr when the true label is

y ∈ 0, 1, . . . , s− 1r is given by

`NDCG(y, σ) = 1− 1

z(y)

r∑i=1

2yi − 1

log2(σ(i) + 1),

where z(y) is a normalizer that ensures the loss is non-negative and z(y) depends only on

y. The NDCG loss can therefore be viewed as a multiclass loss matrix LNDCG ∈ Rsr×r!+ .

Clearly, affdim(LNDCG) ≤ r, and therefore by Theorem 4.2, we have

CCdim(`NDCG) ≤ r .

Indeed, previous results in the literature [11, 83] have shown the existence of r-dimensional

convex calibrated surrogates for NDCG, we also give such surrogates along with predictors

in Chapter 5.


4.5.3 Pairwise Disagreement (PD)

The pairwise disagreement loss is a natural loss for ranking based on how many pairs

of documents are ordered incorrectly, and is one of the prevalent methods for evaluating

rankings [24, 28, 37]. In its most general version, the label space Y is the set of all directed

acyclic graphs (DAGs) on r vertices, which we shall denote as Gr; for each directed edge

(i, j) in a graph G ∈ Gr associated with an instance x ∈ X , the i-th document in the

document set in x is preferred over the j-th document. The prediction space Y is again

the set of permutations of r objects, Y = Πr. The loss on predicting a permutation

σ ∈ Πr when the true label is G ∈ Gr is given by

`PD(G, σ) =∑

(i,j)∈G1(σ(i) > σ(j)

)=

r∑i=1

r∑j=1

1((i, j) ∈ G

)· 1(σ(i) > σ(j)

)=

r∑i=1

i−1∑j=1

1((i, j) ∈ G

)· 1(σ(i) > σ(j)

)+ 1((j, i) ∈ G

)·(

1− 1(σ(i) > σ(j)

))=

r∑i=1

i−1∑j=1

(1((i, j) ∈ G

)− 1((j, i) ∈ G

))· 1(σ(i) > σ(j)

)+

r∑i=1

i−1∑j=1

1((j, i) ∈ G

).

The PD loss can be viewed as a multiclass loss matrix LPD ∈ R|Gr|×r!+ . Note that the

second term in the sum above depends only the label G; removing this term amounts

to simply subtracting a fixed vector from each column of the loss matrix and hence LPD

clearly is such that

affdim(LPD) ≤ r(r − 1)

2.

Therefore, by Theorem 4.2, we have

CCdim(`PD) ≤ r(r − 1)

2.

In fact one can also give tight lower bounds for CCdim(`PD) using the following proposi-

tion.

Proposition 4.9. rank(LPD) ≥ r(r−1)2

.


Proof. Consider the r(r−1)2×r! sub-matrix of the loss matrix LPD with rows corresponding

to graphs consisting of single directed edges (i, j), with i < j. Let us denote this matrix

by LPD and its entry corresponding to the graph with a single directed edge (i, j), and

permutation σ as ((i, j), σ). We show that the rank of LPD is at least r(r−1)2

, by showing

that the rows of LPD are linearly independent.

To see this, assume the contrary, that one of the rows of L (say the row corresponding to

the graph (a, b) with a, b ∈ [r], a < b), can be written as linear combination of the other

rows as follows.

((a, b), σ) =r∑i=1

r∑j=i+1

c(i,j)((i, j), σ) ∀σ ∈ Πr , (4.6)

for some coefficients c(i,j) ∈ R, with c(a,b) = 0.

Consider 2 permutations σ1, σ2 ∈ Πr, which are such that

σ1(a) = σ2(b), σ1(b) = σ2(a), and σ1(i) = σ2(i),∀i ∈ [r] \ a, b .

It can be easily seen that for the columns corresponding to these 2 permutations all entries

on rows other than row (a, b) are identical, but differ on row (a, b), i.e.

((i, j), σ1

)= ((i, j), σ2

), ∀i, j ∈ [r], i < j, (i, j) 6= (a, b) , (4.7)((a, b), σ1

)6= ((a, b), σ2

). (4.8)

Applying Equation (4.6) for σ = σ1 and σ = σ2 along with Equation (4.7), we get

((a, b), σ1

)= ((a, b), σ2

),

which contradicts Equation (4.8). Thus, we have that, the rows of L are linearly inde-

pendent. This give us

rank(LPD) ≥ r(r − 1)

2.


Moreover, it is easy to see that the columns of LPD can all be obtained from one another

by permuting entries. Therefore, by Corollary 4.8, we also have

CCdim(`PD) ≥ r(r − 1)

2− 2 .

This strengthens previous results of Duchi et al. [34] and Calauzenes et al. [17]. In

particular, Duchi et al. [34] showed that certain popular r-dimensional convex surrogates

are not calibrated for the PD loss, and conjectured that such convex calibrated surrogates

(in r dimensions) do not exist; Calauzenes et al. [17] showed that indeed there do not

exist any r-dimensional convex surrogates along with argsort as the predictor that are

calibrated w.r.t. the PD loss. The above result allows us to go further and conclude that

in fact, one cannot design convex calibrated surrogates for the PD loss in any prediction

space of less than r(r−1)2− 2 dimensions (regardless of the predictor used).

Many popular ranking algorithms like RankBoost [37], RankNet [13], LambdaRank [14]

and RankSVM [48, 54], are surrogate minimizing algorithms that minimize a r-dimensional

convex surrogate and simply use argsort to predict a ranking from the r-dimensional score

vector. These negative results immediately tell us that none of these surrogates are cal-

ibrated w.r.t. the PD loss. Hence it suggests to us that if the PD loss is the loss of

interest, and no distributional assumptions can be made, then one needs completely new

surrogates (algorithms) to achieve consistency.

On the other hand, the upper bound also indicates that there do exist convex calibrated

surrogates for the PD loss in r(r−1)2

dimensions; we give such surrogates along with pre-

dictors in Chapter 5.

4.5.4 Mean Average Precision (MAP)

The Mean Average Precision is another widely used evaluation metric in ranking [68].

Here the label space Y is the set of all (non-zero) r-dimensional binary relevance vectors,

Y = 0, 1r \ 0, and the prediction space Y is again the set of permutations of r

objects, Y = Πr. The loss on predicting a permutation σ ∈ Πr when the true label is


y ∈ 0, 1r \ 0 is given by

`MAP(y, σ) = 1− 1

‖y‖1

∑i:yi=1

1

σ(i)

σ(i)∑j=1

yσ−1(j)

= 1− 1

‖y‖1

r∑i=1

i∑j=1

yσ−1(i) yσ−1(j)

i

= 1− 1

‖y‖1

r∑i=1

i∑j=1

yi yjmax(σ(i), σ(j))

(4.9)

Thus the MAP loss can be viewed as a multiclass loss matrix LMAP ∈ R(2r−1)×r!+ . Clearly,

affdim(LMAP) ≤ r(r + 1)

2,

and therefore by Theorem 4.2, we have

CCdim(`MAP) ≤ r(r + 1)

2.

One can also show the following lower bound on the rank of LMAP:

Proposition 4.10. rank(LMAP) ≥ r(r−1)2− 2.

Proof. By Equation (4.9) the (2r − 1)× r! loss matrix LMAP = Lr can be written as

Lr = 12r−11>r! −ArBr

where for any a ∈ N, 1a ∈ Ra is the all ones vector, Ar is a (2r − 1)× r(r+1)2

matrix and

Br is a r(r+1)2× r! matrix.

Let us denote the rows of Ar by labellings y ∈ Y = 0, 1r \ 0r and columns by pairs

(i, j) such that i, j ∈ [r], i ≤ j. Similarly, denote the rows of Br by pairs (i, j) such that

i, j ∈ [r], i ≤ j and columns by permutations σ ∈ Y = Πr.

The entries of Ar are given as

Ar(y, (i, j)) =1∑rγ=1 yγ

yiyj .


The entries of Br are given as

Br((i, j), σ) =1

max(σ(i), σ(j)).

By Lemmas 4.11 and 4.12 we have that,

rank(Ar) ≥r(r + 1)

2− 1 ,

rank(Br) ≥r(r − 1)

2.

Hence

rank(Lr) = rank(12r1r! −ArBr)

≥ rank(ArBr)− 1

≥ rank(Br)− 2

≥ r(r − 1)

2− 2

where the next to last inequality above, follows from the observation that Ar is away

from full (column) rank by at most 1.

Lemma 4.11.


2− 1 .

Proof. Consider the set of 2r dimensional vectors

vα ∈ R2r : vα(y) = 1×∏i∈α

yi ; α ⊆ [r] ,

where vα(y) denotes the element of vector v indexed by y ∈ 0, 1r.

It is easy to see that this set of vectors forms a basis in R2r . Ar can be constructed by

putting alongside r(r+1)/2 distinct elements from this set as column vectors, deleting the

row corresponding to y = [0, . . . , 0], and dividing the elements of every row corresponding


to y by∑r

γ=1 yγ. Thus


2− 1 .

Lemma 4.12.

rank(Br) ≥r(r − 1)

2.

Proof. It can be seen that Br−1 appears as a sub-matrix in Br by taking all the rows

(i, j) such that i, j ∈ [r], i ≤ j and j 6= r and all the columns σ such that σ(r) = r.

The matrix Br can be decomposed as

Br =

Br−1 D

C E

.

The details of this sub-division can be summarized as follows:

Υ = σ ∈ Πr : σ(r) = r Ω = σ ∈ Πr : σ(r) < r

Γ = (i, j) ∈ [r]× [r] : i ≤ j < r Br−1 D

Λ = (i, j) ∈ [r]× [r] : i ≤ j = r C E

Any entry in the matrix C has the form 1max(σ(i),σ(j))

with i ≤ j = r and σ such that

σ(r) = r. Thus all entries in C are the same and equal to 1r

and hence, the rows of C

span only an 1-dimensional space.

We now show that there are r − 1 linearly independent columns in E. Consider any

permutations σ1, σ2, . . . , σr−1 in the set Ω, such that σj(j) = r and σj(r) = r − 1. Such

permutations clearly exist. The sub-matrix of E with size r × (r − 1) corresponding to

these columns is given by


σ(1) = r;σ(r) = r − 1 σ(2) = r;σ(r) = r − 1 . . . σ(r − 1) = r;σ(r) = r − 1

(1, r) 1r

1r−1

. . . 1r−1

(2, r) 1r−1

1r

. . . 1r−1

......

.... . .

...

(r − 1, r) 1r−1

1r−1

. . . 1r

(r, r) 1r−1

1r−1

. . . 1r−1

In other words, excluding the last row of the above submatrix, one gets square matrix

with diagonal entries equal to 1r

and off-diagonal entries equal to 1r−1

. The last row is the

constant vector, with all entries taking the value 1r−1

. Clearly, this submatrix has rank

r − 1.

Also, note that the span of the r− 1 column vectors of this sub-matrix does not intersect

with column space of C non-trivially, i.e. does not contain the all ones vector. This

implies that the columns of Br given by the permutations σ1, σ2, . . . , σr−1 yielding the

linearly independent columns of E, along with the columns of Br given by permutations

yielding linearly independent columns of Br−1, together are linearly independent. Thus

rank(Br) ≥ rank(Br−1) + r − 1 .

Trivially rank(B1) ≥ 0. Thus

rank(Br) ≥ r(r − 1)/2 .

Again, it is easy to see that the columns of LMAP can all be obtained from one another

by permuting entries, and therefore by Corollary 4.8, we have

CCdim(`MAP) ≥ r(r − 1)

2− 4 .

This again strengthens a previous result of Calauzenes et al. [17], who showed that there

do not exist any r-dimensional convex surrogates that use argsort as the predictor and


are calibrated for the MAP loss. As with the PD loss, the above result allows us to go

further and conclude that in fact, one cannot design convex calibrated surrogates for the

MAP loss in any prediction space of less than r(r−1)2− 4 dimensions (regardless of the

predictor used).

Once again, the upper bound indicates that there do exist convex calibrated surrogates

for the PD loss in r(r+1)2

dimensions; we give such surrogates along with predictors in

Chapter 5.

Chapter 5

Generic Rank Dimensional

Calibrated Surrogates

In Chapter 4 we saw that for every loss matrix `, there exists a convex surrogate with

surrogate dimension affdim(L) that is `-calibrated. Even though the proof of existence

was constructive in nature, the surrogate constructed was not practical due to the very

complicated nature of its domain C, and also due to there being no explicit construction

of the predictor pred. We rectify these shortcomings in this chapter by constructing a

simple surrogate and an explicit predictor.


We begin by studying a class of smooth surrogates known as proper losses that are used

for probability estimation in Section 5.2. We then use these proper losses to construct

generic `-calibrated surrogates ψ with surrogate dimension at most affdim(L), and go on

to show excess risk bounds relating the surrogate ψ and loss ` in Section 5.3. We also

give generalizations of the Tsybakov conditions [102] to the general multiclass problem

given by loss matrix `, and show that one can get better excess risk bounds when the

distribution satisfies such a condition in Section 5.4. Finally, we give several examples of

loss matrices used in ranking and multilabel prediction, which have huge (combinatorial)

74

Chapter 5. Generic rank-dimensional calibrated surrogates 75

prediction and label spaces but small rank, and give specific instantiations of the generic

calibrated surrogate for these losses in Section 5.5.

5.2 Strongly Proper Composite Losses

Proper losses are a classic tool for binary probability estimation in statistics [88, 89, 91],

and have gained significance in machine learning as a very powerful tool in recent years

[1, 12, 84, 85, 103].

Definition 5.1 (Proper loss). A binary surrogate loss φ : 1, 2 × [0, 1]→R+ is called a

proper loss, if for all p ∈ ∆2

⟨p,φ(p1)

⟩= inf

u∈[0,1]

⟨p,φ(u)

⟩.

Equivalently, for all p ∈ ∆2

regφp(p1) = 0 .

Proper losses with the additional property of having unique minimizers are called strictly

proper losses.

Definition 5.2 (Strictly proper loss). A binary surrogate loss φ : 1, 2 × [0, 1]→R+ is

called a strictly proper loss, if for all p ∈ ∆2, η ∈ [0, 1], η 6= p1

⟨p,φ(p1)

⟩<⟨p,φ(η)

⟩.

Equivalently, for all p ∈ ∆2

regφp(p1) = 0 and regφp(η) > 0, ∀ η 6= p1 .

Another interesting subclass of proper losses called strongly proper losses were defined

by Agarwal [1] and will serve a crucial purpose in our exposition.


Table 5.1: Example strongly proper composite losses ρ together with the constituentproper loss φ and link function λ and strong properness parameter γ, [1].

Loss V ρ(1, v) ρ(2, v) φ(1, η) φ(2, η) λ(η) γ

Exponential R e−v ev√

1−ηη

√1−ηη

12

ln(

η1−η

)4

Logistic R ln(1 + e−v) ln(1 + ev) − ln(η) − ln(1− η) ln(

η1−η

)4

Squared [0, 1] (1− v)2 (v)2 (1− η)2 η2 η 2

Definition 5.3 (Strongly Proper loss). A binary surrogate loss φ : 1, 2 × [0, 1]→R+ is

called a γ-strongly proper loss for some γ ∈ R+, if for all p ∈ ∆2 and η ∈ [0, 1]

regφp(η) =⟨p,φ(η)

⟩−⟨p,φ(p1)

⟩≥ γ

2(η − p1)2 .

Proper losses are defined only over [0, 1], but for convenience in practice one wishes to

optimize over functions on the entire real line. This is generally taken care of, by using

link functions.

Definition 5.4 (Proper composite loss). Let V ⊆ R. A binary surrogate loss ρ : 1, 2 ×V→R+ is called a proper composite loss if there exists a proper loss φ : 1, 2× [0, 1]→R+

and an invertible link function λ : [0, 1]→V such that for all y ∈ 1, 2 and v ∈ V,

ρ(y, v) = φ(y, λ−1(v)

).

Definition 5.5 (Strongly proper composite loss). Let V ⊆ R. A binary surrogate loss

ρ : 1, 2×V→R+ is called a γ-strongly proper composite loss if there exists a γ- strongly

proper loss φ : 1, 2 × [0, 1]→R+ and an invertible link function λ : [0, 1]→V such that

for all y ∈ 1, 2 and v ∈ V,

ρ(y, v) = φ(y, λ−1(v)

).

Interestingly, many binary surrogates used in practice form strongly proper composite

losses. These surrogates1 are summarized in Table 5.1 and is taken from Agarwal [1].

1We use a scaled and shifted version of the standard squared loss for convenience.


5.3 Generic Rank-Dimensional Calibrated Surrogate

We now give a calibrated surrogate for any given loss matrix `, with a surrogate dimension

of affdim(L), and uses strongly proper composite losses as a building block. Also, it can

be easily shown that the resulting surrogate is convex if an appropriate strongly proper

composite loss is used as a building block. If the loss ` is such that affdim(L) = d, then

clearly there exists matrices A ∈ [0, 1]d×n,B ∈ Rd×k and vector c ∈ Rn, c ∈ Rk such that

L = A>B + c1>k + 1nc> .

We make use of such a decomposition in our construction of calibrated surrogate and

predictor.

Theorem 5.1. Let ` : Y × Y→R+. Suppose there exists d ∈ N, vectors a1, a2, . . . , an ∈[0, 1]d and b1,b2, . . . ,bk ∈ Rd and scalars c1, c2, . . . , cn, c1, c2, . . . , ck ∈ R such that

`(y, t) = 〈ay,bt〉+ cy + ct .

Let V ⊆ R and let ρ : 1, 2 × V→R+ be a γ-strongly proper loss for some γ > 0 with a

link function λ : [0, 1]→V. Let the surrogate ψ : Y × Vd→R+ be given by

ψ(y,u) =d∑i=1

(ay,i)ρ(1, ui) + (1− ay,i)ρ(2, ui)

and let the predictor pred : Vd→Y be such that

pred(u) ∈ argmint∈Y〈λ−1(u),bt〉+ ct ,

where [λ−1(u)]i = λ−1(ui). Then for all distributions D and functions f : X→Vd

reg`D[pred f ] ≤ 2 maxt‖bt‖

√2

γregψD[f ] .

In particular, (ψ, pred) is `-calibrated.


Proof. Let the matrix A ∈ [0, 1]d×n be given by A = [a1, a2, . . . , an]. Let the vectors

α1,α2, . . . ,αd ∈ [0, 1]n be the column vectors of the matrix A>.

Let p ∈ ∆n,u ∈ Vd, then

reg`p(pred(u)) = 〈p, `pred(u)〉 −mint∈Y〈p, `t〉

=(⟨

Ap,bpred(u)

⟩+ cpred(u)

)−min

t∈Y(〈Ap,bt〉+ ct)

= maxt∈Y

[ ⟨Ap,bpred(u) − bt

⟩+ cpred(u) − ct

]= max

t∈Y

[ ⟨Ap− λ−1(u),bpred(u) − bt

⟩+⟨λ−1(u),bpred(u) − bt

⟩+ cpred(u) − ct

]≤ max

t∈Y

[ ⟨Ap− λ−1(u),bpred(u) − bt

⟩ ]≤ ‖Ap− λ−1(u)‖ ·max

t∈Y‖bpred(u) − bt‖

≤ 2 maxt‖bt‖ · ‖Ap− λ−1(u)‖ (5.1)

Let φ : 1, 2 × [0, 1]→R+ be the constituent γ-strongly proper loss for ψ.

For all i ∈ [d] define qi ∈ ∆2 as qi = [〈αi,p〉, 1− 〈αi,p〉]>. We have

regψp(u) = 〈p,ψ(u)〉 − infu′∈Vd

〈p,ψ(u′)〉

=∑y∈Y

py

(d∑i=1

(ay,i) ρ(1, ui) + (1− ay,i) ρ(2, ui)

)− inf

u′∈Vd〈p,ψ(u′)〉

=d∑i=1

(〈αi,p〉ρ(1, ui) + (1− 〈αi,p〉) ρ(2, ui)

)− inf

u′∈Vd〈p,ψ(u′)〉

=d∑i=1

regρqi(ui)

=d∑i=1

regφqi(λ−1(ui))

≥d∑i=1

γ

2

(λ−1(ui)− 〈αi,p〉

)2

=γ

2‖Ap− λ−1(u)‖2 (5.2)


Putting Equations (5.1) and (5.2) together we get

reg`p(pred(u)) ≤ 2 maxt‖bt‖

√2

γregψp(u) .

Setting p = p(X), taking expectation over the instance X, and applying Jensen’s in-

equality completes the proof.

If the strongly proper composite loss ρ is convex in its second argument,2 we have that the

surrogate ψ given by Theorem 5.1 is convex. Thus we have a convex `-calibrated surrogate

with surrogate dimension matching the upper bound of Theorem 4.2. We now give some

example instantiations of the above theorem using the strongly proper composite losses

from Table 5.1.

Example 5.1 (Logistic surrogate). Let the loss ` be as in Theorem 5.1. Let V = R and

let the strongly proper composite loss ρ : 1, 2 × R→R+ be the logistic loss from Table

5.1 given by

ρ(1, v) = ln(1 + e−v)

ρ(2, v) = ln(1 + ev) .

The link function λ : [0, 1]→R and the inverse link function λ−1 : R→[0, 1] are given by

λ(η) = ln

(η

1− η

)λ−1(v) =

1

(1 + e−v).

Using the above ρ as the strongly proper composite loss, the `-calibrated surrogate and

predictor from Theorem 5.1, ψ : Y × Rd→R+ and pred : Rd→Y are such that

ψ(y,u) =d∑i=1

(ay,i) ln(1 + e−ui) + (1− ay,i) ln(1 + eui)

pred(u) ∈ argmint∈Y〈λ−1(u),bt〉+ dt .

2All the strongly proper composite losses in Table 5.1 are convex in their second argument.


As ρ is a convex binary surrogate, and the coefficients ay,i ∈ [0, 1], the surrogate ψ is

convex.

Example 5.2 (Least squares surrogate). Let the loss ` be as in Theorem 5.1. Let V = [0, 1]

and let the strongly proper composite loss ρ : 1, 2×R→R+ be the squared loss from Table

5.1 given by

ρ(1, v) = (v − 1)2

ρ(2, v) = (v)2 .

The link function λ : [0, 1]→[0, 1] and the inverse link function λ−1 : [0, 1]→[0, 1] are

both simply the identity functions. Using the above ρ as the strongly proper composite

loss, the `-calibrated surrogate and predictor from Theorem 5.1, ψ : Y × [0, 1]d→R+ and

pred : [0, 1]d→Y are such that

ψ(y,u) =d∑i=1

(ay,i)(ui − 1)2 + (1− ay,i)(ui)2

=d∑i=1

(ui − ay,i)2 + ay,i − (ay,i)2

pred(u) ∈ argmint∈Y〈u,bt〉+ dt .

As can be seen from the second line above, the surrogate can be simplified into a simpler

form after discarding the constant term∑d

i=1 (ay,i − (ay,i)2). It can also be clearly seen

that ψ is a convex surrogate.

One can use a natural extension for the squared loss of Table 5.1 to values outside [0, 1]

that also retains convexity to get the following surrogate which operates over Rd instead

of [0, 1]d.

Example 5.3 (Extended least squares surrogate). Let ψ : Y ×Rd→R+ and pred : Rd→Ybe such that

ψ(y,u) =d∑i=1

(ui − ay,i)2

pred(u) ∈ argmint∈Y〈(clip(u)),bt〉+ dt ,


where clip(u) for any u ∈ Rd ‘clips’ the components of u to [0, 1]. One can show that the

same excess risk bound for the least squares surrogate holds for this case as well, i.e. for

all distributions D and functions f : X→Rd

reg`D[pred f ] ≤ 2 maxt‖bt‖

√2

γregψD[f ] ,

where γ = 2 is the strong properness parameter for the squared loss. For any f : X→[0, 1]d

the above bound follows from Theorem 5.1. For any other f : X→Rd it follows from the

observations below:

reg`D[pred f ] = reg`D[pred clip f ]

regψD[f ] ≥ regψD[clip f ] .

Note that any loss matrix L can be written as L = InL and applying the above surrogate

with this ‘decomposition’ (i.e. A = In,B = L) with d = n gives exactly the same surrogate

and predictor as in Lemma 4.1, and hence the proof of this excess risk bound can be

considered an alternate proof of Lemma 4.1.

The squared loss of Table 5.1 can be extended to values outside [0, 1] in another natural

way that retains convexity, giving us another variant of the least squares surrogate which

operates over Rd instead of [0, 1]d. This particular modification is sometimes called the

hinge squared loss.

Example 5.4 (Modified least squares surrogate). Let ψ : Y ×Rd→R+ and pred : Rd→Ybe such that

ψ(y,u) =d∑i=1

(ay,i)(ui − 1)2+ + (1− ay,i)(ui + 1)2

+

pred(u) ∈ argmint∈Y1

2〈(clip(u) + 1d),bt〉+ dt ,

where clip(u) for any u ∈ Rd ‘clips’ the components of u to [0, 1]. In the exact same way

as used in the extended least squares surrogate, one can show that the same excess risk

bound for the least squares surrogate holds for this case as well.


5.4 Generalized Tsybakov Conditions

Tsybakov [102] proposed conditions on the distribution of binary data limiting the amount

of ‘noise’ in the data. More specifically, the conditional probability p(X) was constrained

to not have too much mass near the most noisy case of [12, 1

2]. Bartlett et al. [7] showed that

under these conditions one can get better excess risk bounds for many binary classification

surrogates like the logistic loss and exponential loss. Chen and Sun [20] gave a noise

condition generalizing the Tsybakov conditions to the case of multiclass zero-one loss

and showed better excess risk bounds for certain surrogates under those conditions. We

generalize these conditions further for any general multiclass loss, and show that one can

get better excess risk bounds for many types of calibrated surrogates, including those

defined in Theorem 5.1, under these conditions.

We will first define certain quantities that will serve a crucial purpose.

Definition 5.6. For any vector v ∈ Ra, define the smallest value sm(v) and second

smallest value ssm(v) as

sm(v) = mini∈[a]

vi

ssm(v) = mini∈[a]:vi>sm(v)

vi .

If all the elements of vector v are identical then ssm(v) takes the value of +∞.

We now give our generalization of the Tsybakov conditions to general losses `. Informally,

it says that the fraction of instances with ‘difficult to decide’ conditional probabilities is

small. Difficult to decide conditional probabilities are exactly those p ∈ ∆n, for which the

smallest expected risk sm(L>p) and second smallest expected risk ssm(L>p) are close.

Definition 5.7. Let ` : Y × Y→R+. Let D be a distribution over X × Y with marginal

µ over X , and distribution over Y conditioned on X = x given by p(x). Then D is said

to satisfy the `-Tsybakov condition with exponent α ∈ [0,∞) and constant c > 0 if for all

s ≥ 0

PX∼µ(ssm(L>p(X))− sm(L>p(X)) ≤ s

)≤ csα .


2s

Q2

Q1 Q3

(0, 1, 0)

(1, 0, 0) (0, 0, 1)

Figure 5.1: The trigger probabilities for the 3-class zero-one loss, with the ‘difficult todecide’ probabilities being shaded in darker colors.

Note that any distribution D satisfies the `-Tsybakov condition with exponent α = 0.

Higher values of α correspond to stricter conditions. An illustration of the region p ∈∆n : ssm(L>p)− sm(L>p) ≤ s for the 3-class zero-one loss can be seen in Figure 5.1.

We will need the following Lemma, which bounds the probability of a classifier h being

‘wrong’ by a function of reg`D[h].

Lemma 5.2. Let ` : Y × Y→R+. Let distribution D over X × Y satisfy the `-Tsybakov

condition with exponent α and constant c then for all h : X→Y we have

P(h(X) /∈ argmint〈p(X), `t〉) ≤ 2(c)1/(1+α)(reg`D[h]

) α1+α .

Proof. Let h∗ ∈ argminh er`D[h]. Fix a h : X→Y . We have

reg`D[h] = er`D[h]− er`D[h∗]

=

∫x∈X

([L>p(x)]h(x) − [L>p(x)]h∗(x)

)dµ

=

∫x∈X

([L>p(x)]h(x) − sm(L>p(x))

)dµ

=

∫x∈X :h(x)/∈argmint〈p(x),`t〉

([L>p(x)]h(x) − sm(L>p(x))

)dµ

≥∫x∈X :h(x)/∈argmint〈p(x),`t〉

(ssm(L>p(x))− sm(L>p(x))

)dµ (5.3)


Let s ∈ R+. Define sets X1,X2 ⊆ X as

X1 = x ∈ X : h(x) /∈ argmint〈p(x), `t〉

X2 = x ∈ X : ssm(L>p(x))− sm(L>p(x)) > s

By virtue of the `-Tsybakov condition we have P(X ∈ X2) ≥ (1− csα). and hence

P(X ∈ X1 ∩ X2) ≥ P(X ∈ X1)−P(X /∈ X2) ≥ P(X ∈ X1)− csα .

Putting this together with Equation (5.3), we have

reg`D[h] ≥∫x∈X :x∈X1


)dµ

≥∫x∈X :x∈X1∩X2


)dµ

≥∫x∈X :x∈X1∩X2

sdµ

= s(P(X ∈ X1 ∩ X2))

≥ sP(X ∈ X1)− csα+1 .

Setting s =(

reg`D[h]

c

)1/(1+α)

, and rearranging terms, we have

P(X ∈ X1) = P(h(X) /∈ argmint〈p(X), `t〉) ≤ 2(c)1/(1+α)(reg`D[h]

) α1+α .

The following theorem is the main result of this section. We show that if the distribution

D satisfies the `-Tsybakov conditions with noise exponent α, then a surrogate satisfying

an excess risk bound of the form: reg`D[pred f ] ≤ d(

regψD[f ])β

; for some β ∈ (0, 1] also

satisfies a better excess risk bound: reg`D[pred f ] ≤ d′(

regψD[f ])β′

; for some β′ ∈ [β, 1]

depending on the noise exponent α. In particular, if α = 0 then β′ = β and if α = ∞then β′ = 1.


Theorem 5.3. Let ` : Y×Y→R+. Let surrogate ψ : Y×C→R+ and predictor pred : C→Ybe such that for all distributions D over X × Y and functions f : X→C

reg`D[pred f ] ≤ d(

regψD[f ])β

, (5.4)

for some β ∈ [0, 1], d ∈ R+. Further, let distribution D over X ×Y satisfy the `-Tsybakov

condition with exponent α and constant c then for all functions f : X→C, we have

reg`D[pred f ] ≤ d′(

regψD[f ])(β+αβ)/(1+αβ)

, (5.5)

where d′ =(2(2−β)(1+α)c(1−β)d(1+α)

)1/(1+αβ).

Proof. Let f : X→C and let h = pred f . Let s ∈ R+, and define the sets X1,X2 ⊆ X as

follows.

X1 = x ∈ X : 0 < reg`p(x)(h(x)) < s

X2 = x ∈ X : s ≤ reg`p(x)(h(x)) .

We have,

reg`D[h] =

∫x∈X1

reg`p(x)(h(x))dµ+

∫x∈X2

reg`p(x)(h(x))dµ

≤ sP(x ∈ X1) +

∫x∈X2

reg`p(x)(h(x))dµ

≤ sP(X ∈ X1) +1

s1β−1

∫x∈X2

(reg`p(x)(h(x))

)1/βdµ

≤ sP(h(X) /∈ argmint〈p(X), `t〉) + s1−1/β

∫x∈X2

d1/βregψp(x)(f(x))dµ

≤ sP(h(X) /∈ argmint〈p(X), `t〉) + s1−1/βd1/βregψD[f ]

≤ 2s(c)1/(1+α)(reg`D[h]

) α1+α + s1−1/βd1/βregψD[f ]

The third step in the argument above follows by multiplying the second term with(reg`

p(x)(h(x))

s

)1/β−1

which is greater than 1 for all x ∈ X2. The fourth step follows by

applying the assumed excess risk bound in Equation (5.4). The last step follows from

Lemma 5.2.


Setting s =(

2(c)1/(1+α)(reg`D[h]

) α1+α

)−β (d1/βregψD[f ]

)βand rearranging terms, we have

reg`D[h] ≤(2(2−β)(1+α)c(1−β)d(1+α)

)1/(1+αβ)(

regψD[f ])(β+αβ)/(1+αβ)

.

Clearly, the above theorem can be applied to the surrogate and predictor constructed in

Theorem 5.1, with β = 12.

5.5 Example Applications in Ranking and Multilabel

Prediction

Many practical problems with combinatorial sized label and prediction spaces, especially

in ranking and multilabel prediction, have a low-rank loss matrix and are hence amenable

to construction of efficient convex calibrated surrogates via Theorem 5.1. We give many

such examples in this section.

For the purpose of illustration, when applying Theorem 5.1, we shall use the squared

loss as the strongly proper composite loss and use the extension given in Example 5.3 to

extend the domain.

5.5.1 Subset Ranking

In Section 4.5, we analyzed several loss functions used in subset ranking and derived

bounds on their CC dimension. In this section, we apply Theorem 5.1 to these losses and

construct explicit convex surrogates and predictors calibrated with them.

Example 5.5 (Precision@q– Section 4.5.1). Let Y = 0, 1r and Y = Πr. The Precision@q

loss `P@q : Y × Y is given as

`P@q(y, σ) = 1− 1

q

q∑i=1

yσ−1(i) = 1− 1

q

q∑i=1

yi1(σ(i) ≤ q) .


Let ψP@q : Y × Rr→R+ and predP@q : Rr→Y be such that

ψP@q(y,u) =r∑i=1

(ui − yi)2

predP@q(u) ∈ argmaxσ∈Πr

r∑i=1

ui1(σ(i) ≤ q) .

From Theorem 5.1, we have that (ψP@q, predP@q) is `P@q-calibrated. Also, it can be easily

be seen that predP@q(u) can be implemented efficiently by sorting the r objects in the

descending order of their scores ui.

Note that, the popular winner-take-all (WTA) loss, which assigns a loss of 0 if the top-

ranked item is relevant (i.e. if yσ−1(1) = 1) and 1 otherwise, is simply a special case of

the Precision@q loss with q = 1; therefore the above construction also yields a calibrated

surrogate for the WTA loss.

Example 5.6 (NDCG – Section 4.5.2). Let Y = 0, 1, . . . , s− 1r, and Y = Πr. The loss

on predicting a permutation σ ∈ Πr when the true label is y ∈ 0, 1, . . . , s− 1r is given

by

`NDCG(y, σ) = 1− 1

z(y)

r∑i=1

2yi − 1

log2(σ(i) + 1),

where z(y) is a normalizer that ensures the loss is non-negative and z(y) depends only

on y.

Let ψNDCG : Y × Rr→R+ and predNDCG : Rr→Y be such that

ψNDCG(y,u) =r∑i=1

(ui −

2yi − 1

z(y)

)2

predNDCG(u) ∈ argmaxσ∈Πr

r∑i=1

ui ·1

log2(σ(i) + 1).

From Theorem 5.1, we have that (ψNDCG, predNDCG) is `NDCG-calibrated. Also, it can be

easily be seen that predNDCG(u) can be implemented efficiently by sorting the r objects

in the descending order of their scores ui. Surrogates similar to these were proposed and

proven to be calibrated w.r.t. the NDCG loss by Buffoni et al. [11], Cossock and Zhang

[25], Ravikumar et al. [83].


Example 5.7 (Mean Average Precision – Section 4.5.4). Let Y = 0, 1r \ 0 and

Y = Πr. The loss on predicting a permutation σ ∈ Πr when the true label is y ∈ Y is

given by

`MAP(y, σ) = 1− 1

‖y‖1

r∑i=1

i∑j=1


.

Let ψMAP : Y × R(r)(r+1)/2→R+ and predMAP : Rr→Y be such that

ψMAP(y,u) =r∑i=1

i∑j=1

(ui,j −

yiyj‖y‖1

)2

predMAP(u) ∈ argmaxσ∈Πr

r∑i=1

i∑j=1

ui,j ·1

max(σ(i), σ(j)).

From Theorem 5.1, we have that (ψMAP, predMAP) is `MAP-calibrated. However, unfor-

tunately the predictor predMAP cannot be simplified and takes time exponential in r to

compute. We give ways to tackle this problem in Chapter 6.

Example 5.8 (Pairwise Disagreement – Section 4.5.3). Let Y = Gr the set of all directed

acyclic graphs (DAGs) on r vertices and let Y = Πr. The loss on predicting a permutation

σ ∈ Πr when the true label is G ∈ Gr is given by

`PD(G, σ) =r∑i=1

r∑j=1,j 6=i

Gi,j · 1(σ(i) > σ(j)

),

where Gi,j = 1((i, j) ∈ G

). Let ψPD : Y × Rr2→R+ and predPD : Rr2→Y be such that

ψPD(G,u) =r∑i=1

r∑j=1

(ui,j −Gi,j)2

predPD(u) ∈ argminσ∈Πr

r∑i=1

r∑j=1

ui,j · 1(σ(i) > σ(j)

).

From Theorem 5.1, we have that (ψPD, predPD) is `PD-calibrated.3 However, unfortu-

nately, computing the predictor predPD corresponds to the NP-Hard problem of feedback

3One can easily construct a r(r− 1)/2 dimensional surrogate that does just as well. We use a r2 dimsurrogate for convenience.


arc set, and hence takes time exponential in r to compute. We give ways to tackle this

problem in Chapter 6.

5.5.2 Multilabel Prediction

In this section we consider the standard Micro-F measure used in multilabel prediction,

and also propose two new losses that are appropriate for the problem of multilabel pre-

diction with r tags, and a graph structure over the r tags representing the similarity of

the tags with one another. We then observe that even though the size of the loss matrices

is exponential in the number of tags r, the rank of all three loss matrices are either linear

or quadratic in r, and is hence amenable to construction of efficient convex calibrated

surrogates via Theorem 5.1.

Example 5.9 (Micro F-measure in multilabel classification). Consider the popular micro

F-measure used in multilabel classification. Let Y = Y = 0, 1r. The loss `F : Y×Y→R+

is given as

`F (y, t) =2∑r

i=1 yiti||y||1 + ||t||1

=r∑j=1

r∑i=1

2yi · 1(||y||1 = j) · tij + ||t||1

.

Clearly the loss matrix has a rank of at most r2. Consider the surrogate ψF : Y×Rr2→R+

and predictor predF : Rr2→Y defined as

ψF (y,u) =r∑i=1

r∑j=1

(ui,j − 1(||y||1 = j) · yi)2

predF (u) = argmint∈Y

r∑j=1

r∑i=1

2ui,j ·ti

j + ||t||1.

From Theorem 5.1 we have that (ψF , predF ) is `F -calibrated. A similar result was given

by Dembczynski et al. [30, 31], along with an efficient (polynomial time in r) method to

compute predF .

Example 5.10 (Graph-based multilabel classification - penalized selection). Consider a

multilabel classification problem, where there is a set of r possible tags. Both labels and


predictions are binary vectors, indicating which of the tags are ‘present’ in any given

instance (which could, for example, be an image or a document): Y = Y = 0, 1r.The Hamming loss in Example 2.11 can be used here, but it treats all tags in the same

manner. Here we consider a general graph-based version of the problem where there is

an undirected graph G = ([r], E) over the set of tags [r], with an edge between i and j

indicating that tags i and j are ‘similar’. Let dG : [r]× [r]→R+ denote the shortest path

metric in G; then the penalized selection loss `PS : 0, 1r × 0, 1r→R+ we consider can

be defined as follows:

`PS(y, t) =r∑i=1

yi minj:tj=1

dG(i, j) +r∑i=1

ti minj:yj=1

dG(i, j) .

The first term penalizes cases where a tag i present in y is far from all tags predicted in

t; the second term penalizes cases where a tag i predicted in t is far from all tags present

in y. When G is the complete graph, the above loss becomes equal to the Hamming loss;

for more general graphs G, one gets a loss that penalizes mistakes based on the structure

of G. Also it can be easily seen that the loss matrix for this loss has rank at most 2r.

Let the surrogate ψPS : Y × R2r→R+ and predPS : R2r→Y be such that

ψPS(y,u) =r∑i=1

(ui − yi)2 +r∑i=1

(ur+i − min

j:yj=1dG(i, j)

)2

predPS(u) ∈ argmint∈0,1rr∑i=1

(ui · min

j:tj=1dG(i, j) + ur+i · ti

).

We have from Theorem 5.1, that (ψPS, predPS) is `PS-calibrated. However, computing

predPS(u) exactly amounts to solving an uncapacitated facility location (UFL) problem,

which in general is NP-hard.4 Once again, we give ways to overcome this problem in

Chapter 6.

Example 5.11 (Graph-based multilabel classification - budgeted selection). Consider a

multilabel classification problem, where there is a set of r possible ‘tags’. The labels are

binary vectors indicating which of these tags are ‘present’ in any given instance (which

could, for example, be an image or a document), Y = 0, 1r. Further, we have a fixed

4We note that efficient algorithms for UFL exist in the special case when G is a tree [99].


budget of selecting at most p tags: Y = y ∈ 0, 1r :∑yi ≤ p. Here we consider a

general graph-based version of the problem where there is an undirected graph G = ([r], E)

over the set of tags [r], with an edge between i and j indicating that tags i and j are

‘similar’. Let dG : [r] × [r]→R+ denote the shortest path metric in G, then the budgeted

selection loss `BS : 0, 1r × Y→R+ we consider can be defined as follows:

`BS(y, t) =r∑i=1

yi minj:tj=1

dG(i, j) .

This loss simply penalizes cases where a tag i present in y is far from all tags predicted

in t. It can also be easily seen that the loss matrix here has a rank of at most r.

Let the surrogate ψBS : 0, 1r × Rr→R+ and predictor predBS : Rr→Y be such that

ψBS(y,u) =r∑i=1

(ui − yi)2

predBS(u) ∈ argmint∈Y

r∑i=1

(ui · min

j:tj=1dG(i, j)

).

we have that (ψBS, predBS) is `BS-calibrated. However, computing predBS(u) exactly amounts

to solving a p-median problem, which in general is NP-hard.5 This problem is addressed

in Chapter 6.

5We note that efficient algorithms for p-median exist in the special case when G is a tree [99].

Chapter 6

Weak Notions of Consistency

Consistency is a very desirable property in a supervised learning algorithm, but in many

cases it can be a very difficult goal to achieve. For example, consider the calibrated sur-

rogate and predictor given in Example 5.8 for the pairwise disagreement loss in ranking.

While the complexity of the surrogate is reasonably small, having only a quadratic de-

pendence on the number of objects to be ranked r, computing the predictor requires time

exponential in r, which is impractical even when r is small. The case of Examples 5.7,

5.10 and 5.11, is also similar. In such cases, one might be willing to relax the requirement

of consistency in exchange for a more efficient algorithm in both training and prediction.

In this chapter we give two such weak notions of consistency and give several example

problems, where there is no known approach to construct efficient consistent algorithms,

but it is easy to construct efficient learning algorithms satisfying these weak notions of

consistency.


Firstly we consider a weak notion of consistency that we call as consistency under noise

conditions in Section 6.2, and also give example problems where this can be achieved by

efficient algorithms, but the standard notion of consistency is hard to achieve. We then

92

Chapter 6. Weak Notions of Consistency 93

consider another weak notion of consistency known as approximate consistency in Section

6.3, and give some example problems where this notion is an apt choice.

6.2 Consistency Under Noise Conditions

For an algorithm returning a classifier hM : X→Y on being given a training sample of

size M , consistency requires that er`D[hM ]P−→ er`,∗D , for all distributions D. Hence, a

natural way of weakening consistency is to require the above convergence to hold only for

distributions D over X ×Y satisfying certain conditions. Such conditions are called noise

conditions because they essentially ensure that the ‘noise’ in distribution D is not too

wild.1 In particular, we will be interested in noise conditions that only restrict the values

taken by the conditional probability vector p(x), and can be represented by a set P ⊆ ∆n.

More precisely, the set of ‘allowed’ distributions D of the random variable (X, Y ) is the

set of distributions whose conditional probability vectors are such that p(X) ∈ P with

probability 1 over X.

An example of such a noise condition is given below.

Example 6.1 (Dominant label condition). Let Y = [n]. A prevalent noise condition in

multiclass problems is that of the dominant label condition, where it is assumed that for

every instance x ∈ X , the conditional probability vector p(x) is such that there exists a

class with probability greater than 12. The set P in this case is simply

P = p ∈ ∆n : maxy∈[n]

py ≥1

2 .

An illustration of the above condition for n = 3 is given in Figure 6.1. It is known

[86, 116] that under the dominant label condition, both the one-vs-all hinge loss and the

Crammer-Singer surrogate are calibrated w.r.t. 0-1 loss.

Similar to the `-calibrated surrogates and predictors in Definition 3.1, one can define the

notion of (`,P)-calibration which is the requirement to derive surrogate minimization

1The type of noise conditions considered in this chapter are distinct from the Tsybakov type noiseconditions in Chapter 5.


(1, 0, 0) (0, 0, 1)

(0, 1, 0)

(12 ,12 , 0)

(12 , 0,12)

(0, 12 ,12)

Figure 6.1: The dominant label noise condition for n = 3. The ‘allowed’ probabilitiesare shaded green.

algorithms that are consistent under a noise condition. This is captured by the following

definition and Theorem, whose proof exactly follows that of Theorem 3.2.

Definition 6.1 ((`,P)-calibration). Let ` : Y × Y→R+. Let P ⊆ ∆n. Let ψ : Y ×C→R+

and pred : C→Y. (ψ, pred) is said to be (`,P)-calibrated, or calibrated w.r.t ` over P, if

∀p ∈ P : infu∈C:pred(u)/∈argmint〈p,`t〉


Also, ψ is said to be (`,P)-calibrated, if there exists a pred : C→Y such that (ψ, pred) is

(`,P)-calibrated.

Theorem 6.1. Let ` : Y × Y→R+. Let P ⊆ ∆n. Let ψ : Y × C→R+ and pred : C→Y.

(ψ, pred) is (`,P)-calibrated iff for all distributions D on X × [n] such that p(X) ∈ Pwith probability 1, and all sequences of (random) vector functions fm : X→C, we have

that

erψD[fm]P−→ erψ,∗D implies er`D[pred fm]

P−→ er`,∗D .

We now consider the problem of ranking under the evaluation metrics of pairwise disagree-

ment and mean average precision, as in Examples 5.8 and 5.7. For both these evaluation

metrics we gave an efficient convex calibrated surrogate, but the predictor was shown to

be hard to compute. Below, we give alternate surrogates and efficient predictors for both

these problems, which achieve calibration under appropriate noise conditions. For the

sake of simplicity we only give results for the least squares type surrogate discussed in


Examples 5.8 and 5.7, these results can be easily extended to the other strongly proper

composite losses given in Table 5.1.

6.2.1 Pairwise Disagreement

The pairwise disagreement loss is a popular evaluation metric used in subset ranking.

We repeat the details of the pairwise disagreement surrogate in Example 5.8 here for

convenience.

Let Y = Gr the set of all directed acyclic graphs (DAGs) on r vertices and let Y = Πr.

The loss on predicting a permutation σ ∈ Πr when the true label is Y ∈ Gr is given by

`PD(Y, σ) =r∑i=1

r∑j=1,j 6=i

Yi,j · 1(σ(i) > σ(j)

),

where Yi,j = 1((i, j) ∈ Y

). Let ψPD : Y × Rr2→R+ and predPD : Rr2→Y be given as

ψPD(Y,u) =r∑i=1

r∑j=1

(ui,j − Yi,j)2


r∑i=1

r∑j=1

ui,j · 1(σ(i) > σ(j)

).

From Theorem 5.1, we have that (ψPD, predPD) is `PD-calibrated. But, as mentioned in

Example 5.8, computing the predictor requires time super-polynomial2 in the number of

objects per instance, r.

Below we give two sets of results. Firstly, we consider a predictor predPD

, which is a

simple to implement version of predPD above, and show that (ψPD, predPD

) is (`PD,PDAG)

calibrated for a noise condition PDAG. Secondly, we we give a family of score-based (r-

dimensional) surrogates, used along with the argsort predictor, that are calibrated w.r.t.

`PD loss under different conditions on the probability distribution– all of which are more

restrictive than PDAG. This illustrates an interesting trade-off, i.e. if one is willing to

settle for consistency under more restrictive noise conditions, then it is possible to make

2Assuming P 6= NP .


both the surrogate optimization in the training phase, and predictor computation in the

prediction phase, computationally easier. These score-based surrogates and conditions

generalize the surrogate and noise condition of Duchi et al. [34].

6.2.1.1 DAG Based Surrogate

Observing the expression for predPD(u) carefully, one can see that the main reason for

the computational difficulty is that the graph given by u may not be acyclic, in which

case the problem becomes equivalent to feedback arc set. If the directed graph Y given

by u is indeed acyclic, then any permutation given by the topological sorted order of

Y , denoted by topsort(Y ), would satisfy the requirement of pred(u). Hence, if one can

ensure a noise condition such that it does not matter what the predictor does on inputs

u corresponding to cyclic graphs, then we immediately get an efficient predictor.

Consider the predictor predPD

: Rr2→Y that is described by Algorithm 1 below:

Algorithm 1 predPD

Input: u ∈ Rr2

Output: Permutation σ ∈ Πr

Construct a directed graph Y over [r] with edge (i, j) having weight (ui,j − uj,i)+.while Y has cycles

Delete the edge of Y with minimum weightend whilereturn topsort(Y )

Let ∆Y be the set of all distributions over the set of DAGs Y . For each p ∈ ∆Y , define

Ep = (i, j) ∈ [r]× [r] : EY∼p[Yi,j] > EY∼p[Yj,i], and define PDAG ⊂ ∆Y as follows:

PDAG =

p ∈ ∆Y :([r], Ep

)is a DAG

.

Then we have the following result:

Theorem 6.2. (ψPD, predPD

) is (`PD,PDAG)-calibrated.


Proof. Let p ∈ PDAG. Define up ∈ Rr2 such that for all i, j ∈ [r]

upi,j = EY∼p[Yi,j] =∑y∈Y

pyyi,j .

It is easy to see that up is the unique minimizer of 〈p,ψPD(u) over u ∈ Rr2 .

Define the following sets:

Π∗(p) = argminσ∈Πr〈p, `PDσ 〉 = argminσ∈Sr

r∑i=1

r∑j=1

(upi,j) · 1(σ(i) > σ(j))

Π(p) =σ ∈ Πr : σ corresponds to any topological sorted order given by pred

PD.

We claim that Π(p) ⊆ Π∗(p). To see this, let σ ∈ Π(p). Since p ∈ PDAG, we have that

the graph with edge weights (upi,j − upj,i)+ formed by predPD

(up) is a DAG, and therefore

σ must agree with the edges in this graph, i.e.

upi,j > upj,i =⇒ σ(i) < σ(j) ,

upi,j < upj,i =⇒ σ(i) > σ(j) .

This clearly gives σ ∈ Π∗(p). Thus Π(p) ⊆ Π∗(p).

Now, let

A(p) =u ∈ Rr2 : pred

PD(u) /∈ argminσ〈p, `PD

σ 〉

=u ∈ Rr2 : pred

PD(u) /∈ Π∗(p)

.

To show that (ψPD, predPD

) is (`PD,PDAG)-calibrated, one simply needs to show:

infu∈A(p)

〈p,ψPD(u)〉 > infu∈Rr〈p,ψPD(u)〉 .

We do so by showing that any sequence um in Rr2 converging to up must eventually

lie outside A(p), i.e. that any such sequence must eventually have predPD

(um) ∈ Π∗(p);

the result will then follow from the fact that up is the unique minimizer of 〈p,ψPD(.)〉.


Let um be any sequence in Rr2 converging to up. Let

ε = mini 6=j

upi,j − upj,i : upi,j − upj,i > 0

.

Then for large enough m, we must have the following (by convergence of um to up):

upi,j − upj,i > 0 =⇒ umi,j − umj,i ≥ ε/2 ,

upi,j − upj,i = 0 =⇒ umi,j − umj,i ≤ ε/4 .

Thus, for large enough m, the directed graph induced by um contains the DAG induced by

up, and any edge (i, j) such that upi,j −upj,i > 0 will not be deleted by the algorithm when

predPD

(um) is evaluated. Thus, for large enoughm, we have predPD

(um) ∈ Π(p) ⊆ Π∗(p).

Since the above holds for all p ∈ PDAG, we have that (ψPD, predPD

) is (`PD,PDAG)-

calibrated.

6.2.1.2 Score-Based Surrogates

The following theorem gives a family of score-based surrogates, parameterized by func-

tions α : Y→Rr, that are calibrated w.r.t. `PD under different conditions on the proba-

bility distribution.

Theorem 6.3. Let α : Y→Rr be any function that maps DAGs Y ∈ Y to score vectors

α(Y ) ∈ Rr. Let ψα : Y × Rr→R+, pred : Rr→Πr and Pα ⊂ ∆Y be such that

ψα(Y,u) =r∑i=1

(ui − αi(Y )

)2

pred(u) ∈σ ∈ Sr : ui > uj =⇒ σ(i) < σ(j)

= argsort(u)

Pα =

p ∈ ∆Y : EY∼p[Yi,j] > EY∼p[Yj,i] =⇒ EY∼p[αi(Y )] > EY∼p[αj(Y )].

Then (ψα, pred) is (`PD,Pα)-calibrated.


Proof. Let p ∈ Pα. Define up ∈ Rr as

up = EY∼p[α(Y )] =∑y∈Y

pyα(y) .

It is easy to see that up is the unique minimizer of 〈p,ψα(u)〉 over u ∈ Rr.

Also define yp ∈ Rr2 such that for all i, j ∈ [r]

ypi,j = EY∼p[Yi,j] =∑y∈Y

pyyi,j .

Define the following sets:

Π∗(p) = argminσ∈Πr〈p, `PDσ 〉

= argminσ∈Πr

r∑i=1

r∑j=1

ypi,j · 1(σ(i) > σ(j))

Π(p) =σ ∈ Πr : upi > upj =⇒ σ(i) < σ(j)

.

We claim that Π(p) ⊆ Π∗(p). To see this, let σ ∈ Π(p). Since p ∈ Pα, we have

ypi,j > ypj,i =⇒ upi > upj =⇒ σ(i) < σ(j) ,

ypi,j < ypj,i =⇒ upi < upj =⇒ σ(i) > σ(j) .

This clearly gives σ ∈ Π∗(p). Thus Π(p) ⊆ Π∗(p).

By the definition of pred and Π(p), we also have that ∃ε > 0 such that for any u ∈ Rr,

‖u− up‖ < ε =⇒ pred(u) ∈ Π(p) .


Thus, we have

infu∈Rr:pred(u)/∈argminσ〈p,`PD

σ 〉〈p,ψα(u)〉 = inf

u∈Rr:pred(u)/∈Π∗(p)〈p,ψα(u)〉

≥ infu∈Rr:pred(u)/∈Π(p)

〈p,ψα(u)〉

≥ infu∈Rr:‖u−up‖≥ε

〈p,ψα(u)〉

> infu∈Rr〈p,ψα(u)〉 ,

where the last inequality follows because, 〈p,ψα(u)〉 has a unique minimizer up.

Since the above holds for all p ∈ Pα, we have that (ψα, pred) is (`PD,Pα)-calibrated.

The noise conditions Pα state that the expected value of function α must decide the

‘right’ ordering. It can be seen that for any α : Y→Rr, the noise condition Pα ( PDAG,

and thus the noise conditions for such score based surrogates are strictly more restrictive

than those for the r2-dimensional surrogates discussed in Section 6.2.1.1.

We note that the surrogate given by Duchi et al. [34] can be written in our notation as

ψDMJ(Y,u) =r∑i=1

∑j 6=i

Yi,j · (uj − ui) + νr∑i=1

λ(ui) ,

where λ is a strictly convex and 1-coercive function and ν > 0. Taking λ(z) = z2 and

ν = 12

gives a special case of the family of score-based surrogates in Theorem 6.3 above

obtained by taking α such that for all i ∈ [r]

αi(Y ) =∑j 6=i

(Yi,j − Yj,i

).

Indeed, the set of noise conditions under which the surrogate ψDMJ is shown to be cali-

brated w.r.t. `PD in Duchi et al. [34] is exactly the set Pα above with this choice of α.

We also note that α can be viewed as a ‘standardization function’ [11] for the PD loss

over Pα.


6.2.2 Mean Average Precision

The mean average precision is another popular evaluation metric in ranking. We repeat

the details from Example 5.7 here for convenience.

Let Y = 0, 1r \ 0 and Y = Πr. The loss on predicting a permutation σ ∈ Πr when

the true label is y ∈ Y is given by

`MAP(y, σ) = 1− 1

‖y‖1

r∑i=1

i∑j=1


.

Let ψMAP : Y × R(r)(r+1)/2→R+ and predMAP : Rr→Y be such that

ψMAP(y,u) =r∑i=1

i∑j=1

(ui,j −

yiyj‖y‖1

)2

predMAP(u) ∈ argmaxσ∈Πr

r∑i=1

i∑j=1

ui,j ·1

max(σ(i), σ(j)).

From Theorem 5.1, we have that (ψMAP, predMAP) is `MAP-calibrated. But, as mentioned

in Example 5.7, computing the predictor is hard.

Below, we describe an alternate mapping in place of predMAP which can be computed

efficiently, and show that under certain conditions on the probability distribution, the

surrogate ψMAP together with this mapping is still calibrated for `MAP.

Specifically, define predMAP : Rr(r+1)/2→Y as follows:

predMAP

(u) ∈σ ∈ Πr : ui,i > uj,j =⇒ σ(i) < σ(j)

.

Clearly, predMAP

(u) can be implemented efficiently by simply sorting the ‘diagonal’ ele-

ments ui,i for i ∈ [r]. Also, let ∆Y denote the probability simplex over Y , and for each

p ∈ ∆Y , define up ∈ Rr(r+1)/2 as follows:

upi,j = EY∼p

[YiYj||Y ||1

]=∑y∈Y

py

(yiyj||y||1

)∀i, j ∈ [r] : i ≥ j .


Now define PMAP ⊂ ∆Y as follows:

PMAP =

p ∈ ∆Y : upi,i ≥ upj,j =⇒ upi,i ≥ upj,j +∑

γ∈[r]\i,j(upjγ − upiγ)+

,

where we set upi,j = upj,i for i < j. Then we have the following result:

Theorem 6.4. (ψMAP, predMAP

) is (`MAP,PMAP)-calibrated.

Proof. Let p ∈ PMAP.

It is easy to see that up ∈ Rr(r+1)/2 is the unique minimizer of 〈p,ψMAP(u)〉 over u ∈Rr(r+1)/2.

We have from the definition of the MAP loss,

〈p, `MAPσ 〉 = 1−

r∑i=1

i∑j=1

upi,j1

max(σ(i), σ(j))

= 1−r∑i=1

1

i

i∑j=1

upσ−1(i)σ−1(j) . (6.1)

Now define the following sets:

Π∗(p) = argminσ∈Πr〈p, `MAPσ 〉

Π(p) =σ ∈ Πr : upi,i > upj,j =⇒ σ(i) < σ(j)

.

From Lemma 6.5 below, we have that Π(p) ⊆ Π∗(p).

By the definition of predMAP

and Π(p), we also have that ∃ε > 0 such that for any

u ∈ Rr(r+1)/2,

‖u− up‖ < ε =⇒ predMAP

(u) ∈ Π(p) .


Thus, we have

infu∈Rr(r+1)/2:pred

MAP(u)/∈argminσ〈p,`σ〉

〈p,ψMAP(u)〉 = infu∈Rr(r+1)/2:pred

MAP(u)/∈Π∗(p)

〈p,ψMAP(u)〉

≥ infu∈Rr(r+1)/2:pred

MAP(u)/∈Π(p)

〈p,ψMAP(u)〉

≥ infu∈Rr(r+1)/2:‖u−up‖≥ε

〈p,ψMAP(u)〉

> infu∈Rr(r+1)/2

〈p,ψMAP(u)〉 ,

where the last inequality follows because 〈p,ψMAP(u)〉 has a unique minimizer up.

Since the above holds for all p ∈ PMAP, we have that (ψMAP, predMAP

) is (`MAP,PMAP)-

calibrated.

The proof of Theorem 6.4 makes use of the following technical lemma:

Lemma 6.5. Let p ∈ PMAP. Let the sets Π∗(p) and Π(p) be defined as in the proof of

Theorem 6.4 above. Then Π(p) ⊆ Π∗(p).

Proof of Lemma 6.5. We first observe that all permutations σ ∈ Π(p) have the same

value of 〈p, `MAPσ 〉. To see this, note that permutations in Π(p) differ only in positions

they assign to elements i, j ∈ [r] with upi,i = upj,j. But since p ∈ PMAP, we have that if

upi,i = upj,j, then upi,γ = upj,γ for all γ ∈ [r] \ i, j. Thus, from the form of 〈p, `MAPσ 〉, we

can see that if upi,i = upj,j, interchanging the positions of i and j in a permutation σ does

not change the value of 〈p, `MAPσ 〉. This establishes that all permutations σ ∈ Π(p) have

the same value of 〈p, `MAPσ 〉.

We will show below that ∃ a permutation σ∗ ∈ Π(p) ∩ Π∗(p). This will give that σ∗ ∈Π(p) and 〈p, `MAP

σ∗ 〉 = minσ〈p, `MAPσ 〉; by the above observation, we will then have that

〈p, `MAPσ′ 〉 = minσ〈p, `MAP

σ 〉 for all σ′ ∈ Π(p), i.e. that Π(p) ⊆ Π∗(p).

In order to show the existence of a permutation σ∗ ∈ Π(p) ∩ Π∗(p), we will start

with an arbitrary element σ0 ∈ Π∗(p), and will construct a sequence of permutations

σ1, σ2, . . . , σm = σ∗ by transposing one adjacent pair at a time, such that all elements in

the sequence remain in Π∗(p), and the final permutation σm is also in Π(p).


Let σ0 ∈ Π∗(p). If σ0 ∈ Π(p), we are done, so let us assume σ0 /∈ Π(p). Thus there must

exist an adjacent pair of elements in σ that are not ordered according to the scores upi,i,

i.e. there must exist a, b, c ∈ [r] such that

σ0(a) = c, σ0(b) = c+ 1, and upa,a < upb,b .

Define σ1 to be such that σ1(a) = c+ 1, σ1(b) = c, and σ1(i) = σ0(i) for all other i ∈ [r].

We will show that σ1 ∈ Π∗(p). For convenience let us denote (σ0)−1 as π0 and (σ1)−1 as

π1. Note that

π0(c) = π1(c+ 1) = a

π0(c+ 1) = π1(c) = b

π0(i) = π1(i) ∀i ∈ [r] \ c, c+ 1 .

From the expression for 〈p, `MAPσ 〉 in Equation (6.1) in the proof of Theorem 6.4 above,

we have

〈p, `MAPσ0 〉 − 〈p, `MAP

σ1 〉

=1

c

(c∑j=1

(upπ1(c),π1(j) − upπ0(c),π0(j))

)+

1

c+ 1

(c+1∑j=1

(upπ1(c+1),π1(j) − upπ0(c+1),π0(j))

)

=1

c

(c∑j=1

(upb,π1(j) − upa,π0(j))

)+

1

c+ 1

(c+1∑j=1

(upa,π1(j) − upb,π0(j))

)

=

(1

c− 1

c+ 1

) c−1∑j=1

(upb,π1(j) − upa,π1(j)) +

1

c(upb,b − upa,a) +

1

c+ 1(upa,b + upa,a − upb,a − upb,b)

=

(1

c− 1

c+ 1

)( c−1∑j=1

(upb,π1(j) − upa,π1(j)) + upb,b − upa,a

)

=

(1

c− 1

c+ 1

)(upb,b −

(upa,a +

c−1∑j=1

(upa,π1(j) − upb,π1(j))

))

≥(

1

c− 1

c+ 1

)upb,b − (upa,a +∑

j∈[r],j /∈c,c+1(upa,π1(j) − u

pb,π1(j))+

)≥ 0 ,


where the last inequality above follows since p ∈ PMAP. This gives σ1 ∈ Π∗(p). Moreover,

the number of pairs in σ1 that disagree with the ordering according to upi,i is one less than

that in σ0. Since there can be at most(r2

)such pairs in σ0 to start with, by repeating the

above process, we will eventually end up with a permutation σm ∈ Π(p) ∩ Π∗(p) (with

m ≤(r2

)). The claim follows.

The ideal predictor predMAP uses the entire u matrix, but the predictor predMAP

, uses only

the diagonal elements. The noise conditions PMAP can be viewed as basically enforcing

that the diagonal elements dominate and enforce a clear ordering themselves.

In fact, since the mapping predMAP

depends on only the diagonal elements of u, we can

equivalently define an r-dimensional surrogate that is calibrated w.r.t. `MAP over PMAP.

Specifically, we have the following immediate corollary:

Corollary 6.6. Let ψMAP : 0, 1r × Rr→R+ and pred : Rr→Πr be such that

ψMAP(y, u) =r∑i=1

(ui −

yi||y||1

)2

pred(u) ∈σ ∈ Πr : ui > uj =⇒ σ(i) < σ(j)

= argsort(u) .

Then (ψMAP, pred) is (`MAP,PMAP)-calibrated.

Looking at the form of ψMAP and pred, we can see that the function s : Y→Rr defined as

si(y) = yi/(||y||1) is a ‘standardization function’ for the MAP loss over PMAP, and there-

fore it follows that any ‘order-preserving surrogate’ with this standardization function is

also calibrated with the MAP loss over PMAP [11].

6.3 Approximate Consistency

Another approach to weakening the consistency requirement, distinct from assuming noise

conditions, is that of approximate consistency.

An algorithm returning a classifier hM : X→Y on being given a training sample of size

M , is said to be θ-approximately consistent if for any δ, ε > 0, we have that for large


enough M , the folowing holds with with probability 1− δ:

er`D[hM ] ≤ θ · er`,∗D + ε .

For example, the 1-nearest neighbor algorithm is known to be 2-approximately consistent

for the zero-one loss used in classification [26].

We define the notion of approximate calibration of a surrogate, which implies approximate

consistency of the corresponding surrogate minimization algorithm.

Definition 6.2 (Approximate calibration). Let ` : Y × Y→R+ be a loss function. Let

ψ : Y × C→R+ and pred : C→Y. Let θ ≥ 1. We will say (ψ, pred) is θ-approximately

calibrated w.r.t. ` (or simply θ-approximately `-calibrated) if

∀p ∈ ∆n : infu∈C:〈p,`pred(u)〉>θmint〈p,`t〉


The following result shows that if (ψ, pred) is θ-approximately `-calibrated, then any

algorithm that is consistent w.r.t. ψ, and maps back its predictions in C to predictions in

Y via the mapping pred : C→Y , is also θ-approximately consistent w.r.t. `:

Theorem 6.7. Let ` : Y × Y→R+ be a loss function. Let ψ : Y × C→R+ and pred :

C→Y. Let θ > 1. If (ψ, pred) is θ-approximately `-calibrated, then for all distributions

D on X × Y and all sequences of (random) vector functions fm : X→C (depending on

(X1, Y1), . . . , (Xm, Ym)), as m→∞,

erψD[fm]P−→ erψ,∗D implies ∀ε > 0 : P

(er`D[pred fm] ≥ θ · er`,∗D + ε

)→ 0 .

Proof. The proof is similar to the proof of Theorem 3.2; we give an outline here for

completeness.

For each p ∈ ∆n, define Hpθ : R+→R+ as follows:

Hpθ (ε′) = inf

u∈Rd

〈p,ψ(u)〉 − inf

u′∈Rd〈p,ψ(u′)〉 : 〈p, `pred(u)〉 − θmin

t〈p, `t〉 ≥ ε′

.


Since (ψ, pred) is θ-approximately `-calibrated, we have Hpθ (ε′) > 0 for all ε′ > 0 and

p ∈ ∆n. Now, define Hθ : R+→R+ as follows:

Hθ(ε′) = inf

p∈∆n,u∈Rd

〈p,ψ(u)〉 − inf

u′∈Rd〈p,ψ(u′)〉 : 〈p, `pred(u)〉 − θmin

t〈p, `t〉 ≥ ε′

.

The main step in the proof involves showing that we also have Hθ(ε′) > 0 for all ε′ > 0;

this can be established using arguments similar to those in the proof of Theorem 3.2. It

then follows that there exists a concave non-decreasing function ξ : R+→R+ with ξ(0) = 0

and continuous at 0, such that for all distributions D and functions f : X→Rd,

er`D[pred f ]− θ · er`,∗D ≤ ξ(erψD[f ]− erψ,∗D

).

The claim follows.

Below, we give two generic ways of constructing approximately calibrated surrogates and

predictors. The first way is to simply construct an exactly calibrated surrogate and

predictor for another loss matrix that closely approximates the loss of interest.

Theorem 6.8. Let ` : Y×Y→R+ and ˜ : Y×Y→R+ be loss functions such that ∃c1, c2 >

0, c2 ≥ c1, such that

c1˜(y, t) ≤ `(y, t) ≤ c2

˜(y, t) ∀y ∈ Y , t ∈ Y .

Let ψ : Y × Rd→R+ and pred : Rd→Y be such that (ψ, pred) is ˜-calibrated. Then

(ψ, pred) is(c2c1

)-approximately `-calibrated.

Proof. Let p ∈ ∆n. Let um be any sequence in Rd such that 〈p,ψ(um)〉 converges to

infu∈Rd〈p,ψ(u)〉. Then since (ψ, pred) is ˜-calibrated, we have that for large enough m,

〈p, ˜tm〉 = mint∈[k]〈p, ˜t〉 ,

where tm = pred(um). Now, for any t∗ satisfying 〈p, ˜t∗〉 = mint∈[k]〈p, ˜t〉, we have

∀t ∈ [k]:

〈p, `t∗〉 ≤ c2 〈p, ˜t∗〉 ≤ c2 〈p, ˜t〉 ≤ (c2

c1

)〈p, `t〉 .


This gives that for large enough m, 〈p, `tm〉 ≤(c2c1

)mint∈[k]〈p, `t〉. Thus we have that

(ψ, pred) is(c2c1

)-approximately `-calibrated.

The second way of constructing approximately calibrated surrogates and predictors is

based on the generic loss matrix rank dimensional surrogate and predictor from Theorem

5.1. Recall that the predictor in that case was expressed as a discrete optimization

problem. Simply replacing it by a predictor that solves the discrete optimization problem

only to a factor θ of the best solution, gives an θ-approximately calibrated surrogate and

predictor.

Theorem 6.9. Let ` : Y × Y→R+. Let θ > 1. Suppose there exists d ∈ N, vectors

a1, a2, . . . , an ∈ [0, 1]d and b1,b2, . . . ,bk ∈ Rd and scalars c1, c2, . . . , cn, c1, c2, . . . , ck ∈ R

such that

`(y, t) = 〈ay,bt〉+ cy + ct .

Let V ⊆ R and let ρ : 1, 2 × V→R+ be a γ-strongly proper loss for some γ > 0 with a

link function λ : [0, 1]→V. Let the surrogate ψ : Y × Vd→R+ be given by

ψ(y,u) =d∑i=1

(ay,i)ρ(1, ui) + (1− ay,i)ρ(2, ui)

and let the predictor pred : Vd→Y be such that

〈λ−1(u),bpred(u)〉+ dpred(u) ≤ θ ·mint∈Y

(〈λ−1(u),bt〉+ ct

),

where [λ−1(u)]i = λ−1(ui). Then for all distributions D and functions f : X→Vd

er`D[pred f ] ≤ (θ + 1) maxt‖bt‖

√2

γregψD[f ] + θ min

h:X→Yer`D[h] .

In particular, (ψ, pred) is θ-approximately `-calibrated.

Proof. Let the matrix A ∈ [0, 1]d×n be given by A = [a1, a2, . . . , an]. Fix p ∈ ∆n,u ∈ Vd.Let t∗ ∈ Y be such that

⟨λ−1(u)−Ap,bt∗

⟩= max

t∈Y

(⟨λ−1(u)−Ap,bt

⟩).


We then have that,

er`p(pred(u))

= 〈p, `pred(u)〉+ 〈p, c〉

=⟨Ap,bpred(u)

⟩+ cpred(u) + 〈p, c〉

=⟨Ap− λ−1(u),bpred(u)

⟩+⟨λ−1(u),bpred(u)

⟩+ cpred(u) + 〈p, c〉

≤⟨Ap− λ−1(u),bpred(u)

⟩+ θ ·min

t∈Y

(⟨λ−1(u),bt

⟩+ ct

)+ 〈p, c〉


⟩+ θ ·min

t∈Y


⟩+ 〈Ap,bt〉+ ct

)+ 〈p, c〉

≤⟨Ap− λ−1(u),bpred(u)

⟩+ θ ·max

t∈Y


⟩)+ θ ·min

t∈Y(〈Ap,bt〉+ ct) + 〈p, c〉


⟩+ θ ·

(⟨λ−1(u)−Ap,bt∗

⟩)+ θ ·min


=⟨Ap− λ−1(u),bpred(u) − θbt∗

⟩+ θ ·min



⟩+ θ ·min

t∈Y(〈Ap,bt〉+ ct + 〈p, c〉)


⟩+ θ ·min

t∈Yer`p(t)

≤ (θ + 1) · ‖Ap− λ−1(u)‖ ·maxt‖bt‖+ θ ·min

t∈Yer`p(t) . (6.2)

The last inequality above follows from Cauchy-Schwarz. Also, from Equation (5.2) from

the proof of Theorem 5.1, we have

regψp(u) ≥ γ

2‖Ap− λ−1(u)‖2 (6.3)

Putting Equations (6.2) and (6.3) together we have

er`p(pred(u)) ≤ (θ + 1) ·maxt‖bt‖ ·

√2

γregψp(u) + θ ·min

t∈Yer`p(t) .

The theorem follows from taking expectations and applying Jensen’s inequality to the

square root function.

We now give some example applications where approximate consistency can be achieved

using the above theorem by efficient training and prediction algorithms, but achieving

exact consistency is hard. Once again, we use the squared loss as the proper loss ρ for


illustration, any other strongly proper convex surrogate like the logistic or exponential

loss in Table 5.1 can be used as well.

Example 6.2 (Pairwise disagreement on permutations). Consider the pairwise disagree-

ment loss discussed in Example 5.8. We constructed a calibrated surrogate and predictor

for the PD loss, however computing the predictor is equivalent to solving the feedback arc

set problem, which is not only NP-hard, but even hard to approximate.

Now, consider a variant of the PD loss where the label space Y = Y = Πr, the set of

permutations on [r]. In this case, one can express the PD loss on predicting a permutation

σ, when the true permutation is y as

`PD(y, σ) =r∑i=1

i−1∑j=1

1(y(i) < y(j)) · 1(σ(i) > σ(j)) + 1(y(i) > y(j)) · 1

(σ(i) < σ(j)) .

Let ψPD : Y × Rr(r−1)/2→R+ and predPD : Rr(r−1)/2→Y be such that

ψPD(y,u) =r∑i=1

i−1∑j=1

(uij − 1(y(i) < y(j)))2


r∑i=1

i−1∑j=1

ui,j · 1(σ(i) > σ(j)) + (1− ui,j) · 1(σ(i) < σ(j)) ,

From Theorem 5.1, we have that (ψPD, predPD) is `PD-calibrated. While solving the opti-

mization problem posed by predPD is still NP-Hard, due to the sum to 1 structure of the op-

posite edge weights ui,j and uj,i, efficient constant-factor approximation algorithms exist.

For example, the LP-based rounding procedure of Ailon et al. [3] achieves (in expectation)

a 2.5-factor approximation; by Theorem 6.9, this yields an efficient 2.5-approximately

calibrated surrogate and predictor.3

Example 6.3 (Graph-based multilabel classification - penalized selection). Consider the

problem of graph based multilabel prediction with penalized selection, considered in Exam-

ple 5.10. Let Y = Y = 0, 1r. Let G be a graph over [r] and dG be the shortest distance

3One can also in principle use the PTAS of Kenyon-Mathieu and Schudy [57] for this problem to geta better approximation factor, although this is more complicated to implement.


metric induced by it. The loss `PS : 0, 1r × 0, 1r→R+ is given as

`PS(y, t) =r∑i=1

yi minj:tj=1

dG(i, j) +r∑i=1

ti minj:yj=1

dG(i, j)

Let the surrogate ψPS : Y × R2r→R and predPS : R2r→Y be such that

ψPS(y,u) =r∑i=1

(ui − yi)2 +r∑i=1

(ur+i − min

j:yj=1dG(i, j)

)2

predPS(u) ∈ argmint∈0,1rr∑i=1

(ui · min

j:tj=1dG(i, j) + ur+i · ti

).

We have from Theorem 5.1, that (ψPS, predPS) is `PS-calibrated. However, computing

predPS(u) exactly amounts to solving the NP-Hard, uncapacitated facility location (UFL)

problem. Fortunately, the UFL problem admits efficient constant factor approximation

algorithms; for example, a simple greedy type algorithm achieves a factor of 1.61 [50]. By

Theorem 6.9, this yields an efficient 1.61-approximately calibrated surrogate and predictor.

Example 6.4 (Graph-based multilabel classification - budgeted selection). Consider the

problem of graph based multilabel prediction with budgeted selection, considered in Example

5.11. Let Y = 0, 1r, p ∈ [r] and Y = y ∈ 0, 1r :∑yi ≤ p. Let G be a graph over

[r] and dG be the shortest distance metric induced by it. The loss `BS : 0, 1r × Y→R+

is given as

`BS(y, t) =r∑i=1

yi minj:tj=1

dG(i, j)

Let the surrogate ψBS : 0, 1r × Rr→R+ and predictor predBS : Rr→Y be such that

ψBS(y,u) =r∑i=1

(ui − yi)2

predBS(u) ∈ argmint∈Y

r∑i=1

(ui · min

j:tj=1dG(i, j)

).

We have that (ψBS, predBS) is `BS-calibrated. However, computing predBS(u) exactly

amounts to solving the NP-Hard p-median problem. Fortunately, the p-median prob-

lem admits efficient constant factor approximation algorithms; for example, the simple

local search type algorithm of Arya et al. [4] achieves an approximation factor of 4. By


Theorem 6.9, this yields an efficient 4-approximately calibrated surrogate and predictor.

Part II

Application to

Hierarchical Classification

Chapter 7

Multiclass Classification with an

Abstain Option

In many applications like medical diagnosis, classification (binary or multiclass) is the

ultimate objective, but they have an additional requirement of having to make confident

decisions. In the event that they cannot make confident decisions, it is better to not

take any decision at all rather than make a wrong decision. For instance, in medical

diagnosis, if unable to predict the condition affecting the patient confidently using the

current symptoms and test reports, it is better to take further tests and postpone the

decision point, rather than making a prediction right away.

In other words, the prediction space Y contains the elements in Y to constitute a classifi-

cation problem, but it also contains a special element (which we denote by ⊥) indicating

the abstain option. Such problems are called multiclass (or binary) classification with an

abstain option and abbreviated for convenience as MCAO (or BCAO). This particular

type of supervised learning problem was also discussed in Example 2.4.

The problem of binary classification with an abstain option, has also been called ‘clas-

sification with a reject option’ and has been the object of study of many papers [6, 38–

41, 44, 47, 114]. In particular, Bartlett and Wegkamp [6], Grandvalet et al. [47], Yuan

and Wegkamp [114] assign a loss matrix of size 2×3 to this problem, and study consistent

surrogate minimization algorithms for this loss matrix.

114

Chapter 7. Abstain Loss 115

The loss function `α : Y×Y→R+, used by Bartlett and Wegkamp [6], Yuan and Wegkamp

[114], which we call the abstain(α) loss, for some Y with cardinality of 2, and Y = Y∪⊥and α ∈ [0, 1] is given as

`α(y, t) =

0 if y = t

α if t = ⊥

1 otherwise

(7.1)

Here α ∈ [0, 1] denotes the cost of abstaining. In the binary case, we have that for any

α > 12, it is never optimal to abstain and hence the only interesting range of values for α

is[0, 1

2

].

Yuan and Wegkamp [114] show that many standard smooth convex surrogates used in

binary classification, like the logistic surrogate, squared loss surrogate and exponential

loss surrogate are calibrated w.r.t. the above loss. Bartlett and Wegkamp [6] show that

the hinge loss is not calibrated with the above loss, but can be made calibrated with

a simple modification. The suggested modification is simply to use a double hinge loss

with three linear segments instead of the two segments in standard hinge loss, the ratio of

slopes of the two non-flat segments depends on the cost of abstaining α. Grandvalet et al.

[47] consider a slightly more general loss function than the above and derive a calibrated

double hinge surrogate loss for the same.

In many practical applications, one might require the abstain option with multiple classes,

i.e. to solve the MCAO problem, and it can be easily seen from Equation (7.1) that the

abstain(α) loss can just as easily be applied to the multiclass case with |Y| = n, as

in Example 2.10. While there have been some empirical and heuristic results for the

MCAO problem [93, 110, 117] there has been little theoretical analysis of the problem.

In particular, no known result exists on convex calibrated surrogates for such a loss with

n > 2, and none of the results for the binary case given above can be extended in a simple

way for n > 2.

The generic calibration result of Theorem 5.1, can be used to derive a (n−1)-dimensional

smooth convex calibrated surrogate for the abstain(α) loss for any α ∈ [0, n−1n

]. However


such a smooth surrogate, essentially estimates the entire conditional probability vector

and does much more than what is necessary to solve this problem. On the other hand,

consistent piecewise linear surrogate minimizing algorithms do only what is needed and

can be expected to be more successful. For example, the squared loss, the logistic loss and

hinge loss surrogates are all calibrated w.r.t. standard 0-1 loss in binary classification, but

the support vector machine (which minimizes the piecewise linear hinge loss surrogate) is

arguably the most widely used method in binary classification. Piecewise linear surrogates

have other advantages like easier optimization and sparsity (in the dual) as well. This

motivates the question –

Are there natural piecewise linear convex calibrated surrogates for the n-class abstain(α)

loss generalizing those of Bartlett and Wegkamp [6] and Grandvalet et al. [47] ?

Another interesting object of study is the convex calibration dimension of the n-class

abstain(α) loss, which motivates the following question –

Is CCdim(`α) significantly less than n− 1 = CCdim(`0-1) ? If so what surrogate achieves

it?

In this chapter, we give positive answers to both questions. We construct three convex

calibrated piecewise-linear surrogates for the general n-class abstain(α) loss, for α ∈[0, 1

2

], all of which reduce to the double hinge surrogate of Bartlett and Wegkamp [6]

when n = 2, thus answering the first question. One of our constructed surrogates has a

surrogate dimension of dlog2(n)e, which answers the second question.


Firstly, we give some more detailed background on the abstain loss and discuss the effects

of α, the cost of abstaining in Section 7.2. We then show that the Crammer-Singer

(CS) surrogate [27] and one-vs-all hinge (OVA) surrogate [86] are calibrated w.r.t. the

abstain(12) loss in Sections 7.3 and 7.4, and also give excess risk bounds relating these

surrogates and the abstain(12) loss. We also design a new convex surrogate with surrogate

dimension log2(n) called the binary encoded predictions (BEP) surrogate and show that

it is calibrated with the abstain(12) loss, and give excess risk bounds relating this surrogate


and the abstain(12) loss in Section 7.5. We give the details of a dual block co-ordinate

ascent algorithm for minimizing the BEP surrogate, in Section 7.6. We show how the

CS, OVA and BEP surrogates can be modified to be calibrated w.r.t. the abstain(α) loss

for any α ∈ [0, 12] in Section 7.7, and give experimental results for all three algorithms on

synthetic and real datasets in Section 7.8.

7.2 Background

In the rest of the chapter we shall fix Y = [n] for some n > 1 and Y = Y ∪ ⊥. For

the n-class abstain loss `α : Y × Y→R+ defined as in Equation (7.1), the Bayes optimal

classifier h∗α : X→Y is given as

h∗α(x) =

argmaxy∈[n] py(x) if maxy∈[n] py(x) ≥ 1− α

⊥ Otherwise

, (7.2)

The above can be seen as a natural extension of the ‘Chow’s rule’ [21] for the binary case.

It can also be seen that the interesting range of values for α is [0, n−1n

] as for all α > n−1n

the Bayes optimal classifier for the abstain(α) loss never abstains.

For small α, the classifier h∗α acts as a high-confidence classifier and would be useful in

applications like medical diagnosis. For example, if one wishes to learn a classifier for

diagnosing an illness with 80% confidence, and recommend further medical tests if it is

not possible, the ideal classifier would be h∗0.2, which is the minimizer of the abstain(0.2)

loss. If α = 12, the Bayes classifier h∗α has a very appealing structure – a class y ∈ [n] is

predicted only if the class y has a simple majority. The abstain(α) loss is also useful in

applications where a ‘greater than 1−α conditional probability detector’ can be used as a

black box. As we shall see in Chapter 8, a greater than 12

conditional probability detector

plays a crucial role in the Bayes optimal classifier for a hierarchical classification problem,

and hence any surrogate calibrated w.r.t. abstain(12) loss can be used as a component to

derive calibrated surrogates for hierarchical classification.


(1, 0, 0) (0, 0, 1)

(0, 1, 0)

Q1

Q2

Q⊥

Q3

(23,13, 0)

(23, 0,13)

(0, 13,23)

(13, 0,23)

(13,23, 0) (0, 23,

13)

(a) α = 13

(1, 0, 0) (0, 0, 1)

(0, 1, 0)

Q1

Q2

Q⊥Q3

(12,12, 0)

(12, 0,12)

(0, 12,12)

(b) α = 12

(1, 0, 0) (0, 0, 1)

(0, 1, 0)

Q1

Q2

Q⊥

Q3

(12,12, 0)

(12, 0,12)

(0, 12,12)

(25 ,25 ,

15)

(15 ,25 ,

25)

(25 ,15 ,

25)

(c) α = 35

Figure 7.1: Trigger probability sets for the abstain(α) loss.

The generic calibrated surrogate of Theorem 5.1 can be made to be calibrated with any

α ∈ [0, n−1n

]. However, as we shall see, the CS, OVA and the BEP surrogates are only

calibrated w.r.t. abstain(12) loss. For any fixed α ∈ [0, 1

2], all these three surrogates can

be modified to be calibrated w.r.t. the abstain(α) loss , but they cannot be modified

to be made calibrated with the abstain(α) loss for α ∈ (12, n−1

n]. While this is slightly

restrictive, we argue that MCAO problems are typically applicable in situations where

high confidence (>50%) in the decisions are required, and these exactly correspond to

α ∈ [0, 12].

We also suspect that, abstain(α) problems with α > 12

are fundamentally more difficult

than those with α ≤ 12, for the reason that evaluating the Bayes classifier h∗α(x) can be

done for α ≤ 12

without finding the maximum conditional probability – just check if any

class has conditional probability greater than (1 − α) as there can only be one. This is

also evidenced by the more complicated trigger probability sets for the abstain(α) loss

with α > 12

as shown in Figure 7.1.

7.3 Excess Risk Bounds for the CS Surrogate

In this section we give an excess risk bound relating the abstain(12) loss `

12 , with the

Crammer-Singer surrogate ψCS [27], thereby showing that the CS surrogate is calibrated

w.r.t. the abstain(12) loss.


Define the surrogate ψCS : Y × Rn→R+ and predictor predCSτ : Rn→Y as

ψCS(y,u) = (maxj 6=y

uj − uy + 1)+

predCSτ (u) =

argmaxi∈[n] ui if u(1) − u(2) > τ

⊥ otherwise

,

where u(i) is the ith element of the components of u when sorted in descending order and

τ ∈ (0, 1) is a threshold parameter. Informally, this predictor chooses the class i with the

highest score ui if it is the clear maximum, and abstains otherwise.

We now give the excess risk bound relating ψCS and `12 , which implies that (ψCS, predCS)

is `12 -calibrated.

Theorem 7.1. Let τ ∈ (0, 1) and α = 12. Then for all f : X→Rn

reg`α

D [predCSτ f ] ≤

(regψ

CS

D [f ])

2 min(τ, 1− τ).

The following lemma gives some straightforward to prove, (in)equalities satisfied by the

Crammer-Singer surrogate and will play an important role in the proof of the Theorem

above.

Lemma 7.2.

∀y ∈ [n], ∀p ∈ ∆n

〈p,ψCS(eny )〉 = 2(1− py), (7.3)

〈p,ψCS(0)〉 = 1, (7.4)

∀u ∈ Rn, ∀y ∈ argmaxi ui ,∀y′ /∈ argmaxi ui

ψCS(y,u) ≥ u(2) − u(1) + 1, (7.5)

ψCS(y′,u) ≥ u(1) − u(2) + 1, (7.6)

where eny is the vector in Rn with 1 in the yth position and 0 everywhere else.



We will show that ∀p ∈ ∆n and all u ∈ Rn

regψCS

p (u) ≥ 2 min(τ, 1− τ) · reg`α

p (predCSτ (u)) . (7.7)

The Theorem simply follows from linearity of expectation.

Define the sets U τ1 , . . . ,U τn ,U τ⊥ such that for any t ∈ Y

U τt = u ∈ Rn : predCSτ (u) = t .

These sets evaluate to

U τt = u ∈ Rn : uy > uj + τ for all j 6= y; t ∈ [n]

U τ⊥ = u ∈ Rn : u(1) ≤ u(2) + τ.

Case 1: py ≥ 12

for some y ∈ [n].

We have that y ∈ argmint〈p, `αt 〉.

Case 1a: u ∈ U τy

The RHS of Equation (7.7) is zero, and hence becomes trivial.

Case 1b: u ∈ U τ⊥

We have that u(1) − u(2) ≤ τ .

Let q =∑

i∈argmaxj ujpi. We then have

regψCS

p (u) ≥ 〈p,ψCS(u)〉 − 〈p,ψCS(eny )〉(7.3)=

∑i:ui=u(1)

piψCS(i,u) +

∑i:ui<u(1)

pyψCS(y,u)− 2(1− py)

(7.5),(7.6)

≥ (2q − 1)(u(2) − u(1))− 1 + 2py

≥ (2py − 1)(u(2) − u(1))− 1 + 2py

≥ (2py − 1)(1− τ) . (7.8)


The next to last inequality is due to the observation that, if q > py then u(1) = u(2).

We also have that,

reg`α

p (predCSτ (u)) = 〈p, `α⊥〉 − 〈p, `αy 〉 = py −

1

2(7.9)

From Equations (7.8) and (7.9) we have

regψCS

p (u) ≥ 2(1− τ)(reg`

α

p (predCSτ (u))

)(7.10)

Case 1c: u ∈ Rn \ (U τy ∪ U τ⊥)

We have predCSτ (u) = y′ 6= y. Also py′ ≤ 1− py ≤ 1

2and u(1) = uy′ > u(2) + τ .

regψCS

p (u) ≥ 〈p,ψCS(u)〉 − 〈p,ψCS(eny )〉(7.3)=

(n∑

i=1;i 6=y′piψ

CS(i,u) + py′ψCS(y′,u)

)− 2(1− py)

(7.5),(7.6)

≥ (1− 2py′)(uy′ − u(2))− 1 + py

≥ 2τ(py − py′) (From Case 1c) (7.11)

We also have that

reg`α

p (predCSτ (u)) = 〈p, `αy′〉 − 〈p, `αy 〉 = py − py′ . (7.12)


regψCS

p (u) ≥ 2τ ·(reg`

α

p (predCSτ (u))

). (7.13)

Case 2: py′ <12

for all y′ ∈ [n]

We have that ⊥ ∈ argmint〈p, `αt 〉.

Case 2a: u ∈ U τ⊥ (or predCSτ (u) = ⊥)



Case 2b: u ∈ Rn \ U τ⊥ (or predCSτ (u) 6= ⊥)

Let predCSτ (u) = argmaxi ui = y. We have that u(1) = uy > u(2) + τ and py <

12.

regψCS

p (u) ≥ 〈p,ψCS(u)〉 − 〈p,ψCS(0)〉(7.4)=

(n∑

i=1;i 6=ypiψ

CS(i,u) + pyψCS(y,u)

)− 1

(7.5),(7.6)

≥ (1− 2py)(u(1) − u(2))

≥ (1− 2py)(τ) (From Case 2b) . (7.14)

We also have that

reg`α

p (predCSτ (u)) = 〈p, `αy 〉 − 〈p, `α⊥〉 =

1

2− py (7.15)


regψCS

p (u) ≥ 2τ ·(reg`

α

p (predCSτ (u))

)(7.16)

Equation (7.7), and hence the Theorem, follows from Equations (7.10), (7.13) and (7.16).

7.4 Excess Risk Bounds for the OVA Surrogate

In this section we give an excess risk bound relating the abstain(12) loss `

12 , with the

one-vs-all hinge [86] surrogate, thereby showing that it is calibrated w.r.t. the abstain(12)

loss.

The surrogate ψOVA : Y × Rn→R+ and predictor predOVAτ : Rn→Y are defined as

ψOVA(y,u) =n∑i=1

1(y = i)(1− ui)+ + 1(y 6= i)(1 + ui)+

predOVAτ (u) =

argmaxi∈[n] ui if maxj uj > τ

⊥ otherwise

,


where τ ∈ (−1, 1) is a threshold parameter, and ties are broken arbitrarily, say, in favor of

the label y with the smaller index. Informally, this predictor chooses the class i with the

highest score ui if at least one of the classes have a large score, and abstains otherwise.

We now give the excess risk bound relating ψOVA and `12 .

Theorem 7.3. Let τ ∈ (−1, 1) and α = 12. Then for all f : X→Rn

reg`α

D [predOVAτ f ] ≤

(regψ

OVA

D [f ])

2(1− |τ |) .

The following lemma gives some straightforward to prove, (in)equalities satisfied by the

OVA hinge surrogate and will play a crucial role in the proof of the Theorem above.

Lemma 7.4.

∀y ∈ [n],∀p ∈ ∆n ,∀u ∈ Rn

〈p,ψOVA(2 · eny − 1n)〉 = 4(1− py) (7.17)

〈p,ψOVA(−1n)〉 = 2 (7.18)

ψOVA(y,u) ≥∑j∈[n]

uj − 2uy + n , (7.19)

where eny is the vector in Rn with 1 in the yth position and 0 everywhere else.


We will show that ∀p ∈ ∆n and all u ∈ [−1,∞)n

regψOVA

p (u) ≥ 2(1− |τ |) ·(reg`

α

p (predOVAτ (u))

). (7.20)

The Theorem simply follows from the observation that for all u ∈ Rn clipping the com-

ponents of u below −1 to −1 does not increase ψOVA(y,u) for any y and also does not

change predOVAτ (u).

Define the sets U τ1 , . . . ,U τn ,U τ⊥ such that for any t ∈ Y

U τt = u ∈ Rn : predOVAτ (u) = t .


This evaluates to,

U τt = u ∈ Rn : uy > τ, y = argmaxi∈[n] ui, t ∈ [n]

U τ⊥ = u ∈ Rn : uj ≤ τ for all j ∈ [n].

Case 1: py ≥ 12

for some y ∈ [n].


Case 1a: u ∈ [−1,∞)n ∩ U τy


Case 1b: u ∈ [−1,∞)n ∩ U τ⊥

We have that maxj uj ≤ τ .

regψOVA

p (u) ≥ 〈p,ψOVA(u)〉 − 〈p,ψOVA(2eny − 1n)〉(7.17)= 〈p,ψOVA(u)〉 − 4(1− py)

(7.19)

≥n∑i=1

(1− 2pi)ui + n− 4(1− py)

≥∑

i∈[n]\y(1− 2pi)ui + (2py − 1)(−τ) + n− 4(1− py)

≥∑i∈[n]

(2pi − 1) + (2py − 1)(−τ − 1) + n− 4(1− py)

= (2py − 1)(1− τ) . (7.21)

We also have

reg`α

p (predOVAτ (u)) = 〈p, `α⊥〉 − 〈p, `αy 〉 = py −

1

2. (7.22)

From Equations (7.21) and (7.22) we have for all u ∈ [−1,∞)n ∩ U τ⊥

regψOVA

p (u) ≥ 2(1− τ)(reg`

α

p (predOVAτ (u))

). (7.23)

Case 1c: u ∈ [−1,∞)n \ (U τy ∪ U τ⊥)


We have predOVAτ (u) = y′ 6= y. Also py′ ≤ 1

2; uy′ > τ and uy′ ≥ uy.

regψOVA

p (u) ≥ 〈p,ψOVA(u)〉 − 〈p,ψOVA(2 · eny − 1n)〉(7.17)= 〈p,ψOVA(u)〉 − 4(1− py)

(7.19)

≥(

n∑i=1

(1− 2pi)ui + n

)− 4(1− py)

≥

∑i∈[n]\y′

(1− 2pi)ui + (1− 2py′)(τ) + n

− 4(1− py)

≥

∑i∈[n]

(2pi − 1) + (1− 2py′)(τ + 1) + n

− 4(1− py)

≥ 2(1 + τ)(py − py′) . (7.24)

We also have that

reg`α

p (predOVAτ (u)) = 〈p, `αy′〉 − 〈p, `αy 〉 = py − py′ . (7.25)

From Equations (7.24) and (7.25) we have for all u ∈ [−1,∞)n \ (U τy ∪ U τ⊥)

regψOVA

p (u) ≥ 2(1 + τ)(reg`

α

p (predOVAτ (u))

). (7.26)

Case 2: py′ <12

for all y′ ∈ [n]

We have that ⊥ ∈ argmint〈p, `t〉

Case 2a: u ∈ U τ⊥


Case 2b: u ∈ [−1,∞)n \ U τ⊥

Let predOVAτ (u) = argmaxi ui = y. We have that uy ≥ τ and py <

12.

regψOVA

p (u) ≥ 〈p,ψOVA(u)〉 − 〈p,ψOVA(−1n)〉(7.18)=

(〈p,ψOVA(u)〉

)− 2


(7.19)

≥(

n∑i=1

(1− 2pi)ui + n

)− 2

≥

∑i∈[n]\y

(1− 2pi)ui + (1− 2py)(τ) + n

− 2

≥

∑i∈[n]

(2pi − 1) + (1− 2py)(τ + 1) + n

− 2

= (1− 2py)(τ + 1) (7.27)

We also have that

reg`α

p (predOVAτ (u)) = 〈p, `αy 〉 − 〈p, `α⊥〉 =

1

2− py (7.28)

From Equations (7.27) and (7.28) we have for all u ∈ [−1,∞)n \ U τ⊥

regψOVA

p (u) ≥ 2(1 + τ)(reg`

α

p (predOVAτ (u))

)(7.29)

Equation (7.20), and hence the Theorem, follows from Equations (7.23), (7.26) and (7.29).

Remark: It has been pointed out previously by Rifkin and Klautau [86], Zhang [116],

that if the data distribution D is such that maxy py(x) > 0.5 for all x ∈ X , the Crammer-

Singer surrogate, ψCS, and the one vs all hinge loss, ψOVA, are calibrated w.r.t. multiclass

0-1 loss when used with the standard argmax predictor. Theorems 7.1 and 7.3 imply

the above observation. However they also give more – in the case that the distribution

does not satisfy the dominant class assumption, the model learned by using the surrogate

and predictor (ψCS, predCSτ ) or (ψOVA, predOVA

τ ) asymptotically still gives the right answer

for instances having a dominant class, and fails in a graceful manner by abstaining for

instances that do not have a dominant class.


7.5 The BEP Surrogate

The Crammer-Singer surrogate and the one vs all hinge surrogate have a surrogate dimen-

sion of n. Thus any algorithm that minimizes these surrogates must learn n real valued

functions over the instance space. In this section, we construct a dlog2(n)e dimensional

convex surrogate, which we call as the binary encoded predictions (BEP) surrogate and

give an excess risk bound relating this surrogate and the abstain(12) loss. In particular

these results show that the BEP surrogate is calibrated w.r.t. the abstain loss; this in

turn implies that CCdim(`12 ) ≤ dlog2(n)e.

For the purpose of simplicity let us assume n = 2d for some d ∈ N.1 Let b : [n]→+1,−1d

be any one-one and onto mapping, with an inverse mapping b−1 : +1,−1d→[n]. For

any j ∈ [d], let bj : [n]→+1,−1 be the jth component of b. Define the BEP surrogate

ψBEP : Y × Rd→R+ and its corresponding predictor predBEPτ : Rd→Y as

ψBEP(y,u) = (maxj∈[d]

bj(y) · uj + 1)+

predBEPτ (u) =

⊥ if mini∈[d] |ui| ≤ τ

b−1(sign(−u)) Otherwise

where sign(u) is the sign of u, with sign(0) = 1 and τ ∈ (0, 1) is a threshold parameter.

Below we give an example construction of the BEP surrogate and predictor for a fixed n,

and also illustrate the predictor.

Example 7.1 (BEP surrogate for n = 4). Consider the surrogate and predictor look for

the case of n = 4 and τ = 12. We have d = 2. Let us fix the mapping b such that b(y) is

the standard d-bit binary representation of (y − 1), with −1 in the place of 0. Then we

have,

ψBEP(1,u) = (max(−u1,−u2) + 1)+

ψBEP(2,u) = (max(−u1, u2) + 1)+

1If n is not a power of 2, just add enough dummy classes that never occur.


(0, 0)

(−12 ,

12)

(−12 ,−1

2)

(12 ,12)

(12 ,−12)

U⊥

U3 U1

U4 U2

u2

u1

Figure 7.2: The partition of R2 induced by predBEP12

.

ψBEP(3,u) = (max(u1,−u2) + 1)+

ψBEP(4,u) = (max(u1, u2) + 1)+

predBEP12

(u) =

1 if u1 >12, u2 >

12

2 if u1 >12, u2 < −1

2

3 if u1 < −12, u2 >

12

4 if u1 < −12, u2 < −1

2

⊥ otherwise

An illustration of the predictor above is given in Figure 7.2.

We now give the excess risk bound relating the BEP surrogate and the abstain(12) loss.

Theorem 7.5. Let τ ∈ (0, 1) and α = 12. Let n = 2d. Then for all f : X→Rd

reg`α

D [predBEPτ f ] ≤

(regψ

BEP

D [f ])

2 min(τ, 1− τ)

We will require the following technical lemma which is straightforward to prove.


Lemma 7.6.

∀y, y′ ∈ [n],p ∈ ∆n,u ∈Rn , y′ 6= b−1(sign(−u))

〈p,ψBEP(−b(y))〉 = 2(1− py) (7.30)

〈p,ψBEP(0)〉 = 1 (7.31)

ψBEP(b−1(sign(−u)),u) ≥ −minj|uj|+ 1 (7.32)

ψBEP(y′,u) ≥ minj|uj|+ 1 (7.33)

Proof. We will show that ∀p ∈ ∆n and all u ∈ Rd

regψBEP

p (u) ≥ 2 min(τ, 1− τ)(reg`

α

p (predBEPτ (u))

). (7.34)

The theorem follows by linearity of expectation.

Define the sets U τ1 , . . . ,U τn ,U τ⊥, where for any t ∈ Y ,

U τt = u ∈ Rd : predBEPτ (u) = t .

This evaluates to

U τt = u ∈ Rd : maxj∈[d]

bj(t) · uj < −τ for t ∈ [n]

U τ⊥ = u ∈ Rd : minj∈[d]|uj| ≤ τ .

Case 1: py ≥ 12

for some y ∈ [n]


Case 1a: u ∈ U τy (or predBEPτ (u) = y)


Case 1b: u ∈ U τ⊥ (or predBEPτ (u) = ⊥)


Let y′ = b−1(sign(−u)). We have minj |uj| ≤ τ .

regψBEP

p (u) ≥ 〈p,ψBEP(u)〉 − 〈p,ψBEP(−b(y))〉

(7.30)=

py′ψBEP(y′,u) +∑

i∈[n]\y′piψ

BEP(i,u)

− 2(1− py)

(7.32),(7.33)

≥(py′(−min

j∈[d]|uj|) + (1− py′)(min

j∈[d]|uj|) + 1

)− 2(1− py)

= (2py′ − 1)(−minj∈[d]|uj|) + 1− 2(1− py)

≥ (2py − 1)(−minj∈[d]|uj|) + 1− 2(1− py)

≥ (2py − 1)(−τ) + 1− 2(1− py)

= (2py − 1)(1− τ) . (7.35)

We also have that

reg`α

p (predBEPτ (u)) = 〈p, `α⊥〉 − 〈p, `αy 〉 = py −

1

2. (7.36)

From Equations (7.35) and (7.36) we have that

regψBEP

p (u) ≥ 2(1− τ)(reg`α

p (predBEPτ (u))) . (7.37)

Case 1c: u ∈ Rd \ (U τy ∪ U τ⊥)

Let b−1(sign(−u)) = pred(u) = y′ for some y′ 6= y. We have py′ ≤ 1 − py ≤ 12, and

minj |uj| > τ and

regψBEP

p (u) ≥ 〈p,ψBEP(u)〉 − 〈p,ψBEP(−b(y))

(7.30)

≥ py′ψBEP(y′,u) +

n∑i=1;i 6=y′

piψBEP(i,u)− 2(1− py)

(7.32),(7.33)

≥ py′(−minj|uj|) + (1− py′)(min

j|uj|) + 1− 2(1− py)

≥ τ(1− 2py′) + 1− 2(1− py) (From case 1c)


≥ τ(1− 2py′) + 2py − 1

≥ τ(1− 2py′) + τ(2py − 1)

= 2τ(py − py′) . (7.38)

We also have that

reg`α

p (predBEPτ (u)) = 〈p, `αy′〉 − 〈p, `αy 〉 = py − py′ . (7.39)


regψBEP

p (u) ≥ 2(τ)(reg`α

p (predBEPτ (u))) . (7.40)

Case 2: py <12

for all y ∈ [n]

We have that ⊥ ∈ argmint〈p, `αt 〉

Case 2a: u ∈ U τ⊥


Case 2b: u ∈ Rd \ U τ⊥

Let b−1(sign(−u)) = y′ = predBEPτ (u) for some y′ ∈ [n]. We have py′ <

12

and minj |uj| >τ .

regψBEP

p (u) ≥ 〈p,ψBEP(u)〉 − 〈p,ψBEP(0)〉(7.31)= py′ψ

BEP(y′,u) +n∑

i=1;i 6=y′piψ

BEP(i,u)− 1

(7.32),(7.33)

≥ −py′ minj|uj|+ (1− py′) min

j|uj|

≥ (1− 2py′)τ (7.41)

We also have that

reg`α

p (predBEPτ (u)) = 〈p, `αy′〉 − 〈p, `α⊥〉 =

1

2− py′ . (7.42)



regψBEP

p (u) ≥ 2τ(reg`α

p (predBEPτ (u))) . (7.43)

Equation (7.34), and hence the Theorem, follows from equations (7.37), (7.40) and (7.43).

7.6 BEP Surrogate Optimization Algorithm

In this section, we frame the problem of finding the linear (vector valued) function that

minimizes the BEP surrogate loss over a training set (xi, yi)Mi=1, with xi ∈ Ra and

yi ∈ [n], as a convex optimization problem. Once again, for simplicity we assume that

the size of the label space n = 2d for some d ∈ Z+. The primal and dual of the resulting

optimization problem with a norm squared regularizer is given below:

Primal problem:

minw1,...,wd,ξ1,...,ξM

M∑i=1

ξi +λ

2

d∑j=1

||wj||2

such that ∀i ∈ [M ], j ∈ [d]

ξi ≥ bj(yi)w>j xi + 1

ξi ≥ 0

Dual problem:

maxα−

M∑i=1

αi,0 −1

2λ

M∑i=1

M∑i′=1

〈xi,xi′〉µi,i′(α)

such that ∀i ∈ [M ], j ∈ [d] ∪ 0

αi,j ≥ 0;d∑

j′=0

αi,j′ = 1 .

where µi,i′(α) =∑d

j=1 bj(yi)bj(yi′)αijαi′,j.

We optimize the dual as it can be easily extended to work with kernels. The structure of

the constraints in the dual lends itself easily to a block co-ordinate ascent algorithm, where

we optimize over αi,j : j ∈ 0, . . . , d and fix every other variable in each iteration. Such

methods have been recently proven to have exponential convergence rate for SVM-type

problems [107], and we expect results of those type to apply to our problem as well.

The problem to be solved at every iteration reduces to an `2 projection of a vector gi ∈ Rd

on to the set Si = g ∈ Rd : 〈g,b(yi)〉 ≤ 1. The projection problem is a simple variant


of projecting a vector on the l1 ball of radius 1, which can be solved efficiently in O(d)

time [33]. The vector gi is such that for any j ∈ [d],

gij =λ

〈xi,xi〉

(bj(yi)−

1

λ

(m∑

i′=1;i′ 6=i〈xi,xi′〉αi′,jbj(yi′)

)).

7.7 Extensions to Other Abstain Costs

The excess risk bounds derived for the CS, OVA and BEP surrogates apply only to the

abstain(

12

)loss. But it is possible to derive such excess risk bounds for abstain(α) with

α ∈[0, 1

2

]with slight modifications to the CS, OVA and BEP surrogates.

Define ψCS,α : Y ×Rn→R+, ψOVA,α : Y ×Rn→R+ and ψBEP,α : Y ×Rd→R+, with n = 2d

as

ψCS,α(y,u) = 2 ·max

(αmax

j 6=yγ(uj − uy), (1− α) max

j 6=yγ(uj − uy)

)+ 2α

ψOVA,α(y,u) = 2 ·n∑i=1

(1(y = i)α(1− ui)+ + 1(y 6= i)(1− α)(1 + ui)+

)ψBEP,α(y,u) = 2 ·max

(αmax

j∈[d]γ(bj(y)uj), (1− α) max

j∈[d]γ(bj(y)uj)

)+ 2α ,

where γ(a) = max(a,−1) and b : [n]→−1, 1d is any bijection. Note that ψCS, 12 = ψCS,

ψOVA, 12 = ψOVA and ψBEP, 1

2 = ψBEP. Also note that all three surrogates abve are convex

as they can be expressed as a sum or point-wise maximum of convex functions.

One can show the following theorem which is a generalization of Theorems 7.1, 7.3 and

7.5. The proof proceeds along the same lines as the proofs of Theorems 7.1, 7.3 and 7.5.

Theorem 7.7. Let n ∈ N, τ ∈ (0, 1), τ ′ ∈ (−1, 1) and α ∈[0, 1

2

]. Let n = 2d. Then for

all f : X→Rd, g : X→Rn

reg`α

D [predCSτ g] ≤ 1

2 min(τ, 1− τ)

(regψ

CS,α

D [g])

reg`α

D [predOVAτ ′ g] ≤ 1

2(1− |τ ′|)(

regψOVA,α

D [g])

reg`α

D [predBEPτ f ] ≤ 1

2 min(τ, 1− τ)

(regψ

BEP,α

D [f ]).


Remark: When n = 2, the CS, OVA and BEP surrogate all reduce to the hinge loss and

α is restricted to be at most 12

to ensure the relevance of the abstain option. Applying

the above extension for α ≤ 12

to the hinge loss, we get the double hinge loss of Bartlett

and Wegkamp [6].

7.8 Experimental Results

In this section we give our experimental results for the algorithms proposed on both

synthetic and real datasets. The objective of the synthetic data experiments is to illustrate

the consistency of the three proposed algorithms for the abstain loss. The objective of

the experiments on real data is to illustrate that one can achieve lower error rates on

multiclass datasets if the classifier is allowed to abstain, and also to show that the BEP

algorithm has competitive performance with the other two algorithms of CS and OVA.

7.8.1 Synthetic Data

We optimize the CS, OVA and BEP surrogates, over appropriate kernel spaces on a 2-

dimensional 8 class synthetic data set and show that the the abstain(

12

)loss incurred by

the trained model for all three algorithms approaches the Bayes optimal under various

thresholds.

The dataset we use was generated as follows. We randomly sample 8 prototype vectors

v1, . . . ,v8 ∈ R2, with each vy drawn independently from a zero mean unit variance 2D-

Gaussian, N (0, I2) distribution. These 8 prototype vectors correspond to the 8 classes.

Each example (x, y) is generated by first picking y from one of the 8 classes uniformly at

random, and the instance x is set as x = vy + 0.65 · u, where u is independently drawn

from N (0, I2). We generated 12800 such (x, y) pairs for training, and another 10000

instances, for testing.

The CS, OVA, BEP surrogates were all optimized over a reproducing kernel Hilbert Space

(RKHS) with a Gaussian kernel and the standard norm-squared regularizer. The kernel


102

103

104

0.35

0.4

0.45

0.5

Training size

Exp

ecte

d ab

stai

n lo

ss

τ=0τ=0.25τ=0.5τ=0.75τ=1Bayes risk

(a) CS

102

103

104

0.35

0.4

0.45

0.5

Training size

Exp

ecte

d ab

stai

n lo

ss

τ=−1τ=−0.5τ=0τ=0.5τ=1Bayes risk

(b) OVA

102

103

104

0.35

0.4

0.45

0.5

Training size

Exp

ecte

d ab

stai

n lo

ss

τ=0τ=0.25τ=0.5τ=0.75τ=1Bayes risk

(c) BEP

Figure 7.3: Performance of the CS, OVA and BEP surrogate minimizing algorithmsvarious thresholds τ as a function of training size.

width parameter and the regularization parameter were chosen by grid search using a

separate validation set.2

As indicated by Figure 7.3, the expected abstain risk incurred by the trained model

approaches the Bayes risk with increasing training data for all three algorithms and

intermediate τ values. The excess risk bounds in Theorems 7.1, 7.3 and 7.5 breakdown

when the threshold parameter τ ∈ 0, 1 for the CS and BEP surrogates, and when

τ ∈ −1, 1 for the OVA surrogate. This is supported by the observation that, in Figure

7.3 the curves corresponding to these thresholds perform poorly. In particular, using τ = 0

for the CS and BEP algorithms implies that the resulting algorithms never abstain.

Though all three surrogate minimizing algorithms we consider are consistent w.r.t. the

abstain(12) loss, we find that the BEP and OVA algorithms use less computation time and

samples than the CS algorithm to attain the same error. However, the BEP surrogate

performs poorly when optimized over a linear function class (experiments not shown

here), due to its much restricted representation power.

7.8.2 Real Data

We ran experiments on real multiclass datasets from the UCI repository, the details of

which are in Table 7.1. In each of these datasets if a train/test split is not indicated in

the dataset we made one ourselves by splitting at random.

2We used Joachims’ SVM-light package [55] for the OVA and CS algorithms.


Table 7.1: Details of datasets used.

# Train # Test # Feat # Classsatimage 4,435 2,000 36 6yeast 1,000 484 8 10letter 16,000 4,000 16 26vehicle 700 146 18 4image 2,000 310 19 7

covertype 15,120 565,892 54 7

Table 7.2: Error percentages of the three algorithms when the abstain percentage isfixed at 0%, 20% and 40%.

Abstain: 0% 20% 40%

Algorithm: CS OVA BEP CS OVA BEP CS OVA BEP

satimage 10.25 8.3 8.15 5.6 2.5 2.4 2.9 0.9 0.6yeast 44.4 38.8 42.7 34.5 26 29.7 24 17 19.8letter 4.8 2.8 4.6 1.4 0.1 0.6 0.4 0 0.1vehicle 31.5 17.1 20.5 24.6 8.2 13 16.4 5.5 6.1image 5.8 5.1 4.2 2.2 1.6 1.6 0.6 0.6 0.3

covertype 32.2 28.1 29.4 23.6 19.3 20.4 16.3 11.7 12.8

All three algorithms (CS, OVA and BEP) were optimized over an RKHS with a Gaussian

kernel and the standard norm-squared regularizer. The kernel width and regulariza-

tion parameters were chosen through validation – 10-fold cross-validation in the case of

satimage, yeast, vehicle and image datasets, and a 75-25 split of the train set into

train and validation for the letter and covertype datasets. For simplicity we set τ = 0

(or τ = −1 for OVA) during the validation phase.

The results of the experiment with the CS, OVA and BEP algorithms is given in Table

7.2. The abstain rate is fixed at some given level by choosing the threshold τ for each

algorithm and dataset appropriately. As can be seen from the Table, the BEP algorithm’s

performance is comparable to the OVA, and is better than the CS algorithm. However,

Table 7.3, which gives the training times for the algorithms, reveals that the BEP algo-

rithm runs the fastest, thus making the BEP algorithm a good option for large datasets.

The main reason for the observed speedup of the BEP is that it learns only log2(n) func-

tions for a n-class problem and hence the speedup factor of the BEP over the OVA would


Table 7.3: Time taken for learning final model and making predictions on test set (doesnot include validation time)

Algorithm CS OVA BEP

satimage 2153s 76s 44syeast 5s 7s 2sletter 9608s 1055s 313svehicle 3s 3s 1simage 222s 16s 6s

covertype 47974s 23709s 6786s

potentially be better for larger n.

Chapter 8

Hierarchical Classification

In many practical applications of the multiclass classification problem, the class labels

live in a pre-defined hierarchy. For example, in document classification the class labels

are topics and they form topic hierarchies, in computational biology the class labels are

protein families and they are also best organized in a hierarchy. See Figure 8.1 for an

example hierarchy used in mood classification of speech. Such problems are commonly

known in the machine learning literature as hierarchical classification.

Hierarchical classification problems are of great practical importance and have been the

subject of many studies [5, 16, 18, 19, 29, 45, 46, 87, 97, 105, 106]. For a detailed review

and more references we refer the reader to a survey on hierarchical classification by Silla

Jr. and Freitas [92].

The label hierarchy has been incorporated into the problem in various ways in differ-

ent approaches. Our approach, based on statistical decision theory, incorporating the

hierarchy via the evaluation metric or loss matrix, is one of the most popular and tech-

nically appealing approaches. Assuming that the class labels are single nodes in a tree, a

very natural evaluation metric is the tree-distance loss, that simply penalizes predictions

according to the tree-distance between the prediction and truth. This is a popular eval-

uation metric in hierarchical classification, and there have been several algorithmic and

empirical studies on hierarchical classification using this metric [16, 29, 97].

138

Chapter 8. Hierarchical Classification 139

Non-Active

Speech

Active

Median Passive

Fear Neutral Sadness Boredom

Anger Gladness

Figure 8.1: Speech based mood classification hierarchy in the Berlin dataset [15] usedby Xiao et al. [112]

In this chapter, we give a theoretical analysis of the hierarchical classification problem

that leads us to the construction of convex calibrated surrogates for the tree-distance loss,

which in turn yields practical algorithms for hierarchical classification that outperform

previous baselines.

In our results, we show that, the Bayes optimal classifier for the tree-distance loss classifies

an instance according to the deepest node in the hierarchy, such that the total conditional

probability of the subtree rooted at the node, is greater than 12. We use the observation

from Equation (7.2), that the Bayes classifier for the abstain(12) loss is a greater than

12

conditional probability detector, and construct a convex surrogate calibrated with the

tree-distance loss that uses any convex surrogate calibrated w.r.t. the abstain(12) loss as a

component. The surrogate minimizing algorithm corresponding to one such surrogate can

be implemented using just binary SVM solvers as sub-routines, and outperforms other

standard algorothms on benchmark hierarchical classification datasets.


We begin by giving some preliminaries and notation in Section 8.2. We then characterize

the Bayes optimal classifier for the tree-distance loss in Section 8.3. We then reduce

the hierarchical classification problem to the problem of multiclass classification with an

abstain option (MCAO), and give a template to design convex calibrated surrogates for

hierarchical classification algorithms based on this reduction in Section 8.4. We detail


one particular instantiation of the template called the OVA-cascade in Section 8.4, and

conclude in Section 8.6 by giving the results of running our algorithm on some benchmark

hierarchical classification datasets.

8.2 Preliminaries

In this section we define some useful objects based on the graphical structure of the tree

under study.

We let Y = Y = [n] in this chapter. Let H = ([n], E,W ) be a connected tree over [n],

with edge set E, and positive, finite edge lengths for the edges in E given by W . Let the

root node be r ∈ [n]. Let the tree-distance loss function `H : [n]× [n]→R+ be

`H(y, y′) = Path length in H between y and y′ .

All objects defined below depend on the tree H, but we suppress this dependence in the

notation to avoid clutter. For every y ∈ [n] define the descendants D(y), children C(y),

ancestors U(y), parent P (y) as follows:

D(y) = Set of descendants of y including y

P (y) = Parent of y

C(y) = Set of children of y

U(y) = Set of ancestors of y, not including y

For all y ∈ [n], define the level of y denoted by lev(y), and the mapping Sy : ∆n→[0, 1]

as follows:

lev(y) = |U(y)|

Sy(p) =∑i∈D(y)

pi


Let the depth of the tree be s = maxy∈[n] lev(y). Define the sets N=j, N≤j for 0 ≤ j ≤ s

as:

N=j = y ∈ [n] : lev(y) = j

N≤j = y ∈ [n] : lev(y) ≤ j .

Define scalars αj, βj for 0 ≤ j ≤ s as

αj = maxy,y′∈N=j

`H(y, y′)

βj = maxy∈N=j

`H(y, P (y)) .

By reordering the classes we ensure that lev is a non-decreasing function and hence we

always have that N≤j = [nj] for some integers nj and r = 1.

For integers 0 ≤ j ≤ s define the function ancj : [n]→[nj] such that for all y ∈ [n],

ancj(y) =

y if lev(y) ≤ j

ancestor of y at level j otherwise

.

For integers 0 ≤ j ≤ s define the vector function aj : ∆n→∆nj with components

aj1, . . . , ajnj

such that for all y ∈ [nj],

ajy(p) =∑

i∈[n]:ancj(i)=y

pi

=

py if lev(y) < j

Sy(p) if lev(y) = j

.

Note that in all the above definitions, the only terms that depend on the edge lengths W

are the scalars αj and βj.


4 5 6 7

2 3

1p1 = 0

S1(p) = 1

p2 = 0.2S2(p) = 0.3

p3 = 0S3(p) = 0.7

p4 = 0.1S4(p) = 0.1

p5 = 0S5(p) = 0

p6 = 0S6(p) = 0.1

p7 = 0.2S7(p) = 0.6

8 9p8 = 0.1

S8(p) = 0.1p9 = 0

S9(p) = 0

10 11p10 = 0.2

S10(p) = 0.2p11 = 0.2

S11(p) = 0.2

Figure 8.2: An example tree and an associated conditional probability vector p(x) forsome instance x, along with S(p(x)). The Bayes optimal prediction is shaded here.

8.3 Bayes Optimal Classifier for the Tree-Distance

Loss

In this section we characterize the Bayes optimal classifier minimizing the expected tree-

distance loss. We show that such a predictor can be viewed as a ‘greater than 12

conditional

probability subtree detector’. We then design a scheme for computing this prediction

based on this observation.

Figure 8.2 gives an illustration for Theorem 8.1 below, characterizing the Bayes optimal

classifier for the tree-distance loss.

Theorem 8.1. Let H = ([n], E,W ) and let `H : [n]× [n]→R+ be the tree-distance loss for

the tree H. Then there exists a g∗ : X→[n] such that for all x ∈ X the following holds:

(a) Sg∗(x)(p(x)) ≥ 12

(b) Sy(p(x)) ≤ 12,∀y ∈ C(g∗(x)) .

Also, g∗ is a Bayes optimal classifier for the tree-distance loss, i.e.

reg`H

D [g∗] = 0 .


Proof. We shall simply show for all p ∈ ∆n, there exists a y∗ ∈ [n] such that

Sy∗(p) ≥ 1

2(8.1)

Sy(p) ≤ 1

2, ∀y ∈ C(y∗) , (8.2)

and is such that

〈p, `Hy∗〉 = miny∈[n]〈p, `Hy 〉 .

This would imply reg`H

p (y∗) = 0. The theorem then simply follows from linearity of

expectation.

Let p ∈ ∆n. We construct a y∗ ∈ [n] satisfying Equations (8.1) and (8.2) in the following

way. We start at the root node, which always satisfies Equation (8.1), and keep on moving

to the child of the current node that satisfies Equation (8.1), and terminate when we reach

a leaf node, or a node where all of its children fail Equation (8.1). Clearly the resulting

node, y∗, satisfies both Equations (8.1) and (8.2).

Now we show that y∗ indeed minimizes 〈p, `Hy 〉 over y ∈ [n].

Let y′ ∈ argmint〈p, `Ht 〉. If y′ = y∗ we are done, hence assume y′ 6= y∗.

Case 1: y′ /∈ D(y∗)

〈p, `Hy′ 〉 − 〈p, `Hy∗〉 =∑

y∈D(y∗)

py(`H(y, y′)− `H(y, y∗)) +

∑y∈[n]\D(y∗)

py(`H(y, y′)− `H(y, y∗))

=∑

y∈D(y∗)

py(`H(y∗, y′)) +

∑y∈[n]\D(y∗)

py(`H(y, y′)− `H(y, y∗))

≥∑

y∈D(y∗)

py(`H(y∗, y′)) +

∑y∈[n]\D(y∗)

py(−`H(y′, y∗)))

= `H(y′, y∗)(2Sy∗(p)− 1)

≥ 0

Case 2: y′ ∈ D(y∗) \ C(y∗)


Let y be the child of y∗ that is the ancestor of y′. Hence we have Sy(p) ≤ 12.

〈p, `Hy′ 〉 − 〈p, `Hy∗〉 =∑

y∈D(y)

py(`H(y, y′)− `H(y, y∗)) +

∑y∈[n]\D(y)

py(`H(y, y′)− `H(y, y∗))

=∑

y∈D(y)

py(`H(y, y′)− `H(y, y∗)) +

∑y∈[n]\D(y)

py(`H(y∗, y′))

≥∑

y∈D(y)

py(−`H(y∗, y′)) +∑

y∈[n]\D(y)

py(`H(y∗, y′))

= `H(y′, y∗)(1− 2Sy(p))

≥ 0

Case 3: y′ ∈ C(y∗)

〈p, `Hy′ 〉 − 〈p, `Hy∗〉 =∑

y∈D(y′)

py(`H(y, y′)− `H(y, y∗)) +

∑y∈[n]\D(y′)

py(`H(y, y′)− `H(y, y∗))

=∑

y∈D(y′)

py(−`H(y′, y∗)) +∑

y∈[n]\D(y′)

py(`H(y′, y∗))

= `H(y′, y∗)(1− 2Sy′(p))

≥ 0

Putting all three cases together we have

〈p, `Hy∗〉 ≤ 〈p, `Hy′ 〉 = miny∈[n]〈p, `Hy 〉 .

For any instance x, with conditional probability p ∈ ∆n, Theorem 8.1 says that predicting

y ∈ [n] that has the largest level and has Sy(p) ≥ 12

is optimal. Surprisingly, this does

not depend on the edge lengths W .

Theorem 8.1 suggests the following scheme to find the optimal prediction for a given

instance, with conditional probability p:


1. For each j ∈ 1, 2, . . . , s create a multiclass problem instance with the classes being

elements of N≤j = [nj], and the probability associated with each class in y ∈ N≤jis equal to ajy(p), i.e. py if lev(y) < j and equal to Sy(p) if lev(y) = j.

2. For each multiclass problem j ∈ 1, 2, . . . , s, if there exists a class with probability

mass at least 12

assign it to v∗j , otherwise let v∗j = ⊥.

3. Find the largest j such that v∗j 6= ⊥ and return the corresponding v∗j , or return the

root 1 if v∗j = ⊥ for all j ∈ [s].

We will illustrate the above procedure for the example in Figure 8.2.

Example 8.1. From Figure 8.2 we have that s = 3. The three induced multiclass problems

are given below.

1. n1 = 3, and the class probabilities are given as 110

[0, 3, 7]. Clearly, v∗1 = 3.


[0, 2, 0, 1, 0, 1, 6]. Clearly v∗2 = 7.


[0, 2, 0, 1, 0, 0, 2, 1, 0, 2, 2]. Clearly,

v∗3 = ⊥.

And hence the largest j such that v∗j 6= ⊥ is 2, and the scheme returns v∗2 = 7.

The reason such a scheme as the one above is of interest to us, is that the second step in

the above scheme exactly corresponds to the Bayes optimal classifier for the abstain(12)

loss employed in the problem of multiclass classification with an abstain option as given

in Chapter 7, Equation (7.2).

8.4 Cascade Surrogate for Hierarchical Classification

In this section we construct a template surrogate ψC and template predictor predC based

on the scheme in Section 8.3 and is constituted of simpler surrogates ψj and predictors

predj. We then give a excess risk bound relating ψC and `H assuming the existence of


excess risk bounds relating the component surrogates ψj and the abstain(12) loss. We

shall denote the abstain(12) loss with nj classes as `?,nj in this chapter.

For all j ∈ 1, 2, . . . , s, let the surrogate ψj : [nj] × Rdj→R+ and predictor predj :

Rdj→([nj] ∪ ⊥) be such that they are calibrated w.r.t. `?,nj for some integers dj. Let

d =∑j

i=1 dj. Let any u ∈ Rd be decomposed as u = [u>1 , . . . ,u>s ]>, with each uj ∈ Rdj .

The template surrogate, that we call the cascade surrogate ψC : [n]×Rd→R+, is defined

in terms of its constituent surrogates as follows:

ψC(y,u) =s∑

j=1

ψj(ancj(y),uj) . (8.3)

The template predictor, predC, is defined via the function predCj : Rd1 × . . . × Rdj→[nj]

which is defined recursively as follows:

predCj (u1, . . . ,uj) =

predj(uj) if predj(uj) 6= ⊥

predCj−1(u1, . . . ,uj−1) otherwise

(8.4)

The function predC0 takes no arguments and simply returns 1 (the root node). Occasion-

ally we abuse notation by representing predCj (u1, . . . ,uj) simply as predC

j (u).

The template predictor, predC : Rd→[n] is simply defined as predC(u) = predCs (u1, . . . ,us) .

The lemma below, captures the essence of the reduction from the problem of hierarchical

classification to the problem of multiclass classification with an abstain option.

Lemma 8.2. For all p ∈ ∆n,u ∈ Rd we have

reg`H

p (predC(u)) ≤s∑

j=1

γj(uj) · reg`?,nj

aj(p)(predj(uj)),

where γj(uj) =

2αj if predj(uj) 6= ⊥

2βj if predj(uj) = ⊥.

Proof. For all 0 ≤ j ≤ s, define `H,j : [nj] × [nj]→R+ as simply the restriction of `H to

[nj]× [nj].


For all j ∈ [s], we will first prove a bound relating the tree-distance regret at level j, with

the tree-distance regret at level j − 1 and abstain loss regret at level j as follows:

reg`H,j

aj(p)(predCj (u)) ≤ reg`

H,j−1

aj−1(p)(predCj−1(u)) + γj(uj) · reg`

?,nj

aj(p)(predj(uj)) .

The theorem would simply follow from applying such a bound recursively and observing

that reg`H,0

a0(p)(predC0 (u)) = 0.

One observation of the tree-distance loss that will be often of use in the proof is that for

all non-root y ∈ [n]

`H(y, y′)− `H(P (y), y′) =

−`H(y, P (y)) if y′ ∈ D(y)

`H(y, P (y)) otherwise

The details of the proof follows: Fix j ∈ [s],u ∈ Rd,p ∈ ∆n.

Let y∗j = argminy∈[nj ]reg`

H,j

aj(p)(y).

Case 1: predj(uj) 6= ⊥

reg`H,j

aj(p)(predCj (u)) =

nj∑y=1

ajy(p)(`H(y, predCj (u))− `H(y, y∗j ))

≤ `H(y∗j , predCj (u))(1− 2aj

predCj (u)

(p)) (8.5)

We also have,

reg`?,nj

aj(p)(predj(uj)) = 1− ajpredj(uj)(p)− min

y∈[nj ]∪⊥〈aj(p), `?,nj

y 〉

≥ 1− ajpredj(uj)(p)− 〈aj(p), `

?,nj⊥ 〉

=1

2− ajpredj(uj)

(p)

=1

2− aj

predCj (u)

(p) . (8.6)

The last inequality above follows because if predj(uj) 6= ⊥, then predCj (u) = predj(uj).


Putting Equations (8.5) and (8.6) together, we get

reg`H,j

aj(p)(predCj (u)) ≤ 2`H(y∗j , predC

j (u)) · reg`?,nj

aj(p)(predj(uj))

≤ 2αj · reg`?,nj

aj(p)(predj(uj)) (8.7)

Case 2: predj(uj) = ⊥

In this case predCj (u) = predC

j−1(u), and hence lev(predCj (u)) ≤ j − 1.

We now have,

〈aj(p), `H,jpredC

j (u)〉 − 〈aj−1(p), `H,j−1

predCj−1(u)

〉

= 〈aj(p), `H,jpredC

j (u)〉 − 〈aj−1(p), `H,j−1

predCj (u)〉

=∑y∈N=j

Sy(p)(`H(y, predC

j (u))− `H(P (y), predCj (u))

)=

∑y∈N=j

Sy(p)`H(y, P (y)) (8.8)

For ease of analysis, we divide case 2, further into two sub-cases.

Case 2a: lev(y∗j ) < j

〈aj−1(p), `H,j−1y∗j−1

〉 − 〈aj(p), `H,jy∗j〉 = 〈aj−1(p), `H,j−1

y∗j−1〉 − 〈aj−1(p), `H,j−1

y∗j〉

+〈aj−1(p), `H,j−1y∗j

〉 − 〈aj(p), `H,jy∗j〉

≤ 〈aj−1(p), `H,j−1y∗j

〉 − 〈aj(p), `H,jy∗j〉

=∑y∈N=j

Sy(p)(`H(P (y), y∗j )− `H(y, y∗j ))

=∑y∈N=j

Sy(p)(−`H(y, P (y))) (8.9)

Adding, Equation (8.8) and (8.9), we get

reg`H,j


H,j−1

aj−1(p)(predCj−1(u)) (8.10)

Case 2b: lev(y∗j ) = j


For any integers a, b with a ≤ b and vector v ∈ Rb let v∣∣1:a∈ Ra be the vector given by

the first a components of v.

〈aj−1(p), `H,j−1y∗j−1

〉 − 〈aj−1(p), `Hy∗j

∣∣[1:nj−1]

〉 ≤ 〈aj−1(p), `H,j−1P (y∗j ) 〉 − 〈aj−1(p), `Hy∗j

∣∣[1:nj−1]

〉

=∑

y∈N≤j−1

aj−1y (p)(`H(y, P (y∗j ))− `H(y, y∗j ))

=∑

y∈N≤j−1

aj−1y (p)(−`H(y∗j , P (y∗j )))

= −`H(y∗j , P (y∗j )) (8.11)

Also,

〈aj−1(p), `Hy∗j

∣∣[1:nj−1]

〉 − 〈aj(p), `H,jy∗j〉

=∑y∈N=j

Sy(p)(`H(P (y), y∗j )− `H(y, y∗j ))

=∑

y∈N=j\y∗j Sy(p)(−`H(y, P (y))) + Sy∗j (p)(`H(y∗j , P (y∗j ))) . (8.12)

Adding Equations (8.8), (8.11) and (8.12), we get

reg`H,j


H,j−1

aj−1(p)(predCj−1(u)) + (2Sy∗j (p)− 1) · `H(y∗j , P (y∗j ))

≤ reg`H,j−1

aj−1(p)(predCj−1(u)) + (2Sy∗j (p)− 1) · βj . (8.13)

Inequality (8.13) follows because, by the definitions of y∗j and Theorem 8.1 we have that

Sy∗j (p) ≥ 12.


Also, we have that

reg`?,nj

aj(p)(predj(uj)) = reg`?,nj

aj(p)(⊥)

=1

2− min

y∈[n]∪⊥〈aj(p), `?,nj

y 〉

≥ 1

2− 〈aj(p), `

?,njy∗j〉

=1

2− (1− Sy∗j (p))

= Sy∗j (p)− 1

2. (8.14)

Putting Equations (8.13) and (8.14) together, we have that

reg`H,j


H,j−1

aj−1(p)(predCj−1(u)) + 2βj · reg`

?,nj

aj(p)(predj(uj)). (8.15)

Putting the results for case 1, case 2a and case 2b, from Equations (8.7), (8.10) and (8.15)

respectively, we have

reg`H,j


H,j−1

aj−1(p)(predCj−1(u)) + γj(uj) · reg`

?,nj

aj(p)(predj(uj)) .

Lemma 8.2 bounds the `H regret on a distribution D, by a weighted sum of abstain loss

regrets, each over a modified distribution derived from D. Each of the components of

the surrogate ψC is exactly designed to minimize the abstain loss for the corresponding

modified distribution. Assuming an excess risk bound relating ψj and `?,nj for all j ∈ [s],

one can easily derive an excess risk bound relating `H and ψC. This is done in the theorem

below.

Theorem 8.3. For all j ∈ [s], let ψj : [nj]× Rdj and predj : Rdj→nj be such that for all

fj : X→Rdj , and all distributions D over X × [nj] we have

reg`?,nj

D [predj fj] ≤ C · regψj

D [fj],


for some constant C > 0. Then for all f : X→Rd and distributions D over X × [n],

reg`H

D [predC f ] ≤ 2αhC · regψC

D [f ] .

Proof. Fix u ∈ Rd,p ∈ ∆n. From Lemma 8.2 and the observation that γj(uj) ≤ αj, we

have that

reg`H

p (predC(u)) ≤s∑

j=1

2αj · reg`?,nj

aj(p)(predj(uj))

≤ 2αh ·s∑

j=1

·reg`?,nj

aj(p)(predj(uj))

≤ 2αhC ·s∑

j=1

regψj

aj(p)(uj)

= 2αhC · regψC

p (u) .

The proof now simply follows from linearity of expectation.

Thus, one just needs to plug in appropriate convex surrogates ψj and predictors predj to

get concrete calibrated surrogates for hierarchical classification. The results of Chapter

7 can be used to immediately get three such by setting the component surrogates ψj

and predictors predj as either the CS, OVA or BEP surrogates and predictors. We call

the resulting algorithms (surrogates and predictors) as CS-cascade , OVA-cascade and

BEP-cascade.

An interesting consequence of the BEP-cascade being calibrated w.r.t. `H is that

CCdim(`H) ≤ s · dlog2(n)e ,

which for balanced trees is of the order of (log(n))2. Disappointingly though, the BEP-

cascade algorithm does not work as well as OVA-cascade and CS-cascade, with parametric

function classes like the linear function class, on real data sets.


Algorithm 2 OVA-Cascade Training

Input: S = ((x1, y1), . . . , (xM , yM)) ∈ (X × [n])M , H = ([n], E).Parameters: Regularization parameter C > 0

for i = 1 : nLet tj = 2 · 1(yj ∈ D(i))− 1, ∀j ∈ [M ]Ti = ((x1, t1), . . . , (xM , tM)) ∈ (X × +1,−1)M .fi=SVM-Train(Ti, C)Let t′j = 2 · 1(yj = i)− 1, ∀j ∈ [M ]T ′i = ((x1, t

′1), . . . , (xM , t

′M)) ∈ (X × +1,−1)M .

f ′i=SVM-Train(T ′i , C)end for

8.5 OVA-Cascade Algorithm

In this section, we will consider the OVA cascade in more detail, as the problem of opti-

mizing the OVA-cascade surrogate is very amenable to being broken down into multiple

separate binary SVM problems, and thus can be easily parallelized. Also, there are nu-

merous efficient solvers for binary SVM, removing the need to tailor a generic convex

optimization algorithm to suit our purpose.

To make the dependence on the number of classes more explicit, we shall use ψOVA,n and

predOVA,nτ , to denote the OVA surrogate and predictor with n classes and threshold τ as

in Section 7.4.

Let ψj = ψOVA,nj and predj = predOVA,njτj

for some τj ∈ (−1, 1). In this case we have

dj = nj. In the surrogate minimizing algorithm for OVA-cascade, one solves s one-vs-all

SVM problems. Problem j has nj classes, with the classes corresponding to the nj−1 nodes

in the hierarchy at level less than j, and nj − nj−1 ‘super-nodes’ in the hierarchy at level

j which also absorb the nodes of its descendants. The resulting training and prediction

algorithms can thus be simplified and they are presented in Algorithms 2 and 3. The

training phase requires an SVM optimization sub-routine, SVM-Train, which takes in a

binary dataset and a regularization parameter C and returns a real valued function over

the instance space minimizing the regularized hinge loss over an appropriate function

space.

Theorems 7.3 and 8.3 immediately give the following excess risk bound for OVA-cascade.


Algorithm 3 OVA-Cascade Prediction

Input: x ∈ X , H = ([n], E), trained models fi, f′i for all i ∈ [n]

Parameters: Scalars τ1, . . . , τs in (−1, 1)

for j = s down to 1Construct u ∈ Rnj such that, for all i ∈ [nj],

ui =

fi(x) if lev(i) = j

f ′i(x) if lev(i) < j

if maxi ui > τjreturn argmaxi ui

end ifend forreturn 1

Corollary 8.4. For 1 ≤ j ≤ s, let τj ∈ (−1, 1). Let the component surrogates and

predictors of ψC and predC be ψj = ψOVA,nj and predj = predOVA,njτj

. Then, for all

distributions D and functions f : X→Rd,

reg`H

D [predC f ] ≤ αh1−maxj |τj|

· regψC

D [f ]

To get the best bound from Corollary 8.4, one must set τj = 0 for all j ∈ [s]. However,

using a slightly more intricate version of Theorem 7.3 and the full extent of Lemma 8.2,

one can give a better upper bound for the `H-regret than in Corollary 8.4, and this tighter

upper bound is minimized for a different τj. This observation is captured by Theorem

8.5 below.

Theorem 8.5. For 1 ≤ j ≤ s, let τj =αj−βjαj+βj

. Let the component surrogates and predictors

of ψC and predC be ψj = ψOVA,nj and predj = predOVA,njτj

. Then, for all distributions D

and functions f : X→Rd,

reg`H

D [predC f ] ≤ 1

2maxj∈[s]

(αj + βj) · regψC

D [f ]

To prove the above theorem we first give the more intricate version of Theorem 7.3, whose

proof still follows from the proof of Theorem 7.3.


Lemma 8.6. Let τ ∈ (−1, 1). For all u ∈ Rn,p ∈ ∆n. Then for all p ∈ ∆n

reg`?,n

p (predOVA,nτ (u)) ≤

(1(predOVA,n(u) = ⊥)

2(1− τ)+

1(predOVA,n(u) 6= ⊥)

2(1 + τ)

)· regψ

OVA,n

p (u) .


Let u ∈ Rd,p ∈ ∆n, From Lemmas 8.2 and 8.6, we have that

reg`H

p (predC(u))

≤h∑j=1

γj(uj) · reg`?,nj

aj(p)(predj(uj))

≤h∑j=1

γj(uj)

(1(predOVA,nj(uj) = ⊥)

2(1− τj)+

1(predOVA,nj(uj) 6= ⊥)

2(1 + τj)

)· regψ

OVA,nj

aj(p)(uj)

=h∑j=1

(βj · 1(predOVA,nj(uj) = ⊥)

(1− τj)+αj · 1(predOVA,nj(uj) 6= ⊥)

(1 + τj)

)· regψ

OVA,nj

aj(p)(uj)

For each j ∈ [h], the coefficients of both the terms within parantheses (i.e.αj

1+τjand

βj1−τj )

both evaluate toαj+βj

2when the thresholds τj are set as τj =

αj−βjαj+βj

. In fact it can easily

be seen that this value of τj minimizes the worst-case coefficient of regψOVA,nj

aj(p)(uj) in the

bound. Thus, we have

reg`H

p (predC(u)) ≤h∑j=1

1

2(αj + βj) · regψ

OVA,nj

aj(p)(uj)

≤ 1

2maxj∈[h]

(αj + βj) ·h∑j=1

regψOVA,nj

aj(p)(uj)

=1

2maxj∈[h]

(αj + βj) · regψC

p (u) .

The Theorem now follows from linearity of expectation.

One can clearly see the effect of improved bounds given by setting τj as in Theorem 8.5

for the unweighted hierarchy, in which case αj = 2j and βj = 1.

Corollary 8.7. Let the hierarchy H be an unweighted tree with all edges having length

1. Let the component surrogates and predictors of ψC and predC be ψj = ψOVA,nj and

predj = predOVA,njτj

.


a. For all j ∈ [s] let τj = 0, then, for all distributions D and functions f : X→Rd,

reg`H

D [predC f ] ≤ 2s · regψC

D [f ] .

b. For all j ∈ [s] let τj = 2j−12j+1

, then, for all distributions D and functions f : X→Rd,

reg`H

D [predC f ] ≤(

s +1

2

)· regψ

C

D [f ] .

Thus setting τj = 2j−12j+1

gives almost a factor 2 improvement over setting τj = 0. This

threshold setting is also intuitively satisfying, as it says to use a higher threshold and

predict conservatively (abstain more often) in deeper levels, and use a lower threshold to

predict aggressively in levels nearer to the root. In practice, the optimal thresholds are

distribution dependent and are best obtained via cross-validation.

8.6 Experiments

We run our cascade surrogate based algorithm for hierarchical classification on some stan-

dard document classification tasks with a class hierarchy and compare the results against

other standard algorithms. We use the unweighted tree-distance loss as the evaluation

metric. The details of the datasets and the algorithms are given below.

8.6.1 Datasets

We used several standard multiclass document classification datasets, all of which have

one class label per example. The basic statistics of the datasets is given in Table 8.1.

• CLEF [32] Hierarchical collection of medical images.

• IPC 1 Patents organized according to the International Patent Classification Hier-

archy.

1http://www.wipo.int/classifications/ipc/en/support/


Table 8.1: Dataset Statistics.

Dataset #Train #Validation #Test #Labels #Leaf-Labels Depth #FeaturesCLEF 9,000 1,000 1,006 97 63 3 89

LSHTC-small 4,463 1,860 1,858 2,388 1,139 5 51,033IPC 35,000 11,324 28,926 553 451 3 541,869

DMOZ-2010 80,000 13,805 34,905 17,222 12,294 5 381,580DMOZ-2012 250,000 50,000 83,408 13,347 11,947 5 348,548

Table 8.2: Average tree-distance loss on the test set for various algorithms and datasets.Runs that failed due to memory issues are denoted as ‘-’.

Root OVA HSVM-margin

HSVM-slack

CS-Cascade

OVA-Cascade

Plug-in

CLEF 3.00 1.10 0.98 1.00 0.91 0.95 0.97

LSHTC-small 4.77 4.12 3.47 3.54 3.20 3.19 3.26

IPC 2.97 2.29 - - - 2.06 2.05

DMOZ-2010 4.65 3.96 - - - 3.12 3.16

DMOZ-2012 4.75 2.83 - - - 2.46 2.48

Table 8.3: Training times (not including validation) in hours (h) or seconds (s). Runsthat failed due to memory issues are denoted as ‘-’.

Root OVA HSVM-margin

HSVM-slack

CS-Cascade

OVA-Cascade

Plug-in

CLEF 0 s 35s 50 s 45 s 20 s 50 s 66 s

LSHTC-small 0 h 0.24 h 2.1 h 1.8 h 1.7 h 0.3 h 0.5 h

IPC 0 h 2.6 h - - - 2.9 h 4.2 h

DMOZ-2010 0 h 36 h - - - 59 h 146 h

DMOZ-2012 0 h 201 h - - - 220 h 361 h

• LSHTC-small, DMOZ-2010 and DMOZ-2012 2 Web-pages from the LSHTC

(Large-Scale Hierarchical Text Classification) challenges during 2010-12.

We used the standard train-test splits wherever available and possible. For the DMOZ-

2010 and 2012 datasets however, we created our own train-test splits because the given

test sets do not contain class labels and the oracle for evaluating submissions does not

accept interior nodes as predictions.

2http://lshtc.iit.demokritos.gr/node/3


8.6.2 Algorithms

We run a variety of algorithms on the above datasets. The details of the algorithms are

given below.

Root: This is a simple baseline method where the returned classifier always predicts

the root of the hierarchy.

OVA: This is the standard one-vs-all SVM algorithm which completely ignores the

hierarchy information and treats the problem as one of standard multiclass classification.

HSVM-margin and HSVM-slack : These algorithms are Struct-SVM like [101] al-

gorithms for the tree-distance loss as proposed in Cai and Hoffman [16]. HSVM-margin

uses margin rescaling, while HSVM-slack uses slack rescaling and are considered among

the state-of-the-art algorithms for hierarchical classification.

OVA-Cascade: This is the algorithm in which we minimize the surrogate ψC with the

component surrogates being ψj = ψOVA,nj and is detailed as Algorithms 2 and 3. All

the datasets in Table 8.1 have the property that all instances are associated only with a

leaf-label (note however that we can still predict interior nodes), and hence the step of

computing f ′i in Algorithm 2 can be skipped, and f ′i can be set to be identically equal to

negative infinity for all i ∈ [n]. Note that, in this case, the training phase is very similar

to the ‘less-inclusive policy’ using the ‘local node approach’ [92]. We use LIBLINEAR

[35] for the SVM-train subroutine and use the simple linear kernel. The regularization

parameter C is chosen via a separate validation set. The thresholds τj for j ∈ [s] are also

chosen via a coarse grid search using the validation set.

Plug-in classifier: This algorithm is based on estimating the conditional probabilities

using a logistic loss. Specifically, it estimates Sy(p) for all non-root nodes y. This is

done by creating a binary dataset for each y, with instances having labels which are

the descendants of y being positive and the rest being negative, and running a logistic

regression algorithm on this dataset. The final predictor is simply based on Theorem 8.1,

it chooses the deepest node y such that the estimated value of Sy(p) is greater than 12.


CS-Cascade: This algorithm also minimizes the cascade surrogate ψC, but with the

component surrogates ψj being the Crammer-Singer surrogate. Using the results from

Chapter 7, one can derive excess risk transforms for the resulting cascade surrogate as

well. As all instances have labels which are leaf nodes, the s subproblems all turn out to be

multiclass learning problems with nj classes for each of which we use the Crammer-Singer

algorithm. We optimize the Crammer-Singer surrogate over the standard multiclass lin-

ear function class using the LIBLINEAR [35] software. Once again, we use the same

regularization parameter C for all the s problems, which we choose using the validation

set. We also use a threshold vector tuned on the validation set over a coarse grid.

The three algorithms, OVA-Cascade, Probability estimation, and CS-cascade are all mo-

tivated by our analysis and would form consistent algorithms for the tree-distance loss, if

used with an appropriate function class.

8.6.3 Discussion of Results

Table 8.2 gives the average tree-distance loss incurred by various algorithms on some

standard datasets, and Table 8.3 gives the times taken for running these algorithms on a

4-core CPU.3 Some of the algorithms like HSVM, and CS-cascade could not be run on the

larger datasets due to memory issues. On the smaller datasets of CLEF and LSHTC-small

where all the algorithms could be run, the algorithms motivated by our analysis – OVA-

cascade, Plug-in and CS-cascade – perform the best. On the bigger datasets, only the

OVA-cascade, plug-in and the flat OVA algorithms could be run, and both OVA-cascade

and Plug-in perform significantly better than the flat OVA. While both OVA-cascade

and Plug-in give comparable error performance, the OVA-cascade only takes about half

as much time as the Plug-in and hence is more preferable.

3HSVM, and CS-cascade effectively only use a single core due to lack of parallelization.

Part III

Complex Multiclass

Evaluation Metrics

Chapter 9

Consistent Algorithms for Complex

Multiclass Penalties

In many practical applications of machine learning, the evaluation metric used to measure

the performance of a classifier takes a complex form, and is not simply the expectation

or sum of a loss on individual examples. Indeed, this is the case with the G-mean, H-

mean and Q-mean losses used in class imbalance settings [56, 58, 63, 98, 108], the micro

and macro F1 measures used in information retrieval (IR) applications [65], the min-

max measure used in detection theory [104], and many others. The loss-matrix based

evaluation metrics, which we have studied in the previous chapters, are simply linear

functions of the confusion matrix of a classifier, while these complex evaluation metrics

are defined by more general functions of the confusion matrix.

There has been much interest in designing algorithms for such complex evaluation metrics.

A prominent example is the SVMperf algorithm [53], which was developed primarily for

the binary setting; other examples include algorithms for the binary F1-measure and

its multiclass and multilabel variants [30, 31, 70, 74–76, 113]. More recently, there has

been increasing interest in designing consistent algorithms for complex evaluation metrics;

however, most of this work has focused on the binary case [61, 69, 71, 113].

In the case of loss-matrix based evaluation metrics, one can always design consistent

algorithms by minimizing the n-dimensional convex calibrated conditional probability

160

Chapter 9. Complex Multiclass Penalties 161

estimation surrogate of Theorem 5.1 or Lemma 4.1. But for these complex evaluation

metrics, these methods do not apply.

In this chapter we address the problem of designing consistent algorithms for such com-

plex evaluation metrics. Our approach involves viewing the learning problem as an opti-

mization problem over the set of feasible confusion matrices, and solving (approximately,

based on the training sample) this optimization problem using an appropriate optimiza-

tion method. In particular, we design an algorithm that we call the Bayes-Frank-Wolfe

algorithm, and show that the resulting algorithm is consistent for convex evaluation met-

rics, which includes many complex evaluation metrics used in practice, as seen in Table

9.1.


We begin by formally defining the notion of a complex multiclass penalty, and give several

examples in Section 9.2. We then study the notion of consistency for such complex

penalties, and make some observations linking it to optimization in Section 9.3. Building

on the connection to optimization, we give the Bayes-Frank-Wolfe algorithm in Section

9.4, which is based on the Frank-Wolfe algorithm in optimization, and show that it forms

a consistent learning algorithm for a large family of such complex penalties.

9.2 Complex Multiclass Penalties

The problem is set in the standard supervised learning setting, as detailed in Section 2.2,

with one major difference. Instead of considering just classifiers h : X→Y , we consider

randomized classifiers h, which give a random variable taking values in Y , for an input

x ∈ X . We represent such a classifier by a function h : X→∆Y , i.e. h(x) is distributed as

h(x) for all x ∈ X . This is a strictly more general setting, and we show that this generality

is indeed required for complex losses. For the sake of simplicity, we set Y = Y = [n] in this

chapter, the results can be trivially extended for any finite label space Y and prediction

space Y .


The evaluation metrics in this chapter depend on the confusion matrix of a classifier.

The confusion matrix of a classifier h : X→∆n w.r.t. a distribution D, is denoted by

CD[h] ∈ [0, 1]n×n, and has entries given by

CDi,j[h] = P

(Y = i, h(X) = j

),

where the probability is over the randomness in X, Y and also the classifier h, which is

such that h(x) ∼ h(x).

Clearly,

n∑j=1

CDi,j[h] = P(Y = i) ,

n∑i=1

CDi,j[h] = hj(x) = P(h(X) = j) .

For a given distribution D, we denote the set of all matrices satisfying the first constraint

as CD, i.e.

CD =

C ∈ [0, 1]n×n :

n∑j=1

Ci,j = P(X,Y )∼D(Y = i), ∀i ∈ [n]

.

We will be interested in general, complex evaluation metrics that can be expressed as

an arbitrary function of the entries of the confusion matrix CD[h]. For any continuous

penalty function ϕ : [0, 1]n×n→R+, define the ϕ-risk of h w.r.t. D as follows

LϕD[h] = ϕ(CD[h]) .

The ϕ-risk defined above is exactly analogous to the `-risk defined for a loss matrix ` in

Chapter 2.

As the following examples show, this formulation captures both common loss-matrix based

evaluation metrics studied in previous chapters, which are effectively linear functions of

the entries of the confusion matrix, and more complex evaluation metrics such as the


G-mean, micro F1-measure, and several others.1

Example 9.1 (Loss-matrix based evaluation metrics). Consider a multiclass loss ` : Y ×Y→R+ with loss matrix matrix L ∈ Rn×n

+ , as in previous chapters. Let the penalty

function ϕ : [0, 1]n×n→R+, be such that

ϕ(C) = 〈L,C〉 .

Then the ϕ-risk is such that for any classifier h : X→∆n,

LϕD[h] = 〈L,CD[h]〉

=n∑i=1

n∑j=1

Li,jE[1(Y = i)1(h(X) = j)

]= E

[LY,h(X)

]= Eh

[er`D[h]

],

where once again h(x) ∼ h(x) for all x ∈ X .

Hence we have that the loss-matrix based evaluation metrics correspond to using penalties

that are a linear function of the confusion matrix.

Example 9.2 (Binary evaluation metrics). In the binary setting, where n = 2 and the

labels are often indexed as Y = −1, 1, the confusion matrix of a classifier contains

the proportions of true negatives (C−1,−1 = TN), false positives (C−1,1 = FP), false

negatives (C1,−1 = FN), and true positives (C1,1 = TP). Our framework thus includes

any evaluation metric expressed in terms of these 4 quantities in binary classification.

For example, the Fβ-measure (β > 0) given by the penalty function,

ϕFβ(C) = 1− (1 + β2) TP

(1 + β2) TP + β2 FN + FP,

all ‘ratio-of-linear’ binary evaluation metrics [61], and more generally, all ‘non-decomposable’

binary evaluation metrics [71].

1Many of these evaluation metrics are given as gains in their original form, and hence we subtractthem from a constant to consider them as penalties.


Example 9.3 (A-mean evaluation metric). The arithmetic mean evaluation metric is a

simple evaluation metric that balances the errors from all classes [69] and is given by the

penalty function,

ϕAM(C) =n∑i=1

(∑n

j=1Ci,j)− Ci,i∑nj=1Ci,j

.

Example 9.4 (G-mean evaluation metric). The geometric mean evaluation metric is used

to evaluate both binary and multiclass classifiers in settings with class imbalance [98, 108],

and is given by the penalty function

ϕGM(C) = 1−( n∏

i=1

Ci,i∑nj=1Ci,j

)1/n

.

Example 9.5 (H-mean evaluation metric). The harmonic mean evaluation metric is de-

signed for situations with class imbalance [56], and is given by the penalty function

ϕHM(C) = 1− n(

n∑i=1

∑nj=1Ci,j

Ci,i

)−1

.

Example 9.6 (Q-mean evaluation metric). The Q-mean evaluation metric [63], is another

evaluation metric designed for problems with class imbalance and is given by the penalty

function

ϕQM(C) =

√√√√ 1

n

n∑i=1

(1− Ci,i∑n

j=1Ci,j

)2

.

Other examples of complex evaluation metrics include the macro F1-measure [65], the

spectral norm measure [59, 67, 78], and the min-max measure in detection theory [104];

see Table 9.1.

The ultimate goal in the learning problem with a complex penalty ϕ : [0, 1]n×n→R+, is

to find a classifier that achieves as small a ϕ-risk as possible. In this context one can

define the minimum ϕ-risk, ϕ-regret and consistency w.r.t. ϕ, in a manner similar to

such notions defined for a loss matrix ` in Chapter 2.

The minimum ϕ-risk Lϕ,∗D is defined as

Lϕ,∗D = infh:X→∆n

LϕD[h] ,


Table 9.1: Examples of complex multiclass evaluation metrics.

Evaluation metric ϕ(C) Convex over CD?

A-Mean∑n

i=1

(∑nj=1 Ci,j)−Ci,i∑n

j=1 Ci,jYes

G-mean 1−(∏n

i=1Ci,i∑nj=1 Ci,j

)1/n

Yes

H-mean 1− n(∑n

i=1

∑nj=1 Ci,j

Ci,i

)−1

Yes

Q-mean

√1n

∑ni=1

(1− Ci,i∑n

j=1 Ci,j

)2

No

Micro F1 1− 2∑ni=2 Ci,i

2−∑ni=1 C1i−

∑ni=1 Ci1

No

Macro F1 1− 1n

∑ni=1

2Ci,i∑nj=1 Ci,j +

∑nj=1 Cji

Yes

Spectral norm 1− ‖C‖∗ Yes

(where C is obtained from C

by normalizing rows to sum to 1

and setting diagonal entries to 0)

Min-max 1−mini∈[n]Ci,i∑nj=1 Ci,j

Yes

and the ϕ-regret of a classifier h : X→∆n is given as

regϕD[h] = LϕD[h]− Lϕ,∗D .

Definition 9.1 (ϕ-consistency). An algorithm that takes a training sample S ∈ (X ×Y)M

drawn i.i.d from D and returns a classifier hM (which is a random variable depending on

S) is said to be consistent w.r.t. ϕ, or simply ϕ-consistent, if as M approaches ∞,

LϕD[hM ]P−→ Lϕ,∗D .

In developing our algorithms, we will find it useful to also define the empirical confusion

matrix of a classifier h : X→∆n w.r.t. sample S, denoted CS[h] ∈ [0, 1]n×n, as

CSi,j[h] =

1

MEh

M∑k=1

1(yk = i, h(xk) = j),


where h(x) ∼ h(x) for all x ∈ X .

We will also find it convenient to define the following objects. The notation argmin∗i∈[n]

will denote ties being broken in favor of the larger number. The function rand maps

elements in [n] to ∆n, such that rand(y) = eny for all y ∈ [n]. Also, for any µ : X→∆n,

define the following set of multiclass classifiers,

Hµ = h : X→[n], h(x) = argmin∗j∈[n]〈µ(x), `j〉 for some L ∈ [0, 1]n×n ,

where `1, . . . , `n are the columns of L.

9.3 Consistency via Optimization

In this section we take an optimization viewpoint for deriving consistent algorithms. For

convenience, we shall assume in this section, that we have access to the entire distribution

D, and not just a finite sample S drawn from it.

In order to understand optimal classifiers for an arbitrary penalty ϕ, we define the set of

feasible confusion matrices, which will play a crucial role in this chapter.

Definition 9.2 (Feasible confusion matrices). Define the set of feasible confusion matri-

ces w.r.t. D as the set of all confusion matrices achieved by some randomized classifier:

CD =CD[h] : h : X→∆n

.

Clearly, CD ⊆ CD, and is hence at most a n2− n dimensional set. Also, it is easy to show

that CD is a convex set.

Proposition 9.1. CD is a convex set.

It can be easily seen that minimum ϕ-risk is the minimum value of ϕ over all feasible

confusion matrices, i.e.

Lϕ,∗D = infC∈CD

ϕ(C) . (9.1)


TP

TN

P(Y = 1)

P(Y = −1)

1

1

0

Figure 9.1: A schematic illustration of feasible TP and TN values in binary classificationfor some distribution D, with the Pareto optimal frontier highlighted in red.

Equation (9.1), is the basic foundation on which the rest of the chapter is based. It

converts the problem of finding the best classifier, which is an infinite dimensional opti-

mization problem, into a finite dimensional optimization problem, and allows the usage

of standard optimization tools to derive consistent algorithms.

In the case of binary classification, one can actually implement the above optimization

efficiently due to the following reasoning. The set CD is just a 2-dimensional set and

can simply be expressed as the set of feasible true positive (TP) and true negative (TN)

values. An illustration is given in Figure 9.1. One can then argue that any reasonable

penalty ϕ must be monotone decreasing with the increase of both TP and TN, and hence

the optimal confusion matrix, must lie in the ‘Pareto optimal frontier’ (POF) which is

just a one-dimensional manifold. One can easily show that, each point in the POF is

simply the confusion matrix of classifier got by considering the conditional probability

distribution p(x), and thresholding at an appropriate level. Therefore an approximate

brute force optimization approach for solving Equation (9.1), can be implemented just

by trying all possible thresholds in [0, 1] with discretization.

In the case of multiclass classification, the above approach is not feasible as the dimen-

sionality of the POF grows at least linearly with the number of classes n, and hence

the number of classifiers to try grows exponentially with n. Hence, we need a more

sophisticated approach than brute force optimization.


From Proposition 9.1, we immediately have that if ϕ is a convex function over CD, the re-

sulting optimization problem is convex, and hence one can expect efficient algorithms with

a guarantee. Indeed, many complex penalties in practice are convex in CD as indicated

in Table 9.1.

Even assuming the convexity of CD and ϕ, we have a key issue precluding the usage

of standard optimization algorithms like gradient descent – the intractability of the set

CD. Given a confusion matrix C ∈ CD, it is not possible to immediately say if C ∈ CD.

However, thanks to the observation that linear penalties are equivalent to loss matrix

based evaluation metrics, and the observation that for any loss matrix L we have

〈L,CD[h]〉 = EX〈p(X),Lh(X)〉 ,

we immediately have the following.

Proposition 9.2. Let L ∈ Rn×n+ be a loss matrix, with columns `1, . . . , `n. Then any

(deterministic) classifier h∗ : X→[n] satisfying

h∗(x) ∈ argminj∈[n]〈p(x), `j〉

is such that

〈L,CD[h∗]〉 = infh:X→∆n

〈L,CD[h]〉 = infC∈CD〈L,C〉 .

In other words, while checking whether a given confusion matrix C ∈ CD is difficult,

implementing a linear minimization oracle over CD is simple. This observation along

with the observation that the optimization problem given by Equation (9.1) is convex for

a convex penalty ϕ, immediately suggests the usage of the Frank-Wolfe or conditional

gradient [36, 49] algorithm as an apt choice in this situation.

9.4 The BFW Algorithm for Convex Penalties

In this section, we give the details of the Bayes-Frank-Wolfe (BFW) algorithm (see Al-

gorithm 4), which uses finite samples and the Frank-Wolfe algorithm to solve Equation


(9.1). Further, we give a proof showing the asymptotic consistency of such an algorithm

for convex penalties.

An ideal version of the BFW algorithm for exactly minimizing ϕ(C) over C ∈ CD, having

access to the entire distribution D, would maintain iterates ht : X→∆n,Ct ∈ CD such

that Ct = CD[ht], compute Lt = ∇ϕ(Ct−1), solve the resulting linear minimization

problems minC∈CD〈Lt,C〉 and update Ct and ht. By standard Frank-Wolfe convergence

arguments [49] we have that

ϕ(CD[ht]) = ϕ(Ct) approaches infC∈CD

ϕ(C) = infh:X→∆n

ϕ(CD[ht]) .

However as we only have access to a finite sample S, the above ideal quantities (ht,Ct,Lt)

are replaced by sample dependent quantities (ht, Ct, Lt). The proof technique for show-

ing the consistency of such an algorithm proceeds as though we maintain implicitly the

ideal quantities, and all errors for maintaining the sample dependent quantities are ab-

sorbed into an additive approximation factor for solving the linear minimization problems

minC∈CD〈Lt,C〉. The proof of convergence for the Frank-Wolfe algorithm also holds under

such approximations [49].

The final (randomized) classifier output by the algorithm is a convex combination of the

classifiers learned across all the iterations.

We now show the consistency of the BFW algorithm for convex and smooth penalties.

Theorem 9.3 (ϕ-regret of BFW). Let ϕ : [0, 1]n×n→R+ be convex over CD, and L-

Lipschitz and β-smooth w.r.t. the `1 norm. Let S ∈ (X × [n])M be drawn i.i.d from D.

Let p : X→∆n be the CPE model learned in Algorithm 4 and hBFWS : X→∆n the classifier

returned after κM iterations. Let δ ∈ (0, 1]. Then with probability ≥ 1−δ (over S ∼ DM)

LϕD[hBFWS ]− Lϕ,∗D

≤ 4LEX

[∥∥p(X)− p(X)∥∥

1

]+ 4√

2βn2C

√n2 log(n) log(M) + log(n2/δ)

M+

8β

κM + 2,

where C > 0 is a distribution-independent constant.

The proof of the above theorem proceeds via a sequence of lemmas.


Algorithm 4 Bayes-Frank-Wolfe Algorithm

1: Input: ϕ : [0, 1]n×n→R+

S = ((x1, y1), . . . , (xM , yM)) ∈ (X × [n])M

2: Parameter: κ ∈ N

3: Split S into S1 and S2 with sizes⌈M2

⌉and

⌊M2

⌋4: Let p : X→∆n be given as p = CPE(S1) for some class probability estimator CPE

5: Initialize: h0 : X→∆n, C0 = CS2 [h0]

6: For t = 1 to T = κM do

7: Let Lt = ∇ϕ(Ct−1), with columns t1, . . . , tn8: Obtain gt : x 7→ rand

(argmin∗j∈[n]〈p(x), tj〉)

9: Let Γt

= CS2 [gt]

10: Let ht =(1− 2

t+1

)· ht−1 + 2

t+1· gt

11: Let Ct =(1− 2

t+1

)· Ct−1 + 2

t+1· Γt

12: end For

13: Output: hBFWS = hT : X→∆n

Lemma 9.4 (Frank-Wolfe lemma). Let ϕ : [0, 1]n×n→R+ be convex over CD, and β-smooth

w.r.t. the `1 norm. Let h0, h1, . . . , hT and g1, g2, . . . , gT be classifiers from X→∆n such

that for all t ∈ 1, . . . , T

CD[ht] =

(1− 2

t+ 1

)CD[ht−1] +

2

t+ 1CD[gt] (9.2)⟨

∇ϕ(CD

[ht−1

]),CD

[gt]⟩≤ inf

g:X :→∆n

⟨∇ϕ

(CD

[ht−1

]),CD [g]

⟩+ ε (9.3)

Then,

LϕD[hT ]− Lϕ,∗D ≤ 2ε+8β

T + 2.

Proof. Let Cϕ be the curvature constant as defined in Jaggi [49].

Cϕ = supC1,C2∈CD,γ∈[0,1]

2

γ2

(ϕ(C1 + γ(C2 −C1)

)− ϕ

(C1

)− γ⟨C2 −C1,∇ϕ(C1)

⟩)≤ sup

C1,C2∈CD,γ∈[0,1]

2

γ2

(β2γ2||C1 −C2||21

)= 4β .


The second inequality above follows from the β-smoothness of ϕ. Define a constant δapx

such that

δapx =(T + 1)ε

Cϕ.

We then have that for all t ∈ [T ]

⟨∇ϕ

(CD

[ht−1

]),CD

[gt]⟩≤ inf

g:X :→∆n

⟨∇ϕ

(CD

[ht−1

]),CD [g]

⟩+ ε

= infC∈CD

⟨∇ϕ

(CD

[ht−1

]),C⟩

+1

2δapx

2

T + 1Cϕ

≤ infC∈CD

⟨∇ϕ

(CD

[ht−1

]),C⟩

+1

2δapx

2

t+ 1Cϕ (9.4)

From Theorem 1 of Jaggi [49], and Equation (9.4), we have that

ϕ(CD

[hT])≤ inf

C∈CDϕ(C) +

2CϕT + 2

(1 + δapx)

≤ infC∈CD

ϕ(C) +8β

T + 2+ 2ε

Now we show that the conditions of Lemma 9.4 hold for the ht, gt defined in Algorithm

4, with an appropriate ε.

Lemma 9.5. Let ϕ : [0, 1]n×n→R+ be any differentiable function. Let h0, h1, . . . , hT and

g1, g2, . . . , gT be functions from X to ∆n and let L1, . . . , LT be matrices in Rn×n as defined

in Algorithm 4. Then for all t ∈ [T ],

⟨∇ϕ

(CD

[ht−1

]),CD

[gt]⟩− inf

g:X :→∆n

⟨∇ϕ

(CD

[ht−1

]),CD [g]

⟩≤

(〈Lt,CD[gt]〉 − inf

g:X→∆n

〈Lt,CD[g]〉)

+ 2∥∥∥∇ϕ(CD

[ht−1

])− Lt

∥∥∥∞

(9.5)

Proof. Fix some t ∈ [T ]. Let g∗ : X→∆n be such that

⟨∇ϕ

(CD

[ht−1

]),CD [g∗]

⟩= inf

g:X :→∆n

⟨∇ϕ

(CD

[ht−1

]),CD [g]

⟩.


We then have that

⟨∇ϕ

(CD

[ht−1

]),CD

[gt]⟩− inf

g:X :→∆n

⟨∇ϕ

(CD

[ht−1

]),CD [g]

⟩=

⟨∇ϕ

(CD

[ht−1

]),CD

[gt]⟩−⟨∇ϕ

(CD

[ht−1

]),CD [g∗]

⟩=

⟨∇ϕ

(CD

[ht−1

])− Lt,CD

[gt]⟩−⟨∇ϕ

(CD

[ht−1

])− Lt,CD [g∗]

⟩+(⟨

Lt,CD[gt]⟩−⟨Lt,CD [g∗]

⟩)=

⟨∇ϕ

(CD

[ht−1

])− Lt,CD

[gt]−CD [g∗]

⟩+(⟨

Lt,CD[gt]⟩−⟨Lt,CD [g∗]

⟩)≤

∥∥∥∇ϕ(CD[ht−1

])− Lt

∥∥∥∞· ‖CD

[gt]−CD [g∗] ‖1 +

(〈Lt,CD[gt]〉 − inf

g:X→∆n

〈Lt,CD[g]〉)

≤(〈Lt,CD[gt]〉 − inf

g:X→∆n

〈Lt,CD[g]〉)

+ 2∥∥∥∇ϕ(CD

[ht−1

])− Lt

∥∥∥∞.

The next to last inequality above follows from Holder’s inequality, and the last inequality

follows from the observation that all confusion matrices lie in `1-ball of diameter 2.

It now only remains to bound both terms on the RHS of Equation (9.5).

The first term on the RHS of Equation (9.5) can be bounded using the lemma below and

the observation that the matrices Lt in Algorithm 4 are all such that ||Lt||∞ ≤ L, the

Lipschitz constant of ϕ w.r.t. `1 norm.

Lemma 9.6. For a fixed loss matrix L ∈ Rn×n with columns `1, . . . , `n and class proba-

bility estimation model p : X→∆n, let g : X→∆n be a classifier such that

g(x) = rand(argmin∗j∈[n]〈p(x), `j〉

).

Then

〈L,CD[g]〉 − infg:X→∆n

〈L,CD[g]〉 ≤ 2‖L‖∞ · EX

[∥∥p(X) − p(X)∥∥

1

].

Proof. Let g∗ : X→∆n be such that

g∗(x) = rand(argmin∗j∈[n]〈p(x), `j〉)


By Proposition 9.2, we have that

g∗ ∈ argming:X→∆n〈L,CD[g]〉 .

We have that

〈L,CD[g]〉 − infg:X→∆n

〈L,CD[g]〉

= 〈L,CD[g]〉 − 〈L,CD[g∗]〉

= EX

[〈p(X),Lg(X)〉

]− EX

[〈p(X),Lg∗(X)〉

]= EX

[〈p(X)− p(X),Lg(X)〉

]+ EX

[〈p(X),Lg(X)〉

]− EX

[〈p(X),Lg∗(X)〉

]≤ EX

[〈p(X)− p(X),Lg(X)〉

]+ EX

[〈p(X),Lg∗(X)〉

]− EX

[〈p(X),Lg∗(X)〉

]= EX

[〈p(X)− p(X),L(g(X)− g∗(X))〉

]≤ 2||L||∞ · EX

[∥∥p(X) − p(X)∥∥

1

].

The last inequality above follows from Holder’s inequality.

The second term on the RHS of Equation (9.5) can be bounded using the two lemmas

below.

Lemma 9.7. Let ϕ : [0, 1]n×n→R+ be β-smooth w.r.t. the `1-norm. Let classifiers

h0, . . . , hT , g1, . . . , gT and matrices L1, . . . , LT be as defined in Algorithm 4. Then for

all t ∈ [T ]

∥∥∥∇ϕ(CD[ht−1

])− Lt

∥∥∥∞≤ βn2 sup

h∈Hp

∥∥CD[h] − CS[h]∥∥∞ .


Proof. By definition we have Lt = ∇ϕ(CS2

[ht−1

]). Hence,

∥∥∥∇ϕ(CD[ht−1

])− Lt

∥∥∥∞

=∥∥∥∇ϕ(CD

[ht−1

])−∇ϕ

(CS2

[ht−1

])∥∥∥∞

≤ β∥∥CS2

[ht−1

]−CD

[ht−1

] ∥∥1

≤ βn2∥∥CS2

[ht−1

]−CD

[ht−1

] ∥∥∞

≤ βn2 maxi∈[t−1]

∥∥CS2[gi]−CD

[gi] ∥∥∞

≤ βn2 suph∈Hp

∥∥CS2 [h]−CD [h]∥∥∞.

The next to last inequality follows because ht is in the convex hull of g1, . . . , gt. The last

inequality follows because for all t ∈ [T ] there exists a h ∈ Hp such that CD[h] = CD[gt]

and CS2 [h] = CS2 [gt].

Lemma 9.8. Let µ : X→∆n. Let S ∈ (X × [n])M be a sample drawn i.i.d. from DM .

For any δ ∈ (0, 1], w.p. at least 1− δ (over draw of S from DM),

suph∈Hµ

∥∥CD[h] − CS[h]∥∥∞ ≤ C

√n2 log(n) log(M) + log(n2/δ)

M,

where C > 0 is a distribution-independent constant.

Proof. For any a, b ∈ [n] we have,

suph∈Hµ

∣∣∣CSa,b[h]− CD

a,b[h]∣∣∣ = sup

h∈Hµ

∣∣∣∣∣ 1

M

M∑i=1

(1(yi = a, h(xi) = b)− E[1(Y = a, h(X) = b)])

∣∣∣∣∣= sup

h∈Hbµ

∣∣∣∣∣ 1

M

M∑i=1

(1(yi = a, h(xi) = 1)− E[1(Y = a, h(X) = 1)])

∣∣∣∣∣ ,where

Hbµ =

h : X→0, 1 : ∃L ∈ [0, 1]n×n,∀x ∈ X , h(x) = 1

(b = argmin∗j∈[n]〈µ(x), `y〉

).

The set Hbµ can be seen as a binary hypothesis class whose concepts are the intersection

of n halfspaces in Rn (corresponding to µ(x)) through the origin. Hence we have from

Lemma 3.2.3 of Blumer et al. [9] that the VC-dimension of Hbµ is at most 2n2 log(3n).


From standard uniform convergence arguments we have that the following holds with

probability 1− δ,

suph∈Hµ

∣∣∣CSa,b[h]− CD

a,b[h]∣∣∣ ≤ C

√n2 log(n) log(M) + log(1

δ)

M

where C > 0 is some constant. Applying union bound for all a, b ∈ [n] we have that the

following holds with probability 1− δ

suph∈Hµ

∣∣∣∣∣∣CS[h]−CD[h]∣∣∣∣∣∣∞≤ C

√n2 log(n) log(M) + log(n

2

δ)

M.

The proof of Theorem 9.3 follows from applying Lemmas 9.4, 9.5, 9.6, 9.7, 9.8.

Theorem 9.3 shows that the BFW algorithm, when used with a consistent CPE whose `1

probability estimation error goes to zero, is consistent for convex smooth penalties. How-

ever, many penalties in practice including the GM, HM, QM and the min-max penalties

in Table 9.1 are convex but non-smooth. Theorem 9.3 can be easily extended to such

non-smooth penalties as well due to the observation that, any non-smooth convex func-

tion over a compact domain can be arbitrarily approximated (in the relative interior of its

domain) using a smooth convex function. Hence, applying Algorithm 4 to such a smooth

approximation, with the approximation error going to zero appropriately as the sample

size increases, gives a consistent algorithm for non-smooth convex penalties as well.

Chapter 10

Conclusions and Future Directions

10.1 Summary

In the first part of the thesis, we presented the foundations of a framework to study

consistent surrogate minimizing algorithms for general multiclass learning problems with

arbitrary loss matrix based evaluation metrics. The framework constructed includes sev-

eral important and useful tools that can be used to check whether a surrogate is calibrated

w.r.t. a given loss matrix, to characterize the difficulty of constructing convex calibrated

surrogates for a given loss matrix, and most importantly, to motivate and construct novel

convex calibrated surrogates for specific learning problems.

In the second part of the thesis, we focused particularly on the problem of hierarchical

classification, with the tree distance loss matrix, and gave a template to design convex

calibrated surrogates. In particular, the reduction to the problem of multiclass classi-

fication with an abstain option allows the construction of several SVM-like consistent

algorithms for hierarchical classification, one of which also performs well empirically on

benchmark hierarchical classification datasets.

In the third part of the thesis, we considered complex evaluation metrics more general

than loss matrix based evaluation metrics. We showed that finding the classifier with

the smallest such error is equivalent to a finite dimensional optimization problem with

the linear minimization oracle being the only useful primitive available, and constructed

176

Chapter 9. Conclusions and Future Directions 177

a learning algorithm based on the Frank-Wolfe optimization algorithm that is consistent

for a large family of such complex evaluation metrics.

10.2 Future Directions

While this thesis raises and answers several questions on consistency for multiclass learn-

ing problems that deepen our understanding, it also raises several questions which re-

main unanswered and form interesting directions of future research. We give some of

these questions below, and organize them in accordance with the three main parts of this

thesis.

10.2.1 Consistency and Calibration

• While Theorems 3.7 and 3.9 give necessary conditions and sufficient conditions for

calibration, they are a far cry from the simple characterisation for calibration in

binary classification as in Bartlett et al. [7]. Some extra complexity is necessary

to accommodate the multiclass setting and general predictors rather than just the

sign predictor used in binary classification. But, can one do better than the current

conditions?

• While the convex calibration dimension is an intrinsic notion of difficulty for a

loss matrix, it is not the only one, nor does it fully capture the difficulty of the

entire consistent algorithm. In particular it does not capture the complexity of the

predictor as evidenced by the pairwise disagreement and mean average precision

losses. Is there a better notion for capturing this as well ?

• As has been observed, the upper and lower bounds on the convex calibration di-

mension are not tight in general. Can they be tightened?

• Weakening consistency with noise condition requirements is an excellent way of

trading off computational complexity with generality of assumptions, but charac-

terizing and constructing such algorithms for an arbitrary noise condition and loss

matrix remains hard to do. Can such results be obtained?


10.2.2 Application to Hierarchical Classification

• The existence of a log2(n)-dimensional convex calibrated surrogate for the n-class

abstain(12) loss is an interesting result, but the lower bound on the CC-Dimension

of the abstain(12) loss is actually lower than this (it is in fact just 2 for any n). Can

one show a tighter lower bound on the CC dimension of this loss?

• How do we construct convex calibrated (piecewise linear, SVM-type) surrogates for

the abstain(α) loss when α ∈ [12, 1]?

• The reduction of the hierarchical classification problem to the multiclass classifica-

tion problem with an abstain option is valid only for tree hierarchies. Does there

exist such a result for general graph or at least DAG based hierarchies?

10.2.3 Multiclass Complex Evaluation Metrics

• The optimization viewpoint equating the problem of consistency for multiclass eval-

uation metrics with that of a finite dimensional optimization problem forms a very

useful tool, but the only known useful primitive available to us for the optimization

problem is the linear minimization oracle. Are there any other primitives that can

be used?

• The Bayes-Frank-Wolfe algorithm uses only the linear minimization oracle and is apt

for convex penalties. Harikrishna Narasimhan’s PhD thesis will describe a bisection

algorithm based method which also uses only the linear minimization oracle, and is

apt for penalties that can be expressed as a ratio of linear functions of the confusion

matrix, like the micro F-measure (see also [72]). Are there other interesting complex

penalties used in practice which can be solved with other optimization algorithms

that use only the linear minimization oracle?

• The macro F-measure in multiclass classification is an important multiclass complex

penalty, but unfortunately it is neither convex nor can it be expressed as a ratio of

linear functions. Can one get interesting guarantees for either the Frank-Wolfe or


bisection based algorithm, or give a different algorithm that is consistent for this

performance measure?

10.3 Comments

In conclusion, in this thesis, we have developed a deep understanding and fundamental

results in the theory of supervised multiclass learning. These insights have allowed us to

develop computationally efficient and statistically consistent algorithms for a variety of

multiclass learning problems of practical interest, in many cases significantly outperform-

ing the state-of-the-art algorithms for these problems.

Appendix A

Convexity

Definition A.1 (Convex set). A set C ⊆ Rd is said to be convex if for all x1,x2 ∈ C and

λ ∈ [0, 1] we have that λx1 + (1− λ)x2 ∈ C.

Definition A.2 (Convex function). A function f : C→R is is said to be convex if C is

convex and for all x1,x2 ∈ C and λ ∈ [0, 1] we have that

f(λx1 + (1− λ)x2) ≤ λf(x1) + (1− λ)f(x2) .

Definition A.3 (Strictly convex function). A function f : C→R is is said to be strictly

convex if C is convex and for all x1,x2 ∈ C with x1 6= x2 and λ ∈ (0, 1) we have that

f(λx1 + (1− λ)x2) < λf(x1) + (1− λ)f(x2) .

Proposition A.1 (Minimizers of convex and strictly convex functions). If f : C→R

is a convex function, then all local minimizers are global minimizers. Also the set of

minimizers, argminx∈C f(x), forms a convex set. If f : C→R is a strictly convex function,

then the set of minimizers, argminx∈C f(x), is a singleton.

Definition A.4 (Sub-differentials of a convex function). The sub-differentials of a convex

function f : C→R+, at a point x ∈ C, for some C ⊆ Rd is denoted by ∂f(x) and is given

as

∂f(x) = w ∈ Rd : f(y) ≥ f(x) + 〈w,y − x〉, ∀y ∈ C .

180

Appendix A. Convexity 181

If f is differentiable at x, then ∂f(x) is a singleton containing only ∇f(x).

Definition A.5 (ε-sub-differentials of a convex function). For any ε > 0, the ε-sub-

differentials of a convex function f : C→R+, at a point x ∈ C, for some C ⊆ Rd is

denoted by ∂εf(x) and is given as

∂εf(x) = w ∈ Rd : f(y) ≥ f(x) + 〈w,y − x〉 − ε,∀y ∈ C .

If ε = 0, then ε-sub-differentials are the same as sub-differentials.

Definition A.6 (Convex hull). A convex hull of a set A ⊆ Rd is a subset of Rd, denoted

by conv(A) and is given by

x ∈ Rd : x =

N∑i=1

λixi, for some N ∈ N, λ1, . . . , λN > 0,N∑i=1

λi = 1,x1, . . . ,xN ∈ A.

Definition A.7 (Affine hull). An affine hull of a set A ⊆ Rd is a subset of Rd, denoted

by aff(A) and is given by

x ∈ Rd : x =

N∑i=1

λixi, for some N ∈ N, λ1, . . . , λN ∈ R,N∑i=1

λi = 1,x1, . . . ,xN ∈ A.

Definition A.8 (Minkowski sum). For any two sets A,B ⊆ Rd, the Minkowski sum of Aand B is denoted by A+ B and is given by

A+ B = x ∈ Rd : x = x1 + x2, for some x1 ∈ A,x2 ∈ B .

Proposition A.2 (Properties of ε-sub-differentials of a convex function). Let C ⊆ Rd.

Let f : C→R be a convex function. Then,

• 0 ∈ ∂εf(x0) ⇐⇒ f(x0) ≤ infx∈C

f(x) + ε .

• For any λ > 0, x0 ∈ C,

∂ε(λf(x0)) = λ ∂(ε/λ)f(x0) .

Appendix A. Convexity 182

• Let f = f1 + . . .+ fn for some convex functions fi : C→R. Let x0 ∈ C. Then

∂εf(x0) ⊆ ∂εf1(x0) + . . .+ ∂εfn(x0) ⊆ ∂nεf(x0) .

where the sum of sets is the Minkowski sum.

• ε1 ≤ ε2 =⇒ ∂ε1f(x0) ⊆ ∂ε2f(x0) .

Bibliography

[1] S. Agarwal. Surrogate regret bounds for the area under the ROC curve via strongly

proper losses. In Proceedings of International Conference on Learning Theory

(COLT), 2013.

[2] S. Agarwal. Surrogate regret bounds for bipartite ranking via strongly proper losses.

Journal of Machine Learning Research, 15:1653–1674, 2014.

[3] N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information:

Ranking and clustering. Journal of the ACM, 55(5), 2008.

[4] V. Arya, N. Garg, R. Khandekar, A. Myerson, K. Munagala, and V. Pandit. Local

search heuristics for k-median and facility location problems. SIAM Journal of

Computing, 33:544–562, 2004.

[5] R. Babbar, I. Partalas, E. Gaussier, and M.-R. Amin. On flat versus hierarchical

classification in large-scale taxonomies. In Advances in Neural Information Pro-

cessing Systems, 2013.

[6] P. L. Bartlett and M. Wegkamp. Classification with a reject option using a hinge

loss. Journal of Machine Learning Research, 9:1823–1840, 2008.

[7] P. L. Bartlett, M. Jordan, and J. McAuliffe. Convexity, classification and risk

bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.

[8] D. Bertsekas, A. Nedic, and A. Ozdaglar. Convex Analysis and Optimization.

Athena Scientific, 2003.

[9] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and the

Vapnik-Chervonenkis dimension. Journal of the ACM, 36:929–965, 1989.

183

Bibliography 184

[10] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge university press,

2004.

[11] D. Buffoni, C. Calauzenes, P. Gallinari, and N. Usunier. Learning scoring func-

tions with order-preserving losses and standardized supervision. In Proceedings of

International Conference on Machine Learning, 2011.

[12] A. Buja, W. Stuetzle, and Y. Shen. Loss functions for binary class probability esti-

mation and classification: Structure and applications. Technical report, University

of Pennsylvania, November 2005.

[13] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G.

Hullender. Learning to rank using gradient descent. In Proceedings of International

Conference on Machine Learning, 2005.

[14] C. Burges, Q. V. Le, and R. R. Learning to rank with nonsmooth cost functions.

In Advances in Neural Information Processing Systems, 1997.

[15] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss. A database of

german emotional speech. In Proceedings of the 9th European conference on speech

communication and technology, 2005.

[16] L. Cai and T. Hofmann. Hierarchical document categorization with support vector

machines. In International Conference on Information and Knowledge Management

(CIKM), 2004.

[17] C. Calauzenes, N. Usunier, and P. Gallinari. On the (non-)existence of convex, cal-

ibrated surrogate losses for ranking. In Advances in Neural Information Processing

Systems, 2012.

[18] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Hierarchical classification: combining

Bayes with SVM. In Proceedings of International Conference on Machine Learning,

2006.

[19] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Incremental algorithms for hierar-

chical classification. Journal of Machine Learning Research, 7:31–54, 2006.

Bibliography 185

[20] D.-R. Chen and T. Sun. Consistency of multiclass empirical risk minimization

methods based on convex loss. Journal of Machine Learning Research, 7:2435–

2447, 2006.

[21] C. Chow. On optimum recognition error and reject tradeoff. IEEE Transactions

on Information Theory, 16:41–46, 1970.

[22] S. Clemencon and N. Vayatis. Ranking the best instances. Journal of Machine

Learning Research, 8:2671–2699, 2007.

[23] S. Clemencon, G. Lugosi, and N. Vayatis. Ranking and empirical minimization of

U-statistics. Annals of Statistics, 36:844–874, 2008.

[24] W. W. Cohen, R. E. Schapire, and Y. Singer. Learning to order things. In Advances

in Neural Information Processing Systems, 1997.

[25] D. Cossock and T. Zhang. Statistical analysis of Bayes optimal subset ranking.

IEEE Transactions on Information Theory, 54(11):5140–5154, 2008.

[26] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions

on Information Theory, 13(1):21–27, 1967.

[27] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-

based vector machines. Journal of Machine Learning Research, 2:265–292, 2001.

[28] O. Dekel, C. D. Manning, and Y. Singer. Log-linear models for label ranking. In


[29] O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classification. In


[30] K. Dembczynski, W. Waegeman, W. Cheng, and E. Hullermeier. An exact algo-

rithm for F-measure maximization. In Advances in Neural Information Processing

Systems 25, 2011.

Bibliography 186

[31] K. Dembczynski, A. Jachnik, W. Kotlowski, W. Waegeman, and E. Hullermeier.

Optimizing the f-measure in multi-label classification: Plug-in rule approach ver-

sus structured loss minimization. In Proceedings of International Conference on

Machine Learning, 2013.

[32] I. Dimitrovski, D. Kocev, L. Suzana, and S. Dzeroski. Hierchical annotation of

medical images. Pattern Recognition, 2011.

[33] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections

onto the l1 -ball for learning in high dimensions. In Proceedings of International


[34] J. Duchi, L. Mackey, and M. Jordan. On the consistency of ranking algorithms. In


[35] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. Liblinear: A library for large

linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.

[36] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research

Logistics Quarterly, 3(1-2):95–110, 1956.

[37] Y. Freund, R. Iyer, R. E. Schapire, , and Y. Singer. An efficient boosting algorithm

for combining preferences. Journal of Machine Learning Research, 4:933–969, 2003.

[38] G. Fumera and F. Roli. Suppport vector machines with embedded reject option.

Pattern Recognition with Support Vector Machines, pages 68–82, 2002.

[39] G. Fumera and F. Roli. Analysis of error-reject trade-off in linearly combined

multiple classifiers. Pattern Recognition, 37:1245–1265, 2004.

[40] G. Fumera, F. Roli, and G. Giacinto. Reject option with multiple thresholds.

Pattern Recognition, 33:2099–2101, 2000.

[41] G. Fumera, I. Pillai, and F. Roli. Classification with reject option in text categorisa-

tion systems. In IEEE International Conference on Image Analysis and Processing,

pages 582–587, 2003.

Bibliography 187

[42] J. Gallier. Notes on convex sets, polytopes, polyhedra, combinatorial topology,

Voronoi diagrams and Delaunay triangulations. Technical report, Department of

Computer and Information Science, University of Pennsylvania, 2009.

[43] W. Gao and Z.-H. Zhou. On the consistency of multi-label learning. In Proceedings

of International Conference on Learning Theory, 2011.

[44] M. Golfarelli, D. Maio, and D. Maltoni. On the error-reject trade-off in biometric

verification systems. IEEE Transactions on Pattern Analysis and Machine Intelli-

gence, 19:786–796, 1997.

[45] S. Gopal and Y. Yang. Recursive regularization for large-scale classification with

hierarchical and graphical dependencies. In International Conference on Knowledge

Discovery and Data Mining (KDD), 2013.

[46] S. Gopal, B. Bai, Y. Yang, and A. Niculescu-Mizil. Bayesian models for large-scale

hierarchical classification. In Advances in Neural Information Processing Systems

25, 2012.

[47] Y. Grandvalet, A. Rakotomamonjy, J. Keshet, and S. Canu. Support vector ma-

chines with a reject option. In Advances in Neural Information Processing Systems,

2008.

[48] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for or-

dinal regression. In Smola, Bartlett, Schoelkopf, and Schurmaans, editors, Advances

in Large Margin Classifiers. MIT Press, Cambridge, MA, 2000.

[49] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In


[50] K. Jain, M. Mahdian, and A. Saberi. A new greedy approach for facility location

problems. In Symposium on Theory of Computing (STOC), 2002.

[51] K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant

documents. In International ACM SIGIR Conference on Research and Development

in Information Retrieval, 2000.

Bibliography 188

[52] W. Jiang. Process consistency for AdaBoost. Annals of Statistics, 32(1):13–29,

2004.

[53] T. Joachims. A support vector method for multivariate performance measures. In


[54] T. Joachims. Optimizing search engines using clickthrough data. In ACM Confer-

ence on Knowledge Discovery and Data Mining (KDD), 2002.

[55] T. Joachims. Making large-scale svm learning practical. In B. Scholkopf, C. Burges,

and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning.

MIT-Press, 1999.

[56] K. Kennedy, B. Namee, and S. Delany. Learning without default: A study of

one-class classification and the low-default portfolio problem. In ICAICS, 2009.

[57] C. Kenyon-Mathieu and W. Schudy. How to rank with few errors. In Symposium

on Theory of Computing (STOC), 2007.

[58] J.-D. Kim, Y. Wang, and Y. Yasunori. The genia event extraction shared task,

2013 edition - overview. Association of Computational Linguistics, 2013.

[59] S. Koco and C. Capponi. On multi-class classification through the minimization of

the confusion matrix norm. In ACML, 2013.

[60] W. Kotlowski, K. Dembczynski, and E. Huellermeier. Bipartite ranking through

minimization of univariate loss. In Proceedings of International Conference on Ma-

chine Learning, 2011.

[61] O. Koyejo, N. Natarajan, P. Ravikumar, and I. Dhillon. Consistent binary classi-

fication with generalized performance metrics. In Advances in Neural Information

Processing Systems, 2014.

[62] N. Lambert and Y. Shoham. Eliciting truthful answers to multiple-choice questions.

In ACM Conference on Electronic Commerce, 2009.

Bibliography 189

[63] S. Lawrence, I. Burns, A. Back, A.-C. Tsoi, and C. Giles. Neural network classifica-

tion and prior class probabilities. In Neural Networks: Tricks of the Trade, LNCS,

pages 1524:299–313. Springer, 1998.

[64] Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines: Theory

and application to the classification of microarray data. Journal of the American

Statistical Association, 99(465):67–81, 2004.

[65] D. Lewis. Evaluating text categorization. In Proceedings of the Workshop on Speech

and Natural Language, HLT ’91, 1991.

[66] G. Lugosi and N. Vayatis. On the Bayes-risk consistency of regularized boosting

methods. Annals of Statistics, 32(1):30–55, 2004.

[67] P. Machart and L. Ralaivola. Confusion matrix stability bounds for multiclass

classification. Technical report, Aix-Marseille University, 2012.

[68] C. D. Manning, P. Raghavan, and H. Schtze. Introduction to Information Retrieval.

Cambridge University Press, 2008.

[69] A. Menon, H. Narasimhan, S. Agarwal, and S. Chawla. On the statistical consis-

tency of algorithms for binary classification under class imbalance. In Proceedings

of International Conference on Machine Learning, 2013.

[70] D. Musicant, V. Kumar, and A. Ozgur. Optimizing F-measure with support vector

machines. In FLAIRS, 2003.

[71] H. Narasimhan, R. Vaish, and S. Agarwal. On the statistical consistency of plug-

in classifiers for non-decomposable performance measures. In Advances in Neural

Information Processing Systems, 2014.

[72] H. Narasimhan*, H. G. Ramaswamy*, A. Saha, and S. Agarwal. Consistent multi-

class algorithms for complex performance measures. In Proceedings of International


[73] D. O’Brien, M. Gupta, and R. Gray. Cost-sensitive multi-class classification from

probability estimates. In Proceedings of International Conference on Machine

Learning, 2008.

Bibliography 190

[74] S. Parambath, N. Usunier, and Y. Grandvalet. Optimizing F-measures by cost-

sensitive classification. In Advances in Neural Information Processing Systems,

2014.

[75] J. Petterson and T. Caetano. Reverse multi-label learning. In Advances in Neural

Information Processing Systems, 2010.

[76] J. Petterson and T. Caetano. Submodular multi-label learning. In Advances in

Neural Information Processing Systems, 2011.

[77] B. A. Pires, C. Szepesvari, and M. Ghavamzadeh. Cost-sensitive multiclass classifi-

cation risk bounds. In Proceedings of International Conference on Machine Learn-

ing, 2013.

[78] L. Ralaivola. Confusion-based online learning and a passive-aggressive scheme. In


[79] H. G. Ramaswamy and S. Agarwal. Classification calibration dimension for general

multiclass losses. In Advances in Neural Information Processing Systems, 2012.

[80] H. G. Ramaswamy, S. Agarwal, and A. Tewari. Convex calibrated surrogates for

low-rank loss matrices with applications to subset ranking losses. In Advances in

Neural Information Processing Systems, 2013.

[81] H. G. Ramaswamy, B. S. Babu, S. Agarwal, and R. C. Williamson. On the con-

sistency of output code based learning algorithms for multiclass learning problems.

In Proceedings of International Conference on Learning Theory, 2014.

[82] H. G. Ramaswamy, S. Agarwal, and A. Tewari. Convex calibrated surrogates for

hierarchical classification. In Proceedings of International Conference on Machine

Learning, 2015.

[83] P. Ravikumar, A. Tewari, and E. Yang. On NDCG consistency of listwise ranking

methods. In International Conference on Artificial Intelligence and Statistics, 2011.

[84] M. D. Reid and R. C. Williamson. Composite binary losses. Journal of Machine


Bibliography 191

[85] M. D. Reid and R. C. Williamson. Information, divergence and risk for binary

experiments. Journal of Machine Learning Research, 11:731–817, 2011.

[86] R. Rifkin and A. Klautau. In defense of one-vs-all classification. Journal of Machine


[87] J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor. Kernel-based learning of

hierarchical multilabel classification models. Journal of Machine Learning Research,

7:1601–1626, 2006.

[88] L. Savage. Elicitation of personal probabilities and expectations. Journal of the

American Statistical Association, 66(336):783–801, 1971.

[89] M. Schervish. A general method for comparing probability assessors. Annals of

Statistics, 17(4):1856–1879, 1989.

[90] C. Scott. Calibrated asymmetric surrogate losses. Electronic Journal of Statistics,

6:958–992, 2012.

[91] E. Shuford, A. Albert, and H. Massengill. Admissible probability measurement

procedures. Psychometrika, 31(2):125–145, 1966.

[92] C. N. Silla Jr. and A. A. Freitas. A survey of hierarchical classification across

different application domains. Data Mining and Knowledge Discovery, 2011.

[93] P. Simeone, C. Marrocco, and F. Tortorella. Design of reject rules for ECOC

classification systems. Pattern Recognition, 45:863–875, 2012.

[94] I. Steinwart. Consistency of support vector machines and other regularized kernel

classifiers. IEEE Transactions on Information Theory, 51(1):128–142, 2005.

[95] I. Steinwart. How to compare different loss functions and their risks. Constructive

Approximation, 26:225–287, 2007.

[96] C. J. Stone. Consistent nonparametric regression. Annals of Statistics, 5(4):595–

620, 1977.

[97] A. Sun and E.-P. Lim. Hierarchical text classification and evaluation. In Proceedings

of International Conference on Data Mining, 2001.

Bibliography 192

[98] Y. Sun, M. Kamel, and Y. Wang. Boosting for learning multiple classes with

imbalanced class distribution. In Proceedings of International Conference on Data

Mining, 2006.

[99] A. Tamir. An O(pn2) algorithm for the p-median and related problems on tree

graphs. Operations Research letters, 19:59–64, 1996.

[100] A. Tewari and P. L. Bartlett. On the consistency of multiclass classification meth-

ods. Journal of Machine Learning Research, 8:1007–1025, 2007.

[101] I. Tsochantiridis, T. Joachims, T. Hoffman, and Y. Altun. Large margin methods

for structured and interdependent output variables. Journal of Machine Learning

Research, 6:1453–1484, 2005.

[102] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals

of Statistics, 32(1):135–166, 2004.

[103] E. Vernet, R. C. Williamson, and M. D. Reid. Composite multiclass losses. In


[104] P. Vincent. An Introduction to Signal Detection and Estimation. Springer-Verlag

New York, Inc., 1994.

[105] H. Wang, X. Shen, and W. Pan. Large margin hierarchical classification with

mutually exclusive class membership. Journal of Machine Learning Research, 12:

2721–2748, 2011.

[106] K. Wang, S. Zhou, and S. C. Liew. Building hierarchical classifiers using class

proximity. In International Conference on Very Large Data Bases, 1999.

[107] P.-W. Wang and C.-J. Lin. Iteration complexity of feasible descent methods for

convex optimization. Journal of Machine Learning Research, 15:1523–1548, 2014.

[108] S. Wang and X. Yao. Multiclass imbalance problems: Analysis and potential solu-

tions. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics,

42(4):1119–1130, 2012.

Bibliography 193

[109] J. Weston and C. Watkins. Support vector machines for multi-class pattern recog-

nition. In 7th European Symposium On Artificial Neural Networks, 1999.

[110] Q. Wu, C. Jia, and W. Chen. A novel classification-rejection sphere SVMs for

multi-class classification problems. In IEEE International Conference on Natural

Computation, 2007.

[111] F. Xia, T.-Y. Liu, J. Wang, W. Zhang, and H. Li. Listwise approach to learning

to rank: Theory and algorithm. In Proceedings of International Conference on

Machine Learning, 2008.

[112] Z. Xiao, E. Dellandrea, W. Dou, and L. Chen. Hierarchical classification of emo-

tional speech. IEEE Transactions on Multimedia, 2007.

[113] N. Ye, K. Chai, W. Lee, and H. Chieu. Optimizing F-measures: A tale of two

approaches. In Proceedings of International Conference on Machine Learning, 2012.

[114] M. Yuan and M. Wegkamp. Classification methods with reject option based on

convex risk minimization. Journal of Machine Learning Research, 11:111–130, 2010.

[115] T. Zhang. Statistical behavior and consistency of classification methods based on

convex risk minimization. Annals of Statistics, 32(1):56–134, 2004.

[116] T. Zhang. Statistical analysis of some multi-category large margin classification

methods. Journal of Machine Learning Research, 5:1225–1251, 2004.

[117] C. Zou, E. Zheng, H. Xu, and L. Chen. Cost-sensitive multi-class SVM with reject

option: A method for steam turbine generator fault diagnosis. International Journal

of Computer Theory and Engineering, 2011.

Design and Analysis of Consistent Algorithms for ...

Documents

Transcript of Design and Analysis of Consistent Algorithms for ...