INVITED PAPER LearningReductionsThat...

12
INVITED PAPER Learning Reductions That Really Work This paper summarizes the mathematical and computational techniques that have enabled learning reductions to effectively address a wide class of tasks. By Alina Beygelzimer , Hal Daume ´, III , John Langford, and Paul Mineiro ABSTRACT | In this paper, we provide a summary of the mathematical and computational techniques that have enabled learning reductions to effectively address a wide class of tasks, and show that this approach to solving machine learning problems can be broadly useful. Our work is instantiated and tested in a machine learning library, Vowpal Wabbit, to prove that the techniques discussed here are fully viable in practice. KEYWORDS | Learning systems; machine learning; prediction methods I. INTRODUCTION In a reduction, a complex problem is decomposed into simpler subproblems so that a solution to the subproblems gives a solution to the complex problem. Learning reductions differ from other types of reductions used in computer science because they require understanding how the distribution induced by the reduction affects the transfer of predictive performance from the induced problem to the original problem. The canonical example of a learning reduction is one- against-all (OAA), which solves k-class classification via reduction to k base prediction problems, one for each class: For i 2f1; ... ; kg, the ith predictor is trained to predict the probability of label i. To make a multiclass prediction, the reduction chooses the class with the largest probability estimate. Fig. 1 shows how this reduction works experimentally, comparing the induced multiclass loss to the average squared-error loss of the base predictors. Because the relationship between the average squared-error loss and multiclass loss scales with k in general, we decided to use k ¼ 2 for this pedagogical experiment. Fig. 1 confirms what is expected and guaranteed by the analysis in Section II-B. In particular, it shows that a small squared-error loss on the created regression problems implies a small zero-one loss on the original two-class classification problem. It is impossible to induce a large zero-one loss without incurring a large squared-error loss; a large squared-error error loss can (but need not) lead to a large zero-one loss. Are learning reductions an effective approach for solving complex machine learning problems? The answer is not obvious, because there is a representational concern: maybe the process of reduction creates ‘‘hard’’ problems that simply cannot be solved well? A simple example is given in Fig. 2 by a three-class classification problem on the line. If all of the examples from class 1 are at x ¼ 1, all the examples from class 2 are at x ¼ 2, and all of class 3 at x ¼ 3, then an OAA linear classifier cannot succeed. In particular, it is impossible to separate class 2 from the union of classes 1 and 3. In contrast, the all-pairs reduction, which learns a classifier for each pair of labels, does not suffer from this problem in this case. Although this concern is significant, there are many other convenient choices made in machine learning, such as conjugate priors, proxy losses, and sigmoid link functions. Perhaps the representations created by natural learning reductions work well on natural problems. Or perhaps there is a theory of representation-respecting learning reductions. We have investigated this approach to machine learning for about a decade now, and provide a summary of results here, addressing several important desiderata. 1) A well-founded theory for analysis. A well- founded theory makes the approach teachable, Manuscript received February 9, 2015; revised April 29, 2015; accepted September 3, 2015. Date of publication December 10, 2015; date of current version December 18, 2015. A. Beygelzimer is with Yahoo Labs, New York, NY 10036 USA (e-mail: [email protected]). H. Daume ´, III is with the University of Maryland, College Park, MD 20742 USA (e-mail: [email protected]). J. Langford is with Microsoft Research, New York, NY 10011 USA (e-mail: [email protected]). P. Mineiro is with the Cloud and Information Services Laboratory, Microsoft, Bellevue, WA 98052, USA (e-mail: [email protected]). Digital Object Identifier: 10.1109/JPROC.2015.2494118 0018-9219 Ó 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 136 Proceedings of the IEEE | Vol. 104, No. 1, January 2016

Transcript of INVITED PAPER LearningReductionsThat...

Page 1: INVITED PAPER LearningReductionsThat ReallyWorkresearch.cs.rutgers.edu/~lihong/ftp/papers/rl/Learning Reductions Th… · learning reductions to effectively address a wide class of

INV ITEDP A P E R

Learning Reductions ThatReally WorkThis paper summarizes the mathematical and computational techniques that have

enabled learning reductions to effectively address a wide class of tasks.

By Alina Beygelzimer, Hal Daume, III, John Langford, and Paul Mineiro

ABSTRACT | In this paper, we provide a summary of the

mathematical and computational techniques that have enabled

learning reductions to effectively address a wide class of tasks,

and show that this approach to solving machine learning

problems can be broadly useful. Our work is instantiated and

tested in a machine learning library, Vowpal Wabbit, to prove

that the techniques discussed here are fully viable in practice.

KEYWORDS | Learning systems; machine learning; prediction

methods

I . INTRODUCTION

In a reduction, a complex problem is decomposed into

simpler subproblems so that a solution to the subproblems

gives a solution to the complex problem. Learning

reductions differ from other types of reductions used in

computer science because they require understanding how

the distribution induced by the reduction affects the

transfer of predictive performance from the inducedproblem to the original problem.

The canonical example of a learning reduction is one-

against-all (OAA), which solves k-class classification via

reduction to k base prediction problems, one for each

class: For i 2 f1; . . . ; kg, the ith predictor is trained to

predict the probability of label i. To make a multiclass

prediction, the reduction chooses the class with the largest

probability estimate. Fig. 1 shows how this reduction

works experimentally, comparing the induced multiclass

loss to the average squared-error loss of the base

predictors. Because the relationship between the average

squared-error loss and multiclass loss scales with k ingeneral, we decided to use k ¼ 2 for this pedagogical

experiment.

Fig. 1 confirms what is expected and guaranteed by the

analysis in Section II-B. In particular, it shows that a small

squared-error loss on the created regression problems

implies a small zero-one loss on the original two-class

classification problem. It is impossible to induce a large

zero-one loss without incurring a large squared-error loss;a large squared-error error loss can (but need not) lead to a

large zero-one loss.

Are learning reductions an effective approach for

solving complex machine learning problems? The answer

is not obvious, because there is a representational concern:

maybe the process of reduction creates ‘‘hard’’ problems

that simply cannot be solved well? A simple example is

given in Fig. 2 by a three-class classification problem onthe line. If all of the examples from class 1 are at x ¼ 1, all

the examples from class 2 are at x ¼ 2, and all of class 3 at

x ¼ 3, then an OAA linear classifier cannot succeed. In

particular, it is impossible to separate class 2 from the

union of classes 1 and 3. In contrast, the all-pairs

reduction, which learns a classifier for each pair of labels,

does not suffer from this problem in this case.

Although this concern is significant, there are manyother convenient choices made in machine learning, such

as conjugate priors, proxy losses, and sigmoid link

functions. Perhaps the representations created by natural

learning reductions work well on natural problems. Or

perhaps there is a theory of representation-respecting

learning reductions.

We have investigated this approach to machine

learning for about a decade now, and provide a summaryof results here, addressing several important desiderata.

1) A well-founded theory for analysis. A well-

founded theory makes the approach teachable,

Manuscript received February 9, 2015; revised April 29, 2015; accepted

September 3, 2015. Date of publication December 10, 2015; date of current version

December 18, 2015.

A. Beygelzimer is with Yahoo Labs, New York, NY 10036 USA (e-mail:

[email protected]).

H. Daume, III is with the University of Maryland, College Park, MD 20742 USA

(e-mail: [email protected]).

J. Langford is with Microsoft Research, New York, NY 10011 USA (e-mail:

[email protected]).

P. Mineiro is with the Cloud and Information Services Laboratory, Microsoft,

Bellevue, WA 98052, USA (e-mail: [email protected]).

Digital Object Identifier: 10.1109/JPROC.2015.2494118

0018-9219 � 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

136 Proceedings of the IEEE | Vol. 104, No. 1, January 2016

Page 2: INVITED PAPER LearningReductionsThat ReallyWorkresearch.cs.rutgers.edu/~lihong/ftp/papers/rl/Learning Reductions Th… · learning reductions to effectively address a wide class of

and provides a form of assurance that good

empirical results should be expected, and carry

over to new problems.

2) Good predictive performance in practice. A theory

should provide some effective guidance about

which learning algorithms are better in practice.3) Good computational performance. This is critical

for learning reductions, because the large data

regime is where sound algorithmics begin to

outperform clever representation and problem

understanding.

4) Good programmability. Development and main-

tenance burdens are not traditional concerns for

machine learning but can matter significantly inpractice.

5) A unique ability. To be interesting, learning

reductions must provide a means to address an

entirely new class of problems.

Here we show that all the above criteria have now been

met. Furthermore, we instantiated our work in the open

source machine learning system, Vowpal Wabbit [46].

A. Strawman OAAA common approach to implementing OAA for k-way

multiclass classification is to create a script that processes

the data set k times, creating k intermediate binary

classification data sets, then executes a binary learningalgorithm k times, creating k different model files. For test

time evaluation, another script then invokes a testing

system k times for each example in a batch. The multiclass

prediction is the label with a positive prediction, with ties

broken arbitrarily.

A careful study of learning reductions reveals that

every aspect of this strawman approach can be improved.

B. OrganizationSection II discusses the types of reduction theory that

have been developed and found most useful.

Section III discusses the programming interface we

have developed for learning reductions. Although pro-

grammability is a nonstandard concern in machine

learning applications, we have found it of critical

importance. Creating a usable interface that is notcomputationally constraining is critical to success.

Section IV discusses several problems for which the

only known solution is derived via a reduction mechanism,

providing evidence that the reduction approach is useful

for research.

Section V shows experimental results for a particularly

complex ‘‘deep’’ reduction for structured prediction,

including comparisons with many other approaches.Together, these sections show that learning reductions

are a useful approach to machine learning.

II . REDUCTIONS THEORY

There are several natural learning reduction theories.

These theories differ structurally from other learning

theories, and thus offer a different mixture of strengthsand weaknesses for prescribing what happens experimen-

tally. Simple learning reductions neglect representational

concerns in favor of effective problem decomposition.

Representational concerns are important in general, but

we have found it fruitful in practice to focus on effective

problem decomposition, and let the individual problem

dictate representational choices.

Online learning [48], empirical risk minimization [60],and polynomial time probably approximately correct

(PAC) learning [59] are examples of learning theories

that take into account a choice of representation.

Only the optimization oracle reductions theory takes

representation into account (see Section II-D). The

simpler reduction theories only model the transformation

of predictive performance from one learning task to

another.In learning reductions, unlike in other reductions used

in computer science, we need to incorporate, track, and

reason about distributions over examples. A learning

reduction from task A to task B transforms a distribution

generating A into a distribution generating B, and then

implicitly transforms a solution of some quality on task Binto a solution of some quality for A.

Fig. 1. OAA reduction applied to many different k-class classification

data sets, for k ¼ 2. The x-axis is the squared-error loss of the base

regressor. The y-axis is the k-classification loss of the implied

classifier. The lack of any points in the upper left corner for all data

sets is as predicted by analysis.

Fig. 2. Hard problem for OAA with linear representations. There is a

single feature represented by the horizontal dimension. There are

three classes 1, 2, and 3 with each example from class i having feature

value i. OAA with linear regression cannot effectively distinguish

class 2 from classes 1 and 3.

Beygelzimer et al. : Learning Reductions That Really Work

Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 137

Page 3: INVITED PAPER LearningReductionsThat ReallyWorkresearch.cs.rutgers.edu/~lihong/ftp/papers/rl/Learning Reductions Th… · learning reductions to effectively address a wide class of

More formally, a task A is defined by:1) a d is tr ibut ion DA generat ing examples

ðx; yÞ 2 X � Y, where X is the instance space

and Y is the label space;

2) a loss function LA : Y � Z ! R, where Z is a

task-dependent prediction space.

In k-class classification, Y ¼ Z ¼ f1; . . . ; kg, and

LAðy; zÞ ¼ I½y 6¼ z�

for y 2 Y, z 2 Z.

In importance-weighted binary classification,

Y ¼ f�1; 1g � Rþ, Z ¼ f�1; 1g, and

LA hy;wi; zð Þ ¼ w � I½y 6¼ z�

for hy;wi 2 Y, z 2 Z. Here each example has an associated

misclassification cost, and the loss is weighted by the cost.

A reduction from task A to task B consists of two

algorithms, R and R�1, where R maps examples from task A to

(possibly multiple instances of) examples for task B, and R�1

maps a learned solution for task B back to a solution for task A.For example, in the strawman OAA reduction from

Section I-A, A is multiclass classification and B is binary

classification. Here R is the algorithm taking a multiclass

example ðx; yÞ and creating k binary examples, where the

ith binary example is ðx; 2 � I½y ¼ i� � 1Þ. Thus, R creates kinstances of B. Multiple reduced instances can be

combined into a single instance using a standard trick

[5], which is just to augment the feature space with thename of the instance. The inverse R�1 computes the

argmax of k binary predictions, breaking ties randomly.

Critically, both A and B have loss functions associated

with them, with reductions differing in what is proved about

how losses for B translate to losses for A. Since R is an

algorithm which transforms examples of one type into

examples of another type, any distribution DA for examples of

type A induces a distribution DB for examples of type B via R.

A. Error ReductionsIn an error reduction, a small error rate on the induced

task implies a small error rate on the original task. Let hB

be a predictor for the induced task B, and let hA _¼ R�1ðhBÞbe the resulting predictor for the original task. An error

reduction satisfies a theorem of the form

EðxA;yAÞ�DALA yA; hAðxAÞð Þ � f EðxB;yBÞ�DB

LB yB; hBðxBÞð Þ� �

where f : R ! R is some continuous function with fð0Þ ¼ 0.

When multiple base predictors are created, we measure their

average loss. Since it is easy to resample examples or indicate

via an importance weight that one example is more importantthan another, nonuniform averages are allowed.

For example, in the strawman OAA reduction, an

average binary classification error rate of � implies a

multiclass error rate of at most ðk� 1Þ� (see [5] and [33]).

This algorithm can fail in at least two ways: the first failure

is if none of the k binary classifiers returns þ1; the second

failure is if more than one of them do. In the first case, the

algorithm chooses a label arbitrarily. In the second case,the algorithm chooses a prediction from among those

binary classifiers that returned þ1.

A careful examination of the analysis shows how to

improve the error-transformation properties of this

reduction: The first observation is that it helps to break

ties randomly instead of arbitrarily. The second observa-

tion is that, in the absence of other errors, a false negative

implies only a 1=k probability of making the rightmulticlass prediction, whereas for a false positive this

probability is 1/2. Thus, modifying the reduction to make

the binary classifier more prone to output a positive, which

can be done via an appropriate use of importance

weighting, improves the error transform from ðk� 1Þ� to

roughly ðk=2Þ�. As predicted by analysis, both of these

elements yield an improvement in practice [14].

Another example of an error reduction for multiclass clas-sification is based on error-correcting output codes [27], [33].

A valid criticism of error reductions is that the

guarantees they provide become vacuous if the base

problems they create are inherently noisy. For example,

when no base binary classifier can achieve an error rate

better than 2=k, the OAA guarantee above is vacuous.

Given this, all error reduction statements should be paired

with a claim that small error rates can be achieved.

B. Regret ReductionsRegret analysis addresses this criticism by analyzing

the transformation of excess loss, or regret. More formally,

regret of a predictor h is the difference between its loss and

the minimum achievable loss on the same task ðD; LÞ

regDðhÞ ¼ Eðx;yÞ�DL y; hðxÞð Þ � infh0

Eðx;yÞ�DL y; h0ðxÞð Þ:

Note that inf in this section is over all possible predictors

rather than predictors in some function class. (The latter

model is considered in Section II-D.) A regret reduction

bounds the regret on the original task in terms of the average

regret on the base task, yielding a theorem of the form

regDAðhAÞ � f regDB

ðhBÞ� �

where f , hA, and hB are defined as in error reductions.

Beygelzimer et al. : Learning Reductions That Really Work

138 Proceedings of the IEEE | Vol. 104, No. 1, January 2016

Page 4: INVITED PAPER LearningReductionsThat ReallyWorkresearch.cs.rutgers.edu/~lihong/ftp/papers/rl/Learning Reductions Th… · learning reductions to effectively address a wide class of

A reduction that translates any optimal (i.e., no-regret)solution to the base problems into an optimal solution to

the top-level problem is called ‘‘consistent.’’ Consistency is

a basic requirement for a good reduction. Unfortunately,

error reductions are generally inconsistent. To see that

OAA is inconsistent, consider three classes with true

conditional probabilities ð1=2Þ � 2�, ð1=4Þ þ �, and

ð1=4Þ þ �. The optimal base binary prediction is always

the negative class since it is always more likely that theclass label is i for any i, resulting in multiclass loss of 2/3.

The corresponding multiclass regret is ð1=6Þ � 2�, which

is positive for any � G 1=12.

Strawman OAA can be easily made consistent by

reducing to squared-loss regression instead of binary

classification. The multiclass prediction is made by

evaluating the learned regressor on all labels and

predicting with the argmax. As shown below, thisregression approach is consistent. It also resolves ties via

precision rather than via randomization, which is empir-

ically more effective.

Let us analyze how this approach transforms squared

loss regret into multiclass regret for any fixed x, taking

expectation over x at the end. Let hðx; aÞ be the learned

regressor, predicting the conditional probability of class agiven x. Let pa ¼ E½y ¼ ajx� be the true conditionalprobability. For the analysis, we think of h as an adversary

trying to induce multiclass regret without paying much in

squared loss regret.

The squared loss regret of h on the predicted class a is

Eyjx hðx; aÞ � Iðy ¼ aÞð Þ2� pa � Iðy ¼ aÞð Þ2� �

¼ pa � hðx; aÞð Þ2:

Let a� ¼ arg maxa pa be the optimal prediction on x. The

regret of h on a� is, similarly, ðpa� � hðx; a�ÞÞ2. To incur

multiclass regret, we must have hðx; aÞ hðx; a�Þ for some

a 6¼ a�. The two regrets are convex and the minima is

r e a c h e d w h e n hðx; aÞ ¼ hðx; a�Þ ¼ pa þ pa�=2. T h e

corresponding squared loss regret suffered by h on botha and a� is ðp�a � paÞ2=2. Since the regressor does not need

to incur any loss on other predictions, the regressor can

pay only regðhÞ ¼ ðp�a � paÞ2=2k in average squared loss

regret to induce multiclass regret of pa� � pa on x. Solving

for multiclass regret in terms of regðhÞ shows that the

multiclass regret of this approach is bounded byffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2k regðhÞ

p. Since the adversary can actually play this

optimal strategy, the bound is tight.Moving from an error reduction to a regret reduction is

often empirically beneficial. Fig. 3 illustrates the empirical

superiority of reducing to regression rather than binary

classification for the Mnist data set [47].

There are many known regret reductions for such

problems as multiclass classification [38], [53], cost-

sensitive classification [15], [45], and ranking [3], [4].

There is also a rich body of work on surrogate regret

bounds. It is common to use some efficiently minimizable

surrogate loss instead of the loss one actually wishes to

optimize. A surrogate regret bound quantifies the resultingregret in terms of the surrogate regret [3], [8], [65]. These

results show that standard algorithms minimizing the

surrogate are in fact consistent solutions to the problem at

hand. In some cases, commonly used surrogate losses

actually turn out to be inconsistent [31].

Many open problems exist in regret reductions, one of

which is the efficient robust conditional probability

estimation problem [44]. At a high level, the question is:How do you estimate the conditional probability of any one

of k things in time logarithmic in k with a regret ratio

bounded by a constant? A $1000 reward exists for a good

answer.

C. Adaptive ReductionsAdaptive reductions create learning problems that are

dependent on the solution to other learning problems. In

general, adaptivity is undesirable, since conditionally

defined problems are more difficult to form and solve

wellVthey are less amenable to parallelization, and more

prone to overfitting due to propagation and compounding

of errors. In some cases, however, the best known

approaches are adaptive reductions.

Boosting [56] is an adaptive reduction for convertingany weak learner into a strong learner. Typical boosting

statements bound the error rate of the resulting classifier

in terms of the weighted training errors �t on the

distributions created adaptively by the booster. The ability

to boost is rooted in the assumption of weak

learnabilityVa weak learner gets a positive edge over

random guessing for any distribution created by the

Fig. 3. Mnist experimental results for OAA reduced to binary

classification and OAA reduced to squared loss regression while

varying training set size. The x-axis is the 0/1 test loss of the induced

subproblems, and the y-axis is the 0/1 test loss on the multiclass

problem. The classifiers are linear in pixels. The regression

approach (a regret transform) dominates the binary approach

(an error transform).

Beygelzimer et al. : Learning Reductions That Really Work

Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 139

Page 5: INVITED PAPER LearningReductionsThat ReallyWorkresearch.cs.rutgers.edu/~lihong/ftp/papers/rl/Learning Reductions Th… · learning reductions to effectively address a wide class of

booster. As with any reduction, there is a concern that thebooster may create ‘‘hard’’ distributions, making it difficult

to satisfy the assumption. Despite this concern, boosting

has been quite effective in practice. A boosting classifier or

regressor could compose with other reductions to allow

boosting to be applied to many new problems.

Section IV-D discusses an adaptive reduction for

logarithmic time multiclass prediction. All known non-

adaptive log-time approaches yield inconsistency in thepresence of label noise [15].

Note that the average regret of base algorithms is still

well defined as long as there is a partial order over the base

problems, i.e., each base learning problem is defined given

a predictor for everything earlier in the order. Conse-

quently, the regret reduction definition can extend to the

adaptive case.

D. Optimization Oracle ReductionsWhen the problem is efficiently gathering information,

as in active learning (discussed in Section IV-C) or

contextual bandit learning (discussed in Section IV-B), the

previous types of reductions are inadequate because they

lack any way to quantify progress made by the reduction.

To address this problem, we define a new class of

learning reductions based on oracle access to a set ofpredictors H with a limited capacity. For example, H could

be the set of all depth-2 decision trees, the set of all linear

predictors, or the set of all five-layer convolutional neural

networks with at most 1000 convolutional units.

The oracle for H takes a correctly typed data set S and

returns an empirical risk minimizing predictor from H on S

arg minh2H

X

ðx;yÞ2S

LB y; hðxÞð Þ

with respect to the base loss LB. The oracle gives an

abstraction of the ability to search the set H.

The form of the learning problem solved by the oracle

can be binary classification, cost-sensitive classification, or

any other reasonable primitive. Since many supervised

learning algorithms approximate such an oracle, thesereductions are immediately implementable.

Since the capacity of H is limited, tools from statistical

learning theory can be used to argue about the regret of the

predictor returned by the oracle. Cleverly using this oracle

can provide solutions that are exponentially more efficient

than other approaches. Several examples are discussed in

Section IV-B and C.

III . SOFTWARE ARCHITECTURE

A good interface for learning reductions should simulta-

neously be performant, generally useful, easy to program,

and eliminate systemic errors.

A. The Wrong WayThe strawman OAA approach illustrates interfacing

failures well. In particular, consider an implementation

where a binary learning executable, treated as a black box,

is orchestrated to do the OAA approach via shell scripting.

1) Scripting implies a mixed-language solution, which

is relatively difficult to maintain or understand.

2) The approach may easily fail under recursion. For

example, if another script invokes the OAAtraining script multiple times, it is easy to imagine

a problem where the saved models of one

invocation overwrite the saved models of another

invocation. In a good programming approach,

such errors should not be possible.

3) The transformation of multiclass examples into

binary examples is separated from the transfor-

mation of binary predictions into multiclasspredictions. This substantially raises the possibil-

ity of implementation bugs compared to an

approach which has encoder and decoder im-

plemented either side by side or conformally.

4) For more advanced adaptive reductions, it is

common to require a prediction before defining

the created examples. Having a prediction script

operate separately creates a circularity (trainingmust succeed for prediction to work, but predic-

tion is needed for training to occur) which is

extremely cumbersome to avoid in this fashion.

5) The training approach is computationally expensive

since the data set is replicated k times. Particularly

when data sets are large, this is highly undesirable.

6) The testing process is structurally slow, particularly

when there is only one test example to label. Thecomputational time is �(pk) where p is the number

of parameters in a saved model and k is the number of

classes simply due to the overhead of loading a model.

7) Even if all models are loaded into memory, the

process of querying each model is inherently

unfriendly to a hardware cache.

B. A Better WayOur approach [46] eliminates all of the above interfac-

ing bugs, resulting in a system that is general, performant,

and easily programmed.

Reductions have a structure which admits natural

exploitation of information hiding. In particular, we

implement algorithms which presents two similar online

interfaces. The first is for the purpose of prediction only

(no training). An example is given in Algorithm 1 for OAA.The input is two argumentsVan example and reduction-

specific context structure. The need for a predict interface

exists for at least two reasons.

1) In many production deployments, an auditable

code path which only does prediction is necessary.

2) Some more complex reductions make prediction-

dependent choices about how to reduce.

Beygelzimer et al. : Learning Reductions That Really Work

140 Proceedings of the IEEE | Vol. 104, No. 1, January 2016

Page 6: INVITED PAPER LearningReductionsThat ReallyWorkresearch.cs.rutgers.edu/~lihong/ftp/papers/rl/Learning Reductions Th… · learning reductions to effectively address a wide class of

The prediction algorithm of the base learner is defined in thecontext and used on line 4. The system uses syntactic sugar to

easily specify which base learning algorithm with an index to

choose which reduction context is invoked for each.

Algorithm 1: OAA_Predict(example e, context c)

1: let prediction ¼ 1

2: let min value ¼ 13: for i ¼ 0 to c:k� 1 do

4: c.predictðe; iÞ5: if e.prediction G min_value then

6: min_value e.prediction

7: prediction iþ 18: end if

9: end for

10: e.prediction ¼ prediction

The other interface is for learning, with an example for

OAA in Algorithm 2. This takes the same arguments, but

aggressively trains based upon available label information.

This is the only substantial difference from the prediction

interface. In particular, the learning interface also does a

prediction for several reasons.1) Prediction is often required for learning, so it is

computationally free. Even when it is not, predic-

tion is typically cheaper than training, so the cost

of prediction is amortized.

2) Predictions allow for online monitoring of the

learning process via progressive validation techni-

ques [16]. Since in many machine learning (ML)

applications finding the right features/update rule/representation is the limiting factor, the ability to

sublinearly debug is beneficial.

Algorithm 2: OAA_Learn(example e, context c)

1: let prediction ¼ 1

2: let min_value ¼ 13: let multiclass_label ¼ e.label4: for i ¼ 0 to c:k� 1 do

5: if multiclass_label ¼ i then

6: e.label ¼ 1

7: else

8: e.label ¼ �1

9: end if

10: let e.prediction ¼ c.learnðe; iÞ11: if prediction G min_value then12: min_value e.prediction

13: prediction iþ 1

14: end if

15: end for

16: e.prediction ¼ prediction

Note that although we require an online learning

algorithm interface, there is no constraint that online

learning must occurVthe manner in which state isupdated by Learn is up to the base learning algorithm.

The interface certainly favors online base learning

algorithms, but we have an implementation of a general-

ized linear model trained via L-BFGS [49] that functions as

an effective (if slow) base learning algorithm.

Since reductions are composable, this interface is both

a constraint on the base learning algorithm and a

constraint on the learning reduction itselfVthe learningreduction must define its own Predict and Learn interfaces.

It is common for reductions to have associated state.

Every reduction requires a base learner which may either

be another reduction or a learning algorithm. Reductions

also typically have some reduction-specific state such as

number of classes k for the OAA reduction. In a traditional

object-oriented language, these arguments can be provided

to the constructor of the reduction and encapsulated in areduction object. In a purely functional language, the input

arguments include an additional state variable, the context

c as implemented here. Our public implementation differs

in a few minor details from the above: a template

eliminates code duplication between predict and learn

and the base learning algorithm is explicit rather than a

member of the context c.

C. ProgrammabilityAccounting for programmability well is notoriously

difficult with such easily measured metrics as lines of code.

Nevertheless, the code required for simple reductions is

indeed simple. For example, OAA is 69 lines, a family of

link functions is 68, a noop reduction is 7 lines, and a

polynomial link function learning reduction is 51 in our

C++ implementation. We have seen implementations ofreductions which require an order of magnitude more code

despite being written in higher level languages.

Perhaps a more significant way to measure program-

mability is to point out the classes of bugs that have been

either eliminated or retarded by design.

1) Evaluation/learning skew. When evaluation of a

learned function is separated from the learning

process, a common bug is skew between these im-plementations, resulting in degraded performance.

This can happen for mundane reasons such as

referencing parameters wrong or differences in

format read/write. This can also happen more struc-

turally, for example when a hidden Markov model

is trained by expectation–maximization (EM), but

evaluated using beam search. The interface above

discourages skew by having both written side byside and even sometimes making them the same

function, differing by a template parameter.

2) Indexing bugs. When system A reduces to multiple

instances of system B, it is essential to keep track

of which instance of B is desired. By using a simple

countable index to reference these problems,

misindexing mistakes are avoided. The simplicity

Beygelzimer et al. : Learning Reductions That Really Work

Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 141

Page 7: INVITED PAPER LearningReductionsThat ReallyWorkresearch.cs.rutgers.edu/~lihong/ftp/papers/rl/Learning Reductions Th… · learning reductions to effectively address a wide class of

of this interface hides the complexity of compositereductions and cache-coherent parameter access,

as discussed in Section III-D.

3) State capture bugs. When system A reduces to

multiple instances of system B, each instance of

system B must be saved and loaded for the system

to restore state correctly. Often, this requires no

effort on the part of the programmer, because it is

automatically handled by the system. In instanceswhere the reduction itself must save state, this can

be handled by either modifying a system provided

set of saved arguments or a save/load function that

the system automatically invokes in the right

order, regardless of reduction depth.

There is a third weak measure of programmability

offered by simple success. Our implementation is the

product of effort by a few researchers acting as part-timeprogrammers, yet has thousands of users, including

significant industrial use. Other machine learning systems

of a similar complexity and userbase typically have

dedicated programmers.

Another way to assert programmability is to note that

reductions create modularity, which is well understood as

desirable. Modularity presents opportunities for optimizations

which are automatically exploited by dependent learningreductions. Alternatively, it is worth investing effort improving

core learning algorithms and reductions, as the benefits are

amortized over many settings. As an example, the superlative

timings exhibited in Section V are partially attributable to the

reuse of a highly optimized reduction stack.

D. Cache CoherencyThe above interface addresses all the previously

mentioned problems except for cache unfriendliness. To

illustrate this problem, consider a multiclass twitter

classification task based on a bag of words converted to

feature indices via a dictionary. For example, one

document might have words represented by indices 1,

53, 720, and 860 when another document has words

represented by indices 23, 820, 4003, and 5121. In a

machine learning context, it is critical for efficiency to takeadvantage of the sparse representation here, where each

word is represented by a sparse feature consisting of the

index and a feature value.

When doing an OAA reduction, the number of

parameters required is roughly the vocabulary size times

the number of classes, which implies that the state of the

learning algorithm does not fit into the fastest cache of a

modern computer. Naively, an OAA reduction will haveseparate parameter vectors for each individual base

regressor. With this memory layout, every (word, base

regressor) pair creates a potential cache miss. These cache

misses are typically the dominant part of the computa-

tional cost.

The implementation in the Vowpal Wabbit toolkit

avoids this via a different memory layout, which can

provide up to an order of magnitude improvement incomputational time on sparse data sets. In essence, we

interleave the parameters of the different base regressors,

so the memory is laid out sequentially as follows: feature 1

for regressor 1, feature 1 for regressor 2, feature 1 for

regressor 3, . . ., feature 2 for regressor 1, feature 2 for

regressor 2, etc. When the OAA reduction invokes the

first regressor, it may generate a cache miss for every word,

but for the second regressor every word is not a cache misssince caching operations are done by the hardware on

many-byte segments. Thus, the maximum number of cache

misses is reduced by a factor related to the cache line size.

This is a rather low level optimization technique. Can it

be done in a manner which is programming friendly? The

first step is to multiply each input feature index by a factor

of k, the number of base regressors, mapping the feature

indices as 0! 0; 1! k; 2! 2k; . . .. Then, we can modifythe feature index definition so the true index value is the

nominal value produced by the mapping, plus a reduction-

defined offset which varies over f0; . . . ; k� 1g. With

respect to a base learning algorithm, this is easy to deal

withVjust provide a template for iterating over the

features correctly under this definition. And for reduc-

tions, this offset is the index of the base learning algorithm

in Algorithm 1 line 4 or Algorithm 2 line 10, so we simplyrecord the offset for use by the feature iteration template.

The above neglects the complexities of making this

work recursively, i.e., when A reduces to B, which reduces

to C. The principles for a recursive application are the same

and have been fully worked out in the implementation.

IV. UNIQUELY SOLVED PROBLEMS

Do reductions just provide a modular alternative that

performs as well as other, direct methods? Or do they

provide solutions to otherwise unsolved problems?

Rephrased, are learning reductions a good first tool for

developing solutions to new problems?

We provide evidence that the answer is ‘‘yes’’ by

surveying an array of learning problems that, so far, have

been effectively addressed only via reduction techniques.A common theme throughout these problems is compu-

tational efficiency. Often there are known inefficient

approaches for solving these problems. Using a reduction

approach, we can isolate the inefficiency of optimization, and

remove other inefficiencies, often resulting in exponential

reduction in computational time in practice.

A. Efficient Contextual Bandit LearningIn the contextual bandit learning problem, the learner

repeatedly observes some context, takes an action, and

observes the reward only for the chosen action.

There are two parts of this problem: 1) learning a better

policy offline using existing exploration data; and 2) optimally

controlling exploration online. A policy is functionally

equivalent to a multiclass classifier that takes as input some

Beygelzimer et al. : Learning Reductions That Really Work

142 Proceedings of the IEEE | Vol. 104, No. 1, January 2016

Page 8: INVITED PAPER LearningReductionsThat ReallyWorkresearch.cs.rutgers.edu/~lihong/ftp/papers/rl/Learning Reductions Th… · learning reductions to effectively address a wide class of

context and produces an action. The term ‘‘policy’’ is used here,because the action is executedVperhaps a news story is

displayed, or a medical treatment is administered.

Efficient nonreduction techniques exist only for special

cases of the problem [40]. All known techniques for the

general setting [13], [32], [67] use reduction approaches.

B. Efficient Exploration in Contextual BanditsContextual bandit learning requires a method for

controlling exploration, choosing possibly suboptimal

actions to gather information for better performance inthe future. There are known approaches to this problem

based on exponential weights [6], with a running time

linear in the size of the policy set. Can statistically optimal

exploration be done more efficiently?

The answer turns out to be positive [1], assuming

access to an oracle for solving supervised cost-sensitive

classification problems, using only OðffiffiffiTpÞ instances of the

oracle across T rounds. This result shows that the oraclecan provide an exponential reduction in computational

time over previous approaches, while achieving optimal

regret bounds.

C. Efficient Agnostic Selective SamplingA learning algorithm with the power to choose which

examples to label can be much more efficient than a

learning algorithm that passively accepts randomly labeled

examples. However, most such approaches break down if

strong assumptions about the nature of the problem arenot met.

The canonical example is learning a threshold on the

real line in the absence of any noise. A passive learning

approach requires Oð1=�Þ samples to achieve error rate �,whereas selective sampling requires only Oðlnð1=�ÞÞsamples using binary search. This exponential reduction

in computational time is quite brittleVa small amount of

label noise can yield an arbitrarily bad predictor.Inefficient approaches for addressing this brittleness

statistically have been known [7], [34]. Is it possible to

benefit from selective sampling in the agnostic setting

efficiently?

Two algorithms have been created [12], [35] which

reduce active learning for binary classification to impor-

tance-weighted binary classification, creating practical

algorithms. No other efficient general approaches toagnostic selective sampling are known.

D. Logarithmic Time ClassificationMost multiclass learning algorithms have time and

space complexities at least linear in the number of classes

per instance when testing or training [2], [9], [11], [36],

[61], [62]. Furthermore, many of these approaches tend to

be inconsistent in the presence of noiseVthey may predict

a wrong label regardless of the amount of data available for

training.

It is easy to note that logarithmic time classificationmay be possible since the output needs to be only Oðlog kÞbits to uniquely identify a class. Can logarithmic time

classification be done in a consistent and robust fashion?

Two reduction algorithms [15], [20] provide a solution

to this problem. The first shows that consistency and

robustness can be achieved with a logarithmic time

approach. The second algorithm addresses learning of

the structure directly.

V. LEARNING TO SEARCH FORSTRUCTURED PREDICTION

Structured prediction is the task of mapping an input to

some output with complex internal structure, for example,

mapping an English sentence to a sequence of part of

speech tags (part of speech tagging), to a syntactic

structure (parsing) or to a meaning-equivalent sentence

in Chinese (translation). Structured prediction aims to

induce a function f such that for any x 2 X , f produces anoutput fðxÞ ¼ y 2 YðxÞ in a (possibly input-dependent)

space YðxÞ. A problem is structured if such y’s can be

decomposed into smaller pieces, but that those pieces are

tied together by features, by a loss function, or just by

statistical dependence (given f ; x). There is a task-specific

loss function ‘ : Y � Y ! Rþ, where ‘ðy; yÞ tells us how

bad it is to predict y when the true output is y. Learning to

search is a family of approaches for solving structuredprediction tasks and encapsulates a number of specific

algorithms (e.g., [22], [24], [26], [28], [29], [37], [50],

[54], [57], [63], and [64]). Learning-to-search approaches:

1) decompose the production of the structure output in

terms of an explicit search space (states, actions, etc.); and

2) learn hypotheses that control a policy that takes actions

in this search space. Some learning-to-search approaches

operate as learning reductions [18], [24], [54], althoughmost have no particular theoretical guarantee associated

with them.

We implemented a learning-to-search algorithm [18]

that operates via reduction to cost sensitive classification,

which is then further reduced to regression via a cost-

sensitive OAA approach.1 This algorithm was then

extensively tested against a suite of many structured

learning algorithms which we report here (see [25] for fulldetails). The first task we considered was sequence

labeling problem: part of speech tagging based on data

from the Wall Street Journal portion of the Penn Treebank

(45 labels, evaluated by Hamming loss, 912 000 words of

training data). The second is a sequence ‘‘chunking’’

problem: named entity recognition using the CoNLL 2003

1Cost-sensitive OAA is a baseline approach which simply learns asquared-loss optimized regressor for each class and returns the argminclass. It extends OAA to support multiple labels per input example, andcosts associated with classifying these labels.

Beygelzimer et al. : Learning Reductions That Really Work

Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 143

Page 9: INVITED PAPER LearningReductionsThat ReallyWorkresearch.cs.rutgers.edu/~lihong/ftp/papers/rl/Learning Reductions Th… · learning reductions to effectively address a wide class of

data set (nine labels, macroaveraged F-measure, 205 000words of training data).

We use the following freely available systems/

algorithms as points of comparison:

1) CRF++: The popular CRF++ toolkit [42] for

conditional random fields [43], which implements

both L-BFGS optimization for CRFs [51] as well as

‘‘structured MIRA’’ [23], [52].

2) CRF SGD: A stochastic gradient-descent condi-tional random field package [17].

3) Structured Perceptron: An implementa-

tion of the structured perceptron [21] due to [19].

4) Structured SVM: The cutting-plane imple-

mentation [39] of the structured SVMs [58] for

‘‘HMM’’ problems.

5) Structured SVM (DEMI-DCD): A multicore

algorithm for optimizing structured SVMs calledDEcoupled Model-update and Inference with

Dual Coordinate Descent.

6) VW Search: Our approach is implemented in the

Vowpal Wabbit toolkit on top of a cost-sensitive

classifier [10] which for these experiments uses a

variant of OAA called cost-sensitive OAA

(CSOAA). CSOAA is a baseline approach which

builds a regressor for each class that is trained topredict the conditional cost of each valid class via

squared loss regression and returns the argmin as

a prediction. The squared loss regression in turn

was trained with an online update rule incorpo-

rating AdaGrad [30], per-feature normalized

updates [55], and importance invariant updates

[41]. The variant VW Search (own fts) uses

computationally inexpensive feature constructionfacilities available in Vowpal Wabbit (e.g., token

prefixes and suffixes), whereas for comparison

purposes VW Search uses the same features as

the other systems.

7) VW Classification: An unstructured base-

line that predicts each label independently, using

OAA multiclass classification [10].

These approaches vary both the objective function (CRF,MIRA, structured SVM, learning to search) and the

optimization approach (L-BFGS, cutting plane, stochastic

gradient descent, AdaGrad). All implementations are in

C/C++, except for the structured perceptron and DEMI-

DCD (Java).

In Fig. 4, we show tradeoffs between training time

(x-axis, log scaled) and prediction accuracy (y-axis) for the

six systems described previously. The left figure is for partof speech tagging and the right figure is for named entity

recognition. For POS tagging, the independent classifier

is by far the fastest (trains in less than one minute) but its

performance peaks at 95% accuracy. Three other

approaches are in roughly the same time/accuracy

tradeoff: VW Search, VW Search (own fts), and

Structured Perceptron. All three can achieve very

good prediction accuracies in just a few minutes of

training. CRF SGD takes about twice as long. DEMI-DCDeventually achieves the same accuracy, but it takes a halfhour. CRF++ is not competitive (taking over five hours to

even do as well as VW Classification). Struc-tured SVM (cutting plane implementation) runs out of

memory before achieving competitive performance, likely

due to too many constraints.

For NER, the story is a bit different. The independent

classifiers are far from competitive. Here, the two variants

of VW Search totally dominate. In this case, Struc-tured Perceptron, which did quite well on POS

tagging, is no longer competitive and is effectively

dominated by CRF SGD. The only system coming close

Fig. 4. Training time versus evaluation accuracy for part of speech

tagging (left) and named entity recognition (right). The x-axis is in

log scale. Different points correspond to different termination criteria

for training. Both figures use hyperparameters that were tuned

(for accuracy) on the heldout data. (Note: lines are curved due to the

log scale x-axis.)

Beygelzimer et al. : Learning Reductions That Really Work

144 Proceedings of the IEEE | Vol. 104, No. 1, January 2016

Page 10: INVITED PAPER LearningReductionsThat ReallyWorkresearch.cs.rutgers.edu/~lihong/ftp/papers/rl/Learning Reductions Th… · learning reductions to effectively address a wide class of

to VW Search’s performance is DEMI-DCD, although itsperformance flattens out after a few minutes.2

In addition to training time, test time behavior can be

of high importance in natural applications. On NER,

prediction times varied from 5300 tokens/second (DEMI-DCD and Structured Perceptron) to around

20 000 (CRF SGD and Structured SVM) to 100 000

(CRF++) to 220 000 (VW(own fts)) and 285 000 (VW).

Although CRF SGD and Structured Perceptronfared well in terms of training time, their test-time

behavior is suboptimal.

When looking at POS tagging, the effect of the OðkÞdependence on the size of the label set further increased

the (relative) advantage of VW Search over alternatives.

VI. SUMMARY AND FUTUREDIRECTIONS

In working with learning reductions, the greatest benefits

seem to come from modularity, deeper reductions, andcomputational efficiency.

Modularity means that the extra code required for, say,

multiclass classification is minor compared to the code

required for binary classification. It also simplifies the use

of a learning system, because learning rate flags, for

instance, apply to all learning algorithms. Modularity is

also an easy experimentation and optimization tool, as one

can plug in different black boxes for different modules.While there are many experiments showing near-parity

prediction performance for simple reductions, compared

to other approaches, it appears that for deeper reductions

the advantage may become more pronounced. This is well

illustrated for the learning-to-search results discussed in

Section V, but has been observed with contextual bandit

learning as well [1]. The precise reason for this is unclear,as it is very difficult to isolate the most important

difference between very different approaches to solving

the problem.

Not all machine learning reductions provide computa-

tional benefits, but those that do may provide enormous

benefits. These are mostly detailed in Section IV, with

benefits often including an exponential reduction in

computational time.In terms of the theory itself, we have often found that

qualitative transitions from an error reduction to a regret

reduction are beneficial. We have also found the isolation

of concerns via encapsulation of the optimization problem

to be quite helpful in developing solutions.

We have not found that precise coefficients are

predictive of relative performance among two reductions

accomplishing the same task with the same base learningalgorithm but different representations. As an example,

the theory for error-correcting tournaments [15] is

substantially stronger than for OAA, yet often OAA

performs better empirically. Since the theory is relativized

by the performance of the base predictor, the represen-

tational compatibility issue can and does play a stronger

role in predicting performance.

There are many questions we still have about learningreductions.

1) Can the proposed interface support effective use

of SIMD/BLAS/GPU approaches to optimization?

Marrying the computational benefits of learning

reductions to the computational benefits of these

approaches could be compelling.

2) Is the learning reduction approach effective when

the base learner is a multitask (possibly ‘‘deep’’)learning system? Often the different subproblems

created by the reduction share enough structure

that a multitask approach appears plausibly

effective.

3) Can the learning reduction approach be usefully

applied at the representational level? Is there a

theory of representational reductions? h

RE FERENCES

[1] A. Agarwal et al., ‘‘Taming the monster: A fastand simple algorithm for contextual bandits,’’in Proc. 31st Int. Conf. Mach. Learn., 2014,pp. 1638–1646.

[2] A. Agarwal, S. M. Kakade, N. Karampatziakis,L. Song, and G. Valiant, ‘‘Least squaresrevisited: Scalable approaches for multi-classprediction,’’ in Proc. 31st Int. Conf. Mach.Learn., Cycle 2, 2014, pp. 541–549.

[3] S. Agarwal, ‘‘Surrogate regret bounds forbipartite ranking via strongly proper losses,’’J. Mach. Learn. Res., vol. 15, pp. 1653–1674,2014.

[4] N. Ailon and M. Mohri, ‘‘Preference-basedlearning to rank,’’ Mach. Learn. J., vol. 8,no. 2/3, pp. 189–211, 2010.

[5] E. Allwein, R. Schapire, and Y. Singer,‘‘Reducing multiclass to binary: A unifyingapproach for margin classifiers,’’ J. Mach.Learn. Res., vol. 1, pp. 113–141, 2000.

[6] P. Auer, N. Cesa-Bianchi, Y. Freund, andR. E. Schapire, ‘‘The non-stochasticmulti-armed bandit problem,’’ SIAM J.Comput., vol. 32, no. 1, pp. 48–77, 2002.

[7] N. Balcan, A. Beygelzimer, and J. Langford,‘‘Agnostic active learning,’’ in Proc. Int. Conf.Mach. Learn., 2006, pp. 65–72.

[8] P. Bartlett, M. Jordan, and J. McAuliffe,‘‘Convexity, classification and risk bounds,’’J. Amer. Stat. Assoc., vol. 101, no. 473,pp. 138–156, 2006.

[9] O. Beijbom, M. Saberian, D. Kriegman, andN. Vasconcelos, ‘‘Guess-averse loss functionsfor cost-sensitive multiclass boosting,’’ in

Proc. 31st Int. Conf. Mach. Learn., 2014,pp. 586–594.

[10] A. Beygelzimer, V. Dani, T. Hayes, J. Langford,and B. Zadrozny, ‘‘Error limiting reductionsbetween classification tasks,’’ in Proc. Int. Conf.Mach. Learn., 2005, pp. 49–56.

[11] S. Bengio, J. Weston, and D. Grangier, ‘‘Labelembedding trees for large multi-class tasks,’’in Proc. Adv. Neural Inf. Process. Syst., 2010,pp. 163–171.

[12] A. Beygelzimer, D. Hsu, J. Langford, andT. Zhang, ‘‘Agnostic active learning withoutconstraints,’’ in Proc. Adv. Neural Inf. Process.Syst., 2010, pp. 199–207.

[13] A. Beygelzimer and J. Langford, ‘‘The offsettree for learning with partial labels,’’ in Proc.15th ACM SIGKDD Int. Conf. Knowl. Disc. DataMining, 2009, pp. 129–138.

2We also tried giving CRF SGD the features computed by VWSearch (own fts) on both POS and NER. On POS, its accuracyimproved to 96.5Von par with VW Search (own fts)Vwitheffectively the same speed. On NER its performance decreased. For bothtasks, clearly features matter. But which features matter is a function ofthe approach being taken.

Beygelzimer et al. : Learning Reductions That Really Work

Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 145

Page 11: INVITED PAPER LearningReductionsThat ReallyWorkresearch.cs.rutgers.edu/~lihong/ftp/papers/rl/Learning Reductions Th… · learning reductions to effectively address a wide class of

[14] A. Beygelzimer, J. Langford, and B. Zadrozny,‘‘Weighted one against all,’’ in Proc. 20th Nat.Conf. Artif. Intell., 2005, pp. 720–725.

[15] A. Beygelzimer, J. Langford, and P. Ravikumar,‘‘Error-correcting tournaments,’’ in Proc. 20thInt. Conf. Algorithmic Learn. Theory, 2009,pp. 247–262.

[16] A. Blum, A. Kalai, and J. Langford, ‘‘Beatingthe holdout: Bounds for KFold andprogressive cross-validation,’’ in Proc.12th Annu. Conf. Comput. Learn. Theory,pp. 203–208.

[17] L. Bottou, ‘‘crfsgd project,’’ 2011. [Online].Available: http://leon.bottou.org/projects/sgd

[18] K.-W. Chang, A. Krishnamurthy, A. Agarwal,H. Daum, III, and J. Langford, ‘‘Learning tosearch better than your teacher,’’ in Proc. Int.Conf. Mach. Learn., 2015, pp. 2058–2066.

[19] K.-W. Chang, V. Srikumar, and D. Roth,‘‘Multi-core structural SVM training,’’ inProc. Eur. Conf. Mach. Learn., vol. 8189,Lecture Notes in Computer Science, 2013,pp. 401–416.

[20] A. Choromanska and J. Langford, ‘‘Logarithmictime online multiclass prediction,’’ 2014.[Online]. Available: http://arxiv.org/abs/1406.1822.

[21] M. Collins, ‘‘Discriminative training methodsfor hidden Markov models: Theory andexperiments with perceptron algorithms,’’ inProc. Conf. Empirical Methods Natural Lang.Process., 2002, DOI: 10.3115/1118693.1118694.

[22] M. Collins and B. Roark, ‘‘Incrementalparsing with the perceptron algorithm,’’ inProc. Conf. Assoc. Comput. Linguistics, 2004,DOI: 10.3115/1218955.1218970.

[23] K. Crammer and Y. Singer, ‘‘Ultraconservativeonline algorithms for multiclass problems,’’J. Mach. Learn. Res., vol. 3, pp. 951–991, 2003.

[24] H. Daume, III, J. Langford, and D. Marcu,‘‘Search-based structured prediction,’’ Mach.Learn. J., vol. 75, no. 3, pp. 297–325,Jun. 2009.

[25] H. Daume, III, J. Langford, and S. Ross,‘‘Efficient programmable learning to search,’’2014. [Online]. Available: http://arxiv.org/abs/1406.1837.

[26] H. Daume, III and D. Marcu, ‘‘Learning assearch optimization: Approximate largemargin methods for structured prediction,’’ inProc. Int. Conf. Mach. Learn., 2005,pp. 169–176.

[27] T. Dietterich and G. Bakiri, ‘‘Solvingmulticlass learning problems viaerror-correcting output codes,’’ J. Artif. Intell.Res., vol. 2, pp. 263–286, 1995.

[28] J. R. Doppa, A. Fern, and P. Tadepalli,‘‘Output space search for structuredprediction,’’ in Proc. Int. Conf. Mach. Learn.,2012, pp. 1151–1158.

[29] J. R. Doppa, A. Fern, and P. Tadepalli,‘‘HC-Search: A learning framework forsearch-based structured prediction,’’ J. Artif.Intell. Res., vol. 50, pp. 369–407, 2014.

[30] J. Duchi, E. Hazan, and Y. Singer, ‘‘Adaptivesubgradient methods for online learning andstochastic optimization,’’ J. Mach. Learn. Res.,vol. 12, pp. 2121–2159, 2011.

[31] J. Duchi, L. Mackey, and M. I. Jordan, ‘‘On theconsistency of ranking algorithms,’’ in Proc.Int. Conf. Mach. Learn., pp. 327–334, 2010.

[32] M. Dudik, J. Langford, and L. Li, ‘‘Doublyrobust policy evaluation and learning,’’ inProc. Int. Conf. Mach. Learn., 2011,pp. 1097–1104.

[33] V. Guruswami and A. Sahai, ‘‘Multiclasslearning, boosting, error-correcting codes,’’ inProc. 12th Annu. Conf. Comput. Learn. Theory,1999, pp. 145–155.

[34] S. Hanneke, ‘‘A bound on the label complexityof agnostic active learning,’’ in Proc. Int. Conf.Mach. Learn., 2007, pp. 353–360.

[35] D. Hsu, ‘‘Algorithms for active learning,’’Ph.D. dissertation, Dept. Comput. Sci. Eng.,Univ. California San Diego, La Jolla, CA, USA,2010.

[36] D. Hsu, S. Kakade, J. Langford, and T. Zhang,‘‘Multi-label prediction via compressedsensing,’’ in Proc. Adv. Neural Inf. Process. Syst.,2009, pp. 772–780.

[37] L. Huang, S. Fayong, and Y. Guo, ‘‘Structuredperceptron with inexact search,’’ in Proc. Conf.North Amer. Chapter Assoc. Comput. Linguistics,2012, pp. 142–151.

[38] G. James and T. Hastie, ‘‘The error codingmethod and PICTs,’’ J. Comput. Graph. Stat.,vol. 7, no. 3, pp. 377–387, 1998.

[39] T. Joachims, T. Finley, and C.-N. Yu,‘‘Cutting-plane training of structural SVMs,’’Mach. Learn. J., vol. 77, pp. 27–59, 2009.

[40] S. M. Kakade, S. Shalev-Shwartz, andA. Tewari, ‘‘Efficient bandit algorithms foronline multiclass prediction,’’ in Proc. Int.Conf. Mach. Learn., 2008, pp. 440–447.

[41] N. Karampatziakis and J. Langford, ‘‘Onlineimportance weight aware updates,’’ in Proc.27th Conf. Uncertainty Artif. Intell., 2011,pp. 392–399.

[42] T. Kudo, ‘‘CRF++ project,’’ 2005. [Online].Available: http://crfpp.googlecode.com.

[43] J. Lafferty, A. McCallum, and F. Pereira,‘‘Conditional random fields: Probabilisticmodels for segmenting and labeling sequencedata,’’ in Proc. Int. Conf. Mach. Learn., 2001,pp. 282–289.

[44] J. Langford, ‘‘Robust efficient conditionalprobability estimation,’’ in Proc. Conf. Learn.Theory, 2010, pp. 316–317.

[45] J. Langford and A. Beygelzimer, ‘‘Sensitiveerror correcting output codes,’’ in Proc. Conf.Learn. Theory, pp. 158–172, 2005.

[46] J. Langford, L. Li, and A. Strehl, ‘‘VowpalWabbit online learning project,’’ Tech. Rep.,2007. [Online]. Available: http://hunch.net/?p=309.

[47] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner,‘‘Gradient-based learning applied to documentrecognition,’’ Proc. IEEE, vol. 86, no. 11,pp. 2278–2324, Nov. 1998.

[48] N. Littlestone and M. Warmuth, ‘‘Weightedmajority algorithm,’’ in Proc. IEEE Symp.Found. Comput. Sci., 1989, pp. 256–261.

[49] J. Nocedal, ‘‘Updating quasi-newton matriceswith limited storage,’’ Math. Comput., vol. 35,pp. 773–782, 1980.

[50] N. Ratliff, D. Bradley, J. A. Bagnell, andJ. Chestnutt, ‘‘Boosting structured prediction

for imitation learning,’’ in Proc. Adv. NeuralInf. Process. Syst., 2007, pp. 1153–1160.

[51] R. Malouf, ‘‘A comparison of algorithms formaximum entropy parameter estimation,’’ inProc. CoNLL, 2002, DOI: 10.3115/1118853.1118871.

[52] R. McDonald, K. Crammer, and F. Pereira,‘‘Large margin online learning algorithms forscalable structured classification,’’ in Proc.NIPS Workshop Learn. Structured Outputs,2004.

[53] H. Ramaswamy, S. B. Balaji, S. Agarwal, andR. Williamson, ‘‘On the consistency of outputcode based learning algorithms for multiclasslearning problems,’’ in Proc. 27th Annu. Conf.Learn. Theory, 2014, pp. 885–902.

[54] S. Ross, G. J. Gordon, and J. Andrew Bagnell,‘‘A reduction of imitation learning andstructured prediction to no-regret onlinelearning,’’ in Proc. Workshop Artif. Intell. Stat.,2011, pp. 627–635.

[55] S. Ross, P. Mineiro, and J. Langford,‘‘Normalized online learning,’’ in Proc. 29thConf. Uncertainty Artif. Intell., 2013,pp. 537–545.

[56] R. E. Schapire and Y. Freund, Boosting:Foundations and Algorithms. Cambridge,MA, USA: MIT Press, 2012.

[57] U. Syed and R. E. Schapire, ‘‘A reduction fromapprenticeship learning to classification,’’ inProc. Adv. Neural Inf. Process. Syst., 2011,pp. 2253–2261.

[58] I. Tsochantaridis, T. Hofmann, T. Joachims,and Y. Altun, ‘‘Support vector machinelearning for interdependent and structuredoutput spaces,’’ in Proc. Int. Conf. Mach. Learn.,2004, DOI: 10.1145/1015330.1015341.

[59] L. Valiant, ‘‘A theory of the learnable,’’Commun. ACM, vol. 27, pp. 1134–1142, 1984.

[60] V. Vapnik and A. Chervonenkis, ‘‘On theuniform convergence of relative frequenciesof events to their probabilities,’’ TheoryProbab. Appl., vol. 16, no. 2, pp. 264–280,1971.

[61] J. Weston, A. Makadia, and H. Yee, ‘‘Labelpartitioning for sublinear ranking,’’ in Proc.Int. Conf. Mach. Learn., 2013, pp. 181–189.

[62] B. Zhao and E. P. Xing, ‘‘Sparse output codingfor large-scale visual recognition,’’ in Proc. Int.Conf. Comput. Vis. Pattern Recognit., 2013,pp. 3350–3357.

[63] Y. Xu and A. Fern, ‘‘On learning linearranking functions for beam search,’’ in Proc.Int. Conf. Mach. Learn., 2007, pp. 1047–1054.

[64] Y. Xu, A. Fern, and S. W. Yoon,‘‘Discriminative learning of beam-searchheuristics for planning,’’ in Proc. Int. JointConf. Artif. Intell., 2007, pp. 2041–2046.

[65] M. Reid and R. Williamson, ‘‘Surrogate regretbounds for proper losses,’’ in Proc. Int. Conf.Mach. Learn., 2009, pp. 897–904.

[66] B. Zadrozny, ‘‘Policy mining: Learningdecision policies from fixed sets of data,’’Ph.D. dissertation, Dept. Comput. Sci. Eng.,Univ. California San Diego, La Jolla, CA, USA,2003.

Beygelzimer et al. : Learning Reductions That Really Work

146 Proceedings of the IEEE | Vol. 104, No. 1, January 2016

Page 12: INVITED PAPER LearningReductionsThat ReallyWorkresearch.cs.rutgers.edu/~lihong/ftp/papers/rl/Learning Reductions Th… · learning reductions to effectively address a wide class of

ABOUT T HE AUTHO RS

Alina Beygelzimer received the Ph.D. degree in

computer science from the University of Roche-

ster, Rochester, NY, USA, in 2003.

She is a Senior Research Scientist at Yahoo

Labs, New York City, NY, USA, working on scalable

machine learning. Prior to that, she was a Research

Staff Member at the IBM Thomas J. Watson

Research Center, Yorktown Heights, NY, USA.

Hal Daume, III received the B.S. degree in

mathematical sciences from Carnegie Mellon Uni-

versity, Pittsburgh, PA, USA and the Ph.D. degree

in computer science from the University of

Southern California, Los Angeles, CA, USA, with a

thesis on structured prediction for language.

He is an Associate Professor in Computer

Science at the University of Maryland, College

Park, MD, USA. He holds joint appointments in the

University of Maryland Institute for Advanced

Computer Studies (UMIACS) and Linguistics. He was previously an

Assistant Professor in the School of Computing, University of Utah, Salt

lake City, UT, USA. His primary research interest is in developing new

learning algorithms for prototypical problems that arise in the context of

language processing and artificial intelligence.

John Langford received the B.S. degrees in

physics and computer science from the California

Institute of Technology, Pasadena, CA, USA, in

1997 and the Ph.D. degree in computer science

from Carnegie Mellon University, Pittsburgh, PA,

USA, in 2002.

He is a Principal Researcher at Microsoft

Research, New York City, NY, USA. He has worked

at Yahoo!, Toyota Technological Institute, and

IBM’s Watson Research Center. He is also the

primary author of the popular Machine Learning weblog, hunch.net, and

the principle developer of Vowpal Wabbit.

Dr. Langford was the Program Co-Chair for the 2012 International

Conference on Machine Learning (ICML) and is the General Chair for the

2016 ICML.

Paul Mineiro received an undergraduate degree

in physics from the California Institute of Tech-

nology, Pasadena, CA, USA and attended graduate

school at the Cognitive Science Department, Uni-

versity of California San Diego, La Jolla, CA, USA.

He is a Research Engineer in the Cloud and

Information Services Laboratory, Microsoft,

Bellevue, WA, USA. His interests include online

learning, extreme classification, and distributed

machine learning.

Beygelzimer et al. : Learning Reductions That Really Work

Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 147