INVITED PAPER LearningReductionsThat...
Transcript of INVITED PAPER LearningReductionsThat...
INV ITEDP A P E R
Learning Reductions ThatReally WorkThis paper summarizes the mathematical and computational techniques that have
enabled learning reductions to effectively address a wide class of tasks.
By Alina Beygelzimer, Hal Daume, III, John Langford, and Paul Mineiro
ABSTRACT | In this paper, we provide a summary of the
mathematical and computational techniques that have enabled
learning reductions to effectively address a wide class of tasks,
and show that this approach to solving machine learning
problems can be broadly useful. Our work is instantiated and
tested in a machine learning library, Vowpal Wabbit, to prove
that the techniques discussed here are fully viable in practice.
KEYWORDS | Learning systems; machine learning; prediction
methods
I . INTRODUCTION
In a reduction, a complex problem is decomposed into
simpler subproblems so that a solution to the subproblems
gives a solution to the complex problem. Learning
reductions differ from other types of reductions used in
computer science because they require understanding how
the distribution induced by the reduction affects the
transfer of predictive performance from the inducedproblem to the original problem.
The canonical example of a learning reduction is one-
against-all (OAA), which solves k-class classification via
reduction to k base prediction problems, one for each
class: For i 2 f1; . . . ; kg, the ith predictor is trained to
predict the probability of label i. To make a multiclass
prediction, the reduction chooses the class with the largest
probability estimate. Fig. 1 shows how this reduction
works experimentally, comparing the induced multiclass
loss to the average squared-error loss of the base
predictors. Because the relationship between the average
squared-error loss and multiclass loss scales with k ingeneral, we decided to use k ¼ 2 for this pedagogical
experiment.
Fig. 1 confirms what is expected and guaranteed by the
analysis in Section II-B. In particular, it shows that a small
squared-error loss on the created regression problems
implies a small zero-one loss on the original two-class
classification problem. It is impossible to induce a large
zero-one loss without incurring a large squared-error loss;a large squared-error error loss can (but need not) lead to a
large zero-one loss.
Are learning reductions an effective approach for
solving complex machine learning problems? The answer
is not obvious, because there is a representational concern:
maybe the process of reduction creates ‘‘hard’’ problems
that simply cannot be solved well? A simple example is
given in Fig. 2 by a three-class classification problem onthe line. If all of the examples from class 1 are at x ¼ 1, all
the examples from class 2 are at x ¼ 2, and all of class 3 at
x ¼ 3, then an OAA linear classifier cannot succeed. In
particular, it is impossible to separate class 2 from the
union of classes 1 and 3. In contrast, the all-pairs
reduction, which learns a classifier for each pair of labels,
does not suffer from this problem in this case.
Although this concern is significant, there are manyother convenient choices made in machine learning, such
as conjugate priors, proxy losses, and sigmoid link
functions. Perhaps the representations created by natural
learning reductions work well on natural problems. Or
perhaps there is a theory of representation-respecting
learning reductions.
We have investigated this approach to machine
learning for about a decade now, and provide a summaryof results here, addressing several important desiderata.
1) A well-founded theory for analysis. A well-
founded theory makes the approach teachable,
Manuscript received February 9, 2015; revised April 29, 2015; accepted
September 3, 2015. Date of publication December 10, 2015; date of current version
December 18, 2015.
A. Beygelzimer is with Yahoo Labs, New York, NY 10036 USA (e-mail:
H. Daume, III is with the University of Maryland, College Park, MD 20742 USA
(e-mail: [email protected]).
J. Langford is with Microsoft Research, New York, NY 10011 USA (e-mail:
P. Mineiro is with the Cloud and Information Services Laboratory, Microsoft,
Bellevue, WA 98052, USA (e-mail: [email protected]).
Digital Object Identifier: 10.1109/JPROC.2015.2494118
0018-9219 � 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
136 Proceedings of the IEEE | Vol. 104, No. 1, January 2016
and provides a form of assurance that good
empirical results should be expected, and carry
over to new problems.
2) Good predictive performance in practice. A theory
should provide some effective guidance about
which learning algorithms are better in practice.3) Good computational performance. This is critical
for learning reductions, because the large data
regime is where sound algorithmics begin to
outperform clever representation and problem
understanding.
4) Good programmability. Development and main-
tenance burdens are not traditional concerns for
machine learning but can matter significantly inpractice.
5) A unique ability. To be interesting, learning
reductions must provide a means to address an
entirely new class of problems.
Here we show that all the above criteria have now been
met. Furthermore, we instantiated our work in the open
source machine learning system, Vowpal Wabbit [46].
A. Strawman OAAA common approach to implementing OAA for k-way
multiclass classification is to create a script that processes
the data set k times, creating k intermediate binary
classification data sets, then executes a binary learningalgorithm k times, creating k different model files. For test
time evaluation, another script then invokes a testing
system k times for each example in a batch. The multiclass
prediction is the label with a positive prediction, with ties
broken arbitrarily.
A careful study of learning reductions reveals that
every aspect of this strawman approach can be improved.
B. OrganizationSection II discusses the types of reduction theory that
have been developed and found most useful.
Section III discusses the programming interface we
have developed for learning reductions. Although pro-
grammability is a nonstandard concern in machine
learning applications, we have found it of critical
importance. Creating a usable interface that is notcomputationally constraining is critical to success.
Section IV discusses several problems for which the
only known solution is derived via a reduction mechanism,
providing evidence that the reduction approach is useful
for research.
Section V shows experimental results for a particularly
complex ‘‘deep’’ reduction for structured prediction,
including comparisons with many other approaches.Together, these sections show that learning reductions
are a useful approach to machine learning.
II . REDUCTIONS THEORY
There are several natural learning reduction theories.
These theories differ structurally from other learning
theories, and thus offer a different mixture of strengthsand weaknesses for prescribing what happens experimen-
tally. Simple learning reductions neglect representational
concerns in favor of effective problem decomposition.
Representational concerns are important in general, but
we have found it fruitful in practice to focus on effective
problem decomposition, and let the individual problem
dictate representational choices.
Online learning [48], empirical risk minimization [60],and polynomial time probably approximately correct
(PAC) learning [59] are examples of learning theories
that take into account a choice of representation.
Only the optimization oracle reductions theory takes
representation into account (see Section II-D). The
simpler reduction theories only model the transformation
of predictive performance from one learning task to
another.In learning reductions, unlike in other reductions used
in computer science, we need to incorporate, track, and
reason about distributions over examples. A learning
reduction from task A to task B transforms a distribution
generating A into a distribution generating B, and then
implicitly transforms a solution of some quality on task Binto a solution of some quality for A.
Fig. 1. OAA reduction applied to many different k-class classification
data sets, for k ¼ 2. The x-axis is the squared-error loss of the base
regressor. The y-axis is the k-classification loss of the implied
classifier. The lack of any points in the upper left corner for all data
sets is as predicted by analysis.
Fig. 2. Hard problem for OAA with linear representations. There is a
single feature represented by the horizontal dimension. There are
three classes 1, 2, and 3 with each example from class i having feature
value i. OAA with linear regression cannot effectively distinguish
class 2 from classes 1 and 3.
Beygelzimer et al. : Learning Reductions That Really Work
Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 137
More formally, a task A is defined by:1) a d is tr ibut ion DA generat ing examples
ðx; yÞ 2 X � Y, where X is the instance space
and Y is the label space;
2) a loss function LA : Y � Z ! R, where Z is a
task-dependent prediction space.
In k-class classification, Y ¼ Z ¼ f1; . . . ; kg, and
LAðy; zÞ ¼ I½y 6¼ z�
for y 2 Y, z 2 Z.
In importance-weighted binary classification,
Y ¼ f�1; 1g � Rþ, Z ¼ f�1; 1g, and
LA hy;wi; zð Þ ¼ w � I½y 6¼ z�
for hy;wi 2 Y, z 2 Z. Here each example has an associated
misclassification cost, and the loss is weighted by the cost.
A reduction from task A to task B consists of two
algorithms, R and R�1, where R maps examples from task A to
(possibly multiple instances of) examples for task B, and R�1
maps a learned solution for task B back to a solution for task A.For example, in the strawman OAA reduction from
Section I-A, A is multiclass classification and B is binary
classification. Here R is the algorithm taking a multiclass
example ðx; yÞ and creating k binary examples, where the
ith binary example is ðx; 2 � I½y ¼ i� � 1Þ. Thus, R creates kinstances of B. Multiple reduced instances can be
combined into a single instance using a standard trick
[5], which is just to augment the feature space with thename of the instance. The inverse R�1 computes the
argmax of k binary predictions, breaking ties randomly.
Critically, both A and B have loss functions associated
with them, with reductions differing in what is proved about
how losses for B translate to losses for A. Since R is an
algorithm which transforms examples of one type into
examples of another type, any distribution DA for examples of
type A induces a distribution DB for examples of type B via R.
A. Error ReductionsIn an error reduction, a small error rate on the induced
task implies a small error rate on the original task. Let hB
be a predictor for the induced task B, and let hA _¼ R�1ðhBÞbe the resulting predictor for the original task. An error
reduction satisfies a theorem of the form
EðxA;yAÞ�DALA yA; hAðxAÞð Þ � f EðxB;yBÞ�DB
LB yB; hBðxBÞð Þ� �
where f : R ! R is some continuous function with fð0Þ ¼ 0.
When multiple base predictors are created, we measure their
average loss. Since it is easy to resample examples or indicate
via an importance weight that one example is more importantthan another, nonuniform averages are allowed.
For example, in the strawman OAA reduction, an
average binary classification error rate of � implies a
multiclass error rate of at most ðk� 1Þ� (see [5] and [33]).
This algorithm can fail in at least two ways: the first failure
is if none of the k binary classifiers returns þ1; the second
failure is if more than one of them do. In the first case, the
algorithm chooses a label arbitrarily. In the second case,the algorithm chooses a prediction from among those
binary classifiers that returned þ1.
A careful examination of the analysis shows how to
improve the error-transformation properties of this
reduction: The first observation is that it helps to break
ties randomly instead of arbitrarily. The second observa-
tion is that, in the absence of other errors, a false negative
implies only a 1=k probability of making the rightmulticlass prediction, whereas for a false positive this
probability is 1/2. Thus, modifying the reduction to make
the binary classifier more prone to output a positive, which
can be done via an appropriate use of importance
weighting, improves the error transform from ðk� 1Þ� to
roughly ðk=2Þ�. As predicted by analysis, both of these
elements yield an improvement in practice [14].
Another example of an error reduction for multiclass clas-sification is based on error-correcting output codes [27], [33].
A valid criticism of error reductions is that the
guarantees they provide become vacuous if the base
problems they create are inherently noisy. For example,
when no base binary classifier can achieve an error rate
better than 2=k, the OAA guarantee above is vacuous.
Given this, all error reduction statements should be paired
with a claim that small error rates can be achieved.
B. Regret ReductionsRegret analysis addresses this criticism by analyzing
the transformation of excess loss, or regret. More formally,
regret of a predictor h is the difference between its loss and
the minimum achievable loss on the same task ðD; LÞ
regDðhÞ ¼ Eðx;yÞ�DL y; hðxÞð Þ � infh0
Eðx;yÞ�DL y; h0ðxÞð Þ:
Note that inf in this section is over all possible predictors
rather than predictors in some function class. (The latter
model is considered in Section II-D.) A regret reduction
bounds the regret on the original task in terms of the average
regret on the base task, yielding a theorem of the form
regDAðhAÞ � f regDB
ðhBÞ� �
where f , hA, and hB are defined as in error reductions.
Beygelzimer et al. : Learning Reductions That Really Work
138 Proceedings of the IEEE | Vol. 104, No. 1, January 2016
A reduction that translates any optimal (i.e., no-regret)solution to the base problems into an optimal solution to
the top-level problem is called ‘‘consistent.’’ Consistency is
a basic requirement for a good reduction. Unfortunately,
error reductions are generally inconsistent. To see that
OAA is inconsistent, consider three classes with true
conditional probabilities ð1=2Þ � 2�, ð1=4Þ þ �, and
ð1=4Þ þ �. The optimal base binary prediction is always
the negative class since it is always more likely that theclass label is i for any i, resulting in multiclass loss of 2/3.
The corresponding multiclass regret is ð1=6Þ � 2�, which
is positive for any � G 1=12.
Strawman OAA can be easily made consistent by
reducing to squared-loss regression instead of binary
classification. The multiclass prediction is made by
evaluating the learned regressor on all labels and
predicting with the argmax. As shown below, thisregression approach is consistent. It also resolves ties via
precision rather than via randomization, which is empir-
ically more effective.
Let us analyze how this approach transforms squared
loss regret into multiclass regret for any fixed x, taking
expectation over x at the end. Let hðx; aÞ be the learned
regressor, predicting the conditional probability of class agiven x. Let pa ¼ E½y ¼ ajx� be the true conditionalprobability. For the analysis, we think of h as an adversary
trying to induce multiclass regret without paying much in
squared loss regret.
The squared loss regret of h on the predicted class a is
Eyjx hðx; aÞ � Iðy ¼ aÞð Þ2� pa � Iðy ¼ aÞð Þ2� �
¼ pa � hðx; aÞð Þ2:
Let a� ¼ arg maxa pa be the optimal prediction on x. The
regret of h on a� is, similarly, ðpa� � hðx; a�ÞÞ2. To incur
multiclass regret, we must have hðx; aÞ hðx; a�Þ for some
a 6¼ a�. The two regrets are convex and the minima is
r e a c h e d w h e n hðx; aÞ ¼ hðx; a�Þ ¼ pa þ pa�=2. T h e
corresponding squared loss regret suffered by h on botha and a� is ðp�a � paÞ2=2. Since the regressor does not need
to incur any loss on other predictions, the regressor can
pay only regðhÞ ¼ ðp�a � paÞ2=2k in average squared loss
regret to induce multiclass regret of pa� � pa on x. Solving
for multiclass regret in terms of regðhÞ shows that the
multiclass regret of this approach is bounded byffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2k regðhÞ
p. Since the adversary can actually play this
optimal strategy, the bound is tight.Moving from an error reduction to a regret reduction is
often empirically beneficial. Fig. 3 illustrates the empirical
superiority of reducing to regression rather than binary
classification for the Mnist data set [47].
There are many known regret reductions for such
problems as multiclass classification [38], [53], cost-
sensitive classification [15], [45], and ranking [3], [4].
There is also a rich body of work on surrogate regret
bounds. It is common to use some efficiently minimizable
surrogate loss instead of the loss one actually wishes to
optimize. A surrogate regret bound quantifies the resultingregret in terms of the surrogate regret [3], [8], [65]. These
results show that standard algorithms minimizing the
surrogate are in fact consistent solutions to the problem at
hand. In some cases, commonly used surrogate losses
actually turn out to be inconsistent [31].
Many open problems exist in regret reductions, one of
which is the efficient robust conditional probability
estimation problem [44]. At a high level, the question is:How do you estimate the conditional probability of any one
of k things in time logarithmic in k with a regret ratio
bounded by a constant? A $1000 reward exists for a good
answer.
C. Adaptive ReductionsAdaptive reductions create learning problems that are
dependent on the solution to other learning problems. In
general, adaptivity is undesirable, since conditionally
defined problems are more difficult to form and solve
wellVthey are less amenable to parallelization, and more
prone to overfitting due to propagation and compounding
of errors. In some cases, however, the best known
approaches are adaptive reductions.
Boosting [56] is an adaptive reduction for convertingany weak learner into a strong learner. Typical boosting
statements bound the error rate of the resulting classifier
in terms of the weighted training errors �t on the
distributions created adaptively by the booster. The ability
to boost is rooted in the assumption of weak
learnabilityVa weak learner gets a positive edge over
random guessing for any distribution created by the
Fig. 3. Mnist experimental results for OAA reduced to binary
classification and OAA reduced to squared loss regression while
varying training set size. The x-axis is the 0/1 test loss of the induced
subproblems, and the y-axis is the 0/1 test loss on the multiclass
problem. The classifiers are linear in pixels. The regression
approach (a regret transform) dominates the binary approach
(an error transform).
Beygelzimer et al. : Learning Reductions That Really Work
Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 139
booster. As with any reduction, there is a concern that thebooster may create ‘‘hard’’ distributions, making it difficult
to satisfy the assumption. Despite this concern, boosting
has been quite effective in practice. A boosting classifier or
regressor could compose with other reductions to allow
boosting to be applied to many new problems.
Section IV-D discusses an adaptive reduction for
logarithmic time multiclass prediction. All known non-
adaptive log-time approaches yield inconsistency in thepresence of label noise [15].
Note that the average regret of base algorithms is still
well defined as long as there is a partial order over the base
problems, i.e., each base learning problem is defined given
a predictor for everything earlier in the order. Conse-
quently, the regret reduction definition can extend to the
adaptive case.
D. Optimization Oracle ReductionsWhen the problem is efficiently gathering information,
as in active learning (discussed in Section IV-C) or
contextual bandit learning (discussed in Section IV-B), the
previous types of reductions are inadequate because they
lack any way to quantify progress made by the reduction.
To address this problem, we define a new class of
learning reductions based on oracle access to a set ofpredictors H with a limited capacity. For example, H could
be the set of all depth-2 decision trees, the set of all linear
predictors, or the set of all five-layer convolutional neural
networks with at most 1000 convolutional units.
The oracle for H takes a correctly typed data set S and
returns an empirical risk minimizing predictor from H on S
arg minh2H
X
ðx;yÞ2S
LB y; hðxÞð Þ
with respect to the base loss LB. The oracle gives an
abstraction of the ability to search the set H.
The form of the learning problem solved by the oracle
can be binary classification, cost-sensitive classification, or
any other reasonable primitive. Since many supervised
learning algorithms approximate such an oracle, thesereductions are immediately implementable.
Since the capacity of H is limited, tools from statistical
learning theory can be used to argue about the regret of the
predictor returned by the oracle. Cleverly using this oracle
can provide solutions that are exponentially more efficient
than other approaches. Several examples are discussed in
Section IV-B and C.
III . SOFTWARE ARCHITECTURE
A good interface for learning reductions should simulta-
neously be performant, generally useful, easy to program,
and eliminate systemic errors.
A. The Wrong WayThe strawman OAA approach illustrates interfacing
failures well. In particular, consider an implementation
where a binary learning executable, treated as a black box,
is orchestrated to do the OAA approach via shell scripting.
1) Scripting implies a mixed-language solution, which
is relatively difficult to maintain or understand.
2) The approach may easily fail under recursion. For
example, if another script invokes the OAAtraining script multiple times, it is easy to imagine
a problem where the saved models of one
invocation overwrite the saved models of another
invocation. In a good programming approach,
such errors should not be possible.
3) The transformation of multiclass examples into
binary examples is separated from the transfor-
mation of binary predictions into multiclasspredictions. This substantially raises the possibil-
ity of implementation bugs compared to an
approach which has encoder and decoder im-
plemented either side by side or conformally.
4) For more advanced adaptive reductions, it is
common to require a prediction before defining
the created examples. Having a prediction script
operate separately creates a circularity (trainingmust succeed for prediction to work, but predic-
tion is needed for training to occur) which is
extremely cumbersome to avoid in this fashion.
5) The training approach is computationally expensive
since the data set is replicated k times. Particularly
when data sets are large, this is highly undesirable.
6) The testing process is structurally slow, particularly
when there is only one test example to label. Thecomputational time is �(pk) where p is the number
of parameters in a saved model and k is the number of
classes simply due to the overhead of loading a model.
7) Even if all models are loaded into memory, the
process of querying each model is inherently
unfriendly to a hardware cache.
B. A Better WayOur approach [46] eliminates all of the above interfac-
ing bugs, resulting in a system that is general, performant,
and easily programmed.
Reductions have a structure which admits natural
exploitation of information hiding. In particular, we
implement algorithms which presents two similar online
interfaces. The first is for the purpose of prediction only
(no training). An example is given in Algorithm 1 for OAA.The input is two argumentsVan example and reduction-
specific context structure. The need for a predict interface
exists for at least two reasons.
1) In many production deployments, an auditable
code path which only does prediction is necessary.
2) Some more complex reductions make prediction-
dependent choices about how to reduce.
Beygelzimer et al. : Learning Reductions That Really Work
140 Proceedings of the IEEE | Vol. 104, No. 1, January 2016
The prediction algorithm of the base learner is defined in thecontext and used on line 4. The system uses syntactic sugar to
easily specify which base learning algorithm with an index to
choose which reduction context is invoked for each.
Algorithm 1: OAA_Predict(example e, context c)
1: let prediction ¼ 1
2: let min value ¼ 13: for i ¼ 0 to c:k� 1 do
4: c.predictðe; iÞ5: if e.prediction G min_value then
6: min_value e.prediction
7: prediction iþ 18: end if
9: end for
10: e.prediction ¼ prediction
The other interface is for learning, with an example for
OAA in Algorithm 2. This takes the same arguments, but
aggressively trains based upon available label information.
This is the only substantial difference from the prediction
interface. In particular, the learning interface also does a
prediction for several reasons.1) Prediction is often required for learning, so it is
computationally free. Even when it is not, predic-
tion is typically cheaper than training, so the cost
of prediction is amortized.
2) Predictions allow for online monitoring of the
learning process via progressive validation techni-
ques [16]. Since in many machine learning (ML)
applications finding the right features/update rule/representation is the limiting factor, the ability to
sublinearly debug is beneficial.
Algorithm 2: OAA_Learn(example e, context c)
1: let prediction ¼ 1
2: let min_value ¼ 13: let multiclass_label ¼ e.label4: for i ¼ 0 to c:k� 1 do
5: if multiclass_label ¼ i then
6: e.label ¼ 1
7: else
8: e.label ¼ �1
9: end if
10: let e.prediction ¼ c.learnðe; iÞ11: if prediction G min_value then12: min_value e.prediction
13: prediction iþ 1
14: end if
15: end for
16: e.prediction ¼ prediction
Note that although we require an online learning
algorithm interface, there is no constraint that online
learning must occurVthe manner in which state isupdated by Learn is up to the base learning algorithm.
The interface certainly favors online base learning
algorithms, but we have an implementation of a general-
ized linear model trained via L-BFGS [49] that functions as
an effective (if slow) base learning algorithm.
Since reductions are composable, this interface is both
a constraint on the base learning algorithm and a
constraint on the learning reduction itselfVthe learningreduction must define its own Predict and Learn interfaces.
It is common for reductions to have associated state.
Every reduction requires a base learner which may either
be another reduction or a learning algorithm. Reductions
also typically have some reduction-specific state such as
number of classes k for the OAA reduction. In a traditional
object-oriented language, these arguments can be provided
to the constructor of the reduction and encapsulated in areduction object. In a purely functional language, the input
arguments include an additional state variable, the context
c as implemented here. Our public implementation differs
in a few minor details from the above: a template
eliminates code duplication between predict and learn
and the base learning algorithm is explicit rather than a
member of the context c.
C. ProgrammabilityAccounting for programmability well is notoriously
difficult with such easily measured metrics as lines of code.
Nevertheless, the code required for simple reductions is
indeed simple. For example, OAA is 69 lines, a family of
link functions is 68, a noop reduction is 7 lines, and a
polynomial link function learning reduction is 51 in our
C++ implementation. We have seen implementations ofreductions which require an order of magnitude more code
despite being written in higher level languages.
Perhaps a more significant way to measure program-
mability is to point out the classes of bugs that have been
either eliminated or retarded by design.
1) Evaluation/learning skew. When evaluation of a
learned function is separated from the learning
process, a common bug is skew between these im-plementations, resulting in degraded performance.
This can happen for mundane reasons such as
referencing parameters wrong or differences in
format read/write. This can also happen more struc-
turally, for example when a hidden Markov model
is trained by expectation–maximization (EM), but
evaluated using beam search. The interface above
discourages skew by having both written side byside and even sometimes making them the same
function, differing by a template parameter.
2) Indexing bugs. When system A reduces to multiple
instances of system B, it is essential to keep track
of which instance of B is desired. By using a simple
countable index to reference these problems,
misindexing mistakes are avoided. The simplicity
Beygelzimer et al. : Learning Reductions That Really Work
Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 141
of this interface hides the complexity of compositereductions and cache-coherent parameter access,
as discussed in Section III-D.
3) State capture bugs. When system A reduces to
multiple instances of system B, each instance of
system B must be saved and loaded for the system
to restore state correctly. Often, this requires no
effort on the part of the programmer, because it is
automatically handled by the system. In instanceswhere the reduction itself must save state, this can
be handled by either modifying a system provided
set of saved arguments or a save/load function that
the system automatically invokes in the right
order, regardless of reduction depth.
There is a third weak measure of programmability
offered by simple success. Our implementation is the
product of effort by a few researchers acting as part-timeprogrammers, yet has thousands of users, including
significant industrial use. Other machine learning systems
of a similar complexity and userbase typically have
dedicated programmers.
Another way to assert programmability is to note that
reductions create modularity, which is well understood as
desirable. Modularity presents opportunities for optimizations
which are automatically exploited by dependent learningreductions. Alternatively, it is worth investing effort improving
core learning algorithms and reductions, as the benefits are
amortized over many settings. As an example, the superlative
timings exhibited in Section V are partially attributable to the
reuse of a highly optimized reduction stack.
D. Cache CoherencyThe above interface addresses all the previously
mentioned problems except for cache unfriendliness. To
illustrate this problem, consider a multiclass twitter
classification task based on a bag of words converted to
feature indices via a dictionary. For example, one
document might have words represented by indices 1,
53, 720, and 860 when another document has words
represented by indices 23, 820, 4003, and 5121. In a
machine learning context, it is critical for efficiency to takeadvantage of the sparse representation here, where each
word is represented by a sparse feature consisting of the
index and a feature value.
When doing an OAA reduction, the number of
parameters required is roughly the vocabulary size times
the number of classes, which implies that the state of the
learning algorithm does not fit into the fastest cache of a
modern computer. Naively, an OAA reduction will haveseparate parameter vectors for each individual base
regressor. With this memory layout, every (word, base
regressor) pair creates a potential cache miss. These cache
misses are typically the dominant part of the computa-
tional cost.
The implementation in the Vowpal Wabbit toolkit
avoids this via a different memory layout, which can
provide up to an order of magnitude improvement incomputational time on sparse data sets. In essence, we
interleave the parameters of the different base regressors,
so the memory is laid out sequentially as follows: feature 1
for regressor 1, feature 1 for regressor 2, feature 1 for
regressor 3, . . ., feature 2 for regressor 1, feature 2 for
regressor 2, etc. When the OAA reduction invokes the
first regressor, it may generate a cache miss for every word,
but for the second regressor every word is not a cache misssince caching operations are done by the hardware on
many-byte segments. Thus, the maximum number of cache
misses is reduced by a factor related to the cache line size.
This is a rather low level optimization technique. Can it
be done in a manner which is programming friendly? The
first step is to multiply each input feature index by a factor
of k, the number of base regressors, mapping the feature
indices as 0! 0; 1! k; 2! 2k; . . .. Then, we can modifythe feature index definition so the true index value is the
nominal value produced by the mapping, plus a reduction-
defined offset which varies over f0; . . . ; k� 1g. With
respect to a base learning algorithm, this is easy to deal
withVjust provide a template for iterating over the
features correctly under this definition. And for reduc-
tions, this offset is the index of the base learning algorithm
in Algorithm 1 line 4 or Algorithm 2 line 10, so we simplyrecord the offset for use by the feature iteration template.
The above neglects the complexities of making this
work recursively, i.e., when A reduces to B, which reduces
to C. The principles for a recursive application are the same
and have been fully worked out in the implementation.
IV. UNIQUELY SOLVED PROBLEMS
Do reductions just provide a modular alternative that
performs as well as other, direct methods? Or do they
provide solutions to otherwise unsolved problems?
Rephrased, are learning reductions a good first tool for
developing solutions to new problems?
We provide evidence that the answer is ‘‘yes’’ by
surveying an array of learning problems that, so far, have
been effectively addressed only via reduction techniques.A common theme throughout these problems is compu-
tational efficiency. Often there are known inefficient
approaches for solving these problems. Using a reduction
approach, we can isolate the inefficiency of optimization, and
remove other inefficiencies, often resulting in exponential
reduction in computational time in practice.
A. Efficient Contextual Bandit LearningIn the contextual bandit learning problem, the learner
repeatedly observes some context, takes an action, and
observes the reward only for the chosen action.
There are two parts of this problem: 1) learning a better
policy offline using existing exploration data; and 2) optimally
controlling exploration online. A policy is functionally
equivalent to a multiclass classifier that takes as input some
Beygelzimer et al. : Learning Reductions That Really Work
142 Proceedings of the IEEE | Vol. 104, No. 1, January 2016
context and produces an action. The term ‘‘policy’’ is used here,because the action is executedVperhaps a news story is
displayed, or a medical treatment is administered.
Efficient nonreduction techniques exist only for special
cases of the problem [40]. All known techniques for the
general setting [13], [32], [67] use reduction approaches.
B. Efficient Exploration in Contextual BanditsContextual bandit learning requires a method for
controlling exploration, choosing possibly suboptimal
actions to gather information for better performance inthe future. There are known approaches to this problem
based on exponential weights [6], with a running time
linear in the size of the policy set. Can statistically optimal
exploration be done more efficiently?
The answer turns out to be positive [1], assuming
access to an oracle for solving supervised cost-sensitive
classification problems, using only OðffiffiffiTpÞ instances of the
oracle across T rounds. This result shows that the oraclecan provide an exponential reduction in computational
time over previous approaches, while achieving optimal
regret bounds.
C. Efficient Agnostic Selective SamplingA learning algorithm with the power to choose which
examples to label can be much more efficient than a
learning algorithm that passively accepts randomly labeled
examples. However, most such approaches break down if
strong assumptions about the nature of the problem arenot met.
The canonical example is learning a threshold on the
real line in the absence of any noise. A passive learning
approach requires Oð1=�Þ samples to achieve error rate �,whereas selective sampling requires only Oðlnð1=�ÞÞsamples using binary search. This exponential reduction
in computational time is quite brittleVa small amount of
label noise can yield an arbitrarily bad predictor.Inefficient approaches for addressing this brittleness
statistically have been known [7], [34]. Is it possible to
benefit from selective sampling in the agnostic setting
efficiently?
Two algorithms have been created [12], [35] which
reduce active learning for binary classification to impor-
tance-weighted binary classification, creating practical
algorithms. No other efficient general approaches toagnostic selective sampling are known.
D. Logarithmic Time ClassificationMost multiclass learning algorithms have time and
space complexities at least linear in the number of classes
per instance when testing or training [2], [9], [11], [36],
[61], [62]. Furthermore, many of these approaches tend to
be inconsistent in the presence of noiseVthey may predict
a wrong label regardless of the amount of data available for
training.
It is easy to note that logarithmic time classificationmay be possible since the output needs to be only Oðlog kÞbits to uniquely identify a class. Can logarithmic time
classification be done in a consistent and robust fashion?
Two reduction algorithms [15], [20] provide a solution
to this problem. The first shows that consistency and
robustness can be achieved with a logarithmic time
approach. The second algorithm addresses learning of
the structure directly.
V. LEARNING TO SEARCH FORSTRUCTURED PREDICTION
Structured prediction is the task of mapping an input to
some output with complex internal structure, for example,
mapping an English sentence to a sequence of part of
speech tags (part of speech tagging), to a syntactic
structure (parsing) or to a meaning-equivalent sentence
in Chinese (translation). Structured prediction aims to
induce a function f such that for any x 2 X , f produces anoutput fðxÞ ¼ y 2 YðxÞ in a (possibly input-dependent)
space YðxÞ. A problem is structured if such y’s can be
decomposed into smaller pieces, but that those pieces are
tied together by features, by a loss function, or just by
statistical dependence (given f ; x). There is a task-specific
loss function ‘ : Y � Y ! Rþ, where ‘ðy; yÞ tells us how
bad it is to predict y when the true output is y. Learning to
search is a family of approaches for solving structuredprediction tasks and encapsulates a number of specific
algorithms (e.g., [22], [24], [26], [28], [29], [37], [50],
[54], [57], [63], and [64]). Learning-to-search approaches:
1) decompose the production of the structure output in
terms of an explicit search space (states, actions, etc.); and
2) learn hypotheses that control a policy that takes actions
in this search space. Some learning-to-search approaches
operate as learning reductions [18], [24], [54], althoughmost have no particular theoretical guarantee associated
with them.
We implemented a learning-to-search algorithm [18]
that operates via reduction to cost sensitive classification,
which is then further reduced to regression via a cost-
sensitive OAA approach.1 This algorithm was then
extensively tested against a suite of many structured
learning algorithms which we report here (see [25] for fulldetails). The first task we considered was sequence
labeling problem: part of speech tagging based on data
from the Wall Street Journal portion of the Penn Treebank
(45 labels, evaluated by Hamming loss, 912 000 words of
training data). The second is a sequence ‘‘chunking’’
problem: named entity recognition using the CoNLL 2003
1Cost-sensitive OAA is a baseline approach which simply learns asquared-loss optimized regressor for each class and returns the argminclass. It extends OAA to support multiple labels per input example, andcosts associated with classifying these labels.
Beygelzimer et al. : Learning Reductions That Really Work
Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 143
data set (nine labels, macroaveraged F-measure, 205 000words of training data).
We use the following freely available systems/
algorithms as points of comparison:
1) CRF++: The popular CRF++ toolkit [42] for
conditional random fields [43], which implements
both L-BFGS optimization for CRFs [51] as well as
‘‘structured MIRA’’ [23], [52].
2) CRF SGD: A stochastic gradient-descent condi-tional random field package [17].
3) Structured Perceptron: An implementa-
tion of the structured perceptron [21] due to [19].
4) Structured SVM: The cutting-plane imple-
mentation [39] of the structured SVMs [58] for
‘‘HMM’’ problems.
5) Structured SVM (DEMI-DCD): A multicore
algorithm for optimizing structured SVMs calledDEcoupled Model-update and Inference with
Dual Coordinate Descent.
6) VW Search: Our approach is implemented in the
Vowpal Wabbit toolkit on top of a cost-sensitive
classifier [10] which for these experiments uses a
variant of OAA called cost-sensitive OAA
(CSOAA). CSOAA is a baseline approach which
builds a regressor for each class that is trained topredict the conditional cost of each valid class via
squared loss regression and returns the argmin as
a prediction. The squared loss regression in turn
was trained with an online update rule incorpo-
rating AdaGrad [30], per-feature normalized
updates [55], and importance invariant updates
[41]. The variant VW Search (own fts) uses
computationally inexpensive feature constructionfacilities available in Vowpal Wabbit (e.g., token
prefixes and suffixes), whereas for comparison
purposes VW Search uses the same features as
the other systems.
7) VW Classification: An unstructured base-
line that predicts each label independently, using
OAA multiclass classification [10].
These approaches vary both the objective function (CRF,MIRA, structured SVM, learning to search) and the
optimization approach (L-BFGS, cutting plane, stochastic
gradient descent, AdaGrad). All implementations are in
C/C++, except for the structured perceptron and DEMI-
DCD (Java).
In Fig. 4, we show tradeoffs between training time
(x-axis, log scaled) and prediction accuracy (y-axis) for the
six systems described previously. The left figure is for partof speech tagging and the right figure is for named entity
recognition. For POS tagging, the independent classifier
is by far the fastest (trains in less than one minute) but its
performance peaks at 95% accuracy. Three other
approaches are in roughly the same time/accuracy
tradeoff: VW Search, VW Search (own fts), and
Structured Perceptron. All three can achieve very
good prediction accuracies in just a few minutes of
training. CRF SGD takes about twice as long. DEMI-DCDeventually achieves the same accuracy, but it takes a halfhour. CRF++ is not competitive (taking over five hours to
even do as well as VW Classification). Struc-tured SVM (cutting plane implementation) runs out of
memory before achieving competitive performance, likely
due to too many constraints.
For NER, the story is a bit different. The independent
classifiers are far from competitive. Here, the two variants
of VW Search totally dominate. In this case, Struc-tured Perceptron, which did quite well on POS
tagging, is no longer competitive and is effectively
dominated by CRF SGD. The only system coming close
Fig. 4. Training time versus evaluation accuracy for part of speech
tagging (left) and named entity recognition (right). The x-axis is in
log scale. Different points correspond to different termination criteria
for training. Both figures use hyperparameters that were tuned
(for accuracy) on the heldout data. (Note: lines are curved due to the
log scale x-axis.)
Beygelzimer et al. : Learning Reductions That Really Work
144 Proceedings of the IEEE | Vol. 104, No. 1, January 2016
to VW Search’s performance is DEMI-DCD, although itsperformance flattens out after a few minutes.2
In addition to training time, test time behavior can be
of high importance in natural applications. On NER,
prediction times varied from 5300 tokens/second (DEMI-DCD and Structured Perceptron) to around
20 000 (CRF SGD and Structured SVM) to 100 000
(CRF++) to 220 000 (VW(own fts)) and 285 000 (VW).
Although CRF SGD and Structured Perceptronfared well in terms of training time, their test-time
behavior is suboptimal.
When looking at POS tagging, the effect of the OðkÞdependence on the size of the label set further increased
the (relative) advantage of VW Search over alternatives.
VI. SUMMARY AND FUTUREDIRECTIONS
In working with learning reductions, the greatest benefits
seem to come from modularity, deeper reductions, andcomputational efficiency.
Modularity means that the extra code required for, say,
multiclass classification is minor compared to the code
required for binary classification. It also simplifies the use
of a learning system, because learning rate flags, for
instance, apply to all learning algorithms. Modularity is
also an easy experimentation and optimization tool, as one
can plug in different black boxes for different modules.While there are many experiments showing near-parity
prediction performance for simple reductions, compared
to other approaches, it appears that for deeper reductions
the advantage may become more pronounced. This is well
illustrated for the learning-to-search results discussed in
Section V, but has been observed with contextual bandit
learning as well [1]. The precise reason for this is unclear,as it is very difficult to isolate the most important
difference between very different approaches to solving
the problem.
Not all machine learning reductions provide computa-
tional benefits, but those that do may provide enormous
benefits. These are mostly detailed in Section IV, with
benefits often including an exponential reduction in
computational time.In terms of the theory itself, we have often found that
qualitative transitions from an error reduction to a regret
reduction are beneficial. We have also found the isolation
of concerns via encapsulation of the optimization problem
to be quite helpful in developing solutions.
We have not found that precise coefficients are
predictive of relative performance among two reductions
accomplishing the same task with the same base learningalgorithm but different representations. As an example,
the theory for error-correcting tournaments [15] is
substantially stronger than for OAA, yet often OAA
performs better empirically. Since the theory is relativized
by the performance of the base predictor, the represen-
tational compatibility issue can and does play a stronger
role in predicting performance.
There are many questions we still have about learningreductions.
1) Can the proposed interface support effective use
of SIMD/BLAS/GPU approaches to optimization?
Marrying the computational benefits of learning
reductions to the computational benefits of these
approaches could be compelling.
2) Is the learning reduction approach effective when
the base learner is a multitask (possibly ‘‘deep’’)learning system? Often the different subproblems
created by the reduction share enough structure
that a multitask approach appears plausibly
effective.
3) Can the learning reduction approach be usefully
applied at the representational level? Is there a
theory of representational reductions? h
RE FERENCES
[1] A. Agarwal et al., ‘‘Taming the monster: A fastand simple algorithm for contextual bandits,’’in Proc. 31st Int. Conf. Mach. Learn., 2014,pp. 1638–1646.
[2] A. Agarwal, S. M. Kakade, N. Karampatziakis,L. Song, and G. Valiant, ‘‘Least squaresrevisited: Scalable approaches for multi-classprediction,’’ in Proc. 31st Int. Conf. Mach.Learn., Cycle 2, 2014, pp. 541–549.
[3] S. Agarwal, ‘‘Surrogate regret bounds forbipartite ranking via strongly proper losses,’’J. Mach. Learn. Res., vol. 15, pp. 1653–1674,2014.
[4] N. Ailon and M. Mohri, ‘‘Preference-basedlearning to rank,’’ Mach. Learn. J., vol. 8,no. 2/3, pp. 189–211, 2010.
[5] E. Allwein, R. Schapire, and Y. Singer,‘‘Reducing multiclass to binary: A unifyingapproach for margin classifiers,’’ J. Mach.Learn. Res., vol. 1, pp. 113–141, 2000.
[6] P. Auer, N. Cesa-Bianchi, Y. Freund, andR. E. Schapire, ‘‘The non-stochasticmulti-armed bandit problem,’’ SIAM J.Comput., vol. 32, no. 1, pp. 48–77, 2002.
[7] N. Balcan, A. Beygelzimer, and J. Langford,‘‘Agnostic active learning,’’ in Proc. Int. Conf.Mach. Learn., 2006, pp. 65–72.
[8] P. Bartlett, M. Jordan, and J. McAuliffe,‘‘Convexity, classification and risk bounds,’’J. Amer. Stat. Assoc., vol. 101, no. 473,pp. 138–156, 2006.
[9] O. Beijbom, M. Saberian, D. Kriegman, andN. Vasconcelos, ‘‘Guess-averse loss functionsfor cost-sensitive multiclass boosting,’’ in
Proc. 31st Int. Conf. Mach. Learn., 2014,pp. 586–594.
[10] A. Beygelzimer, V. Dani, T. Hayes, J. Langford,and B. Zadrozny, ‘‘Error limiting reductionsbetween classification tasks,’’ in Proc. Int. Conf.Mach. Learn., 2005, pp. 49–56.
[11] S. Bengio, J. Weston, and D. Grangier, ‘‘Labelembedding trees for large multi-class tasks,’’in Proc. Adv. Neural Inf. Process. Syst., 2010,pp. 163–171.
[12] A. Beygelzimer, D. Hsu, J. Langford, andT. Zhang, ‘‘Agnostic active learning withoutconstraints,’’ in Proc. Adv. Neural Inf. Process.Syst., 2010, pp. 199–207.
[13] A. Beygelzimer and J. Langford, ‘‘The offsettree for learning with partial labels,’’ in Proc.15th ACM SIGKDD Int. Conf. Knowl. Disc. DataMining, 2009, pp. 129–138.
2We also tried giving CRF SGD the features computed by VWSearch (own fts) on both POS and NER. On POS, its accuracyimproved to 96.5Von par with VW Search (own fts)Vwitheffectively the same speed. On NER its performance decreased. For bothtasks, clearly features matter. But which features matter is a function ofthe approach being taken.
Beygelzimer et al. : Learning Reductions That Really Work
Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 145
[14] A. Beygelzimer, J. Langford, and B. Zadrozny,‘‘Weighted one against all,’’ in Proc. 20th Nat.Conf. Artif. Intell., 2005, pp. 720–725.
[15] A. Beygelzimer, J. Langford, and P. Ravikumar,‘‘Error-correcting tournaments,’’ in Proc. 20thInt. Conf. Algorithmic Learn. Theory, 2009,pp. 247–262.
[16] A. Blum, A. Kalai, and J. Langford, ‘‘Beatingthe holdout: Bounds for KFold andprogressive cross-validation,’’ in Proc.12th Annu. Conf. Comput. Learn. Theory,pp. 203–208.
[17] L. Bottou, ‘‘crfsgd project,’’ 2011. [Online].Available: http://leon.bottou.org/projects/sgd
[18] K.-W. Chang, A. Krishnamurthy, A. Agarwal,H. Daum, III, and J. Langford, ‘‘Learning tosearch better than your teacher,’’ in Proc. Int.Conf. Mach. Learn., 2015, pp. 2058–2066.
[19] K.-W. Chang, V. Srikumar, and D. Roth,‘‘Multi-core structural SVM training,’’ inProc. Eur. Conf. Mach. Learn., vol. 8189,Lecture Notes in Computer Science, 2013,pp. 401–416.
[20] A. Choromanska and J. Langford, ‘‘Logarithmictime online multiclass prediction,’’ 2014.[Online]. Available: http://arxiv.org/abs/1406.1822.
[21] M. Collins, ‘‘Discriminative training methodsfor hidden Markov models: Theory andexperiments with perceptron algorithms,’’ inProc. Conf. Empirical Methods Natural Lang.Process., 2002, DOI: 10.3115/1118693.1118694.
[22] M. Collins and B. Roark, ‘‘Incrementalparsing with the perceptron algorithm,’’ inProc. Conf. Assoc. Comput. Linguistics, 2004,DOI: 10.3115/1218955.1218970.
[23] K. Crammer and Y. Singer, ‘‘Ultraconservativeonline algorithms for multiclass problems,’’J. Mach. Learn. Res., vol. 3, pp. 951–991, 2003.
[24] H. Daume, III, J. Langford, and D. Marcu,‘‘Search-based structured prediction,’’ Mach.Learn. J., vol. 75, no. 3, pp. 297–325,Jun. 2009.
[25] H. Daume, III, J. Langford, and S. Ross,‘‘Efficient programmable learning to search,’’2014. [Online]. Available: http://arxiv.org/abs/1406.1837.
[26] H. Daume, III and D. Marcu, ‘‘Learning assearch optimization: Approximate largemargin methods for structured prediction,’’ inProc. Int. Conf. Mach. Learn., 2005,pp. 169–176.
[27] T. Dietterich and G. Bakiri, ‘‘Solvingmulticlass learning problems viaerror-correcting output codes,’’ J. Artif. Intell.Res., vol. 2, pp. 263–286, 1995.
[28] J. R. Doppa, A. Fern, and P. Tadepalli,‘‘Output space search for structuredprediction,’’ in Proc. Int. Conf. Mach. Learn.,2012, pp. 1151–1158.
[29] J. R. Doppa, A. Fern, and P. Tadepalli,‘‘HC-Search: A learning framework forsearch-based structured prediction,’’ J. Artif.Intell. Res., vol. 50, pp. 369–407, 2014.
[30] J. Duchi, E. Hazan, and Y. Singer, ‘‘Adaptivesubgradient methods for online learning andstochastic optimization,’’ J. Mach. Learn. Res.,vol. 12, pp. 2121–2159, 2011.
[31] J. Duchi, L. Mackey, and M. I. Jordan, ‘‘On theconsistency of ranking algorithms,’’ in Proc.Int. Conf. Mach. Learn., pp. 327–334, 2010.
[32] M. Dudik, J. Langford, and L. Li, ‘‘Doublyrobust policy evaluation and learning,’’ inProc. Int. Conf. Mach. Learn., 2011,pp. 1097–1104.
[33] V. Guruswami and A. Sahai, ‘‘Multiclasslearning, boosting, error-correcting codes,’’ inProc. 12th Annu. Conf. Comput. Learn. Theory,1999, pp. 145–155.
[34] S. Hanneke, ‘‘A bound on the label complexityof agnostic active learning,’’ in Proc. Int. Conf.Mach. Learn., 2007, pp. 353–360.
[35] D. Hsu, ‘‘Algorithms for active learning,’’Ph.D. dissertation, Dept. Comput. Sci. Eng.,Univ. California San Diego, La Jolla, CA, USA,2010.
[36] D. Hsu, S. Kakade, J. Langford, and T. Zhang,‘‘Multi-label prediction via compressedsensing,’’ in Proc. Adv. Neural Inf. Process. Syst.,2009, pp. 772–780.
[37] L. Huang, S. Fayong, and Y. Guo, ‘‘Structuredperceptron with inexact search,’’ in Proc. Conf.North Amer. Chapter Assoc. Comput. Linguistics,2012, pp. 142–151.
[38] G. James and T. Hastie, ‘‘The error codingmethod and PICTs,’’ J. Comput. Graph. Stat.,vol. 7, no. 3, pp. 377–387, 1998.
[39] T. Joachims, T. Finley, and C.-N. Yu,‘‘Cutting-plane training of structural SVMs,’’Mach. Learn. J., vol. 77, pp. 27–59, 2009.
[40] S. M. Kakade, S. Shalev-Shwartz, andA. Tewari, ‘‘Efficient bandit algorithms foronline multiclass prediction,’’ in Proc. Int.Conf. Mach. Learn., 2008, pp. 440–447.
[41] N. Karampatziakis and J. Langford, ‘‘Onlineimportance weight aware updates,’’ in Proc.27th Conf. Uncertainty Artif. Intell., 2011,pp. 392–399.
[42] T. Kudo, ‘‘CRF++ project,’’ 2005. [Online].Available: http://crfpp.googlecode.com.
[43] J. Lafferty, A. McCallum, and F. Pereira,‘‘Conditional random fields: Probabilisticmodels for segmenting and labeling sequencedata,’’ in Proc. Int. Conf. Mach. Learn., 2001,pp. 282–289.
[44] J. Langford, ‘‘Robust efficient conditionalprobability estimation,’’ in Proc. Conf. Learn.Theory, 2010, pp. 316–317.
[45] J. Langford and A. Beygelzimer, ‘‘Sensitiveerror correcting output codes,’’ in Proc. Conf.Learn. Theory, pp. 158–172, 2005.
[46] J. Langford, L. Li, and A. Strehl, ‘‘VowpalWabbit online learning project,’’ Tech. Rep.,2007. [Online]. Available: http://hunch.net/?p=309.
[47] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner,‘‘Gradient-based learning applied to documentrecognition,’’ Proc. IEEE, vol. 86, no. 11,pp. 2278–2324, Nov. 1998.
[48] N. Littlestone and M. Warmuth, ‘‘Weightedmajority algorithm,’’ in Proc. IEEE Symp.Found. Comput. Sci., 1989, pp. 256–261.
[49] J. Nocedal, ‘‘Updating quasi-newton matriceswith limited storage,’’ Math. Comput., vol. 35,pp. 773–782, 1980.
[50] N. Ratliff, D. Bradley, J. A. Bagnell, andJ. Chestnutt, ‘‘Boosting structured prediction
for imitation learning,’’ in Proc. Adv. NeuralInf. Process. Syst., 2007, pp. 1153–1160.
[51] R. Malouf, ‘‘A comparison of algorithms formaximum entropy parameter estimation,’’ inProc. CoNLL, 2002, DOI: 10.3115/1118853.1118871.
[52] R. McDonald, K. Crammer, and F. Pereira,‘‘Large margin online learning algorithms forscalable structured classification,’’ in Proc.NIPS Workshop Learn. Structured Outputs,2004.
[53] H. Ramaswamy, S. B. Balaji, S. Agarwal, andR. Williamson, ‘‘On the consistency of outputcode based learning algorithms for multiclasslearning problems,’’ in Proc. 27th Annu. Conf.Learn. Theory, 2014, pp. 885–902.
[54] S. Ross, G. J. Gordon, and J. Andrew Bagnell,‘‘A reduction of imitation learning andstructured prediction to no-regret onlinelearning,’’ in Proc. Workshop Artif. Intell. Stat.,2011, pp. 627–635.
[55] S. Ross, P. Mineiro, and J. Langford,‘‘Normalized online learning,’’ in Proc. 29thConf. Uncertainty Artif. Intell., 2013,pp. 537–545.
[56] R. E. Schapire and Y. Freund, Boosting:Foundations and Algorithms. Cambridge,MA, USA: MIT Press, 2012.
[57] U. Syed and R. E. Schapire, ‘‘A reduction fromapprenticeship learning to classification,’’ inProc. Adv. Neural Inf. Process. Syst., 2011,pp. 2253–2261.
[58] I. Tsochantaridis, T. Hofmann, T. Joachims,and Y. Altun, ‘‘Support vector machinelearning for interdependent and structuredoutput spaces,’’ in Proc. Int. Conf. Mach. Learn.,2004, DOI: 10.1145/1015330.1015341.
[59] L. Valiant, ‘‘A theory of the learnable,’’Commun. ACM, vol. 27, pp. 1134–1142, 1984.
[60] V. Vapnik and A. Chervonenkis, ‘‘On theuniform convergence of relative frequenciesof events to their probabilities,’’ TheoryProbab. Appl., vol. 16, no. 2, pp. 264–280,1971.
[61] J. Weston, A. Makadia, and H. Yee, ‘‘Labelpartitioning for sublinear ranking,’’ in Proc.Int. Conf. Mach. Learn., 2013, pp. 181–189.
[62] B. Zhao and E. P. Xing, ‘‘Sparse output codingfor large-scale visual recognition,’’ in Proc. Int.Conf. Comput. Vis. Pattern Recognit., 2013,pp. 3350–3357.
[63] Y. Xu and A. Fern, ‘‘On learning linearranking functions for beam search,’’ in Proc.Int. Conf. Mach. Learn., 2007, pp. 1047–1054.
[64] Y. Xu, A. Fern, and S. W. Yoon,‘‘Discriminative learning of beam-searchheuristics for planning,’’ in Proc. Int. JointConf. Artif. Intell., 2007, pp. 2041–2046.
[65] M. Reid and R. Williamson, ‘‘Surrogate regretbounds for proper losses,’’ in Proc. Int. Conf.Mach. Learn., 2009, pp. 897–904.
[66] B. Zadrozny, ‘‘Policy mining: Learningdecision policies from fixed sets of data,’’Ph.D. dissertation, Dept. Comput. Sci. Eng.,Univ. California San Diego, La Jolla, CA, USA,2003.
Beygelzimer et al. : Learning Reductions That Really Work
146 Proceedings of the IEEE | Vol. 104, No. 1, January 2016
ABOUT T HE AUTHO RS
Alina Beygelzimer received the Ph.D. degree in
computer science from the University of Roche-
ster, Rochester, NY, USA, in 2003.
She is a Senior Research Scientist at Yahoo
Labs, New York City, NY, USA, working on scalable
machine learning. Prior to that, she was a Research
Staff Member at the IBM Thomas J. Watson
Research Center, Yorktown Heights, NY, USA.
Hal Daume, III received the B.S. degree in
mathematical sciences from Carnegie Mellon Uni-
versity, Pittsburgh, PA, USA and the Ph.D. degree
in computer science from the University of
Southern California, Los Angeles, CA, USA, with a
thesis on structured prediction for language.
He is an Associate Professor in Computer
Science at the University of Maryland, College
Park, MD, USA. He holds joint appointments in the
University of Maryland Institute for Advanced
Computer Studies (UMIACS) and Linguistics. He was previously an
Assistant Professor in the School of Computing, University of Utah, Salt
lake City, UT, USA. His primary research interest is in developing new
learning algorithms for prototypical problems that arise in the context of
language processing and artificial intelligence.
John Langford received the B.S. degrees in
physics and computer science from the California
Institute of Technology, Pasadena, CA, USA, in
1997 and the Ph.D. degree in computer science
from Carnegie Mellon University, Pittsburgh, PA,
USA, in 2002.
He is a Principal Researcher at Microsoft
Research, New York City, NY, USA. He has worked
at Yahoo!, Toyota Technological Institute, and
IBM’s Watson Research Center. He is also the
primary author of the popular Machine Learning weblog, hunch.net, and
the principle developer of Vowpal Wabbit.
Dr. Langford was the Program Co-Chair for the 2012 International
Conference on Machine Learning (ICML) and is the General Chair for the
2016 ICML.
Paul Mineiro received an undergraduate degree
in physics from the California Institute of Tech-
nology, Pasadena, CA, USA and attended graduate
school at the Cognitive Science Department, Uni-
versity of California San Diego, La Jolla, CA, USA.
He is a Research Engineer in the Cloud and
Information Services Laboratory, Microsoft,
Bellevue, WA, USA. His interests include online
learning, extreme classification, and distributed
machine learning.
Beygelzimer et al. : Learning Reductions That Really Work
Vol. 104, No. 1, January 2016 | Proceedings of the IEEE 147