OptNet: Differentiable Optimization as a Layer in Neural … · 2019-10-16 · OptNet:...

OptNet: Differentiable Optimization as a Layer in Neural Networks

Brandon Amos 1 J. Zico Kolter 1

Abstract

This paper presents OptNet, a network architec-ture that integrates optimization problems (here,specifically in the form of quadratic programs)as individual layers in larger end-to-end train-able deep networks. These layers encode con-straints and complex dependencies between thehidden states that traditional convolutional andfully-connected layers often cannot capture. Inthis paper, we explore the foundations for suchan architecture: we show how techniques fromsensitivity analysis, bilevel optimization, and im-plicit differentiation can be used to exactly differ-entiate through these layers and with respect tolayer parameters; we develop a highly efficientsolver for these layers that exploits fast GPU-based batch solves within a primal-dual interiorpoint method, and which provides backpropaga-tion gradients with virtually no additional cost ontop of the solve; and we highlight the applica-tion of these approaches in several problems. Inone notable example, we show that the methodis capable of learning to play mini-Sudoku (4x4)given just input and output games, with no a pri-ori information about the rules of the game; thishighlights the ability of our architecture to learnhard constraints better than other neural architec-tures.

1. IntroductionIn this paper, we consider how to treat exact, constrainedoptimization as an individual layer within a deep learn-ing architecture. Unlike traditional feedforward networks,where the output of each layer is a relatively simple (thoughnon-linear) function of the previous layer, our optimizationframework allows for individual layers to capture muchricher behavior, expressing complex operations that in to-

1School of Computer Science, Carnegie Mellon Univer-sity. Pittsburgh, PA, USA. Correspondence to: Brandon Amos<[email protected]>, J. Zico Kolter <[email protected]>.

Proceedings of the 34 th International Conference on MachineLearning, Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s).

tal can reduce the overall depth of the network while pre-serving richness of representation. Specifically, we build aframework where the output of the i + 1th layer in a net-work is the solution to a constrained optimization problembased upon previous layers. This framework naturally en-compasses a wide variety of inference problems expressedwithin a neural network, allowing for the potential of muchricher end-to-end training for complex tasks that requiresuch inference procedures.

Concretely, in this paper we specifically consider the taskof solving small quadratic programs as individual layers.These optimization problems are well-suited to captur-ing interesting behavior and can be efficiently solved withGPUs. Specifically, we consider layers of the form

zi+1 = argminz

1

2zTQ(zi)z + q(zi)

T z

subject to A(zi)z = b(zi)

G(zi)z ≤ h(zi)

(1)

where z is the optimization variable, Q(zi), q(zi), A(zi),b(zi), G(zi), and h(zi) are parameters of the optimizationproblem. As the notation suggests, these parameters candepend in any differentiable way on the previous layer zi,and which can eventually be optimized just like any otherweights in a neural network. These layers can be learnedby taking the gradients of some loss function with respectto the parameters. In this paper, we derive the gradients of(1) by taking matrix differentials of the KKT conditions ofthe optimization problem at its solution.

In order to the make the approach practical for larger net-works, we develop a custom solver which can simultane-ously solve multiple small QPs in batch form. We do soby developing a custom primal-dual interior point methodtailored specifically to dense batch operations on a GPU.In total, the solver can solve batches of quadratic programsover 100 times faster than existing highly tuned quadraticprogramming solvers such as Gurobi and CPLEX. One cru-cial algorithmic insight in the solver is that by using aspecific factorization of the primal-dual interior point up-date, we can obtain a backward pass over the optimiza-tion layer virtually “for free” (i.e., requiring no additionalfactorization once the optimization problem itself has beensolved). Together, these innovations enable parameterizedoptimization problems to be inserted within the architec-

arX

iv:1

703.

0044

3v4

[cs

.LG

] 1

4 O

ct 2

019


ture of existing deep networks.

We begin by highlighting background and related work,and then present our optimization layer itself. Using matrixdifferentials we derive rules for computing all the neces-sary backpropagation updates. We then detail our specificsolver for these quadratic programs, based upon a state-of-the-art primal-dual interior point method, and highlight thenovel elements as they apply to our formulation, such asthe aforementioned fact that we can compute backpropaga-tion at very little additional cost. We then provide experi-mental results that demonstrate the capabilities of the archi-tecture, highlighting potential tasks that these architecturescan solve, and illustrating improvements upon existing ap-proaches.

2. Background and related workOptimization plays a key role in modeling complex phe-nomena and providing concrete decision making processesin sophisticated environments. A full treatment of opti-mization applications is beyond our scope (Boyd & Van-denberghe, 2004) but these methods have bound applica-bility in control frameworks (Morari & Lee, 1999; Sastry& Bodson, 2011); numerous statistical and mathematicalformalisms (Sra et al., 2012), and physical simulation prob-lems like rigid body dynamics (Lotstedt, 1984). Generallyspeaking, our work is a step towards learning optimizationproblems behind real-world processes from data that canbe learned end-to-end rather than requiring human specifi-cation and intervention.

In the machine learning setting, a wide array of applica-tions consider optimization as a means to perform infer-ence in learning. Among many other applications, thesearchitectures are well-studied for generic classification andstructured prediction tasks (Goodfellow et al., 2013; Stoy-anov et al., 2011; Brakel et al., 2013; LeCun et al., 2006;Belanger & McCallum, 2016; Belanger et al., 2017; Amoset al., 2017); in vision for tasks such as denoising (Tappenet al., 2007; Schmidt & Roth, 2014); and Metz et al. (2016)uses unrolled optimization within a network to stabilize theconvergence of generative adversarial networks (Goodfel-low et al., 2014). Indeed, the general idea of solving re-stricted classes of optimization problem using neural net-works goes back many decades (Kennedy & Chua, 1988;Lillo et al., 1993), but has seen a number of advances inrecent years. These models are often trained by one of thefollowing four methods.

Energy-based learning methods These methods can beused for tasks like (structured) prediction where the train-ing method shapes the energy function to be low around theobserved data manifold and high elsewhere (LeCun et al.,2006). In recent years, there has been a strong push to fur-ther incorporate structured prediction methods like condi-

tional random fields as the “last layer” of a deep networkarchitecture (Peng et al., 2009; Zheng et al., 2015; Chenet al., 2015) as well as in deeper energy-based architec-tures (Belanger & McCallum, 2016; Belanger et al., 2017;Amos et al., 2017). Learning in this context requires ob-served data, which isn’t present in some of the contexts weconsider in this paper, and also may suffer from instabil-ity issues when combined with deep energy-based archi-tectures as observed in Belanger & McCallum (2016); Be-langer et al. (2017); Amos et al. (2017).

Analytically If an analytic solution to the argmin can befound, such as in an unconstrained quadratic minimization,the gradients can often also be computed analytically. Thisis done in Tappen et al. (2007); Schmidt & Roth (2014). Wecannot use these methods for the constrained optimizationproblems we consider in this paper because there are noknown analytic solutions.

Unrolling The argmin operation over an unconstrainedobjective can be approximated by a first-order gradient-based method and unrolled. These architectures typicallyintroduce an optimization procedure such as gradient de-scent into the inference procedure. This is done in Domke(2012); Amos et al. (2017); Belanger et al. (2017); Metzet al. (2016); Goodfellow et al. (2013); Stoyanov et al.(2011); Brakel et al. (2013). The optimization procedure isunrolled automatically or manually (Domke, 2012) to ob-tain derivatives during training that incorporate the effectsof these in-the-loop optimization procedures. However, un-rolling the computation of a method like gradient descenttypically requires a substantially larger network, and addssubstantially to the computational complexity of the net-work.

In all of these existing cases, the optimization problem isunconstrained and unrolling gradient descent is often easyto do. When constraints are added to the optimization prob-lem, iterative algorithms often use a projection operatorthat may be difficult to unroll through. In this paper, wedo not unroll an optimization procedure but instead useargmin differentiation as described in the next section.

Argmin differentiation Most closely related to our ownwork, there have been several papers that propose someform of differentiation through argmin operators. Thesetechniques also come up in bilevel optimization (Gouldet al., 2016; Kunisch & Pock, 2013) and sensitivity anal-ysis (Bertsekas, 1999; Fiacco & Ishizuka, 1990; Bonnans& Shapiro, 2013). In the case of Gould et al. (2016),the authors describe general techniques for differentiationthrough optimization problems, but only describe the caseof exact equality constraints rather than both equality andinequality constraints (in the case inequality constraints,they add these via a barrier function). Amos et al. (2017)


considers argmin differentiation within the context of a spe-cific optimization problem (the bundle method) but doesnot consider a general setting. Johnson et al. (2016) per-forms implicit differentiation on (multi-)convex objectiveswith coordinate subspace constraints, but don’t consider in-equality constraints and don’t consider in detail general lin-ear equality constraints. Their optimization problem is onlyin the final layer of a variational inference network whilewe propose to insert optimization problems anywhere inthe network. Therefore a special case of OptNet layers(with no inequality constraints) has a natural interpretationin terms of Gaussian inference, and so Gaussian graphi-cal models (or CRF ideas more generally) provide toolsfor making the computation more efficient and interpretingor constraining its structure. Similarly, the older work ofMairal et al. (2012) considered argmin differentiation for aLASSO problem, deriving specific rules for this case, andpresenting an efficient algorithm based upon our ability tosolve the LASSO problem efficiently.

In this paper, we use implicit differentiation (Dontchev& Rockafellar, 2009; Griewank & Walther, 2008) andtechniques from matrix differential calculus (Magnus &Neudecker, 1988) to derive the gradients from the KKTmatrix of the problem we are interested in. A notable dif-ferent from other work within ML that we are aware of, isthat we analytically differentiate through inequality as wellas just equality constraints, but differentiating the comple-mentarity conditions; this differs from e.g., Gould et al.(2016) where they instead approximately convert the prob-lem to an unconstrained one via a barrier method. We havealso developed methods to make this approach practicaland reasonably scalable within the context of deep archi-tectures.

3. OptNet: solving optimization within aneural network

Although in the most general form, an OptNet layer canbe any optimization problem, in this paper we will studyOptNet layers defined by a quadratic program

minimizez

1

2zTQz + qT z

subject to Az = b, Gz ≤ h(2)

where z ∈ Rn is our optimization variable Q ∈ Rn×n � 0(a positive semidefinite matrix), q ∈ Rn, A ∈ Rm×n,b ∈ Rm, G ∈ Rp×n and h ∈ Rp are problem data,and leaving out the dependence on the previous layer zias we showed in (1) for notational convenience. As is well-known, these problems can be solved in polynomial timeusing a variety of methods; if one desires exact (to numeri-cal precision) solutions to these problems, then primal-dualinterior point methods, as we will use in a later section, arethe current state of the art in solution methods. In the neu-

ral network setting, the optimal solution (or more generally,a subset of the optimal solution) of this optimization prob-lems becomes the output of our layer, denoted zi+1, andany of the problem data Q, q,A, b,G, h can depend on thevalue of the previous layer zi. The forward pass in our Opt-Net architecture thus involves simply setting up and findingthe solution to this optimization problem.

Training deep architectures, however, requires that we notjust have a forward pass in our network but also a back-ward pass. This requires that we compute the derivative ofthe solution to the QP with respect to its input parameters,a general topic we topic we discussed previously. To ob-tain these derivatives, we differentiate the KKT conditions(sufficient and necessary conditions for optimality) of (2)at a solution to the problem using techniques from matrixdifferential calculus (Magnus & Neudecker, 1988). Ouranalysis here can be extended to more general convex opti-mization problems.

The Lagrangian of (2) is given by

L(z, ν, λ) =1

2zTQz + qT z + νT (Az − b) + λT (Gz − h)

(3)where ν are the dual variables on the equality constraintsand λ ≥ 0 are the dual variables on the inequality con-straints. The KKT conditions for stationarity, primal feasi-bility, and complementary slackness are

Qz? + q +AT ν? +GTλ? = 0

Az? − b = 0

D(λ?)(Gz? − h) = 0,

(4)

where D(·) creates a diagonal matrix from a vector and z?,ν? and λ? are the optimal primal and dual variables. Takingthe differentials of these conditions gives the equations

dQz? +Qdz + dq + dAT ν?+

AT dν + dGTλ? +GT dλ = 0

dAz? +Adz − db = 0

D(Gz? − h)dλ+D(λ?)(dGz? +Gdz − dh) = 0

(5)

or written more compactly in matrix form Q GT AT

D(λ?)G D(Gz? − h) 0A 0 0

dzdλdν

=

−

dQz? + dq + dGTλ? + dAT ν?

D(λ?)dGz? −D(λ?)dhdAz? − db

.(6)

Using these equations, we can form the Jacobians of z? (orλ? and ν?, though we don’t consider this case here), withrespect to any of the data parameters. For example, if wewished to compute the Jacobian ∂z?

∂b ∈ Rn×m, we would


simply substitute db = I (and set all other differential termsin the right hand side to zero), solve the equation, and theresulting value of dz would be the desired Jacobian.

In the backpropagation algorithm, however, we never wantto explicitly form the actual Jacobian matrices, but ratherwant to form the left matrix-vector product with someprevious backward pass vector ∂`

∂z? ∈ Rn, i.e., ∂`∂z?

∂z?

∂b .We can do this efficiently by noting the solution for the(dz, dλ, dν) involves multiplying the inverse of the left-hand-side matrix in (6) by some right hand side. Thus, ifwe multiply the backward pass vector by the transpose ofthe differential matrixdzdλ

dν

= −

Q GTD(λ?) AT

G D(Gz? − h) 0A 0 0

−1 ( ∂`∂z? )T00

(7)

then the relevant gradients with respect to all the QP pa-rameters can be given by

∇Q` =1

2(dzz

T + zdTz ) ∇q` = dz

∇A` = dνzT + νdTz ∇b` = −dν

∇G` = D(λ?)(dλzT + λdTz ) ∇h` = −D(λ?)dλ

(8)

where as in standard backpropagation, all these terms areat most the size of the parameter matrices. We note thatsome of these parameters should depend on the previouslayer zi and the gradients with respect to the previous layercan be obtained through the chain rule. As we will see inthe next section, the solution to an interior point method infact already provides a factorization we can use to computethese gradient efficiently.

3.1. An efficient batched QP solver

Deep networks are typically trained in mini-batches to takeadvantage of efficient data-parallel GPU operations. With-out mini-batching on the GPU, many modern deep learningarchitectures become intractable for all practical purposes.However, today’s state-of-the-art QP solvers like Gurobiand CPLEX do not have the capability of solving multi-ple optimization problems on the GPU in parallel acrossthe entire minibatch. This makes larger OptNet layers be-come quickly intractable compared to a fully-connectedlayer with the same number of parameters.

To overcome this performance bottleneck in our quadraticprogram layers, we have implemented a GPU-basedprimal-dual interior point method (PDIPM) based on Mat-tingley & Boyd (2012) that solves a batch of quadratic pro-grams, and which provides the necessary gradients neededto train these in an end-to-end fashion. Our performanceexperiments in Section 4.1 shows that our solver is signif-icantly faster than the standard non-batch solvers Gurobiand CPLEX.

Following the method of Mattingley & Boyd (2012), oursolver introduces slack variables on the inequality con-straints and iteratively minimizes the residuals from theKKT conditions over the primal variable z ∈ Rn, slackvariable s ∈ Rp, and dual variables ν ∈ Rm associ-ated with the equality constraints and λ ∈ Rp associatedwith the inequality constraints. Each iteration computesthe affine scaling directions by solving

K

∆zaff

∆saff

∆λaff

∆νaff

=

−(AT ν +GTλ+Qz + q)

−Sλ−(Gz + s− h)−(Az − b)

(9)

where

K =

Q 0 GT AT

0 D(λ) D(s) 0G I 0 0A 0 0 0

,then centering-plus-corrector directions by solving

K

∆zcc

∆scc

∆λcc

∆νcc

=

0

σµ1−D(∆saff)∆λaff

00

, (10)

where µ = sTλ/p is the duality gap and σ is defined inMattingley & Boyd (2012). Each variable v is updated with∆v = ∆vaff + ∆vcc using an appropriate step size. We ac-tually solve a symmetrized version of the KKT conditions,obtained by scaling the second row block by D(1/s). Weanalytically decompose these systems into smaller sym-metric systems and pre-factorize portions of them that don’tchange (i.e. that don’t involve D(λ/s) between iterations).We have implemented a batched version of this methodwith the PyTorch library2 and have released it as an opensource library at https://github.com/locuslab/qpth. It uses a custom CUBLAS extension that pro-vides an interface to solve multiple matrix factorizationsand solves in parallel, and which provides the necessarybackpropagation gradients for their use in an end-to-endlearning system.

3.1.1. EFFICIENTLY COMPUTING GRADIENTS

A key point of the particular form of primal-dual interiorpoint method that we employ is that it is possible to com-pute the backward pass gradients “for free” after solvingthe original QP, without an additional matrix factorizationor solve. Specifically, at each iteration in the primal-dualinterior point, we are computing an LU decomposition ofthe matrixKsym.3 This matrix is essentially a symmetrized

2https://pytorch.org3We actually perform an LU decomposition of a certain sub-

set of the matrix formed by eliminating variables to create only

https://github.com/locuslab/qpth


https://pytorch.org


version of the matrix needed for computing the backprop-agated gradients, and we can similarly compute the dz,λ,νterms by solving the linear system

Ksym

dzdsdλdν

=

(− ∂`∂zi+1

)T000

, (11)

where dλ = D(λ?)dλ for dλ as defined in (7). Thus, allthe backward pass gradients can be computed using thefactored KKT matrix at the solution. Crucially, becausethe bottleneck of solving this linear system is computingthe factorization of the KKT matrix (cubic time as op-posed to the quadratic time for solving via backsubstitutiononce the factorization is computed), the additional time re-quirements for computing all the necessary gradients in thebackward pass is virtually nonexistent compared with thetime of computing the solution. To the best of our knowl-edge, this is the first time that this fact has been exploitedin the context of learning end-to-end systems.

3.2. Properties and representational power

In this section we briefly highlight some of the mathe-matical properties of OptNet layers. The proofs here arestraightforward, and are mostly based upon well-known re-sults in convex analysis, so are deferred to the appendix.The first result simply highlights that (because the solutionof strictly convex QPs is continuous), that OptNet layersare subdifferentiable everywhere, and differentiable at allbut a measure-zero set of points.

Theorem 1. Let z?(θ) be the output of an OptNet layer,where θ = {Q, p,A, b,G, h}. Assuming Q � 0 and thatA has full row rank, then z?(θ) is subdifferentiable every-where: ∂z?(θ) 6= ∅, where ∂z?(θ) denotes the Clarkegeneralized subdifferential (Clarke, 1975) (an extension ofthe subgradient to non-convex functions), and has a singleunique element (the Jacobian) for all but a measure zeroset of points θ.

The next two results show the representational power of theOptNet layer, specifically how an OptNet layer comparesto the common linear layer followed by a ReLU activation.The first theorem shows that an OptNet layer can approxi-mate arbitrary elementwise piecewise-linear functions, andso among other things can represent a ReLU layer.

a p × p matrix (the number of inequality constraints) that needsto be factor during each iteration of the primal-dual algorithm,and one m × m and one n × n matrix once at the start of theprimal-dual algorithm, though we omit the detail here. We alsouse an LU decomposition as this routine is provided in batch formby CUBLAS, but could potentially use a (faster) Cholesky fac-torization if and when the appropriate functionality is added toCUBLAS).

Theorem 2. Let f : Rn → Rn be an elementwise piece-wise linear function with k linear regions. Then the func-tion can be represented as an OptNet layer using O(nk)parameters. Additionally, the layer zi+1 = max{Wzi +b, 0} for W ∈ Rn×m, b ∈ Rn can be represented by anOptNet layer with O(mn) parameters.

Finally, we show that the converse does not hold: that thereare function representable by an OptNet layer which cannotbe represented exactly by a two-layer ReLU layer, whichtake exponentially many units to approximate (known tobe a universal function approximator). A simple ex-ample of such a layer (and one which we use in theproof) is just the max over three linear functions f(z) =max{aT1 x, aT2 x, aT3 x}.Theorem 3. Let f(z) : Rn → R be a scalar-valued func-tion specified by an OptNet layer with p parameters. Con-versely, let f ′(z) =

∑mi=1 wi max{aTi z+ bi, 0} be the out-

put of a two-layer ReLU network. Then there exist func-tions that the ReLU network cannot represent exactly overall of R, and which require O(cp) parameters to approxi-mate over a finite region.

3.3. Limitations of the method

Although, as we will show shortly, the OptNet layer hasseveral strong points, we also want to highlight the poten-tial drawbacks of this approach. First, although, with anefficient batch solver, integrating an OptNet layer into ex-isting deep learning architectures is potentially practical,we do note that solving optimization problems exactly aswe do here has has cubic complexity in the number of vari-ables and/or constraints. This contrasts with the quadraticcomplexity of standard feedforward layers. This means thatwe are ultimately limited to settings where the number ofhidden variables in an OptNet layer is not too large (lessthan 1000 dimensions seems to be the limits of what wecurrently find to the be practical, and substantially less ifone wants real-time results for an architecture).

Secondly, there are many improvements to the OptNet lay-ers that are still possible. Our QP solver, for instance, usesfully dense matrix operations, which makes the solves veryefficient for GPU solutions, and which also makes sense forour general setting where the coefficients of the quadraticproblem can be learned. However, for setting many real-world optimization problems (and hence for architecturesthat wish to more closely mimic some real-world opti-mization problem), there is often substantial structure (e.g.,sparsity), in the data matrices that can be exploited for ef-ficiency. There is of course no prohibition of incorporat-ing sparse matrix methods into the fast custom solver, butdoing so would require substantial added complexity, espe-cially regarding efforts like finding minimum fill orderingsfor different sparsity patterns of the KKT systems. In ouropen source solver qpth, we have started experimenting


with cuSOLVER’s batched sparse QR factorizations andsolves.

Lastly, we note that while the OptNet layers can be trainedjust as any neural network layer, since they are a new cre-ation and since they have manifolds in the parameter spacewhich have no effect on the resulting solution (e.g., scalingthe rows of a constraint matrix and its right hand side doesnot change the optimization problem), there is admittedlymore tuning required to get these to work. This situationis common when developing new neural network architec-tures and has also been reported in the similar architectureof Schmidt & Roth (2014). Our hope is that techniques forovercoming some of the challenges in learning these layerswill continue to be developed in future work.

4. Experimental resultsIn this section, we present several experimental results thathighlight the capabilities of the QP OptNet layer. Specif-ically we look at 1) computational efficiency over exitingsolvers; 2) the ability to improve upon existing convexproblems such as those used in signal denoising; 3) inte-grating the architecture into an generic deep learning archi-tectures; and 4) performance of our approach on a problemthat is challenging for current approaches. In particular, wewant to emphasize the results of our system on learning thegame of (4x4) mini-Sudoku, a well-known logical puzzle;our layer is able to directly learn the necessary constraintsusing just gradient information and no a priori knowl-edge of the rules of Sudoku. The code and data for ourexperiments are open sourced in the icml2017 branchof https://github.com/locuslab/optnet andour batched QP solver is available as a library at https://github.com/locuslab/qpth.

4.1. Batch QP solver performance

All of the OptNet performance results in this section are runon an unloaded Titan X GPU. Gurobi is run on an unloadedquad-core Intel Core i7-5960X CPU @ 3.00GHz.

Our OptNet layers are much more computationally expen-sive than a linear or convolutional layer and a natural ques-tion is to ask what the performance difference is. We setup an experiment comparing a linear layer to a QP OptNetlayer with a mini-batch size of 128 on CUDA with ran-domly generated input vectors sized 10, 50, 100, and 500.Each layer maps this input to an output of the same dimen-sion; the linear layer does this with a batched matrix-vectormultiplication and the OptNet layer does this by taking theargmin of a random QP that has the same number of in-equality constraints as the dimensionality of the problem.Figure 1 shows the profiling results (averaged over 10 tri-als) of the forward and backward passes. The OptNet layeris significantly slower than the linear layer as expected, yet

10 50 100 500Number of Variables (and Inequality Constraints)

10-5

10-4

10-3

10-2

10-1

100

101

Runti

me (

s)

Linear Forward

QP Forward

Linear Backward

QP Backward

Figure 1. Performance of a linear layer and a QP layer.(Batch size 128)

1 64 128

Batch Size

10-2

10-1

100

101

Runti

me (

s)

Gurobi

Ours Serial

Ours Batched

Figure 2. Performance of Gurobi and our QP solver.

still tractable in many practical contexts.

Our next experiment illustrates why standard baseline QPsolvers like CPLEX and Gurobi without batch support aretoo computationally expensive for QP OptNet layers to betractable. We set up random QP of the form (1) that have100 variables and 100 inequality constraints in Gurobi andin the serialized and batched versions of our solver qpthand vary the batch size.4

Figure 2 shows the means and standard deviations of run-ning each trial 10 times, showing that our batched solveroutperforms Gurobi, itself a highly tuned solver for reason-able batch sizes. For the minibatch size of 128, we solveall problems in an average of 0.18 seconds, whereas Gurobitasks an average of 4.7 seconds. In the context of training adeep architecture this type of speed difference for a singleminibatch can make the difference between a practical anda completely unusable solution.

4.2. Total variation denoising

Our next experiment studies how we can use the OptNetarchitecture to improve upon signal processing techniquesthat currently use convex optimization as a basis. Specifi-cally, our goal in this case is to denoise a noisy 1D signal

4Experimental details: we sample entries of a matrix U from arandom uniform distribution and set Q = UTU +10−3I , sampleG with random normal entries, and set h by selecting generat-ing some z0 random normal and s0 random uniform and settingh = Gz0 + s0 (we didn’t include equality constraints just forsimplicity, and since the number of inequality constraints in theprimary driver of complexity for the iterations in a primal-dualinterior point method). The choice of h guarantees the problem isfeasible.

https://github.com/locuslab/optnet




given training data consistency of noisy and clean signalsgenerated from the same distribution. Such problems areoften addressed by convex optimization procedures, and(1D) total variation denoising is a particularly common andsimple approach. Specifically, the total variation denoisingapproach attempts to smooth some noisy observed signal yby solving the optimization problem

argminz

1

2||y − z||+ λ||Dz||1 (12)

where D is the first-order differencing operation, whichcan be expressed in matrix form by a matrix with rowsDi = ei− ei+1 Penalizing the `1 norm of the signal differ-ence encourages this difference to be sparse, i.e., the num-ber of changepoints of the signal is small, and we end upapproximating y by a (roughly) piecewise constant func-tion.

To test this approach and competing ones on a denoisingtask, we generate piecewise constant signals (which are thedesired outputs of the learning algorithm) and corrupt themwith independent Gaussian noise (which form the inputsto the learning algorithm). Table 1 shows the error rate ofthese four approaches.

4.2.1. BASELINE: TOTAL VARIATION DENOISING

To establish a baseline for denoising performance with totalvariation, we run the above optimization problem varyingvalues of λ between 0 and 100. The procedure performsbest with a choice of λ ≈ 13, and achieves a minimum testMSE on our task of about 16.5 (the units here are unimpor-tant, the only relevant quantity is the relative performancesof the different algorithms).

4.2.2. BASELINE: LEARNING WITH AFULLY-CONNECTED NEURAL NETWORK

An alternative approach to denoising is by learning fromdata. A function fθ(x) parameterized by θ can be used topredict the original signal. The optimal θ can be learnedby using the mean squared error between the true and pre-dicted signals. Denoising is typically a difficult functionto learn and Table 1 shows that a fully-connected neuralnetwork perform substantially worse on this denoising taskthan the convex optimization problem. Section B shows theconvergence of the fully-connected network.

4.2.3. LEARNING THE DIFFERENCING OPERATOR WITHOPTNET

Between the feedforward neural network approach and theconvex total variation optimization, we could instead use ageneric OptNet layers that effectively allowed us to solve(12) using any denoising matrix, which we randomly ini-tialize. While the accuracy here is substantially lower thaneven the fully connected case, this is largely the result of

Figure 3. Initial and learned difference operators for denoising.

Method Train MSE Test MSEFC Net 18.5 29.8Pure OptNet 52.9 53.3Total Variation 16.3 16.5OptNet Tuned TV 13.8 14.4

Table 1. Denoising task error rates.

learning an over-regularized solution to D. This is indeeda point that should be addressed in future work (we referback to our comments in the previous section on the po-tential challenges of training these layers), but the point wewant to highlight here is that the OptNet layer seems tobe learning something very interpretable and understand-able. Specifically, Figure 3 shows the D matrix of our so-lution before and after learning (we permute the rows tomake them ordered by the magnitude of where the large-absolute-value entries occurs). What is interesting in thispicture is that the learned D matrix typically captures ex-actly the same intuition as the D matrix used by total vari-ation denoising: a mainly sparse matrix with a few entriesof alternating sign next to each other. This implies that forthe data set we have, total variation denoising is indeed the“right” way to think about denoising the resulting signal,but if some other noise process were to generate the data,then we can learn that process instead. We can then at-tain lower actual error for the method (in this case similarthough slightly higher than the TV solution), by fixing thelearned sparsity of the D matrix and then fine tuning.

4.2.4. FINE-TUNING AND IMPROVING THE TOTALVARIATION SOLUTION

To finally highlight the ability of the OptNet methods toimprove upon the results of a convex program, specificallytailoring to the data. Here, we use the same OptNet archi-tecture as in the previous subsection, but initialize D to bethe differencing matrix as in the total variation solution. Asshown in Table 1, the procedure is able to improve both thetraining and testing MSE over the TV solution, specificallyimproving upon test MSE by 12%. Section B shows theconvergence of fine-tuning.


31

44 1

2 4 1 31 3 2 43 1 4 24 2 3 1

Figure 4. Example mini-Sudoku initial problem and solution.

4.3. MNIST

One compelling use case of an OptNet layer is to learn con-straints and dependencies over the output or latent space ofa model. As a simple example to illustrate that OptNet lay-ers can be included in existing architectures and that thegradients can be efficiently propagated through the layer,we show the performance of a fully-connected feedforwardnetwork with and without an OptNet layer in Section A inthe supplemental material.

4.4. Sudoku

Finally, we present the main illustrative example of the rep-resentational power of our approach, the task of learningthe game of Sudoku. Sudoku is a popular logical puzzle,where a (typically 9x9) grid of points must be arrangedgiven some initial point, so that each row, each column, andeach 3x3 grid of points must contain one of each number1 through 9. We consider the simpler case of 4x4 Sudokupuzzles, with numbers 1 through 4, as shown in Figure 4.3.

Sudoku is fundamentally a constraint satisfaction problem,and is trivial for computers to solve when told the rules ofthe game. However, if we do not know the rules of thegame, but are only presented with examples of unsolvedand the corresponding solved puzzle, this is a challengingtask. We consider this to be an interesting benchmark taskfor algorithms that seek to capture complex strict relation-ships between all input and output variables. The input tothe algorithm consists of a 4x4 grid (really a 4x4x4 tensorwith a one-hot encoding for known entries an all zeros forunknown entries), and the desired output is a 4x4x4 tensorof the one-hot encoding of the solution.

This is a problem where traditional neural networks havedifficulties learning the necessary hard constraints. As abaseline inspired by the models at https://github.com/Kyubyong/sudoku, we implemented a multilayerfeedforward network to attempt to solve Sudoku problems.Specifically, we report results for a network that has 10 con-volutional layers with 512 3x3 filters each, and tried otherarchitectures as well. The OptNet layer we use on this taskis a completely generic QP in “standard form” with onlypositivity inequality constraints but an arbitrary constraintmatrix Ax = b, a small Q = 0.1I to make sure the prob-lem is strictly feasible, and with the linear term q simplybeing the input one-hot encoding of the Sudoku problem.We know that Sudoku can be approximated well with alinear program (indeed, integer programming is a typical

0 2 4 6 8 10 12 14 16 18

Epoch

10-3

10-2

10-1

Loss

Conv Train Conv Test OptNet Train OptNet Test

0 2 4 6 8 10 12 14 16 18

Epoch

10-3

10-2

10-1

Loss

0 2 4 6 8 10 12 14 16 18

Epoch

10-3

10-2

10-1

100

Error

Figure 5. Sudoku training plots.

solution method for such problems), but the model here istold nothing about the rules of Sudoku.

We trained these models using ADAM (Kingma & Ba,2014) to minimize the MSE (which we refer to as “loss”)on a dataset we created consisting of 9000 training puz-zles, and we then tested the models on 1000 different held-out puzzles. The error rate is the percentage of puzzlessolved correctly if the cells are assigned to whichever in-dex is largest in the prediction. Figure 5 shows that theconvolutional is able to learn all of the necessary logic forthe task and ends up over-fitting to the training data. Wecontrast this with the performance of the OptNet network,which learns most of the correct hard constraints within thefirst three epochs and is able to generalize much better tounseen examples.

5. ConclusionWe have presented OptNet, a neural network architecturewhere we use optimization problems as a single layer inthe network. We have derived the algorithmic formula-tion for differentiating through these layers, allowing forbackpropagating in end-to-end architectures. We have alsodeveloped an efficient batch solver for these optimizationsbased upon a primal-dual interior point method, and devel-oped a method for attaining the necessary gradient informa-tion “for free” from this approach. Our experiments high-light the potential power of these networks, showing thatthey can solve problems where existing networks are verypoorly suited, such as learning Sudoku problems purelyfrom data. There are many future directions of researchfor these approaches, but we feel that they add another im-portant primitive to the toolbox of neural network practi-tioners.

https://github.com/Kyubyong/sudoku

https://github.com/Kyubyong/sudoku


AcknowledgmentsBA is supported by the National Science FoundationGraduate Research Fellowship Program under Grant No.DGE1252522. We would like to thank the developersof PyTorch for helping us add core features, particularlySoumith Chintala and Adam Paszke. We also thank IanGoodfellow, Lekan Ogunmolu, Rui Silva, Po-Wei Wang,and Eric Wong for invaluable comments, as well as RockyDuan who helped us improve our feedforward networkbaseline on mini-Sudoku.

ReferencesAmos, Brandon, Xu, Lei, and Kolter, J Zico. Input con-

vex neural networks. In Proceedings of the InternationalConference on Machine Learning, 2017.

Belanger, David and McCallum, Andrew. Structured pre-diction energy networks. In Proceedings of the Interna-tional Conference on Machine Learning, 2016.

Belanger, David, Yang, Bishan, and McCallum, Andrew.End-to-end learning for structured prediction energy net-works. In Proceedings of the International Conferenceon Machine Learning, 2017.

Bertsekas, Dimitri P. Nonlinear programming. Athena sci-entific Belmont, 1999.

Bonnans, J Frederic and Shapiro, Alexander. Perturbationanalysis of optimization problems. Springer Science &Business Media, 2013.

Boyd, Stephen and Vandenberghe, Lieven. Convex opti-mization. Cambridge university press, 2004.

Brakel, Philemon, Stroobandt, Dirk, and Schrauwen, Ben-jamin. Training energy-based models for time-series im-putation. Journal of Machine Learning Research, 14(1):2771–2797, 2013.

Chen, Liang-Chieh, Schwing, Alexander G, Yuille, Alan L,and Urtasun, Raquel. Learning deep structured models.In Proceedings of the International Conference on Ma-chine Learning, 2015.

Clarke, Frank H. Generalized gradients and applica-tions. Transactions of the American Mathematical So-ciety, 205:247–262, 1975.

Domke, Justin. Generic methods for optimization-basedmodeling. In AISTATS, volume 22, pp. 318–326, 2012.

Dontchev, Asen L and Rockafellar, R Tyrrell. Implicitfunctions and solution mappings. Springer Monogr.Math., 2009.

Duchi, John, Shalev-Shwartz, Shai, Singer, Yoram, andChandra, Tushar. Efficient projections onto the l 1-ballfor learning in high dimensions. In Proceedings of the25th international conference on Machine learning, pp.272–279, 2008.

Fiacco, Anthony V and Ishizuka, Yo. Sensitivity and stabil-ity analysis for nonlinear programming. Annals of Oper-ations Research, 27(1):215–235, 1990.

Goodfellow, Ian, Mirza, Mehdi, Courville, Aaron, andBengio, Yoshua. Multi-prediction deep boltzmann ma-chines. In Advances in Neural Information ProcessingSystems, pp. 548–556, 2013.

Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu,Bing, Warde-Farley, David, Ozair, Sherjil, Courville,Aaron, and Bengio, Yoshua. Generative adversarial nets.In Advances in Neural Information Processing Systems,pp. 2672–2680, 2014.

Gould, Stephen, Fernando, Basura, Cherian, Anoop, An-derson, Peter, Santa Cruz, Rodrigo, and Guo, Edison. Ondifferentiating parameterized argmin and argmax prob-lems with application to bi-level optimization. arXivpreprint arXiv:1607.05447, 2016.

Griewank, Andreas and Walther, Andrea. Evaluatingderivatives: principles and techniques of algorithmicdifferentiation. SIAM, 2008.

Johnson, Matthew, Duvenaud, David K, Wiltschko, Alex,Adams, Ryan P, and Datta, Sandeep R. Composinggraphical models with neural networks for structuredrepresentations and fast inference. In Advances in NeuralInformation Processing Systems, pp. 2946–2954, 2016.

Kennedy, Michael Peter and Chua, Leon O. Neural net-works for nonlinear programming. IEEE Transactionson Circuits and Systems, 35(5):554–562, 1988.

Kingma, Diederik and Ba, Jimmy. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.

Kunisch, Karl and Pock, Thomas. A bilevel optimizationapproach for parameter learning in variational models.SIAM Journal on Imaging Sciences, 6(2):938–983, 2013.

LeCun, Yann, Chopra, Sumit, Hadsell, Raia, Ranzato, M,and Huang, F. A tutorial on energy-based learning. Pre-dicting structured data, 1:0, 2006.

Lillo, Walter E, Loh, Mei Heng, Hui, Stefen, and Zak,Stanislaw H. On solving constrained optimization prob-lems with neural networks: A penalty method approach.IEEE Transactions on neural networks, 4(6):931–940,1993.


Lotstedt, Per. Numerical simulation of time-dependentcontact and friction problems in rigid body mechanics.SIAM journal on scientific and statistical computing, 5(2):370–393, 1984.

Magnus, X and Neudecker, Heinz. Matrix differential cal-culus. New York, 1988.

Mairal, Julien, Bach, Francis, and Ponce, Jean. Task-drivendictionary learning. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, 34(4):791–804, 2012.

Mattingley, Jacob and Boyd, Stephen. Cvxgen: A codegenerator for embedded convex optimization. Optimiza-tion and Engineering, 13(1):1–27, 2012.

Metz, Luke, Poole, Ben, Pfau, David, and Sohl-Dickstein,Jascha. Unrolled generative adversarial networks. arXivpreprint arXiv:1611.02163, 2016.

Morari, Manfred and Lee, Jay H. Model predictive con-trol: past, present and future. Computers & ChemicalEngineering, 23(4):667–682, 1999.

Peng, Jian, Bo, Liefeng, and Xu, Jinbo. Conditional neu-ral fields. In Advances in neural information processingsystems, pp. 1419–1427, 2009.

Sastry, Shankar and Bodson, Marc. Adaptive control: sta-bility, convergence and robustness. Courier Corporation,2011.

Schmidt, Uwe and Roth, Stefan. Shrinkage fields for effec-tive image restoration. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pp.2774–2781, 2014.

Sra, Suvrit, Nowozin, Sebastian, and Wright, Stephen J.Optimization for machine learning. Mit Press, 2012.

Stoyanov, Veselin, Ropson, Alexander, and Eisner, Jason.Empirical risk minimization of graphical model param-eters given approximate inference, decoding, and modelstructure. In AISTATS, pp. 725–733, 2011.

Tappen, Marshall F, Liu, Ce, Adelson, Edward H, and Free-man, William T. Learning gaussian conditional randomfields for low-level vision. In Computer Vision and Pat-tern Recognition, 2007. CVPR’07. IEEE Conference on,pp. 1–8. IEEE, 2007.

Zheng, Shuai, Jayasumana, Sadeep, Romera-Paredes,Bernardino, Vineet, Vibhav, Su, Zhizhong, Du, Dalong,Huang, Chang, and Torr, Philip HS. Conditional randomfields as recurrent neural networks. In Proceedings ofthe IEEE International Conference on Computer Vision,pp. 1529–1537, 2015.

OptNet: Supplementary Material

Brandon Amos J. Zico Kolter

A. MNIST ExperimentIn this section we consider the integration of QP OptNetlayers into a traditional fully connected network for theMNIST problem. The results here show only very marginalimprovement if any over a fully connected layer (MNIST,after all, is very fairly well-solved by a fully connected net-work, let alone a convolution network). But our main pointof this comparison is simply to illustrate that we can in-clude these layers within existing network architectures andefficiently propagate the gradients through the layer.

Specifically we use a FC600-FC10-FC10-SoftMax fullyconnected network and compare it to a FC600-FC10-Optnet10-SoftMax network, where the numbers after eachlayer indicate the layer size. The OptNet layer in this caseincludes only inequality constraints and the previous layeris only used in the linear objective term p(zi) = zi. To keepQ � 0, we use a Cholesky factorization Q = LLT + εIand directly learn L (without any information from the pre-vious layer). We also directly learn A and G, and to ensurea feasible solution always exists, we select some learnablez0 and s0 and set b = Az0 and h = Gz0 + s0.

Figure 6 shows that the results are similar for both networkswith slightly lower error and less variance in the OptNetnetwork.

0 50 100 150 200

Epoch

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Error

TrainTest

0 50 100 150 200

Epoch

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Error

TrainTest

Figure 6. Training performance on MNIST; top: fully connectednetwork; bottom: OptNet as final layer.)

B. Denoising Experiment DetailsFigure 7 shows the error of the fully connected networkon the denoising task and Figure 8 shows the error of theOptNet fine-tuned TV solution.

0 50 100 150 200

Epoch

15

20

25

30

35

40

45

50

Error

TrainTest

Figure 7. Error of the fully connected network for denoising

0 10 20 30 40 50

Epoch

10

11

12

13

14

15

16

17

18

Error

TrainTest

Figure 8. Error rate from fine-tuning the TV solution for denois-ing

C. Representational power of the QP OptNetlayer

This section contains proofs for those results we highlightin Section 3.2. As mentioned before, these proofs are allquite straightforward and follow from well-known proper-ties, but we include them here for completeness.

C.1. Proof of Theorem 1

Proof. The fact that an OptNet layer is subdifferentiablefrom strictly convex QPs (Q � 0) follows directly fromthe well-known result that the solution of a strictly convexQP is continuous (though not everywhere differentiable).Our proof essentially just boils down to showing this fact,


though we do so by explicitly showing that there is a uniquesolution to the Jacobian equations (6) that we presentedearlier, except on a measure zero set. This measure zeroset consists of QPs with degenerate solutions, points whereinequality constraints can hold with equality yet also havezero-valued dual variables. For simplicity we assume thatA has full row rank, but this can be relaxed.

From the complementarity condition, we have that at a pri-mal dual solution (z?, λ?, ν?)

(Gz? − h)i < 0→ λ?i = 0

λ?i > 0→ (Gz? − h)i = 0(13)

(i.e., we cannot have both these terms non-zero).

First we consider the (typical) case where exactly one of(Gz? − h)i and λ?i is zero. Then the KKT differential ma-trix Q GT AT

D(λ?)G D(Gz? − h) 0A 0 0

(14)

(the left hand side of (6)) is non-singular. To see this, notethat if we let I be the set where λ?i > 0, then the matrix Q GTI AT

D(λ?)GI D(Gz? − h)I 0A 0 0

=

Q GTI AT

D(λ?)GI 0 0A 0 0

(15)

is non-singular (scaling the second block byD(λ?)−1 givesa standard KKT system (Boyd & Vandenberghe, 2004, Sec-tion 10.4), which is nonsingular for invertible Q and [GTIAT ] with full column rank, which must hold due to ourcondition on A and the fact that there must be less than ntotal tight constraints at the solution. Also note that for anyi 6∈ I, only theD(Gz?−h)ii term is non-zero for the entirerow in the second block of the matrix. Thus, if we want tosolve the system Q GTI AT

D(λ?)GI D(Gz? − h)I 0A 0 0

zλν

=

abc

(16)

we simply first set λi = bi/(Gz? − h)i for i 6∈ I and thensolve the nonsingular system Q GTI AT

D(λ?)GI 0 0A 0 0

zλIν

=

a−GTI λIbIc.

(17)

Alternatively, suppose that we have both λ?i = 0 and(Gz? − h)i = 0. Then although the KKT matrix is nowsingular (any row for which λ?i = 0 and (Gz? − h)i = 0

will be all zero), there still exists a solution to the system(6), because the right hand side is always in the range ofD(λ?) and so will also be zero for these rows. In this casethere will no longer be a unique solution, corresponding tothe subdifferentiable but not differentiable case.


Proof. The proof that an OptNet layer can represent anypiecewise linear univariate function relies on the fact thatwe can represent any such function in “sum-of-max” form

f(x) =

k∑i=1

wi max{aix+ b, 0} (18)

where wi ∈ {−1, 1}, ai, bi ∈ R (to do so, simply proceedleft to right along the breakpoints of the function addinga properly scaled linear term to fit the next piecewise sec-tion). The OptNet layer simply represents this function di-rectly.

That is, we encode the optimization problem

minimizez∈R,t∈Rk

‖t‖22 + (z − wT t)2

subject to aix+ bi ≤ ti, i = 1, . . . , k(19)

Clearly, the objective here is minimized when z = wT t,and t is as small as possible, meaning each t must either beat its bound aix+b ≤ ti or, if aix+b < 0, then ti = 0 willbe the optimal solution due to the objective function. Toobtain a multivariate but elementwise function, we simplyapply this function to each coordinate of the input x.

To see the specific case of a ReLU network, note that thelayer

z = max{Wx+ b, 0} (20)

is simply equivalent to the OptNet problem

minimizez

‖z −Wx− b‖22subject to z ≥ 0.

(21)


Proof. The final theorem simply states that a two-layerReLU network (more specifically, a ReLU followed by alinear layer, which is sufficient to achieve a universal func-tion approximator), can often require exponentially manymore units to approximate a function specified by an Opt-Net layer. That is, we consider a single-output ReLU net-work, much like in the previous section, but defined formulti-variate inputs.

f(x) =

m∑i=1

wi max{aTi x+ b, 0} (22)


aT1 xaT2 x

aT3 x

Figure 9. Creases for a three-term pointwise maximum (left), anda ReLU network (right).

Although there are many functions that such a network can-not represent, for illustration we consider a simple case ofa maximum of three linear functions

f ′(x) = max{aT1 x, aT2 x, aT3 x} (23)

To see why a ReLU is not capable of representing this func-tion exactly, even for x ∈ R2, note that any sum-of-maxfunction, due to the nature of the term max{aTi x+bi, 0} asstated above must have “creases” (breakpoints in the piece-wise linear function), than span the entire input space; thisis in contrast to the max terms, which can have creases thatonly partially span the space. This is illustrated in Figure9. It is apparent, therefore, that the two-layer ReLU cannotexactly approximate the three maximum term (any ReLUnetwork would necessarily have a crease going through oneof the linear region of the original function). Yet this maxfunction can be captured by a simple OptNet layer

minimizez

z2

subject to aTi x ≤ z, i = 1, . . . , 3.(24)

The fact that the ReLU network is a universal functionapproximator means that the we are able to approximatethe three-max term, but to do so means that we requirea dense covering of points over the input space, choosean equal number of ReLU terms, then choose coefficientssuch that we approximate the underlying function on thispoints; however, for a large enough radius this will requirean exponential size covering to approximate the underlyingfunction arbitrarily closely.

Although the example here in this proof is quite simple(and perhaps somewhat limited, since for example the func-tion can be exactly approximated using a “Maxout” net-work), there are a number of other such functions for whichwe have been unable to find any compact representation.For example, projection of a point on to the simplex is eas-ily written as the OptNet layer

minimizez

‖z − x‖22

subject to z ≥ 0, 1T z = 1(25)

yet it does not seem possible to represent this in closed formas a simple network: the closed form solution of such aprojection operator requires sorting or finding a particularmedian term of the data (Duchi et al., 2008), which is notfeasible with a single layer for any form of network thatwe are aware of. Yet for simplicity we stated the theoremabove using just ReLU networks and a straightforward ex-ample that works even in two dimensions.

OptNet: Differentiable Optimization as a Layer in Neural … · 2019-10-16 · OptNet:...

Documents

Transcript of OptNet: Differentiable Optimization as a Layer in Neural … · 2019-10-16 · OptNet:...