Variational Gram Functions: Convex Analysis and...

27
Variational Gram Functions: Convex Analysis and Optimization Amin Jalali * Maryam Fazel Lin Xiao October 28, 2015 Abstract We propose a new class of convex penalty functions, called variational Gram functions (VGFs), that can promote pairwise relations, such as orthogonality, among a set of vectors in a vector space. These functions can serve as regularizers in convex optimization problems arising from hierarchical classification, multitask learning, estimating vectors with disjoint sup- ports, and other applications. We study necessary and sufficient condtions under which a VGF is convex, and give a characterization of its subdifferential. In addition, we show how to com- pute its proximal operator, and discuss efficient optimization algorithms for some structured loss-minimization problems using VGFs. Numerical experiments are presented to demonstrate the effectiveness of VGFs and the associated optimization algorithms. 1 Introduction Let x 1 ,..., x m be vectors in R n . It is well known that their pairwise inner products x T i x j , for i, j 1,...,m , reveal essential information about their relative positions and orientations, and can serve as a measure for various properties such as orthogonality. In this paper, we consider a class of functions that aggregate the pairwise inner products in a variational form, Ω M px 1 ,..., x m q“ max MPM m ÿ i,j 1 M ij x T i x j , (1.1) where M is a compact subset of m by m symmetric matrices. Let X “rx 1 ¨¨¨ x m s be an n ˆ m matrix. Then the pairwise inner products x T i x j are the entries of the Gram matrix X T X and the function above can be written as Ω M pX q“ max MPM xX T X, M y“ max MPM trpXMX T q , (1.2) where xA, By“ trpA T Bq denotes the matrix inner product. We call Ω M a variational Gram function (VGF) of the vectors x 1 ,..., x m induced by the set M. If the set M is clear from the context, we may write ΩpX q to simplify notation. As an example, consider the case where M is given by a box constraint, M M : |M ij Ď M ij , i,j 1,...,m ( , (1.3) * Department of Electrical Engineering, University of Washington, Seattle, WA 98195. Email: [email protected] Department of Electrical Engineering, University of Washington, Seattle, WA 98195. Email: [email protected] Machine Learning Groups, Microsoft Research, Redmond, WA 98053. Email: [email protected] 1

Transcript of Variational Gram Functions: Convex Analysis and...

Page 1: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

Variational Gram Functions: Convex Analysis and Optimization

Amin Jalali∗ Maryam Fazel† Lin Xiao‡

October 28, 2015

Abstract

We propose a new class of convex penalty functions, called variational Gram functions(VGFs), that can promote pairwise relations, such as orthogonality, among a set of vectorsin a vector space. These functions can serve as regularizers in convex optimization problemsarising from hierarchical classification, multitask learning, estimating vectors with disjoint sup-ports, and other applications. We study necessary and sufficient condtions under which a VGFis convex, and give a characterization of its subdifferential. In addition, we show how to com-pute its proximal operator, and discuss efficient optimization algorithms for some structuredloss-minimization problems using VGFs. Numerical experiments are presented to demonstratethe effectiveness of VGFs and the associated optimization algorithms.

1 Introduction

Let x1, . . . ,xm be vectors in Rn. It is well known that their pairwise inner products xTi xj , fori, j “ 1, . . . ,m , reveal essential information about their relative positions and orientations, and canserve as a measure for various properties such as orthogonality. In this paper, we consider a classof functions that aggregate the pairwise inner products in a variational form,

ΩMpx1, . . . ,xmq “ maxMPM

mÿ

i,j“1

MijxTi xj , (1.1)

where M is a compact subset of m by m symmetric matrices. Let X “ rx1 ¨ ¨ ¨ xms be an n ˆmmatrix. Then the pairwise inner products xTi xj are the entries of the Gram matrix XTX and thefunction above can be written as

ΩMpXq “ maxMPM

xXTX,My “ maxMPM

trpXMXT q , (1.2)

where xA,By “ trpATBq denotes the matrix inner product. We call ΩM a variational Gram function(VGF) of the vectors x1, . . . ,xm induced by the set M. If the set M is clear from the context, wemay write ΩpXq to simplify notation.

As an example, consider the case where M is given by a box constraint,

M “

M : |Mij | ď ĎMij , i, j “ 1, . . . ,m(

, (1.3)

∗Department of Electrical Engineering, University of Washington, Seattle, WA 98195. Email: [email protected]†Department of Electrical Engineering, University of Washington, Seattle, WA 98195. Email: [email protected]‡Machine Learning Groups, Microsoft Research, Redmond, WA 98053. Email: [email protected]

1

Page 2: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

where ĎM is a symmetric nonnegative matrix. In this case, the maximization in the definition ofΩM picks either Mij “ ĎMij or Mij “ ´ĎMij depending on the sign of xTi xj , for all i, j “ 1, . . . ,m(if xTi xj “ 0, the choice is arbitrary). Therefore,

ΩMpXq “ maxMPM

mÿ

i,j“1

MijxTi xj “

mÿ

i,j“1

ĎMij |xTi xj | . (1.4)

In other words, ΩMpXq is the weighted sum of the absolute values of the pairwise inner products.This function was proposed in [42] as a regularization function to promote orthogonality betweenlinear classifiers in the context of hierarchical classification.

An important question from both the theoretical and algorithmic points of view is: what arethe conditions on M such that a VGF is convex? Observe that the function trpXMXT q is a convexquadratic function of X if M is positive semidefinite. As a result, the variational form ΩMpXqis convex if M is a subset of the positive semidefinite cone Sm` , because then it is the pointwisemaximum of a family of convex functions parametrized by M P M (see, e.g., [36, Theorem 5.5]).However, this is not a necessary condition. For example, the set M in (1.3) is not a subset of Sm`unless ĎM “ 0, but the VGF in (1.4) is convex provided that the comparison matrix of ĎM (bynegating the off-diagonal entries) is positive semidefinite [42]. In this paper, we give more carefulanalysis of the conditions for a VGF to be convex, and characterize its subdifferential and associatedproximal operator.

Given a convex VGF, we can define a semi-norm1 by taking its square root as

XM :“a

ΩMpXq “ maxMPM

ˆ mÿ

i,j“1

MijxTi xj

˙12

. (1.5)

If M Ă Sm` , then XM is the pointwise maximum of the semi-norms XM12F over all M P M.We call XM a VGF-induced (semi-)norm.

VGFs and the associated norms can serve as penalty or regularization functions in optimizationproblem to promote certain pairwise properties among a set of vector variables (such as orthogo-nality in the above example). In this paper, we consider optimization problems of the form

minimizeXPRnˆm

LpXq ` λΩMpXq , (1.6)

where LpXq is a convex loss function of the variable X “ rx1 ¨ ¨ ¨ xms, ΩpXq is a convex VGF, andλ ą 0 is a parameter to trade off the relative importance of these two functions. We will focus onproblems where LpXq is smooth or has an explicit variational structure, and show how to exploitthe structure of LpXq and ΩpXq together to derive efficient optimization algorithms.

Organization. In Section 2, we give more examples of VGF and explain its connections withfunctions of Euclidean distance matrices and robust optimization. Section 3 studies the convexityof VGFs and their conjugates, semidefinite representability, VGF-induced norms and their subd-ifferentials. Their proximal operators are derived in Section 4. In Section 5, we study a class ofstructured loss minimization problems with VGF penalities, and show how to exploit their struc-ture using the mirror-prox algorithm. Finally in Section 6, we present two numerical examples toillustrate the application of VGF: one on finding vectors with disjoint support, and the other onhierarchical classification.1 a semi-norm satisfies all the properties of a norm except definiteness; i.e. it can have zero value for a nonzero input.

2

Page 3: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

Notation. We use Sm to denote the set of symmetric matrices in Rmˆm, and Sm` Ă Sm is thecone of positive semidefinite (PSD) matrices. The symbol ĺ represents the Loewner partial orderunless subscripted by a specific cone, and x¨, ¨y denotes the inner product of two vectors. Weuse capital letters for matrices and bold lower case letters for vectors. We use X P Rnˆm andx “ vecpXq P Rnm interchangeably, with xi denoting the ith column of X and xT

piq being its ith

row; i.e., X “ rx1 ¨ ¨ ¨ xms and XT “ rxp1q ¨ ¨ ¨ xpnqs . We use 1 and 0 to denote matrices orvectors of all ones and all zeros respectively, whose sizes would be clear from the context. Theentry-wise absolute value of X is denoted by |X| . We use ¨ p to denote the `p norm of theinput vector or matrix, and ¨ F as the Frobenius norm (same as `2 vector norm). The convexconjugate of a function f is defined as f˚pyq “ supy xx, yy ´ fpxq , and the dual norm of ¨ is defined as y˚ “ suptxx,yy : x ď 1u . Finally arg min (arg max) returns an optimal pointto a minimization (maximization) program while Arg min (or Arg max) is the set of all optimalpoints. The operator diagp¨q is interchangeably used to put a vector on the diagonal of a zeromatrix of corresponding size, extract the diagonal entries of a matrix as a vector, or zeroing out theoff-diagonal entries of a matrix. We use f ” g to denote fpxq “ gpxq for all x P dompfq “ dompgq .

2 Examples and connections

In this section, we present examples of VGF corresponding to different choices of the set M. Thelist includes some well known functions that can be expressed in the variational form of (1.1), aswell as some new examples.

Vector norms. Any vector norm ¨ on Rm is the square root of a VGF defined over R1ˆm with

M “ tuuT : u˚ ď 1u.

For a column vector x P Rm , the VGF is given by

ΩMpxT q “ max

uttrpxTuuTxq : u˚ ď 1 u “ max

utpxTuq2 : u˚ ď 1u “ x2 .

Another example for n “ 1 , where M is a compact convex set of diagonal matrices with positivediagonals, has been first introduced in [30] and later discussed in [3]. The corresponding norm isdefined as

ΩMpxT q “ max

θPdiagpMq

mÿ

i“1

θix2i “ x

2, (2.1)

and px˚q2 “ minθPHřmi“1

1θix2i . The k-support norm [2], which is a norm used to encourage

vectors to have k or fewer nonzero entries, is an example of (2.1) corresponding to M “ tdiagpθq :0 ď θi ď 1 , 1T θ “ ku .

Norms of the Gram matrix. Given a symmetric nonnegative matrix ĎM , we can define a classof VGFs based on any vector norm ¨ and its dual norm ¨ ˚. Define the variational set as

M “ tK ˝ ĎM : K˚ ď 1, KT “ Ku, (2.2)

3

Page 4: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

where ˝ represents the matrix Hadamard product, i.e., pK ˝ ĎMqij “ KijĎMij for all i, j. Then we

haveΩMpXq “ max

MPMxM,XTXy “ max

K˚ď1xK ˝ ĎM,XTXy

“ maxK˚ď1

xK,ĎM ˝ pXTXqy “ ĎM ˝ pXTXq .(2.3)

The following are several concrete examples.

• If we let ¨ ˚ in (2.2) be the `8 norm, then M “ tM : |MijĎMij | ď 1, i, j “ 1, . . . ,mu,which is the same as in (1.3). Here we use the convention 00 “ 0, thus Mij “ 0 wheneverĎMij “ 0. In this case, we obtain the VGF in (1.4):

ΩMpXq “ ĎM ˝ pXTXq1 “mÿ

i,j“1

ĎM |xTi xj |

• If we use the `2 norm in (2.2), then M “

M :řmi,jpMijĎMijq

2 ď 1(

. In this case, we have

ΩpXq “ ĎM ˝ pXTXqF “

ˆ mÿ

i,j“1

ĎMijpxTi xjq

˙12

. (2.4)

This function has been considered in multi-task learning [39], and also in the context ofsuper-saturated designs [8, 13].

• We can use the `1 norm in (2.2) to define M “

M :řmi,j |MijĎMij | ď 1

(

, which results in

ΩpXq “ ĎM ˝ pXTXq8 “ maxi,j“1,...,m

ĎMij |xTi xj |. (2.5)

This case can also be traced back to [8] in the statistics literature, where they used themaximum of |xTi xj | for i ‰ j as the measure to choose among supersaturated designs.

Many other interesting examples can be constructed this way. For example, using group-`1 normof the Gram matrix can model sharing vs competition, which was considered in vision tasks [21].We will revisit the above examples for their convexity conditions in Section 3.

Spectral functions. From the definition, the value of a VGF is invariant under left-multiplicationof X by an orthogonal matrix, but this is not true for right multiplication. Hence, VGFs are notfunctions of singular values (e.g. see [26]) in general, and are functions of the row space of X aswell. This also implies that in general ΩpXq ı ΩpXT q. However, if the set M is closed underleft and right multiplication by orthogonal matrices, then ΩMpXq becomes a function of squaredsingular values of X. For any matrix M P Sm , denote the sorted vector of its singular values byσpMq and let Θ “ tσpMq : M PMu. Then we have

ΩMpXq “ maxMPM

trpXMXT q “ maxθPΘ

minpn,mqÿ

i“1

θi σipXq2 , (2.6)

as a result of Von Neumann’s trace inequality [31]. Notice the similarity of the above to the VGFin (2.1). As an example, consider

M “ tM : α1I ĺ M ĺ α2I, trpMq “ α3u, (2.7)

4

Page 5: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

where 0 ă α1 ă α2 and α3 P rmα1, mα2s are given constants. The so called spectral box-norm [29]is the dual to the norm in (1.5) defined via this M . Note that in this case, M Ă Sm` so it is convex.The square of this norm has been considered in [20] for clustering.

Finite set M . For the finite set M “ tM1, . . . ,Mpu Ă Sm` , the VGF is given by

ΩMpXq “ maxi“1,...,p

XM12i 2F ,

which is the pointwise maximum of a finite number of squared weighted Frobenius norms.

2.1 Diversification

VGFs can help in diversifying the columns of the input matrix; e.g. minimizing (1.4) pushes tozero the inner products xTi xj corresponding to the nonzero entries in ĎM as much as possible. Asanother example, observe that two non-negative vectors have disjoint supports if and only if theyare orthogonal to each other. Hence, using a VGF as (1.4) that promotes orthogonality, we candefine

ΨpXq “ Ωp|X|q (2.8)

to promote disjoint supports among the columns of X ; hence diversifying the supports of columnsof X . Convexity of (2.8) will be discussed in Section 3.6.

2.2 Functions of Euclidean distance matrix

Consider a set M Ă Sm with the property that M1 “ 0 for all M P M. Let A “ diagpMq ´M ,and observe that

trpXMXT q “

mÿ

i,j“1

MijxTi xj “

12

mÿ

i,j“1

Aijxi ´ xj22.

This allows us to express the associated VGF as a function of the Euclidean distance matrix D,which is defined by Dij “

12xi ´ xj

22 for i, j “ 1, . . . ,m (see, e.g., [9, Section 8.3]). Let A “

tdiagpMq ´M : M PMu . Then we have

ΩMpXq “ maxMPM

trpXMXT q “ maxAPA

xA,Dy .

A sufficient condition for the above function to be convex in X is that each A P A is entrywisenonnegative, which implies that the corresponding M “ diagpA1q ´ A is diagonally dominantwith nonnegative diagonal elements, hence positive semidefinite. However, this is not a necessarycondition, the function can be convex without all A’s being entrywise nonegative. In Section 3we will discuss more general conditions for convexity of VGFs. See [16] and references therein forapplications of this VGF.

2.3 Connection with robust optimization

The VGF-regularized loss minimization problem has the following connection to robust optimization(see, e.g., [7]): the optimization program

minimizeX

maxMPM

LpXq ` trpXMXT q(

5

Page 6: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

can be interpreted as seeking an X with minimal worst-case value over an uncertainty set M.Alternatively, this can be viewed as a problem with Tikhonov regularization XM122F where theweight matrix M12 is subject to errors characterized by the set M .

3 Convex analysis of VGF

In this section, we study the convexity of VGFs, their conjugate functions and subdifferentials.First, we review some basic properties. Notice that ΩM is the support function of M at the

Gram matrix XTX; i.e.,

ΩMpXq “ maxMPM

trpXMXT q “ SMpXTXq (3.1)

where the support function of a set M is defined as SMpY q “ supMPM xM,Y y (see, e.g., [36,Section 13]). Therefore, by properties of the support function, we have

ΩM ” ΩconvpMq, (3.2)

where convpMq denotes the convex hull of M. It is clear that the representation (i.e., the associatedset M) of a VGF is not unique. Henceforth, without loss of generality we assume M is convex unlessexplicitly noted otherwise. Moreover, while we have assumed M Ă Sm is compact, we only need themaximum in (1.1) to be attained, which for example allows for a closed set M which is unboundedalong any negative semidefinite direction.

As we mentioned in the introduction, a sufficient condition for the convexity of a VGF isM Ă Sm` . The following theorem gives a necessary and sufficient condition for a VGF to be convex.Basically, it only requires the VGF to admit a representation with a set of PSD matrices.

Theorem 3.1 Suppose that M is compact. Then ΩM is convex if and only if for every X thereexists an M P M X S` that achieves the maximum value in the definition of ΩMpXq. In otherwords, ΩM is convex if and only if ΩM ” ΩMXS`.

The above theorem means that a convex VGF is essentially the point-wise maximum of a familyof squared weighted Frobenius norms. Moreover, the condition is independent of the dimension n.That is, once ΩM is convex on Rnˆm for some n, it is convex on Rqˆm for any q ě 1. We postponethe proof until after Lemma 3.4, where we derive the conjugate functions of VGFs, since our proofuses conjugate functions.

While Theorem 3.1 gives a necessary and sufficient condition for the convexity of VGFs, it doesnot provide an effective procedure to check whether or not such a condition holds. In Section 3.1,we discuss more concrete conditions for determining convexity when the set M is a polytope. InSection 3.2, we describe a more tangible sufficient condition for general sets.

3.1 Convexity with polytope M

Consider the case where M is a polytope with p vertices, i.e., M “ convtM1, . . . ,Mpu . The supportfunction of this set is given as SMpY q “ maxi“1,...,p xY,Miy and is piecewise linear [38, Section 8.E].We define Meff as a subset of tM1, . . . ,Mpu with minimal cardinality such that its support functioncoincides with ΩMpXq for every X. That is, Meff Ď tM1, . . . ,Mpu and SMpX

TXq “ SMeffpXTXq

for all X P Rnˆm, but deleting any of member from Meff violates this property. As an example,

6

Page 7: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

considering M “ tM : |Mij | ď ĎMij , i, j “ 1, . . . ,mu, which gives the function defined in (1.4), wehave

Meff Ď tM : Mii “ ĎMii , Mij “ ˘ĎMij for i ‰ j u . (3.3)

Theorem 3.2 For a polytope M Ă Rmˆm , the associated VGF is convex if and only if Meff Ă Sm` .

Proof. Obviously, Meff Ă Sm` ensures convexity of maxMPMefftrpXMXT q “ ΩMpXq . Next,

we prove necessity for any Meff . Here is an observation. Take any Mi P Meff . If for everyX P Rnˆm with ΩpXq “ trpXMiX

T q there exists another Mj P Meff with ΩpXq “ trpXMjXT q,

then MeffztMiu is an effective subset of M which contradicts the minimality of Meff . Hence, thereexists Xi such that ΩpXiq “ trpXiMiX

Ti q ą trpXiMjX

Ti q for all j ‰ i . Hence, for this Xi, Ω is twice

continuously differentiable in a small neighborhood of Xi with Hessian ∇2ΩpvecpXiqq “ Mi b In ,where b denotes the matrix Kronecker product. Since Ω is assumed to be convex, the Hessian hasto be PSD which gives Mi ľ 0 .

The definition of Meff requires ΩM ” ΩMeff, and the condition in Theorem 3.2 is Meff Ă Sm` .

Comparing with Theorem 3.1, here we have Meff ĂMX Sm` , which can be a strict inclusion (evenconvpMeffq can be a strict subset of MX Sm` ). Next we give a few examples to illustrate the use ofTheorem 3.2.

• We continue with the example defined in (1.4). Authors in [42] provided the necessary (whenn ě m´ 1) and sufficient condition for convexity using results from M-matrix theory. First,

define the comparison matrix ĂM associated to the nonnegative matrix ĎM as ĂMii “ ĎMii andĂMij “ ´ĎMij for i ‰ j . Then ΩM is convex if ĂM is positive semidefinite, and this condition isalso necessary when n ě m ´ 1 [42]. Theorem 3.2 provides an alternative and more generalproof. Let λminpMq be the minimum eigenvalue of a symmetric matrix M . Considering thecharacterization of Meff in (3.3), we have

minMPMeff

λminpMq “ minMPMeffz2“1

zTMz ě minz2“1

ÿ

i

ĎMiiz2i ´

ÿ

i‰j

ĎMij |zizj |

“ minz2“1

|z|T ĂM |z| ě λminpĂMq. (3.4)

When n ě m´ 1 , one can construct X P Rnˆm such that all off-diagonal entries of XTX arenegative (see the example in Appendix A.2 of [42]). On the other hand, Lemma 2.1(2) of [12]

states that the existence of such a matrix implies n ě m ´ 1 . Hence, ĂM P Meff if and onlyif n ě m ´ 1 . Therefore, both inequalities in (3.4) should hold with equality, which means

that Meff Ă Sm` if and only if ĂM ľ 0. By Theorem 3.2, this is equivalent to the VGF in (1.4)

being convex. If n ă m´ 1, then ĂM may not belong to Meff , thus ĂM ľ 0 is only a sufficientcondition for convexity for general n .

• Consider a box similar to the set M above which is not necessarily symmetric around theorigin. More specifically, let M “ tM P Sm : Mii “ Cii , |M ´C| ď Du where C represents acenter which is symmetric and has zero diagonals, and D is a symmetric nonnegative matrix.In this case, we have Meff Ď tM : Mii “ Dii , Mij “ Cij ˘Dij for i ‰ ju. When used as apenalty function in applications, this may capture the prior information that when xTi xj is

7

Page 8: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

not zero, a particular range of acute or obtuse angles (depending on the sign of Cij) betweenthe two vectors is preferred. Similar to (3.4), we have

minMPMeff

λminpMq ě minz2“1

|z|T rD|z| ` zTCz ě λminp rDq ` λminpCq,

where rD is the comparison matrix associated to D . Notice that C has zero diagonal andit cannot be PSD. Hence, a sufficient condition for convexity of ΩM defined via such anasymmetric box is that λminp rDq ` λminpCq ě 0 .

• Consider the VGF defined in (2.5), whose associated variational set is

M “ tM P Sm :ÿ

pi,jq: ĎMij‰0

|MijĎMij | ď 1 , Mij “ 0 if ĎMij “ 0u,

where ĎM is a symmetric nonnegative matrix. Vertices of M are matrices with either onlyone nonzero value ĎMii on the diagonal, or two symmetric nonzero off-diagonal entries atpi, jq and pj, iq equal to 1

2ĎMij or ´1

2ĎMij . The second type of matrices cannot be PSD as

their diagonal is zero. Therefore, according to Theorem (3.2), convexity of ΩM requires thesevertices do not belong to Meff . To this end, we need maxtĎMiixi

22,ĎMjjxj

22u ě

ĎMijxTi xj

for all i and j, and any X P Rnˆm , which is equivalent to ĎMiiĎMjj ě ĎM2

ij for all i, j . This

can be satisfied if ĎM ľ 0 . However, regardless of the positive semidefiniteness of ĎM and bythe above argument, a convex Ω corresponds to Meff that only contains diagonal matrices.Hence, any such function has to be simply expressible as ΩpXq “ maxi“1,...,m

ĎMiixi22 .

3.2 A sufficient condition for more general sets

For the VGF defined in (2.4), the associated set M is given in (2.2) with the Frobenius norm, i.e.,

M “ tK ˝ ĎM : KF ď 1, KT “ Ku,

In this case, M is not a polytope, but we can proceed with similar analysis as in the previoussubsection. In particular, given any X P Rnˆm, the value of ΩMpXq is achieved by an optimalmatrix K “ pĎM ˝ XTXqĎM ˝ XTXF . Theorem 3.1 implies that convexity of ΩM requiresK ˝ ĎM ľ 0 , which is equivalent to ĎM ˝ ĎM ˝XTX ľ 0 . Since this needs to hold for every X, wemust have ĎM ˝ĎM ľ 0 . Schur Product Theorem [19, Theorem 7.5.1] states that ĎM ľ 0 is sufficientfor this requirement to hold, hence it is also a sufficient condition for convexity of ΩM.

As a concrete example of the above analysis, choosing ĎM “ 1mˆm leads to ΩMpXq “ XTXF “

σ2pXq2, which is convex in X P Rnˆm . Alternatively, we notice that the corresponding set isM “ tK : KF ď 1, KT “ Ku, which is closed under left and right multiplications by orthogonalmatrices. So ΩM is a spectral function as given by (2.6). In this case, the set Θ in (2.6) is the unit`2 ball in Rm, and we obtain ΩMpXq “ σ

2pXq2.In fact, a similar result can be stated for any absolute norm: if x “ |x| for all x P Rm , then

ΩpXq “ σ2pXq is convex in X P Rnˆm . This is because the dual of an absolute norm is also anabsolute norm [19] and

ΩpXq “ σ2pXq “ maxθ˚ď1

xθ, σ2pXqy “ maxθ˚ď1,0ďθ

xθ, σ2pXqy “ maxθ˚ď1,0ďθ

trpXdiagpθqXT q

8

Page 9: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

is convex by Theorem 3.1. For general norms (not absolute), ΩpXq “ σ2pXq is convex in X P

Rnˆm if the projection of the dual norm ball on to the nonnegative orthant is a subset of the dualnorm ball itself. This can be seen as a result of the following more general lemma.

Lemma 3.3 (a sufficient condition) Let M` be the orthogonal projection of all matrices in M

onto the PSD cone, and M´ S` be the Minkowski difference M´ S` “ tM ´ S : M PM, S P S`u.Then ΩM is convex provided that M` ĎM´ S`.

Proof. Denote by M` the orthogonal projection of a symmetric matrix M onto the PSD cone,which is given by the matrix formed by only positive eigenvalues and their associated eigenvectorsof M . It is easy to see that for any X we have trpXMXT q ď trpXM`X

T q . Therefore,

ΩMpXq “ maxMPM

trpXMXT q ď maxMPM`

trpXMXT q .

If M` ĎM´S` , then we get equality in the above equation and ΩMpXq is convex by Theorem 3.1.Notice that M` ĎM´ S` can hold while M` ĘM .

In the previous example, for a general norm ¨ , we have ΩpXq “ σ2pXq “ maxtxθ, σ2pXqy :θ˚ ď 1u , and Lemma 3.3 provides a sufficient condition for convexity.

Similar to the proof of Lemma 3.3, one can check that another sufficient condition for convexityof a VGF is that all of the maximal points of M with respect to S` are PSD. On the other hand, it iseasy to see that the condition in Lemma 3.3 is not necessary. Consider M “ tM P S2 : |Mij | ď 1u .Although the associated VGF is convex (because the comparison matrix is PSD), we have M` ĘM .As an example,

M “

0 11 1

PM, but M` »

0.44 .72.72 1.17

RM.

3.3 Conjugate function and proof of Theorem 3.1

For any function Ω , the conjugate function is defined as Ω˚pY q “ supX xX,Y y ´ ΩpXq and thetransformation that maps Ω to Ω˚ is called the Legendre-Fenchel transform (e.g., [36, Section 12]).

Lemma 3.4 (conjugate VGF) Consider a convex VGF associated to a compact convex set M .The conjugate function is given by

Ω˚MpY q “14 infM

trpYM :Y T q : rangepY T q Ď rangepMq , M PMX Sm`(

(3.5)

where M : is the Moore-Penrose pseudoinverse of M .

Note that Ω˚pY q is `8 if the optimization problem in (3.5) is infeasible; i.e. if Y pI ´MM :q isnonzero for all M P M , where MM : is the orthogonal projection onto the range of M . This canbe seen from results on generalized Schur complement; e.g. see Appendix A.5.5 in [9] or [11].

Proof. Applying the definition of conjugate function to a VGF gives

Ω˚MpY q “ supX

infMPM

xX,Y y ´ trpXMXT q .

9

Page 10: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

Since M is compact and convex, we can change the order of sup and inf . Next, consider splittingthe minimization over M into minimization over M X Sm` and clpMzS`q . For any fixed non-PSDM in the latter set, the maximization with respect to X is unbounded from above. Therefore,

Ω˚MpY q “ infMPMXSm`

supXxX,Y y ´ trpXMXT q .

This gives Ω˚M ” Ω˚MXS` . Next we define

fMpY q “14 infM,C

"

trpCq :

M Y T

Y C

ľ 0 , M PM

*

. (3.6)

We note that the constraint on the right-hand side of the above definition automatically impliesM ľ 0. Therefore we have fM ” fMXS` . Its conjugate function is

f˚MpXq “ supY

supM,C

"

xX,Y y ´ 14 trpCq :

M Y T

Y C

ľ 0 , M PM

*

“ supMPMXS`

supY,C

"

xX,Y y ´ 14 trpCq :

M Y T

Y C

ľ 0

*

.

Replacing the optimization problem over Y and C with the dual problem gives f˚M ” ΩMXS` :consider a dual variable W ľ 0 with corresponding blocks and write down the Lagrangian as

LpY,C,W q “ xX,Y y ´ 14 trpCq ` xW11,My ` 2xW21, Y y ` xW22, Cy .

Optimal value is finite only if W21 “ ´12X and W22 “

14I . Therefore, the dual problem is given as

minW11

"

xW11,My :

W11 ´12X

T

´12X

14I

ľ 0

*

“ minW11

xW11,My : W11 ľ XTX(

which is equal to xM,XTXy . Next, convexity and lower semi-continuity of fM implies f˚˚M “ fM(e.g. [38, Theorem 11.1]). Therefore, fM is equal to Ω˚MXS` which we showed to be the same asΩ˚M . Taking the generalized Schur complement of the semidefinite constraint in (3.6) gives thedesired representation in (3.5).

Following Lemma 3.4, we are ready to prove Theorem 3.1.

Proof. [of Theorem 3.1] We know that ΩMXS` is convex because it is the pointwise maximumof convex quadratic functions parametrized by M P M X S` . On the other hand, the proof ofLemma 3.4 shows that Ω˚M ” Ω˚MXS` , which in turn gives Ω˚˚M ” Ω˚˚MXS` . Since both ΩM andΩMXS` are proper, lower semi-continuous convex functions, they are equal to their biconjugates[38, Theorem 11.1]. Therefore, ΩM ” ΩMXS` .

3.4 VGF-induced norms

Given a convex VGF ΩM, Theorem 3.1 states that ΩM ” ΩMXS` , which in turn implies

ΩMpXq “ supMPMXS`

trpXMXT q “ supMPMXS`

XM122F ě 0 .

The above representation of ΩM shows that?

ΩM is a semi-norm: absolute homogeneity holdstrivially and it is easy to prove the triangle inequality for the maximum of semi-norms. The nextlemma generalizes this assertion, and we provide a proof in Appendix A.

10

Page 11: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

Lemma 3.5 Suppose a function Ω : Rnˆm Ñ R is homogeneous of order 2, i.e., ΩpθXq “ θ2ΩpXq .Then its square root X “

a

ΩpXq is a semi-norm if and only if Ω is convex. Moreover, providedthat Ω is strictly convex,

?Ω is a norm.

Dual Norm. Given the representation of Ω˚M in Lemma 3.4, one can derive a similar represen-tation for

a

Ω˚M as follows.

Theorem 3.6 Consider a convex VGF ΩM where M is a compact convex set. We have

b

Ω˚MpY q “14 infM,C,α

"

trpCq ` α :

M Y T

Y C

ľ 0 , 1αM PMX S`

*

. (3.7)

Proof. Let pM‹1 , C

‹1 q be an optimal solution to the minimization problem in (3.6), and let

pM‹2 , C

‹2 , α

‹2q be an optimal solution to the minimization problem in (3.7). In addition, let OPT1

and OPT2 be the optimal values to these two minimization problems respectively. First, observethat M1 “

1α˚2M˚

2 and C1 “ α˚2C˚2 are feasible for the optimization program in (3.6). Hence,

OPT1 ď1

4α˚2 trpC˚2 q ď

1

16pα˚2 ` trpC˚2 qq

2 “ OPT22 .

On the other hand, α2 “ 2?

OPT1, M2 “ α2M˚1 and C2 “

1α2C˚1 are feasible for the optimization

program in (3.7). Hence,

OPT2 ď1

4p

4

α2OPT1`α2q “

a

OPT1.

These two results give OPT2 “?

OPT1. From the proof of Lemma 3.4, we have OPT1 “ Ω˚MpY q,which implies that OPT2 “

a

Ω˚MpY q. This is the desired result.

Considering ¨ ”?

ΩM , we have 12ΩM ”

12 ¨

2 . Taking the conjugate function of both sidesyields 2Ω˚M ”

12p ¨

˚q2 where we used the order-2 homogeneity of ΩM . Therefore,

¨ ˚ ” 2b

Ω˚M . (3.8)

3.5 Subdifferentials

In this section, we characterize the subdifferential of VGFs and their conjugate functions, as wellas that of their induced norms. Due to the variational definition of a VGF where the objectivefunction is linear in M , and the fact that M is assumed to be compact, it is straightforward toobtain the subdifferential of ΩM (e.g., see [18, Theorem 4.4.2]).

Proposition 3.7 The subdifferential of a convex VGF ΩM at a given matrix X is given by

BΩMpXq “ conv

2XM : M PMX S` , trpXMXT q “ ΩpXq(

. (3.9)

For its induced norm xM ”?

ΩM, we have BXM “1

XMBΩMpXq if ΩMpXq ‰ 0.

As an example, the subdifferential of ΩpXq “řmi,j“1

ĎMij |xTi xj | , defined in (1.4), is given by

BΩpXq “ tXM : Mij “ ĎMijsignpxTi xjq if xxi,xjy ‰ 0 , |Mij | ď ĎMij otherwiseu . (3.10)

11

Page 12: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

Proposition 3.8 For a convex VGF ΩM , the subdifferential of its conjugate function is given by

BΩ˚MpY q “

12pYM

: `W q : ΩpYM : `W q “ 4Ω˚pY q “ trpYM :Y T q ,

rangepW T q Ď kerpMq Ď kerpY q , M PMX S`(

.(3.11)

When Ω˚MpY q ‰ 0 we have BY ˚M “2

Y ˚M

BΩ˚MpY q .

Proof. We use the results on subdifferentiation in parametric minimization [38, Section 10.C].First, let’s fix some notation. Throughout the proof, we denote 1

2Ω by Ω, and 2Ω˚ by Ω˚. Moreover,denote by IMpMq the indicator function of the set M which is 1 when M PM and `8 otherwise.We will use M instead of MX S` for simplicity of the notation. Considering

fpY,Mq :“

#

12 trpYM :Y T q if rangepY T q Ď rangepMq

`8 otherwise

we have Ω˚pY q “ infM fpY,Mq ` IMpMq . For such a function, we can use results in [10, Theo-rem 4.8] to show that

BfpY,Mq “ conv

pZ,´12Z

TZq : Z “ YM : `W , rangepW T q Ď kerpMq(

.

Since gpY,Mq :“ fpY,Mq` IMpMq is convex, we can use the second part of Theorem 10.13 in [38]:for any choice of M0 which is optimal in the definition of Ω˚pY q ,

BΩ˚pY q “ tZ : pZ,0q P BgpY,M0qu .

Therefore, for any Z P BΩ˚pY q we have

12Z

TZ P BIMpM0q “ tG : xG,M 1 ´M0y ď 0 , @M 1 PMu

(Here BIMpM0q is the normal cone of M at M0.) This implies

12 trpZM 1ZT q ď 1

2 trpZM0ZT q

for all M 1 PM . Taking the supremum of the left hand side over all M 1 PM , we get

ΩpZq “ 12 trpZM0Z

T q “ 12 trpYM :

0YT q “ Ω˚pY q .

where the second equality is implied by the condition rangepW T q Ď kerpM0q (which is equivalentto M0W

T “ 0). Alternatively, for any matrix Z from the right hand side of (3.11), and anyW P Rnˆm we have

Ω˚pW q ě xW,Zy ´ ΩpZq “ xW,Zy ´ Ω˚pY q “ xW ´ Y,Zy ` Ω˚pY q

where we used Fenchel’s inequality, as well as the characterization of Z . Therefore, Z P BΩ˚pY q .This finishes the proof. Notice that, for an achieving M , kerpMq Ď kerpY q (or equivalently,rangepW T q Ď rangepMq) has to hold for the conjugate function to be defined.

Since BΩ˚pY q is non-empty, for any choice of M0 , we can always find a W such that 12pYM

:0`W q P

BΩ˚pY q . However, finding such W is not trivial. The following remark provides us with a compu-tational, but not necessarily simple, approach to compute the subdifferential. The characterizationof the whole subdifferential is helpful for understanding optimality conditions, but algorithms onlyneed to compute a single subgradient, which is easier than computing the whole subdifferential; wewill need to compute a subgradient of Ω˚ as a part of the reduction technique presented in Section5.2, which is also helpful in computing the proximal operator for Ω in Section 4.

12

Page 13: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

Remark 3.9 Given Y and an optimal M0 , which by optimality satisfies kerpM0q Ď kerpY q , wehave

BΩ˚pY q “ Arg minZ

!

ΩpZq : Z “ 12pYM

:0 `W q , rangepW T q Ď kerpM0q Ď kerpY q

)

“ Arg minZ

!

ΩpZq : Z “ 12pYM

:0 `W q , WM0M

:0 “ 0

)

.

This is because for all feasible Z we have ΩpZq ě trpZM0ZT q “ 4Ω˚pY q .

3.6 Composition of VGF and absolute values

The characterization of the subdifferential allows us to establish conditions for convexity of ΨpXq “Ωp|X|q defined in (2.8). Our result is based on the following Lemma.

Lemma 3.10 Given a function f : Rn Ñ R , consider gpxq “ minyě|x| fpyq , and hpxq “ fp|x|q .

(a) h˚˚ ď g ď h .

(b) If f is convex then g is convex and g “ h˚˚.

Proof. (a) First, h˚pyq “ supx txx,yy ´ fp|x|qu “ supxě0 txx, |y|y ´ fpxqu . Next, we have

h˚˚pzq “ supy

xy, zy ´ supxě0

txx, |y|y ´ fpxqu(

“ supyě0

infxě0

xy, |z|y ´ xx,yy ` fpxq(

ď infxě0

supyě0

xy, |z|y ´ xx,yy ` fpxq(

“ infxě0

supyě0

xy, |z| ´ xy ` fpxq(

“ infxě|z|

fpxq “ gpzq.

This shows the first inequality in part (a). The second inequality follows directly from the definitionof g and h.

(b) Consider x1,x2 P Rn and θ P r0, 1s . Suppose gpxiq “ fpyiq for some yi ě |xi| , for i “ 1, 2 .In other words, yi is the minimizer in the definition of gpxiq, for i “ 1, 2. Then,

θy1 ` p1´ θqy2 ě θ|x1| ` p1´ θq|x2| ě |θx1 ` p1´ θqx2|,

where the absolute values and inequalities are all entry-wise. By definition of g and convexity of f

gpθx1 ` p1´ θqx2q ď fpθy1 ` p1´ θqy2q ď θfpy1q ` p1´ θqfpy2q “ θgpx1q ` p1´ θqgpx2q ,

which implies that g is convex. It is a classical result that the epigraph of the biconjugate h˚˚

is the closed convex hull of the epigraph of h; in other words, h˚˚ is the largest lower semi-continuous convex function that is no larger than h (e.g., [36, Theorem 12.2]). Since g is convexand h˚˚ ď g ď h , we must have h˚˚ “ g .

Corollary 3.11 Let ΩM be a convex VGF. Then, ΩMp|X|q is a convex function of X if and onlyif ΩMp|X|q “ minYě|X| ΩMpY q .

Proof. Let ΩM be the function f in Lemma 3.10. Then we have gpXq “ minYě|X|ΩMpY q andhpXq “ ΩMp|X|q. Since here h is a closed convex function, we have h “ h˚˚ [36, Theorem 12.2],thus part (a) of Lemma 3.10 implies h “ g . On the other hand, given a convex function f , part (b)of Lemma 3.10 states that g “ h˚˚ is also convex. Hence, h “ g implies convexity of h .

13

Page 14: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

Lemma 3.12 Let ΩM be a convex VGF. If BΩMpXq X Rnˆm` ‰ H holds for any X ě 0 , thenΨpXq “ ΩMp|X|q is convex.

Proof. Using the definition of subgradients for Ω at |X| we have

Ωp|X| `∆q ě Ωp|X|q ` suptxG, |X| `∆y : G P BΩ at |X|u

where the right-most term is the directional derivative of Ω at |X| in the direction ∆ . Providedthat the assumption of the lemma holds, we get ΩpY q ě Ωp|X|q for all Y ě |X| . Therefore,ΨpXq “ ΩMp|X|q “ minYě|X| ΩMpY q . Corollary 3.11 establishes the convexity of Ψ .

For example, consider the VGF ΩM defined in (1.4), and assume that it is convex. Its subd-ifferential BΩM given in (3.10). For each X ě 0, the matrix product XĎM ě 0 since ĎM is also anonnegative matrix, hence it belongs to BΩMpXq. Therefore the condition in the above lemma issatisfied, and as a consequence, the function ΨpXq “ ΩMp|X|q is convex and has an alternative rep-resentation ΨpXq “ minYě|X| ΩMpY q . This specific function Ψ has been used in [40] for learningmatrices with disjoint supports.

4 Proximal operators

The proximal operator of a closed convex function hp¨q is defined as

proxhpxq “ arg minu

hpuq ` 12u´ x22

(

,

which always exists and is unique (e.g., [36, Section 31]). Computing the proximal operator is theessential step in the proximal point algorithm ([28, 37]) and the proximal gradient methods (e.g.,[35]). In each iteration of such algorithms, we need to compute proxτhp) where τ ą 0 is a step sizeparameter. For a convex VGF Ω (for which, without loss of generality, we assume M Ă Sm` ), wehave

prox τΩpXq “ arg minY

maxMPM

12Y ´X

2F ` τ trpYMY T q

(

. (4.1)

If M is a compact convex set, one can change the min and max and first solve for Y in terms ofany given X and M , which gives Y “ XpI` 2τMq´1. Then we can find the optimal M0 PM givenX as

M0 “ arg minMPM

tr`

XpI` 2τMq´1XT˘

.

which gives prox τΩpXq “ XpI ` 2τM0q´1 . To compute the proximal operator for the conjugate

function Ω˚ , one can use the Moreau’s formula (see, e.g., [36, Theorem 31.5]):

prox τΩpXq ` τ´1 prox τ´1Ω˚pXq “ X . (4.2)

Next we discuss proximal operators of VGF-induced norms. Since computing the proximaloperator of a norm is equivalent to projection onto the dual norm ball, we can express the proximaloperator of the norm ¨ ”

a

Ωp¨q as

prox τ¨pXq “ X ´Π¨˚ďτ pXq

“ X ´ arg minY

minM,C

"

Y ´X2F : trpCq ď τ2 ,

M Y T

Y C

ľ 0 , M PM

*

,

14

Page 15: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

where we used the representation of conjugate VGF in (3.6), and the characterization of the dualnorm is (3.8). On the other hand, using the definition of proximal operator for the dual normcomputed via (3.7) we have

prox τ¨˚pXq “ arg minY

minM,C,α

"

Y ´X2F ` τptrpCq ` αq :

M Y T

Y C

ľ 0 , 1αM PM

*

.

If M ’s are invertible we can use the Schur complement and solve for α to get

prox τ¨˚pXq “ arg minY

minM

!

Y ´X2F ` 2τYM´12F : M PM)

.

The computational cost involved in the above expressions for proximal operators can be highin general (involving solving semidefinite programs). However, we might be able to simplify themfor special cases of M . For example, a fast algorithm for computing the proximal operator of theVGF associated with the set M defined in (2.7) is presented in [29]. For general problems, due tothe convex-concave saddle point structure in (4.1), we may use the mirror-prox algorithm [33] toobtain an inexact solution.

Left unitarily invariance and QR factorization. Suppose n ě m. Consider the QR de-composition of a matrix Y “ QR where Q is an orthogonal matrix with QTQ “ QQT “ I andR “ rRTY 0sT is an upper triangular matrix with RY P Rmˆm . Then for any VGF Ω, we have

ΩpY q “ ΩpRY q, Ω˚pY q “ Ω˚pRY q . (4.3)

Notice that the semidefinite matrix in computing Ω˚pRY q via (3.6) is now 2m-dimensional (inde-pendent of n). In fact, the matrix in (3.6) is PSD if and only if

Im 00 QT

M Y T

Y C

Im 00 Q

M RT

R QTCQ

ľ 0 . (4.4)

Moreover, trpCq “ trpC 1q where C 1 “ QTCQ and assuming C 1 to be zero outside the first m ˆmblock can only reduce the objective function. Therefore, we can ignore the last n ´m rows andcolumns of the above PSD matrix to get Ω˚pY q “ Ω˚pRY q .

Similarly for the proximal operators, we can simply plug in RX from the QR decompositionX “ QrRTX 0sT and get

prox τΩ˚pXq “ Q ¨ arg minR

minM

"

R´RX22 `

12τ trpCq :

M RT

R C

ľ 0 , M PM

*

, (4.5)

where R is restricted to be an upper triangular matrix and the semidefinite matrix is of size 2minstead of n`m we had before.

More generally, notice that VGFs are left unitarily invariant. Therefore, the optimal Y ’s inall of the optimization problems in this section have the same column space as the input matrixX ; otherwise, a rotation as in (4.4) produces a feasible Y with a smaller value for the objectivefunction.

15

Page 16: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

5 Structured optimization with VGF

In this section, we discuss optimization algorithms for solving convex minimization problems withVGF penalties in the form of (1.6). The proximal operators of VGFs we studied in the previoussection are the key parts of proximal gradient methods (see, e.g., [5, 6, 35]). More specifically, whenthe loss function LpXq is smooth, we can iteratively update the variables Xptq as follows:

Xpt`1q “ proxγtΩpXptq ´ γt∇LpXptqqq, t “ 0, 1, 2, . . . ,

where γt is a step size at iteration t. When LpXq is not smooth, then we can use subgradientsof Lpxptqq in the above algorithm, or use the classical subgradient method on the overall objectiveLpXq ` λΩpXq. In either case, we need to use diminishing step size and the convergence can bevery slow. Even when the convergence is relatively fast (in terms of number of iterations), thecomputational cost of the proximal operator in each iteration can be very high.

In this section, we focus on loss functions that have a special conjugate structure that canbe exploited together with the structure of the VGF penalty functions. We assume that the lossfunction L in (1.6) has the following representation:

LpXq “ maxgPG

xX,Dpgqy ´ Lpgq, (5.1)

where L : Rp Ñ R is a convex function, G is convex and compact subset of Rp, and D : Rp Ñ Rnˆmis a linear operator. This is also known as a Fenchel-type representation (e.g. see [24]). Moreover,consider the infimal post-composition [4, Def. 12.33] of L : G Ñ R by Dp¨q , defined as

pD Ź LqpY q “ inf tLpGq : DpGq “ Y , G P Gu . (5.2)

Then, the conjugate to this function is equal to L . The composition of a nonlinear convex lossfunction and a linear operator is very common for optimization of linear predictors in machinelearning (e.g., [17]), which we will demonstrate with several examples.

With the variational representation of L in (5.1), we can write the VGF-penalized loss mini-mization problem (1.6) as a convex-concave saddle-point optimization problem:

Jopt “ minX

maxMPM, gPG

!

xX,Dpgqy ´ Lpgq ` λ trpXMXT q

)

. (5.3)

If L is smooth (while L may still be nonsmooth) and the sets G and M are simple (e.g., admittingsimple projections), we can solve problem (5.3) using the mirror-prox algorithm [33, 24]. In sec-tion 5.1, we present a variant of the miror-prox algorithm equipped with an adaptive line searchscheme. Then in Section 5.2, we present a preprocessing technique to transform problems of theform (5.3) into smaller dimensions, which can be solved more efficiently under favorable conditions.

Before diving into the algorithmic details, we examine some common loss functions and derivethe corresponding representation (5.1) for them. This discussion will provide intuition about thelinear operator D and set G in relation with data and prediction.

Norm loss. Given a norm ¨ and its dual ¨ ˚ , consider the squared norm loss

Lpxq “ 12Ax´ b2 “ max

g

xg, Ax´ by ´ 12pg

˚q2(

.

16

Page 17: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

In terms of the representation in (5.1), here we have Dpgq “ ATg and Lpgq “ 12pg

˚q2 ` bTg.Similarly, a norm loss can be represented as

Lpxq “ Ax´ b “ maxgtxx, ATgy ´ bTg : g˚ ď 1u ,

where we have Dpgq “ ATg, Lpgq “ bTg and G “ tg : g˚ ď 1u.

ε-insensitive loss. The Huber loss, defined as

Hpuq “

#

u2

2δ `δ2 if |u| ď δ

|u| otherwise(5.4)

can be represented in a variational form as Hpuq “ 12 minθěδ

u2

θ ` θ . A variant of Huber lossfunction (e.g., see [32, Section 14.5.1] for more details and applications) is called the ε-insensititiveloss and is given as

Lεpxq “ p|x| ´ εq` “ maxα,βtαpx´ εq ` βp´x´ εq : α, β ě 0, α` β ď 1u . (5.5)

Hinge loss for binary classification. In binary classification problems, we are given a set oftraining examples pa1, b1q, . . . , paN , bN q, where each as P Rp is a feature vector and bs P t`1,´1uis a binary label. We would like to find x P Rp such that the linear function as

Tx can predict thesign of label bs for each s “ 1, . . . , N . The hinge loss maxt0, 1´ bspa

Ts xqu returns 0 if bspa

Ts xq ě 1

and a positive loss growing with the absolute value of bspaTs xq when it is negative. The average

hinge loss over the whole data set can be expressed as

Lpxq “ 1

N

Nÿ

s“1

max

0, 1´ bspaTs xq

(

“ maxgPG

xg,1´Dxy.

where D “ rb1a1, . . . , bNaN sT . Here, in terms of (5.1), we have, G “ tg P RN : 0 ď gs ď 1N u ,

Dpgq “ ´DTg , and Lpgq “ ´1Tg .

Multi-class hinge loss. For multiclass classification problems, each sample as has a label bs Pt1, . . . ,mu, for s “ 1, . . . , N . Our goal is to learn a set of classifiers xi, . . . ,xm, that can predictthe labels bs correctly. For any given example as with label bs, we say the prediction made byx1, . . . ,xm is correct if

xTi as ě xTj as for all pi, jq P Ipbsq, (5.6)

where Ik , for k “ 1, . . . ,m , characterizes the required comparisons to be made for any examplewith label k. Here are two examples.

• Flat multiclass classification: Ipkq “ tpk, jq : j ‰ ku. In this case, the constraints in (5.6) areequivalent to the label bs “ arg maxiPt1,...,mu xTi as; see [41].

• Hierarchical classification. In this case, the labels t1, . . . ,mu are organized in a tree structure,and each Ipkq is a special subset of the edges in the tree depending on the class label k; seeSection 6 and [14, 42] for further details.

17

Page 18: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

Given the labeled data set pa1, b1q, . . . , paN , bN q, we can optimize X “ rx1, . . . ,xms to minimizethe averaged multi-class hinge loss

LpXq “ 1

N

Nÿ

s“1

max

"

0, 1´ maxpi,jqPIpbsq

txTi as ´ xTj asu

*

, (5.7)

which penalizes the amount of violation for the inequality constraints in (5.6).In order to represent the loss function in (5.7) in the form of (5.1), we need some more notations.

Let pk “ |Ipkq|, and define Ek P Rmˆpk as the incidence matrix for the pairs in Ik ; i.e., each columnof Ek, corresponding to a pair pi, jq P Ik, has only two nonzero entries: ´1 at the ith entry and `1at the jth entry. Then the pk constraints in (5.6) can be summarized as ETk X

Tas ď 0. It can beshown that the multi-class hinge loss LpXq in (5.7) can be represented in the form (5.1) via

Dpgq “ ´A EpgqT , and Lpgq “ ´1Tg,

where A “ ra1 ¨ ¨ ¨ aN s and Epgq “ rEb1g1 ¨ ¨ ¨ EbNgN s P RmˆN . Moreover, the domain ofmaximization in (5.1) is defined as

G “ Gb1 ˆ . . .ˆ GbN where Gk “ tg P Rpk : g ě 0 , 1Tg ď 1Nu . (5.8)

Combining the above variational form for multi-class hinge loss and a VGF as penality on X,we can reformulate the nonsmooth convex optimization problem minX tLpXq ` λΩMpXqu as theconvex-concave saddle point problem

minX

maxMPM,gPG

1Tg ´ xX,A EpgqT y ` λ trpXMXT q(

. (5.9)

5.1 Mirror-prox algorithm with adaptive line search

The mirror-prox (MP) algorithm was proposed by Nemirovski [34] for approximating the saddlepoints of smooth convex-concave functions and solutions of variational inequalities with Lipschitzcontinuous monotone operators. It is extension of the extra-gradient method [25], and more variantsare studied in [23]. In the sequel, we first present a variant of the MP algorithm equipped withan adaptive line search scheme. Then explain how to apply it to solve the VGF-penalized lossminimization problem (5.3).

We describe the MP algorithm in the more general setup of solving variational inequalityproblems. Let Z be a convex compact set in Euclidean space E equipped with inner product x¨, ¨y,and ¨ and ¨ ˚ be a pair of conjugate norms on E, i.e., ξ˚ “ maxz:zď1xξ, zy. Let F : Z Ñ Ebe a Lipschitz continuous monotone mapping, i.e.,

@ z, z1 P Z : F pzq ´ F pz1q˚ ď Lz ´ z1, (5.10)

@ z, z1 P Z : xF pzq ´ F pz1q, z ´ z1y ě 0. (5.11)

The goal of the MP algorithm is to approximate a (strong) solution to the variational inequalityassociated with pZ,F q:

xF pz‹q, z ´ z‹y ě 0, @ z P Z.

Let φpx, yq be a smooth function that is convex in x and concave in y, and X and Y are closedconvex sets. Then the convex-concave saddle point problem

minxPX

maxyPY

φpx, yq, (5.12)

18

Page 19: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

Algorithm: Mirror-Proxpz1, γ1, εq

repeatt :“ t` 1repeatγt :“ γtcdec

wt :“ PztpγtF pztqqzt`1 :“ PztpγtF pwtqq

until δt ď 0γt`1 :“ cincγt

until Vztpzt`1q ď εreturn zt :“ p

řtτ“1 γτ q

´1řtτ“1 γτwτ

Figure 1: Mirror-Prox algorithm with adaptive line search. Here cdec ą 1 and cinc ą 1are two fixed parameters controlling how much to decrease and increase the step size γtduring the line search trials (the inner repeat loop). The stopping criterion for theline search is δt ď 0 where δτ “ γτ xF pwτ q, wτ ´ zτ`1y ´ Vzτ pzτ`1q.

can be posed as a variational inequality problem with z “ px, yq, Z “ X ˆ Y and

F pzq “

∇xφpx, yq´∇yφpx, yq

. (5.13)

The setup of the Mirror-Prox algorithm requires a distance-generating function hpzq which iscompatible with the norm ¨ . In other words, hpzq is subdifferentiable on the relative interiorof Z, denoted Zo, and is strongly convex with modulus 1 with respect to ¨ , i.e.,

@ z, z1 P Z : x∇hpzq ´∇hpz1q, z ´ z1y ě z ´ z12. (5.14)

For any z P Zo and z1 P Z, we can define the Bregman divergence at z as

Vzpz1q “ hpz1q ´ hpzq ´ x∇hpzq, z1 ´ zy,

and the associated proximity mapping as

Pzpξq “ arg minz1PZ

xξ, z1y ` Vzpz1q(

“ arg minz1PZ

xξ ´∇hpzq, z1y ` hpz1q(

.

With the above definitions, we are now ready to present the MP algorithm in Figure 1. Com-pared with the original MP algorithm [33, 23], our variant in Figure 1 employs an adaptive linesearch procedure to determine the step sizes γt , for t “ 1, 2, . . . . We can use exit the algorithmwhenever Vztpzt`1q ď ε for some ε ą 0. Under the assumptions (5.10) and (5.11), the MP algorithmin Figure 1 enjoys the same Op1tq convergence rate as the one proposed in [33], but performs muchfaster in practice. The proof requires only simple modifications of the proof in [33, 23], and weleave it as an exercise for the reader.

In order to solve the saddle-point problem we consider in (5.3), assuming that L is smooth, wecan apply the MP algorithm directly. In particular, the gradient mapping in (5.13) becomes

F pX,M,gq “

»

vecp2λXM `Dpgqq´λvecpXTXq

vecp∇Lpgq ´D˚pXqq

fi

fl ,

19

Page 20: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

where D˚p¨q is the adjoint operator to Dp¨q . Assuming g P Rp , computing F requires Opnm2`nmpqoperations for matrix multiplications. In the next section, we present a method to reduce theproblem size and replace n by mintmp, nu . In the case of a hinge loss for a SVM, as in our realdata numerical example, one can replace n by mintN,mp, nu , where N is the number of samples.

5.2 Reduced Formulation

As we discussed early, when the loss function has the structure (5.1), we can formulate the VGF-penalized minimization problem as a convex-concave saddle point problem

Jopt “ minX

maxgPG

!

xX,Dpgqy ´ Lpgq ` λΩpXq)

. (5.15)

Since G is compact, Ω is convex in X , and L is convex in g , we can use the minimax theorem tointerchange the max and min , and then use the definition of the conjugate function Ω˚ to obtain

Jopt “ ´mingPG

!

Lpgq ` λΩ˚p´ 1λDpgqq

)

. (5.16)

While the dimensions of Dpgq are the same as those of X , the structure of a VGFs allows formaking the optimization program in (5.16) independent of n (the ambient dimension of columns ofX ), and getting a reduced formulation.

First, observe that both the VGF and its conjugate depend only on the Gram matrix of theirinput matrix. Therefore, if possible, we can replace Dpgq P Rnˆm with a smaller matrix D1pgq aslong as DpgqTDpgq “ D1pgqTD1pgq . Since Dpgq is linear in g , consider a representation as

Dpgq “ rD1g ¨ ¨ ¨ Dmgs “ rD1 ¨ ¨ ¨ DmspIm b gq “ DpIm b gq,

where Di P Rnˆp and D P Rnˆmp . Considering B P Rqˆmp as the reduced Cholesky factor ofDTD P Rmpˆmp , i.e. BTB “ DTD and letting q “ rankpDq , we can define

D1pgq :“ BpIm b gq P Rqˆm . (5.17)

Notice that q “ rankpDq ď mp can potentially be much smaller than n depending on the applica-tion. In such cases, we can state the reduced reformulation of (5.16) as

Jopt “ ´mingPG

!

Lpgq ` 1λΩ˚

`

D1pgq˘

)

, (5.18)

where Lp¨q is convex, G Ă Rp is convex and compact, and D1 : Rp Ñ Rqˆm is a linear operatorwith q ď mp . Observe that when Ω is convex on Rnˆm , it is also convex on Rqˆm for any q ď n(see discussions after Theorem 3.1). Now, we can either directly solve (5.18) or a smaller versionof (5.15) as

Jopt “ minX 1PRqˆm

maxgPG

!

xX 1,D1pgqy ´ Lpgq ` λΩpX 1q)

. (5.19)

Both have reduced dimensions.

20

Page 21: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

Recovering X . The linear operator D1 encodes the information that is essential for the lossminimization problem with a VGF penalty. This reduction allows us to carry out the computationsin possibly much smaller dimensions than the original formulation. However, we still need torecover X will be still recoverable after solving (5.18). To this end, recall that the reduction stepused definition of the conjugate function:

Ω˚p´ 1λDpgqq “ sup

X

xX, ´1λ Dpgqy ´ ΩpXq

(

.

This implies the conjugacy relations (e.g. see [38, Proposition 11.3])

Xopt P BΩ˚p´ 1

λDpgoptqq , ´ 1λDpgoptq P BΩpXoptq,

andΩpXoptq ` Ω˚p´ 1

λDpgoptqq “ xX,´1λDpgoptqy .

The subdifferential of Ω˚ has been characterized in Section 3.5. According to Remark 3.9, we canrecover an optimal solution as

Xopt “ arg minZ

minY

!

ΩpZq : Z “ 12p´

1λDpgoptqM

:opt ` Y q , Y MoptM

:opt “ 0

)

. (5.20)

Solving this problem might in general require the same effort as directly solving (5.15). However,in some cases Y “ 0 gives a valid subgradient in (3.11) and we do not need to solve the above opti-mization problem. Moreover, if Mopt is positive definite, we can output Xopt “ ´

12λDpgoptqM

´1opt .

Notice that a VGF with M Ă S`` always lie in this category.

Summary. If n " m , we can use the following steps to solved a problem with reduced size andthen recover a solution to the original problem.

1. Compute D1 from (5.17) by performing a Cholesky factorization once.

2. Solve (5.19) using the MP algorithm in Section 5.1, and keep the optimal Mopt and gopt .

3. Compute a subgradient of Ω˚p´ 1λDpgoptqq as in (5.20), and output it as an optimal solution

for (5.15). If Mopt is positive definite, we can output Xopt “ ´1

2λDpgoptqM´1opt .

6 Numerical Examples

In this section, we discuss the application of VGFs in hierarchical classification to demonstrate theeffectiveness of the presented algorithms in a real data experiment.

Let pa1, b1q, . . . , paN , bN q be a set of labeled data where each ai P Rn is a feature vector andthe associated bi P t1, . . . ,mu is a class label. The goal of a multi-class classification task is to learna classification function f : Rn Ñ t1, . . . ,mu so that, given any sample a P Rn (not necessarily inthe training example set), the prediction fpaq attains a small classification error (compared withthe true label).

In hierarchical classification, the class labels t1, . . . ,mu are organized in a category tree, wherethe root of the tree is given the fictious label 0 (see Figure 2a). For each node i P t0, 1, . . . ,mu,let Cpiq be the set of children of i, Spiq be the set of siblings of i, and Apiq be the set of ancestors

21

Page 22: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

0

1 2

3 4

w1 w2

w3 w4

x

y

w1xw2x

w3xw4x

(a)

fpaq “

$

&

%

initialize i :“ 0while Cpiq is not empty

i :“ arg maxjPCpiq

xTj a

return i

,

/

/

/

.

/

/

/

-

(b)

Figure 2: (2a): An example of hierarchical classification with four class labelst1, 2, 3, 4u. The instance a is classified recursively until it reaches the leaf node b “ 3,which is its predicted label. (2b): Definition of the hierarchical classification function.

of i excluding 0 but including itself. A hierarchical linear classifier fpaq is defined in Figure 2b,which is parameterized by the vectors x1, . . . ,xm through a recursive procedure. In other words,an instance is labeled sequentially by choosing the category of which the associated vector outputsthe largest score among its siblings, until a leaf node is reached. An example of this recursiveprocedure is shown in Figure 2a.

For the hierarchical classifier defined above, given any example as with label bs, a correctprediction made by fpaq implies that (5.6) holds with

Ipkq “

pi, jq : j P Spiq, i P Apkq(

.

Given a set of examples pa1, b1q, . . . , paN , bN q, we can train a hierarchical classifier parametrizedby X “ rx1, . . . ,xms by solving the problem minX

LpXq ` λΩpXq(

, with the loss function LpXqdefined in (5.7) and an appropriate VGF penalty function ΩpXq. As discussed in Section 5, thetraining optimization problem can be reformulated as a convex-concave saddle point problem ofthe form (5.3) and solved by the mirror-prox algorithm described in Section 5.1. In addition, wecan use the reduction procedure descussed in Section 5.2 to reduce computational cost.

As a real-world example, we consider the classification dataset Reuters Corpus Volume I, RCV1-v2 [27], which is an archive of over 800,000 manually categorized newswire stories and is availablein libSVM. A subset of the hierarchy of labels in RCV1-v2, with m “ 14 labels, is called CCATand is used in our experiments. The samples and the classifiers are of dimension n “ 47236 .

Nodes Leaves Total Train Test

CCAT 14 10 59835 1797 58038

Table 1: Some statistics from RCV1-v2 [27].

We solve a VGF regularized hinge loss minimization problem as in (5.9), with λ “ 1 , using themirror-prox method. We use the `2 norm as the mirror map to require the least knowledge aboutthe optimization problem (see [22] for the requirements when combining different mirror maps).

22

Page 23: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

With such choice of a mirror map, the projection onto G in (5.8) boils down to separate projectionsonto N full-dimensional simplexes. Each projection can be done by zeroing out the negative entriesfollowed by a simple projection onto `1 unit norm ball which can be implemented using the simpleprocess described in [15].

To compare the prediction error of estimated classifiers from regularization with this VGF ontest data, see [42]).

Although there is a clear advantage to the MP method compared to RDA in terms of thetheoretical guarantee, one should be aware of the difference between the notion of gaps in the twomethods. Figure 3a compares Xt ´XfinalF for MP and RDA using each one’s own final estimateXfinal . In terms of the runtime, we have observed that each iteration of MP takes 3 times moretime compared to RDA. However, as evident from Figure 3a, MP is still much faster in generatinga fixed-accuracy solution. Figure 3b illustrates the decay in the value of the gap for mirror-proxmethod, Vztpzt`1q , which confirms the theoretical findings.

500 1000 1500 2000 2500 3000-5

-4

-3

-2

-1

0

1

2

RDA

MP

(a) Average `2 error of classifiers, between each iter-ation and the final estimate, Xt´XfinalF , for theMP and RDA algorithms; on a logarithmic scale.

500 1000 1500 2000 2500 3000-6

-5

-4

-3

-2

-1

0

1

2

3

4

(b) Vztpzt`1q on a logarithmic scale.

500 1000 1500 2000 2500 3000-6

-5

-4

-3

-2

-1

0

1

2

3

4

MP

RDA

(c) The value of loss function, relative to the finalvalue, on a logarithmic scale for MP.

Figure 3: Convergence behavior for mirror-prox and RDA in our numerical experiment.

23

Page 24: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

A Proof of Lemma 3.5

First, assume that Ω is convex. By plugging in X and ´X in the definition of convexity for Ω weget ΩpXq ě 0 . Therefore, it is well-defined to consider the square root of Ω . Now, for any givenX and Y , we are going to establish the triangle inequality

a

ΩpX ` Y q ďa

ΩpXq `a

ΩpY q . IfΩpX ` Y q be zero, then the inequality is trivially satisfied; hence, assume it is not equal to zero.Given X and Y , for any value θ P p0, 1q define A “ 1

θX , and B “ 11´θY and use the definition of

convexity for Ω to get

ΩpX ` Y q “ ΩpθA` p1´ θqBq ď θΩpAq ` p1´ θqΩpBq “ 1θ ΩpXq ` 1

1´θ ΩpY q (A.1)

where we used the second order homogeneity.

• if ΩpXq ě ΩpY q “ 0 , set θ “ pΩpXq ` ΩpX ` Y qqp2ΩpX ` Y qq ą 0. Assuming that θ ă 1 ,from (A.1) we get

ΩpX ` Y q ď 1θ ΩpXq ` 1

1´θ ΩpY q “2ΩpX ` Y qΩpXq

ΩpX ` Y q ` ΩpXq

which is equivalent to θ ě 1 and is a contradiction. Therefore, θ ě 1 which establishes thedesired inequality.

• if ΩpXq,ΩpY q ‰ 0 , plug in θ “a

ΩpXqpa

ΩpXq `a

ΩpY qq P p0, 1q in (A.1) to get

ΩpX ` Y q ď 1θ ΩpXq ` 1

1´θ ΩpY q “ pa

ΩpXq `a

ΩpY qq2 .

Since?

Ω satisfies the triangle inequality, as well as absolute homogeneity, it is a semi-norm. Noticethat ΩpXq “ 0 does not necessarily imply X “ 0 . However, for a strictly convex Ω, plugging inX ‰ 0 and ´X in the definition of convexity gives ΩpXq ą 0 . Hence,

?Ω will be a norm.

Now, suppose that?

Ω is a semi-norm; hence convex and non-negative. Moreover, f definedby fpxq “ x2 for x ě 0 and fpxq “ 0 for x ď 0 is a non-decreasing function. Composition of thesetwo functions will be convex and is equal to Ω .

B Calculus of VGF Representations

Considering a VGF as a function of the associated set, one can derive a calculus over this argumentas given in Table 2.

M ΩM

αA , α ą 0 αΩA

AY B ΩA _ ΩBŤ

αP∆pα1A : α2Bq ΩAOΩB

A` B ΩA ` ΩB

A : B “ tA : Bu ΩAlΩB

Table 2: Calculus of Representations for VGFs.

In the above table, we have used the following notations:

24

Page 25: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

• The parallel sum [1] for two Hermitian positive semidefinite matrices is defined as A : B “

B : A “ ApA`Bq:B . If both A and B are non-singular, A : B “ pA´1 `B´1q´1 .

• ∆ is the flat simplex.

• The infimal convolution is defined as pΩAlΩBqpXq “ inftΩApX1q`ΩBpX2q : X “ X1`X2u .

• The sublevel convolution is defined as pΩAOΩBqpXq “ inftΩApX1q_ΩBpX2q : X “ X1`X2u .

References

[1] W. N. Anderson and R. J. Duffin. Series and parallel addition of matrices. Journal of Mathe-matical Analysis and Applications, 26(3):576–594, 1969.

[2] A. Argyriou, R. Foygel, and N. Srebro. Sparse prediction with the k-support norm. In Advancesin Neural Information Processing Systems, pages 1457–1465, 2012.

[3] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducingpenalties. Foundations and Trends R© in Machine Learning, 4(1):1–106, 2012.

[4] H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in Hilbertspaces. Springer Science & Business Media, 2011.

[5] A. Beck and M. Teboulle. A fast iterative shrinkage-threshold algorithm for linear inverseproblems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

[6] A. Beck and M. Teboulle. Gradient-based algorithms with applications to signal recovery.In D. P. Palomar and Y. C. Eldar, editors, Convex Optimization in Signal Processing andCommunications, chapter 2, pages 42–88. Cambridge University Press, 2010.

[7] A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust Optimization. Princeton UniversityPress, 2009.

[8] K. H. Booth and D. Cox. Some systematic supersaturated designs. Technometrics, 4(4):489–495, 1962.

[9] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.

[10] J. V. Burke and T. Hoheisel. Matrix support functionals for inverse problems, regularization,and learning. SIAM Journal on Optimization, 2015.

[11] F. Burns, D. Carlson, E. Haynsworth, and T. Markham. Generalized inverse formulas usingthe schur complement. SIAM Journal on Applied Mathematics, 26(2):254–259, 1974.

[12] X. Cia and X. Wang. A note on the positive semidefinite minimum rank of a sign patternmatrix. Electronic Journal of Linear Algebra, 26:345–356, 2013.

[13] C.-S. Cheng. Eps2q-optimal supersaturated designs. Statistica Sinica, 7(4):929–939, 1997.

[14] O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classification. In Proceedings ofthe 21st International Conference on Machine Learning, pages 27–34, 2004.

25

Page 26: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

[15] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l 1-ballfor learning in high dimensions. In Proceedings of the 25th international conference on Machinelearning, pages 272–279. ACM, 2008.

[16] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. InJournal of Machine Learning Research, pages 615–637, 2005.

[17] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining,Inference, and Prediction. Springer, New York, 2nd edition, 2009.

[18] J.-B. Hiriart-Urruty and C. Lemarechal. Convex analysis and minimization algorithms I:fundamentals, volume 305. Springer Science & Business Media, 2013.

[19] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 2012.

[20] L. Jacob, F. Bach, and J.-P. Vert. Clustered multi-task learning: A convex formulation. InNIPS, volume 21, pages 745–752, 2008.

[21] D. Jayaraman, F. Sha, and K. Grauman. Decorrelating semantic visual attributes by resist-ing the urge to share. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEEConference on, pages 1629–1636. IEEE, 2014.

[22] A. Juditsky and A. Nemirovski. First order methods for nonsmooth convex large-scale opti-mization, ii: utilizing problems structure. Optimization for Machine Learning, pages 149–183,2011.

[23] A. Juditsky and A. Nemirovski. First-order methods for nonsmooth convex large-scale opti-mization, II: Utilizing problems’s structure. In S. Sra, S. Nowozin, and S. J. Wright, editors,Optimization for Machine Learning, chapter 6, pages 149–184. The MIT Press, Cambridge,MA., 2011.

[24] A. Juditsky and A. Nemirovski. Solving variational inequalities with monotone operators ondomains given by linear minimization oracles. arXiv preprint arXiv:1312.1073, 2013.

[25] G. M. Korpelevic. An extragradient method for finding saddle points and for other problems.Ekonom. i Mat. Metody, 12(4):747–756, 1976.

[26] A. Lewis. The convex analysis of unitarily invariant matrix functions. Journal of ConvexAnalysis, 2(1/2):173–183, 1995.

[27] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for textcategorization research. The Journal of Machine Learning Research, 5:361–397, 2004.

[28] B. Martinet. Regularisation, d’inequations variationelles par approximations succesives. RevueFrancaise d’Informatique et de Recherche Operationelle, 4:154–159, 1970.

[29] A. M. McDonald, M. Pontil, and D. Stamos. New perspectives on k-support and cluster norms.arXiv preprint arXiv:1403.1481, 2014.

[30] C. A. Micchelli, J. M. Morales, and M. Pontil. Regularizers for structured sparsity. Advancesin Computational Mathematics, 38(3):455–489, 2013.

26

Page 27: Variational Gram Functions: Convex Analysis and Optimizationfaculty.washington.edu/mfazel/VGF.pdfanalysis of the conditions for a VGF to be convex, and characterize its subdi erential

[31] L. Mirsky. A trace inequality of john von neumann. Monatshefte fur Mathematik, 79(4):303–306, 1975.

[32] K. P. Murphy. Machine learning: a probabilistic perspective. MIT Press, 2012.

[33] A. Nemirovski. Prox-method with rate of convergence Op1tq for variational inequalities withlipschitz continuous monotone operators and smooth convex-concave saddle point problems.SIAM Journal on Optimization, 15(1):229–251, 2004.

[34] A. Nemirovski. Prox-method with rate of convergence Op1tq for variational inequalities withLipschitz continuous monotone operators and smooth convex-concave saddle point problems.SIAM Journal on Optimization, 15(1):229–251, 2004.

[35] Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Program-ming, Ser. B, 140:125–161, 2013.

[36] R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970.

[37] R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal onControl and Optimization, 14:877–898, 1976.

[38] R. T. Rockafellar and R. J.-B. Wets. Variational analysis, volume 317. Springer Science &Business Media, 2009.

[39] B. Romera-Paredes, A. Argyriou, N. Berthouze, and M. Pontil. Exploiting unrelated tasks inmulti-task learning. In International Conference on Artificial Intelligence and Statistics, pages951–959, 2012.

[40] K. Vervier, P. Mahe, A. D’Aspremont, J.-B. Veyrieras, and J.-P. Vert. On learning matriceswith orthogonal columns or disjoint supports. In Machine Learning and Knowledge Discoveryin Databases, pages 274–289. Springer, 2014.

[41] J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. InProceedings of the 6th European Symposium on Artificial Neural Networks (ESANN), pages219–224, 1999.

[42] D. Zhou, L. Xiao, and M. Wu. Hierarchical classification via orthogonal transfer. Proceedingsof the 28th International Conference on Machine Learning (ICML), June 2011.

27