Bilevel Programming for Hyperparameter Optimization and ...

12
Bilevel Programming for Hyperparameter Optimization and Meta-Learning Luca Franceschi 12 Paolo Frasconi 3 Saverio Salzo 1 Riccardo Grazzi 1 Massimiliano Pontil 12 Abstract We introduce a framework based on bilevel pro- gramming that unifies gradient-based hyperparam- eter optimization and meta-learning. We show that an approximate version of the bilevel prob- lem can be solved by taking into explicit account the optimization dynamics for the inner objective. Depending on the specific setting, the outer vari- ables take either the meaning of hyperparameters in a supervised learning problem or parameters of a meta-learner. We provide sufficient conditions under which solutions of the approximate prob- lem converge to those of the exact problem. We instantiate our approach for meta-learning in the case of deep learning where representation lay- ers are treated as hyperparameters shared across a set of training episodes. In experiments, we confirm our theoretical findings, present encour- aging results for few-shot learning and contrast the bilevel approach against classical approaches for learning-to-learn. 1. Introduction While in standard supervised learning problems we seek the best hypothesis in a given space and with a given learning algorithm, in hyperparameter optimization (HO) and meta- learning (ML) we seek a configuration so that the optimized learning algorithm will produce a model that generalizes well to new data. The search space in ML often incorpo- rates choices associated with the hypothesis space and the features of the learning algorithm itself (e.g., how optimiza- tion of the training loss is performed). Under this common perspective, both HO and ML essentially boil down to nest- ing two search problems: at the inner level we seek a good 1 Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia, Genoa, Italy 2 Department of Com- puter Science, University College London, London, UK 3 Department of Information Engineering, Universit ` a degli Studi di Firenze, Florence, Italy. Correspondence to: Luca Franceschi <[email protected]>. Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). hypothesis (as in standard supervised learning) while at the outer level we seek a good configuration (including a good hypothesis space) where the inner search takes place. Surprisingly, the literature on ML has little overlap with the literature on HO and in this paper we present a unified framework encompassing both of them. Classic approaches to HO (see e.g. Hutter et al., 2015, for a survey) have been only able to manage a relatively small number of hyperparameters, from a few dozens using random search (Bergstra and Bengio, 2012) to a few hun- dreds using Bayesian or model-based approaches (Bergstra et al., 2013; Snoek et al., 2012). Recent gradient-based tech- niques for HO, however, have significantly increased the number of hyperparameters that can be optimized (Domke, 2012; Maclaurin et al., 2015; Pedregosa, 2016; Franceschi et al., 2017) and it is now possible to tune as hyperparame- ters entire weight vectors associated with a neural network layer. In this way, it becomes feasible to design models that possibly have more hyperparameters than parameters. Such an approach is well suited for ML, since parameters are learned from a small dataset, whereas hyperparameters leverage multiple available datasets. HO and ML only differ substantially in terms of the experi- mental settings in which they are evaluated. While in HO the available data is associated with a single task and split into a training set (used to tune the parameters) and a validation set (used to tune the hyperparameters), in ML we are often in- terested in the so-called few-shot learning setting where data comes in the form of short episodes (small datasets with few examples per class) sampled from a common probability distribution over supervised tasks. Early work on ML dates back at least to the 1990’s (Schmid- huber, 1992; Baxter, 1995; Thrun and Pratt, 1998) but this research area has received considerable attention in the last few years, mainly driven by the need in real-life and indus- trial scenarios for learning quickly a vast multitude of tasks. These tasks, or episodes, may appear and evolve continu- ously over time and may only contain few examples (Lake et al., 2017). Different strategies have emerged to tackle ML. Although they do overlap in some aspects, it is possible to identify at least four of them. The metric strategy attempts to use training episodes to construct embeddings such that examples of the same class are mapped into similar repre- arXiv:1806.04910v2 [stat.ML] 3 Jul 2018

Transcript of Bilevel Programming for Hyperparameter Optimization and ...

Page 1: Bilevel Programming for Hyperparameter Optimization and ...

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Luca Franceschi 1 2 Paolo Frasconi 3 Saverio Salzo 1 Riccardo Grazzi 1 Massimiliano Pontil 1 2

AbstractWe introduce a framework based on bilevel pro-gramming that unifies gradient-based hyperparam-eter optimization and meta-learning We showthat an approximate version of the bilevel prob-lem can be solved by taking into explicit accountthe optimization dynamics for the inner objectiveDepending on the specific setting the outer vari-ables take either the meaning of hyperparametersin a supervised learning problem or parameters ofa meta-learner We provide sufficient conditionsunder which solutions of the approximate prob-lem converge to those of the exact problem Weinstantiate our approach for meta-learning in thecase of deep learning where representation lay-ers are treated as hyperparameters shared acrossa set of training episodes In experiments weconfirm our theoretical findings present encour-aging results for few-shot learning and contrastthe bilevel approach against classical approachesfor learning-to-learn

1 IntroductionWhile in standard supervised learning problems we seek thebest hypothesis in a given space and with a given learningalgorithm in hyperparameter optimization (HO) and meta-learning (ML) we seek a configuration so that the optimizedlearning algorithm will produce a model that generalizeswell to new data The search space in ML often incorpo-rates choices associated with the hypothesis space and thefeatures of the learning algorithm itself (eg how optimiza-tion of the training loss is performed) Under this commonperspective both HO and ML essentially boil down to nest-ing two search problems at the inner level we seek a good

1Computational Statistics and Machine Learning IstitutoItaliano di Tecnologia Genoa Italy 2Department of Com-puter Science University College London London UK3Department of Information Engineering Universita degli Studidi Firenze Florence Italy Correspondence to Luca Franceschiltlucafranceschiiititgt

Proceedings of the 35 th International Conference on MachineLearning Stockholm Sweden PMLR 80 2018 Copyright 2018by the author(s)

hypothesis (as in standard supervised learning) while atthe outer level we seek a good configuration (including agood hypothesis space) where the inner search takes placeSurprisingly the literature on ML has little overlap withthe literature on HO and in this paper we present a unifiedframework encompassing both of them

Classic approaches to HO (see eg Hutter et al 2015for a survey) have been only able to manage a relativelysmall number of hyperparameters from a few dozens usingrandom search (Bergstra and Bengio 2012) to a few hun-dreds using Bayesian or model-based approaches (Bergstraet al 2013 Snoek et al 2012) Recent gradient-based tech-niques for HO however have significantly increased thenumber of hyperparameters that can be optimized (Domke2012 Maclaurin et al 2015 Pedregosa 2016 Franceschiet al 2017) and it is now possible to tune as hyperparame-ters entire weight vectors associated with a neural networklayer In this way it becomes feasible to design modelsthat possibly have more hyperparameters than parametersSuch an approach is well suited for ML since parametersare learned from a small dataset whereas hyperparametersleverage multiple available datasets

HO and ML only differ substantially in terms of the experi-mental settings in which they are evaluated While in HO theavailable data is associated with a single task and split into atraining set (used to tune the parameters) and a validation set(used to tune the hyperparameters) in ML we are often in-terested in the so-called few-shot learning setting where datacomes in the form of short episodes (small datasets with fewexamples per class) sampled from a common probabilitydistribution over supervised tasks

Early work on ML dates back at least to the 1990rsquos (Schmid-huber 1992 Baxter 1995 Thrun and Pratt 1998) but thisresearch area has received considerable attention in the lastfew years mainly driven by the need in real-life and indus-trial scenarios for learning quickly a vast multitude of tasksThese tasks or episodes may appear and evolve continu-ously over time and may only contain few examples (Lakeet al 2017) Different strategies have emerged to tackle MLAlthough they do overlap in some aspects it is possible toidentify at least four of them The metric strategy attemptsto use training episodes to construct embeddings such thatexamples of the same class are mapped into similar repre-

arX

iv1

806

0491

0v2

[st

atM

L]

3 J

ul 2

018

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Table 1 Links and naming conventions among different fields

Bilevelprogramming

Hyperparameteroptimization

Meta-learning

Inner variables Parameters Parameters ofGround models

Outer variables Hyperparameters Parameters ofMeta-learner

Inner objective Training error Training errorson tasks (Eq 3)

Outer objective Validation error Meta-trainingerror (Eq 4)

sentations It has been instantiated in several variants thatinvolve non-parametric (or instance-based) predictors (Kochet al 2015 Vinyals et al 2016 Snell et al 2017) In the re-lated memorization strategy the meta-learner learns to storeand retrieve data points representations in memory It canbe implemented either using recurrent networks (Santoroet al 2016) or temporal convolutions (Mishra et al 2018)The use of an attention mechanism (Vaswani et al 2017) iscrucial both in (Vinyals et al 2016) and in (Mishra et al2018) The initialization strategy (Ravi and Larochelle2017 Finn et al 2017) uses training episodes to infer a goodinitial value for the modelrsquos parameters so that new tasks canbe learned quickly by fine tuning The optimization strat-egy (Andrychowicz et al 2016 Ravi and Larochelle 2017Wichrowska et al 2017) forges an optimization algorithmthat will find it easier to learn on novel related tasks

A main contribution of this paper is a unified view of HOand ML within the natural mathematical framework ofbilevel programming where an outer optimization problemis solved subject to the optimality of an inner optimizationproblem In HO the outer problem involves hyperparame-ters while the inner problem is usually the minimization ofan empirical loss In ML the outer problem could involvea shared representation among tasks while the inner prob-lem could concern classifiers for individual tasks Bilevelprogramming (Bard 2013) has been suggested before in ma-chine learning in the context of kernel methods and supportvector machines (Keerthi et al 2007 Kunapuli et al 2008)multitask learning (Flamary et al 2014) and more recentlyHO (Pedregosa 2016) but never in the context of ML Theresulting framework outlined in Sec 2 encompasses someexisting approaches to ML in particular those based on theinitialization and the optimization strategies

A technical difficulty arises when the solution to the innerproblem cannot be written analytically (for example this hap-pens when using the log-loss for training neural networks)and one needs to resort to iterative optimization approachesAs a second contribution we provide in Sec 3 sufficientconditions that guarantee good approximation propertiesWe observe that these conditions are reasonable and apply

to concrete problems relevant to applications

In Sec 4 by taking inspiration on early work on representa-tion learning in the context of multi-task and meta-learning(Baxter 1995 Caruana 1998) we instantiate the frameworkfor ML in a simple way treating the weights of the last layerof a neural network as the inner variables and the remainingweights which parametrize the representation mapping asthe outer variables As shown in Sec 5 the resulting MLalgorithm performs well in practice outperforming most ofthe existing strategies on MiniImagenet

2 A bilevel optimization frameworkIn this paper we consider bilevel optimization problems(see eg Colson et al 2007) of the form

minf(λ) λ isin Λ (1)

where function f Λrarr R is defined at λ isin Λ as

f(λ) = infE(wλ λ) wλ isin arg minuisinRd

Lλ(u) (2)

We call E Rd times Λrarr R the outer objective and for everyλ isin Λ we call Lλ Rd rarr R the inner objective Note thatLλ λ isin Λ is a class of objective functions parameterizedby λ Specific instances of this problem include HO and MLwhich we discuss next Table 1 outlines the links amongbilevel programming HO and ML

21 Hyperparameter Optimization

In the context of hyperparameter optimization we are in-terested in minimizing the validation error of a modelgw X rarr Y parameterized by a vector w with respectto a vector of hyperparameters λ For example we mayconsider representation or regularization hyperparametersthat control the hypothesis space or penalties respectivelyIn this setting a prototypical choice for the inner objectiveis the regularized empirical error

Lλ(w) =sum

(xy)isinDtr

`(gw(x) y) + Ωλ(w)

where Dtr = (xi yi)ni=1 is a set of inputoutput points `is a prescribed loss function and Ωλ a regularizer param-eterized by λ The outer objective represents a proxy forthe generalization error of gw and it may be given by theaverage loss on a validation set Dval

E(w λ) =sum

(xy)isinDval

`(gw(x) y)

or in more generality by a cross-validation error as detailedin Appendix B Note that in this setting the outer objectiveE does not depend explicitly on the hyperparameters λ

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

w

Error

Train errors

Validation error

Figure 1 Each blue line represents the average training error whenvarying w in gwλ and the corresponding inner minimizer is shownas a blue dot The validation error evaluated at each minimizeryields the black curve representing the outer objective f(λ) whoseminimizer is shown as a red dot

since in HO λ is instrumental in finding a good model gwwhich is our final goal As a more specific example considerlinear models gw(x) = 〈w x〉 let ` be the square loss andlet Ωλ(w) = λw2 in which case the inner objective isridge regression (Tikhonov regularization) and the bilevelproblem optimizes over the regularization parameter thevalidation error of ridge regression

22 Meta-Learning

In meta-learning (ML) the inner and outer objectives arecomputed by averaging a training and a validation errorover multiple tasks respectively The goal is to producea learning algorithm that will work well on novel tasks1For this purpose we have available a meta-training setD = Dj = Dj

tr cup DjvalNj=1 which is a collection of

datasets sampled from a meta-distribution P Each datasetDj = (xji y

ji )

nj

i=1 with (xji yji ) isin X times Yj is linked to

a specific task Note that the output space is task depen-dent (eg a multi-class classification problem with variablenumber of classes) The model for each task is a functiongwj λ X rarr Yj identified by a parameter vectors wj andhyperparameters λ A key point here is that λ is sharedbetween the tasks With this notation the inner and outerobjectives are

Lλ(w) =

Nsumj=1

Lj(wj λDjtr) (3)

E(w λ) =

Nsumj=1

Lj(wj λDjval) (4)

respectively The loss Lj(wj λ S) represents the empiricalerror of the pair (wj λ) on a set of examples S Note that the

1The ML problem is also related to multitask learning howeverin ML the goal is to extrapolate from the given tasks

inner and outer losses for task j use different trainvalidationsplits of the corresponding dataset Dj Furthermore unlikein HO in ML the final goal is to find a good λ and the wj

are now instrumental

The cartoon in Figure 1 illustrates ML as a bilevel problemThe parameter λ indexes an hypothesis space within whichthe inner objective is minimized A particular example de-tailed in Sec 4 is to choose the model gwλ = 〈w hλ(x)〉in which case λ parameterizes a feature mapping Yet an-other choice would be to consider gwj λ(x) = 〈w + λ x〉in which case λ represents a common model around whichtask specific models are to be found (see eg Evgeniou et al2005 Finn et al 2017 Khosla et al 2012 Kuzborskij et al2013 and reference therein)

23 Gradient-Based Approach

We now discuss a general approach to solve Problem (1)-(2)when the hyperparameter vector λ is real-valued To sim-plify our discussion let us assume that the inner objectivehas a unique minimizer wλ Even in this simplified scenarioProblem (1)-(2) remains challenging to solve Indeed ingeneral there is no closed form expression wλ so it is notpossible to directly optimize the outer objective functionWhile a possible strategy (implicit differentiation) is to ap-ply the implicit function theorem tonablaLλ = 0 (Pedregosa2016 Koh and Liang 2017 Beirami et al 2017) anothercompelling approach is to replace the inner problem with adynamical system This point discussed in (Domke 2012Maclaurin et al 2015 Franceschi et al 2017) is developedfurther in this paper

Specifically we let [T ] = 1 T where T is a pre-scribed positive integer and consider the following approxi-mation of Problem (1)-(2)

minλfT (λ) = E(wTλ λ) (5)

where E is a smooth scalar function and2

w0λ = Φ0(λ) wtλ = Φt(wtminus1λ λ) t isin [T ] (6)

with Φ0 Rm rarr Rd a smooth initialization mapping andfor every t isin [T ] Φt Rd times Rm rarr Rd a smooth mappingthat represents the operation performed by the t-th step ofan optimization algorithm For example the optimizationdynamics could be gradient descent Φt(wt λ) = wt minusηtnablaLλ(middot) where (ηt)tisin[T ] is a sequence of steps sizes

The approximation of the bilevel problem (1)-(2) by theprocedure (5)-(6) raises the issue of the quality of this ap-proximation and we return to this issue in the next section

2In general the algorithm used to minimize the inner objectivemay involve auxiliary variables eg velocities when using gradi-ent descent with momentum so w should be intended as a largervector containing both model parameters and auxiliary variables

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

However it also suggests to consider the inner dynamicsas a form of approximate empirical error minimization (egearly stopping) which is valid in its own right From thisperspective ndash conversely to the implicit differentiation strat-egy ndash it is possible to include among the components of λvariables which are associated with the optimization algo-rithm itself For example λ may include the step sizes ormomentum factors if the dynamics Φt in Eq (6) is gradientdescent with momentum in (Andrychowicz et al 2016Wichrowska et al 2017) the mapping Φt is implemented asa recurrent neural network while (Finn et al 2017) focuson the initialization mapping by letting Φ0(λ) = λ

A major advantage of this reformulation is that it makes itpossible to compute efficiently the gradient of fT which wecall hypergradient either in time or in memory (Maclaurinet al 2015 Franceschi et al 2017) by making use of re-verse or forward mode algorithmic differentiation (Griewankand Walther 2008 Baydin et al 2017) This allows us tooptimize a number of hyperparameters of the same order ofthat of parameters a situation which arise in ML

3 Exact and Approximate BilevelProgramming

In this section we provide results about the existence ofsolutions of Problem (1)-(2) and the approximation proper-ties of Procedure (5)-(6) with respect to the original bilevelproblem Proofs of these results are provided in the supple-mentary material

Procedure (5)-(6) though related to the bilevel problem(1)-(2) may not be in general a good approximation ofit Indeed making the assumptions (which sound perfectlyreasonable) that for every λ isin Λ wTλ rarr wλ for somewλ isin arg maxLλ and that E(middot λ) is continuous one canonly assert that limTrarrinfin fT (λ) = E(wλ λ) ge f(λ) Thisis because the optimization dynamics converge to someminimizer of the inner objective Lλ but not necessarilyto the one that also minimizes the function E This isillustrated in Figure 2 The situation is however differentif the inner problem admits a unique minimizer for everyλ isin Λ Indeed in this case it is possible to show that theset of minimizers of the approximate problems convergeas T rarr +infin and in an appropriate sense to the set ofminimizers of the bilevel problem More precisely wemake the following assumptions

(i) Λ is a compact subset of Rm

(ii) E Rd times Λrarr R is jointly continuous

(iii) the map (w λ) 7rarr Lλ(w) is jointly continuous andsuch that arg minLλ is a singleton for every λ isin Λ

(iv) wλ = arg minLλ remains bounded as λ varies in Λ

w

E (λ)

λ λ(1) (2)w w

Figure 2 In this cartoon for a fixed λ argmin Lλ =

w(1)λ w

(2)λ the iterates of an optimization mapping Φ could

converge to w(1)λ with E(w

(1)λ λ) gt E(w

(2)λ λ)

Then problem (1)-(2) becomes

minλisinΛ

f(λ) = E(wλ λ) wλ = argminuLλ(u) (7)

Under the above assumptions in the following we giveresults about the existence of solutions of problem (7) andthe (variational) convergence of the approximate problems(5)-(6) towards problem (7) mdash relating the minima as wellas the set of minimizers In this respect we note that sinceboth f and fT are nonconvex argmin fT and argmin f arein general nonsingleton so an appropriate definition of setconvergence is required

Theorem 31 (Existence) Under Assumptions (i)-(iv) prob-lem (7) admits solutionsProof See Appendix A

The result below follows from general facts on the stabil-ity of minimizers in optimization problems (Dontchev andZolezzi 1993)

Theorem 32 (Convergence) In addition to Assump-tions (i)-(iv) suppose that

(v) E(middot λ) is uniformly Lipschitz continuous

(vi) The iterates (wTλ)TisinN converge uniformly to wλ onΛ as T rarr +infin

Then

(a) inf fT rarr inf f

(b) argmin fT rarr argmin f meaning that for every(λT )TisinN such that λT isin argmin fT we have that

- (λT )TisinN admits a convergent subsequence- for every subsequence (λKT

)TisinN such thatλKT

rarr λ we have λ isin argmin f

Proof See Appendix A

We stress that assumptions (i)-(vi) are very natural and sat-isfied by many problems of practical interests Thus theabove results provide full theoretical justification to the pro-posed approximate procedure (5)-(6) The following remarkdiscusses assumption (vi) while the subsequent examplewill be relevant to the experiments in Sec 5

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Algorithm 1 Reverse-HG for Hyper-representationInput λ current values of the hyperparameter T num-ber of iteration of GD η ground learning rate B mini-batch of episodes from DOutput Gradient of meta-training error wrt λ on Bfor j = 1 to |B| do

wj0 = 0for t = 1 to T do

wjt larr wtminus1 minus ηnablawLj(wjtminus1 λDjtr)

αjT larr nablawLj(wjT λDval)

pj larr nablaλLj(wjT λDval)for t = T minus 1 downto 0 do

pj larr pj minus αjt+1ηnablaλnablawLj(wjt λD

jtr)

αjt larr αjt+1

[I minus ηnablawnablawLj(wjt λD

jtr)

]return

sumj p

j

Remark 33 If Lλ is strongly convex then many gradient-based algorithms (eg standard and accelerated gradient de-scent) yield linear convergence of the iterates wTλrsquos More-over in such cases the rate of linear convergence is of type(νλ minus microλ)(νλ + microλ) where νλ and microλ are the Lipschitzconstant of the gradient and the modulus of strong convexityof Lλ respectively So this rate can be uniformly boundedfrom above by ρ isin ]0 1[ provided that supλisinΛ νλ lt +infinand infλisinΛ microλ gt 0 Thus in these cases wTλ convergesuniformly to wλ on Λ (at a linear rate)

Example 34 Let us consider the following form of theinner objective

LH(w) = y minusXHw2 + ρw2 (8)

where ρ gt 0 is a fixed regularization parameter andH isin Rdtimesd is the hyperparameter representing a linear fea-ture map LH is strongly convex with modulus micro = ρ gt 0(independent on the hyperparameter H) and Lipschitzsmooth with constant νH = 2(XH)gtXH+ρI which isbounded from above if H ranges in a bounded set of squarematrices In this case assumptions (i)-(vi) are satisfied

4 Learning Hyper-RepresentationsIn this section we instantiate the bilevel programming ap-proach for ML outlined in Sec 22 in the case of deep learn-ing where representation layers are shared across episodesFinding good data representations is a centerpiece in ma-chine learning Classical approaches (Baxter 1995 Caru-ana 1998) learn both the weights of the representation map-ping and those of the ground classifiers jointly on the samedata Here we follow the bilevel approach and split eachdatasetepisode in training and validation sets

Our method involves the learning of a cross-task intermedi-

ate representation hλ X rarr Rk (parametrized by a vectorλ) on top of which task specific models gj Rk rarr Yj(parametrized by vectors wj) are trained The final groundmodel for task j is thus given by gj h To find λ we solveProblem (1)-(2) with inner and outer objectives as in Eqs(3) and (4) respectively Since in general this problemcannot be solved exactly we instantiate the approximationscheme in Eqs (5)-(6) as follows

minλfT (λ) =

Nsumj=1

Lj(wjT λDjval) (9)

wjt = wjtminus1minusηnablawLj(wjtminus1 λD

jtr) t j isin [T ] [N ] (10)

Starting from an initial value the weights of the task-specificmodels are learned by T iterations of gradient descentThe gradient of fT can be computed efficiently in timeby making use of an extended reverse-hypergradient pro-cedure (Franceschi et al 2017) which we present in Al-gorithm 1 Since in general the number of episodes in ameta-training set is large we compute a stochastic approx-imation of the gradient of fT by sampling a mini-batch ofepisodes At test time given a new episode D the represen-tation h is kept fixed and all the examples in D are used totune the weights w of the episode-specific model g

Like other initialization and optimization strategies forML our method does not require lookups in a supportset as the memorization and metric strategies do (Santoroet al 2016 Vinyals et al 2016 Mishra et al 2018) Un-like (Andrychowicz et al 2016 Ravi and Larochelle 2017)we do not tune the optimization algorithm which in our caseis plain empirical loss minimization by gradient descentand rather focus on the hypothesis space Unlike (Finnet al 2017) that aims at maximizing sensitivity of new tasklosses to the model parameters we aim at maximizing thegeneralization to novel examples during training episodeswith respect to λ Our assumptions about the structure ofthe model are slightly stronger than in (Finn et al 2017)but still mild namely that some (hyper)parameters definethe representation and the remaining parameters define theclassification function In (Munkhdalai and Yu 2017) themeta-knowledge is distributed among fast and slow weightsand an external memory our approach is more direct sincethe meta-knowledge is solely distilled by λ A further advan-tage of our method is that if the episode-specific models arelinear (eg logistic regressors) and each loss Lj is stronglyconvex in w the theoretical guarantees of Theorem 32 ap-ply (see Remark 33) These assumptions are satisfied in theexperiments reported in the next section

5 ExperimentsThe aim of the following experiments is threefold Firstwe investigate the impact of the number of iterations of theoptimization dynamics on the quality of the solution on a

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

0 1000 2000 3000 4000 5000

Hyperiterations

60

80

100

120

140

160

180Sq

uare

der

ror

Exact solutonT = 1

T = 4

T = 16

T = 64

T = 256

Figure 3 Optimization of the outer objectives f and fT for exactand approximate problems The optimization of H is performedwith gradient descent with momentum with same initializationstep size and momentum factor for each run

simple multiclass classification problem Second we testour hyper-representation method in the context of few-shotlearning on two benchmark datasets Finally we constrastthe bilevel ML approach against classical approaches tolearn shared representations 3

51 The Effect of T

Motivated by the theoretical findings of Sec 3 we empir-ically investigate how solving the inner problem approxi-mately (ie using small T ) affects convergence generaliza-tion performances and running time We focus in particularon the linear feature map described in Example 34 whichallows us to compare the approximated solution against theclosed-form analytical solution given by

wH = [(XH)TXH + ρI]minus1(XH)TY

In this setting the bilevel problem reduces to a (non-convex)optimization problem in H

We use a subset of 100 classes extracted from Omniglotdataset (Lake et al 2017) to construct a HO problem aimedat tuning H A training set Dtr and a validation set Dvaleach consisting of three randomly drawn examples per classwere sampled to form the HO problem A third set Dtestconsisting of fifteen examples per class was used for testingInstead of using raw images as input we employ featurevectors x isin R256 computed by the convolutional networktrained on one-shot five-ways ML setting as described inSec 52

For the approximate problems we compute the hypergra-dient using Algorithm 1 where it is intended that B =(Dtr Dval) Figure 3 shows the values of functions f andfT (see Eqs (1) and (5) respectively) during the optimiza-tion of H As T increases the solution of the approximate

3The code for reproducing the experiments based on thepackage FAR-HO (httpsbitlyfar-ho) is availableat httpsbitlyhyper-repr

0 1000 2000 3000 4000 5000

Hyperiterations

85

86

87

88

89

90

91

92

93

Acc

urac

y

Exact solutonT = 1

T = 4

T = 16

T = 64

T = 256

Figure 4 Accuracy on Dtest of exact and approximated solutionsduring optimization of H Training and validation accuraciesreach almost 100 already for T = 4 and after few hundredhyperiterations and therefore are not reported

Table 2 Execution times on a NVidia Tesla M40 GPU

T 1 4 16 64 256 ExactTime (sec) 60 119 356 1344 5532 320

problem approaches the true bilevel solution However per-forming a small number of gradient descent steps for solvingthe inner problem acts as implicit regularizer As it is evi-dent from Figure 4 the generalization error is better whenT is smaller than the value yielding the best approximationof the inner solution This is to be expected since in thissetting the dimensions of parameters and hyperparametersare of the same order leading to a concrete possibility ofoverfitting the outer objective (validation error) An appro-priate problem dependent choice of T may help avoidingthis issue (see also Appendix C) As T increases the numberof hyperiterations required to reach the maximum test accu-racy decreases further suggesting that there is an interplaybetween the number of iterations used to solve the inner andthe outer objective Finally the running time of Algorithm1 is linear in T and the size of w and independent of thesize of H (see also Table 2) making it even more appealingto reduce the number of iterations

52 Few-shot Learning

We new turn our attention to learning-to-learn precisely tofew-shot supervised learning implementing the ML strategyoutlined in Sec 4 on two different benchmark datasets

bull OMNIGLOT (Lake et al 2015) a dataset that containsexamples of 1623 different handwritten characters from 50alphabets We downsample the images to 28times 28

bull MINIIMAGENET (Vinyals et al 2016) a subset of Ima-geNet (Deng et al 2009) that contains 60000 downsampledimages from 100 different classes

Following the experimental protocol used in a number of re-cent works we build a meta-training set D from which wesample datasets to solve Problem (9)-(10) a meta-validation

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

set V for tuning ML hyperparameters and finally a meta-test set T which is used to estimate accuracy Operationallyeach meta-dataset consists of a pool of samples belonging todifferent (non-overlapping between separate meta-dataset)classes which can be combined to form ground classifica-tion datasets Dj = Dj

tr cup Djval with 5 or 20 classes (for

Omniglot) The Djtrrsquos contain 1 or 5 examples per class

which are used to fit wj (see Eq 10) The Djvalrsquos contain-

ing 15 examples per class is used either to compute fT (λ)(see Eq (9)) and its (stochastic) gradient if Dj isin D or toprovide a generalization score if Dj comes from either V orT For MiniImagenet we use the same split and images pro-posed in (Ravi and Larochelle 2017) while for Omniglotwe use the protocol defined by (Santoro et al 2016)

As ground classifiers we use multinomial logistic regressorsand as task losses `j we employ cross-entropy The innerproblems being strongly convex admit unique minimizersyet require numerical computation of the solutions We ini-tialize ground models parameters wj to 0 and according tothe observation in Sec 51 we perform T gradient descentsteps where T is treated as a ML hyperparameter that has tobe validated Figure 6 shows an example of meta-validationof T for one-shot learning on MiniImagenet We compute astochastic approximation ofnablafT (λ) with Algorithm 1 anduse Adam with decaying learning rate to optimize λ

Regarding the specific implementation of the representationmapping h we employ for Omniglot a four-layers convo-lutional neural network with strided convolutions and 64filters per layer as in (Vinyals et al 2016) and other suc-cessive works For MiniImagenet we tried two differentarchitectures

bull C4L a four-layers convolutional neural network with max-pooling and 32 filters per layer

bull RN a residual network (He et al 2016) built of fourresidual blocks followed by two convolutional layers

The first network architecture has been proposed in (Raviand Larochelle 2017) and then used in (Finn et al 2017)while a similar residual network architecture has been em-ployed in a more recent work (Mishra et al 2018) Furtherdetails on the architectures of h as well as other ML hyper-parameters are specified in the supplementary material Wereport our results using RN for MiniImagenet in Table 3alongside scores from various recently proposed methodsfor comparison

The proposed method achieves competitive results highlight-ing the relative importance of learning a task independentrepresentation on the top of which logistic classifiers trainedwith very few samples generalize well Moreover utilizingmore expressive models such as residual network as repre-sentation mappings is beneficial for our proposed strategyand unlike other methods does not result in overfitting

of the outer objective as reported in (Mishra et al 2018)Indeed compared to C4L RN achieves a relative improve-ment of 65 on one-shot and 42 on five-shot Figure 5provides a visual example of the goodness of the learned rep-resentation showing that MiniImagenet examples (the firstfrom meta-training the second from the meta-testing sets)from similar classes (different dog breeds) are mapped neareach other by h and conversely samples from dissimilarclasses are mapped afar

Figure 5 After sampling two datasets D isin D and Dprime isin T weshow on the top the two images x isin D xprime isin Dprime that minimize||hλ(x)minus hλ(xprime)|| and on the bottom those that maximize it Inbetween each of the two couples we compare a random subset ofcomponents of hλ(x) (blue) and hλ(xprime) (green)

3 5 8 1266

68

70

72

74

76

78Accuracy on Dtr

3 5 8 12

49

50

51

52

53

Accuracy on DtestAccuracy on Dval

Figure 6 Meta-validation of the number of gradient descent steps(T ) of the ground models for MiniImagenet using the RN repre-sentation Early stopping on the accuracy on meta-validation setduring meta-training resulted in halting the optimization of λ after42k 40k 22k and 15k hyperiterations for T equal to 3 5 8 and12 respectively in line with our observation in Sec 51

53 On Variants of Representation Learning Methods

In this section we show the benefits of learning a represen-tation within the proposed bilevel framework compared toother possible approaches that involve an explicit factoriza-tion of a classifier as gj h The representation mapping his either pretrained or learned with different meta-learningalgorithms We focus on the problem of one-shot learningon MiniImagenet and we use C4L as architecture for therepresentation mapping In all the experiments the groundmodels gj are multinomial logistic regressor as in Sec 52tuned with 5 steps of gradient descent We ran the followingexperiments

bull Multiclass the mapping h X rarr R64 is given by the

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Table 3 Accuracy scores computed on episodes from T of various methods on 1-shot and 5-shot classification problems on Omniglotand MiniImagenet For MiniImagenet 95 confidence intervals are reported For Hyper-representation the scores are computed over 600randomly drawn episodes For other methods we show results as reported by their respective authors

OMNIGLOT 5 classes OMNIGLOT 20 classes MINIIMAGENET 5 classesMethod 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot

Siamese nets (Koch et al 2015) 973 984 882 970 minus minusMatching nets (Vinyals et al 2016) 981 989 938 985 4344plusmn 077 5531plusmn 073Neural stat (Edwards and Storkey 2016) 981 995 932 981 minus minusMemory mod (Kaiser et al 2017) 984 996 950 986 minus minusMeta-LSTM (Ravi and Larochelle 2017) minus minus minus minus 4356plusmn 084 6060plusmn 071MAML (Finn et al 2017) 987 999 958 989 4870plusmn 175 6311plusmn 092Meta-networks (Munkhdalai and Yu 2017) 989 minus 970 minus 4921plusmn 096 minusPrototypical Net (Snell et al 2017) 988 997 960 989 4942plusmn 078 6820plusmn 066SNAIL (Mishra et al 2018) 991 998 976 994 5571plusmn 099 6888plusmn 092Hyper-representation 986 995 955 984 5054plusmn 085 6453plusmn 068

linear outputs before the softmax operation of a network4

pretrained on the totality of examples contained in the train-ing meta-dataset (600 examples for each of the 64 classes)In this setting we found that using the second last layer orthe output after the softmax yields worst results

bull Bilevel-train we use a bilevel approach but unlike in Sec4 we optimize the parameter vector λ of the representationmapping by minimizing the loss on the training sets of eachepisode The hypergradient is still computed with Algorithm1 albeit we set Dj

val = Djtr for each training episodes

bull Approx and Approx-train we consider an approxima-tion of the hypergradient nablafT (λ) by disregarding the op-timization dynamics of the inner objectives (ie we setnablaλwjT = 0) In Approx-train we just use the training sets

bull Classic as in (Baxter 1995) we learn h by jointly opti-mize f(λw1 wN ) =

sumNj=1 L

j(wj λDjtr) and treat

the problem as standard multitask learning with the ex-ception that we evaluate f on mini-batches of 4 episodesrandomly sampled every 5 gradient descent iterations

In settings where we do not use the validation sets we letthe training sets of each episode contain 16 examples perclass Using training episodes with just one example perclass resulted in performances just above random chanceWhile the first experiment constitutes a standard baselinethe others have the specific aim of assessing (i) the impor-tance of splitting episodes of meta-training set into trainingand validation and (ii) the importance of computing thehypergradient of the approximate bilevel problem with Al-gorithm 1 The results reported in Table 4 suggest thatboth the trainingvalidation splitting and the full computa-tion of the hypergradient constitute key factors for learninga good representation in a meta-learning context On theother side using pretrained representations especially ina low-dimensional space turns out to be a rather effectivebaseline One possible explanation is that in this context

4The network is similar to C4L but has 64 filters per layer

Table 4 Performance of various methods where the representationis either transfered or learned with variants of hyper-representationmethods The last raw reports for comparison the score obtainedwith hyper-representation

Method filters Accuracy 1-shot

Multiclass 64 4302Bilevel-train 32 2963Approx 32 4112Approx-train 32 3880Classic-train 32 4046Hyper-representation-C4L 32 4751

some classes in the training and testing meta-datasets arerather similar (eg various dog breeds) and thus groundclassifiers can leverage on very specific representations

6 ConclusionsWe have shown that both HO and ML can be formulated interms of bilevel programming and solved with an iterativeapproach When the inner problem has a unique solution(eg is strongly convex) our theoretical results show that theiterative approach has convergence guarantees a result thatis interesting in its own right In the case of ML by adaptingclassical strategies (Baxter 1995) to the bilevel frameworkwith trainingvalidation splitting we present a method forlearning hyper-representations which is experimentally ef-fective and supported by our theoretical guarantees

Our framework encompasses recently proposed methodsfor meta-learning such as learning to optimize but alsosuggests different design patterns for the inner learningalgorithm which could be interesting to explore in futurework The resulting inner problems may not satisfy theassumptions of our convergence analysis raising the needfor further theoretical investigations An additional futuredirection of research is the study of the statistical propertiesof bilevel strategies where outer objectives are based on thegeneralization ability of the inner model to new (validation)data Ideas from (Maurer et al 2016 Denevi et al 2018)may be useful in this direction

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

ReferencesAndrychowicz M Denil M Gomez S Hoffman M W

Pfau D Schaul T and de Freitas N (2016) Learningto learn by gradient descent by gradient descent In Ad-vances in Neural Information Processing Systems (NIPS)pages 3981ndash3989

Bard J F (2013) Practical bilevel optimization algo-rithms and applications volume 30 Springer Science ampBusiness Media 01251

Baxter J (1995) Learning internal representations In Pro-ceedings of the 8th Annual Conference on ComputationalLearning Theory (COLT) pages 311ndash320 ACM

Baydin A G Pearlmutter B A Radul A A and SiskindJ M (2017) Automatic differentiation in machine learn-ing a survey Journal of Machine Learning Research181531ndash15343

Beirami A Razaviyayn M Shahrampour S and TarokhV (2017) On optimal generalizability in parametriclearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3458ndash3468

Bergstra J and Bengio Y (2012) Random search for hyper-parameter optimization Journal of Machine LearningResearch 13(Feb)281ndash305

Bergstra J Yamins D and Cox D D (2013) Makinga science of model search Hyperparameter optimiza-tion in hundreds of dimensions for vision architecturesIn Proceedings of the 30th International Conference onMachine Learning (ICML) pages 115ndash123

Caruana R (1998) Multitask learning In Learning tolearn pages 95ndash133 Springer 02683

Colson B Marcotte P and Savard G (2007) Anoverview of bilevel optimization Annals of operationsresearch 153(1)235ndash256

Denevi G Ciliberto C Stamos D and Pontil M (2018)Incremental learning-to-learn with statistical guaranteesarXiv preprint arXiv180308089 To appear in UAI 2018

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL (2009) Imagenet A large-scale hierarchical imagedatabase In Computer Vision and Pattern Recognition(CVPR) pages 248ndash255

Domke J (2012) Generic Methods for Optimization-BasedModeling In AISTATS volume 22 pages 318ndash326

Dontchev A L and Zolezzi T (1993) Well-posed op-timization problems volume 1543 of Lecture Notes inMathematics Springer-Verlag Berlin

Edwards H and Storkey A (2016) Towards a NeuralStatistician arXiv160602185 [cs stat] 00027 arXiv160602185

Evgeniou T Micchelli C A and Pontil M (2005) Learn-ing multiple tasks with kernel methods Journal of Ma-chine Learning Research 6(Apr)615ndash637

Finn C Abbeel P and Levine S (2017) Model-agnosticmeta-learning for fast adaptation of deep networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 1126ndash1135

Flamary R Rakotomamonjy A and Gasso G(2014) Learning constrained task similarities in graph-regularized multi-task learning Regularization Opti-mization Kernels and Support Vector Machines page103

Franceschi L Donini M Frasconi P and Pontil M(2017) Forward and reverse gradient-based hyperparam-eter optimization In Proceedings of the 34th Interna-tional Conference on Machine Learning (ICML) pages1165ndash1173

Glorot X and Bengio Y (2010) Understanding the diffi-culty of training deep feedforward neural networks InProceedings of the 13th International Conference on Arti-ficial Intelligence and Statistics (AIStat) pages 249ndash256

Griewank A and Walther A (2008) Evaluating deriva-tives principles and techniques of algorithmic differenti-ation SIAM

He K Zhang X Ren S and Sun J (2016) Deep residuallearning for image recognition In Computer Vision andPattern Recognition (CVPR) pages 770ndash778

Hutter F Lcke J and Schmidt-Thieme L (2015) BeyondManual Tuning of Hyperparameters KI - KunstlicheIntelligenz 29(4)329ndash337

Kaiser L Nachum O Roy A and Bengio S (2017)Learning to remember rare events

Keerthi S S Sindhwani V and Chapelle O (2007) Anefficient method for gradient-based adaptation of hyper-parameters in svm models In Advances in Neural Infor-mation Processing Systems (NIPS) pages 673ndash680

Khosla A Zhou T Malisiewicz T Efros A A andTorralba A (2012) Undoing the damage of dataset biasIn European Conference on Computer Vision pages 158ndash171 Springer

Koch G Zemel R and Salakhutdinov R (2015) Siameseneural networks for one-shot image recognition In ICMLDeep Learning Workshop volume 2 00127

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Koh P W and Liang P (2017) Understanding black-boxpredictions via influence functions In Proceedings ofthe 34th International Conference on Machine Learning(ICML) pages 1885ndash1894

Kunapuli G Bennett K Hu J and Pang J-S (2008)Classification model selection via bilevel programmingOptimization Methods and Software 23(4)475ndash489

Kuzborskij I Orabona F and Caputo B (2013) Fromn to n+ 1 Multiclass transfer incremental learning InComputer Vision and Pattern Recognition (CVPR) 2013IEEE Conference on pages 3358ndash3365

Lake B M Salakhutdinov R and Tenenbaum J B(2015) Human-level concept learning through probabilis-tic program induction Science 350(6266)1332ndash1338

Lake B M Ullman T D Tenenbaum J B and Gersh-man S J (2017) Building machines that learn and thinklike people Behavioral and Brain Sciences 40 00152

Maclaurin D Duvenaud D K and Adams R P (2015)Gradient-based hyperparameter optimization through re-versible learning In Proceedings of the 32nd Interna-tional Conference on Machine Learning (ICML pages2113ndash2122

Maurer A Pontil M and Romera-Paredes B (2016) Thebenefit of multitask representation learning The Journalof Machine Learning Research 17(1)2853ndash2884

Mishra N Rohaninejad M Chen X and Abbeel P(2018) A simple neural attentive meta-learner

Munkhdalai T and Yu H (2017) Meta networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 2554ndash2563

Pedregosa F (2016) Hyperparameter optimization withapproximate gradient In Proceedings of The 33rd Inter-national Conference on Machine Learning (ICML) pages737ndash746

Ravi S and Larochelle H (2017) Optimization as a modelfor few-shot learning In In International Conference onLearning Representations (ICLR)

Santoro A Bartunov S Botvinick M Wierstra Dand Lillicrap T (2016) Meta-learning with memory-augmented neural networks In Proceedings ofthe 33rdInternational Conference on Machine Learning pages1842ndash1850

Schmidhuber J (1992) Learning to Control Fast-WeightMemories An Alternative to Dynamic Recurrent Net-works Neural Computation 4(1)131ndash139 00082

Snell J Swersky K and Zemel R S (2017) Prototypicalnetworks for few-shot learning In Advances in neuralinformation processing systems pages 4080ndash4090

Snoek J Larochelle H and Adams R P (2012) Practicalbayesian optimization of machine learning algorithmsIn Advances in Neural Information Processing Systems(NIPS) pages 2951ndash2959

Thrun S and Pratt L (1998) Learning to learn Springer

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser L and Polosukhin I (2017)Attention is all you need In Advances in Neural Informa-tion Processing Systems (NIPS) pages 6000ndash6010

Vinyals O Blundell C Lillicrap T Kavukcuoglu Kand Wierstra D (2016) Matching networks for one shotlearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3630ndash3638

Wichrowska O Maheswaranathan N Hoffman M WColmenarejo S G Denil M Freitas N and Sohl-Dickstein J (2017) Learned optimizers that scale andgeneralize In Proceedings of the 34th International Con-ference on Machine Learning (ICML) pages 3751ndash3760

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

A Proofs of the Results in Sec 3Proof of Theorem 31 Since Λ is compact it follows fromWeierstrass theorem that a sufficient condition for the exis-tence of minimizers is that f is continuous Thus let λ isin Λand let (λn)nisinN be a sequence in Λ such that λn rarr λWe prove that f(λn) = E(wλn

λn)rarr E(wλ λ) = f(λ)Since (wλn

)nisinN is bounded there exists a subsequence(wkn)nisinN such that wkn rarr w for some w isin Rd Nowsince λkn rarr λ and the map (w λ) 7rarr Lλ(w) is jointlycontinuous we have

forallw isin Rd Lλ(w) = limnLλkn

(wkn)

le limnLλkn

(w) = Lλ(w)

Therefore w is a minimizer of Lλ and hence w = wλThis prove that (wλn)nisinN is a bounded sequence havinga unique cluster point Hence (wλn)nisinN is convergenceto its unique cluster point which is wλ Finally since(wλn

λn)rarr (wλ λ) and E is jointly continuous we haveE(wλn λn)rarr E(wλ λ) and the statement follows

We recal a fundamental fact concerning the stability of min-ima and minimizers in optimization problems (Dontchevand Zolezzi 1993) We provide the proof for completeness

Theorem A1 (Convergence) Let ϕT and ϕ be lower semi-continuous functions defined on a compact set Λ Supposethat ϕT converges uniformly to ϕ on Λ as T rarr +infin Then

(a) inf ϕT rarr inf ϕ

(b) argminϕT rarr argminϕ meaning that for every(λT )TisinN such that λT isin argminϕT we have that

- (λT )TisinN admits a convergent subsequence- for every subsequence (λKT

)TisinN such thatλKT

rarr λ we have λ isin argminϕ

Proof Let (λT )TisinN be a sequence in Λ such that for everyT isin N λT isin argminϕT We prove that

1) (λT )TisinN admits a convergent subsequence

2) for every subsequence (λKT)TisinN such that λKT

rarr λwe have λ isin argminϕ and ϕKT

(λKT)rarr inf ϕ

3) inf ϕT rarr inf ϕ

The first point follows from the fact that Λ is compactConcerning the second point let (λKT

)TisinN be a subse-quence such that λKT

rarr λ Since ϕKTconverge uniformly

to ϕ we have

|ϕKT(λKT

)minus ϕ(λKT)| le sup

λisinΛ|ϕKT

(λ)minus ϕ(λ)| rarr 0

Therefore using also the continuity of ϕ we have

forallλ isin Λ ϕ(λ) = limTϕ(λKT

) = limTϕKT

(λKT)

le limTϕKT

(λ) = ϕ(λ)

So λ isin argminϕ andϕ(λ) = limT ϕKT(λKT

) le inf ϕ =ϕ(λ) that is limT ϕKT

(λKT) = inf ϕ

Finally as regards the last point we proceed by contradic-tion If (ϕT (λT ))TisinN does not convergce to inf f thenthere exists an ε gt 0 and a subsequence (ϕKT

(λKT))TisinN

such that

|ϕKT(λKT

)minus inf ϕ| ge ε forallT isin N (11)

Now let (λK

(1)T

) be a convergent subsequence of (λKT)TisinN

Suppose that λK

(1)T

rarr λ Clearly (λK

(1)T

) is also a subse-quence of (λT )TisinN Then it follows from point 2) abovethat ϕ

K(1)T

(λK

(1)T

) rarr inf ϕ This latter finding togetherwith equation (11) gives a contradiction

Proof of Theorem 32 Since E(middot λ) is uniformly Lipschitzcontinuous there exists ν gt 0 such that for every T isin Nand every λ isin Λ

|fT (λ)minus f(λ)| = |E(wTλ λ)minus E(wλ λ)|le νwTλ minus wλ

It follows from assumption (vi) that fT (λ) converges tof(λ) uniformly on Λ as T rarr +infin Then the statementfollows from Theorem A1

B Cross-validation and Bilevel ProgrammingWe note that the (approximate) bilevel programming frame-work easily accommodates also estimations of the gener-alization error generated by a cross-validation proceduresWe describe here the case of K-fold cross-validation whichincludes also leave-one-out cross validation

Let D = (xi yi)ni=1 be the set of available data K-foldcross validation with K isin 1 2 N consists in parti-tioning D in K subsets DjKj=1 and fit as many modelsgwj on training data Dj

tr =⋃i6=j D

i The models are thenevaluated on Dj

val = Dj Denoting by w = (wj)Kj=1 thevector of stacked weights the K-fold cross validation erroris given by

E(w λ) =1

K

Ksumj=1

Ej(wj λ)

where Ej(wj λ) =sum

(xy)isinDj `(gwj (x) y) E can betreated as the outer objective in the bilevel framework whilethe inner objective Lλ may be given by the sum of regu-larized empirical errors over each Dj

tr for the K models

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Under this perspective a K-fold cross-validation procedureclosely resemble the bilevel problem for ML formulatedin Sec 22 where in this case the meta-distribution col-lapses on the data (ground) distribution and the episodes aresampled from the same dataset of points

By following the procedure outlined in Sec 23 we canapproximate the minimization of Lλ with T steps of anoptimization dynamics and compute the hypergradient offT (λ) = 1

K

sumj E

j(wjT λ) by training the K models andproceed with either forward or reverse differentiation Themodels may be fitted sequentially in parallel or stochas-tically Specifically in this last case one can repeatedlysample one fold at a time (or possibly a mini-batch of folds)and compute a stochastic hypergradient that can be used ina SGD procedure in order to minimize fT At last we notethat similar ideas for leave-one out cross-validation error aredeveloped in (Beirami et al 2017) where the hypergradientof an approximation of the outer objective is computed bymeans of the implicit function theorem

C The Effect of T Ridge RegressionIn Sec 51 we showed that increasing the number of itera-tions T leads to a better optimization of the outer objective fthrough the approximations fT which converge uniformlyto f by Proposition 32 This however does not necessaryresults in better test scores due to possible overfitting ofthe training and validation errors as we noted for the linearhyper-representation multiclass classification experiment inSec 51 The optimal choice of T appears to be indeedproblem dependent if in the aforementioned experiment aquite small T led to the best test accuracy (after hyperparam-eter optimization) in this section we present a small scalelinear regression experiment that benefits from an increasingnumber of inner iterations

We generated 90 noisy synthetic data points with 30 featuresof which only 5 were informative and divided them equallyto form training validation and test sets As outer objectivewe set the mean squared error E(w) = ||Xvalw minus yval||2where Xval and yval are the validation design matrix andtargets respectively and w isin R30 is the weight vector of themodel We set as inner inner objective

Lλ(w) = ||Xtrw minus ytr||2 +

30sumi=1

eλiw2i

and optimized the L2 vector of regularization coefficients λ(equivalent to a diagonal Tikhonov regularization matrix)The results reported in Table 5 show that in this scenariooverfitting is not an issue and 250 inner iterations yield thebest test result Note that as in Sec 51 also this problemadmits an analytical solution from which it is possible tocompute the hypergradient analytically and perform exact

hyperparameter optimization as reported in the last row ofthe Table 5

Table 5 Validation and test mean absolute percentage error(MAPE) for various values of T

T Validation MAPE Test MAPE

10 1135 434950 128 522100 055 126250 047 050Exact 037 057

D Further Details on Few-shot ExperimentsThis appendix contains implementation details of the repre-sentation mapping h and the meta-learning hyperparametersused in the few-shot learning experiments of Sec 52

To optimize the representation mapping in all the exper-iments we use Adam with learning rate set to 10minus3 anda decay-rate of 10minus5 We used the initialization strategyproposed in (Glorot and Bengio 2010) for all the weights λof the representation

For Omniglot experiments we used a meta-batch size of 32episodes for five-way and of 16 episodes for 20-way Totrain the episode-specific classifiers we set the learning rateto 01

For one set of experiments with Mini-imagenet we used anhyper-representation (C4L) consisting of 4 convolutionallayers where each layer is composed by a convolution with32 filters a batch normalization followed by a ReLU acti-vation and a 2x2 max-pooling The classifiers were trainedusing mini-batches of 4 episodes for one-shot and 2 episodesfor five-shot with learning rate set to 001

The other set of experiments with Mini-imagenet employeda Residual Network (RN) as the representation mappingbuilt of 4 residual blocks (with 64 96 128 256 filters) andthen the block that follows 1times 1 conv (2048 filters) avgpooling 1 times 1 conv (512 filters) Each residual blockrepeats the following block 3 times 1 times 1 conv batchnormalization leaky ReLU (leak 01) before the residualconnection In this case the classifiers were optimized usingmini-batches of 2 episodes for both one and five-shot withlearning rate set to 004

Page 2: Bilevel Programming for Hyperparameter Optimization and ...

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Table 1 Links and naming conventions among different fields

Bilevelprogramming

Hyperparameteroptimization

Meta-learning

Inner variables Parameters Parameters ofGround models

Outer variables Hyperparameters Parameters ofMeta-learner

Inner objective Training error Training errorson tasks (Eq 3)

Outer objective Validation error Meta-trainingerror (Eq 4)

sentations It has been instantiated in several variants thatinvolve non-parametric (or instance-based) predictors (Kochet al 2015 Vinyals et al 2016 Snell et al 2017) In the re-lated memorization strategy the meta-learner learns to storeand retrieve data points representations in memory It canbe implemented either using recurrent networks (Santoroet al 2016) or temporal convolutions (Mishra et al 2018)The use of an attention mechanism (Vaswani et al 2017) iscrucial both in (Vinyals et al 2016) and in (Mishra et al2018) The initialization strategy (Ravi and Larochelle2017 Finn et al 2017) uses training episodes to infer a goodinitial value for the modelrsquos parameters so that new tasks canbe learned quickly by fine tuning The optimization strat-egy (Andrychowicz et al 2016 Ravi and Larochelle 2017Wichrowska et al 2017) forges an optimization algorithmthat will find it easier to learn on novel related tasks

A main contribution of this paper is a unified view of HOand ML within the natural mathematical framework ofbilevel programming where an outer optimization problemis solved subject to the optimality of an inner optimizationproblem In HO the outer problem involves hyperparame-ters while the inner problem is usually the minimization ofan empirical loss In ML the outer problem could involvea shared representation among tasks while the inner prob-lem could concern classifiers for individual tasks Bilevelprogramming (Bard 2013) has been suggested before in ma-chine learning in the context of kernel methods and supportvector machines (Keerthi et al 2007 Kunapuli et al 2008)multitask learning (Flamary et al 2014) and more recentlyHO (Pedregosa 2016) but never in the context of ML Theresulting framework outlined in Sec 2 encompasses someexisting approaches to ML in particular those based on theinitialization and the optimization strategies

A technical difficulty arises when the solution to the innerproblem cannot be written analytically (for example this hap-pens when using the log-loss for training neural networks)and one needs to resort to iterative optimization approachesAs a second contribution we provide in Sec 3 sufficientconditions that guarantee good approximation propertiesWe observe that these conditions are reasonable and apply

to concrete problems relevant to applications

In Sec 4 by taking inspiration on early work on representa-tion learning in the context of multi-task and meta-learning(Baxter 1995 Caruana 1998) we instantiate the frameworkfor ML in a simple way treating the weights of the last layerof a neural network as the inner variables and the remainingweights which parametrize the representation mapping asthe outer variables As shown in Sec 5 the resulting MLalgorithm performs well in practice outperforming most ofthe existing strategies on MiniImagenet

2 A bilevel optimization frameworkIn this paper we consider bilevel optimization problems(see eg Colson et al 2007) of the form

minf(λ) λ isin Λ (1)

where function f Λrarr R is defined at λ isin Λ as

f(λ) = infE(wλ λ) wλ isin arg minuisinRd

Lλ(u) (2)

We call E Rd times Λrarr R the outer objective and for everyλ isin Λ we call Lλ Rd rarr R the inner objective Note thatLλ λ isin Λ is a class of objective functions parameterizedby λ Specific instances of this problem include HO and MLwhich we discuss next Table 1 outlines the links amongbilevel programming HO and ML

21 Hyperparameter Optimization

In the context of hyperparameter optimization we are in-terested in minimizing the validation error of a modelgw X rarr Y parameterized by a vector w with respectto a vector of hyperparameters λ For example we mayconsider representation or regularization hyperparametersthat control the hypothesis space or penalties respectivelyIn this setting a prototypical choice for the inner objectiveis the regularized empirical error

Lλ(w) =sum

(xy)isinDtr

`(gw(x) y) + Ωλ(w)

where Dtr = (xi yi)ni=1 is a set of inputoutput points `is a prescribed loss function and Ωλ a regularizer param-eterized by λ The outer objective represents a proxy forthe generalization error of gw and it may be given by theaverage loss on a validation set Dval

E(w λ) =sum

(xy)isinDval

`(gw(x) y)

or in more generality by a cross-validation error as detailedin Appendix B Note that in this setting the outer objectiveE does not depend explicitly on the hyperparameters λ

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

w

Error

Train errors

Validation error

Figure 1 Each blue line represents the average training error whenvarying w in gwλ and the corresponding inner minimizer is shownas a blue dot The validation error evaluated at each minimizeryields the black curve representing the outer objective f(λ) whoseminimizer is shown as a red dot

since in HO λ is instrumental in finding a good model gwwhich is our final goal As a more specific example considerlinear models gw(x) = 〈w x〉 let ` be the square loss andlet Ωλ(w) = λw2 in which case the inner objective isridge regression (Tikhonov regularization) and the bilevelproblem optimizes over the regularization parameter thevalidation error of ridge regression

22 Meta-Learning

In meta-learning (ML) the inner and outer objectives arecomputed by averaging a training and a validation errorover multiple tasks respectively The goal is to producea learning algorithm that will work well on novel tasks1For this purpose we have available a meta-training setD = Dj = Dj

tr cup DjvalNj=1 which is a collection of

datasets sampled from a meta-distribution P Each datasetDj = (xji y

ji )

nj

i=1 with (xji yji ) isin X times Yj is linked to

a specific task Note that the output space is task depen-dent (eg a multi-class classification problem with variablenumber of classes) The model for each task is a functiongwj λ X rarr Yj identified by a parameter vectors wj andhyperparameters λ A key point here is that λ is sharedbetween the tasks With this notation the inner and outerobjectives are

Lλ(w) =

Nsumj=1

Lj(wj λDjtr) (3)

E(w λ) =

Nsumj=1

Lj(wj λDjval) (4)

respectively The loss Lj(wj λ S) represents the empiricalerror of the pair (wj λ) on a set of examples S Note that the

1The ML problem is also related to multitask learning howeverin ML the goal is to extrapolate from the given tasks

inner and outer losses for task j use different trainvalidationsplits of the corresponding dataset Dj Furthermore unlikein HO in ML the final goal is to find a good λ and the wj

are now instrumental

The cartoon in Figure 1 illustrates ML as a bilevel problemThe parameter λ indexes an hypothesis space within whichthe inner objective is minimized A particular example de-tailed in Sec 4 is to choose the model gwλ = 〈w hλ(x)〉in which case λ parameterizes a feature mapping Yet an-other choice would be to consider gwj λ(x) = 〈w + λ x〉in which case λ represents a common model around whichtask specific models are to be found (see eg Evgeniou et al2005 Finn et al 2017 Khosla et al 2012 Kuzborskij et al2013 and reference therein)

23 Gradient-Based Approach

We now discuss a general approach to solve Problem (1)-(2)when the hyperparameter vector λ is real-valued To sim-plify our discussion let us assume that the inner objectivehas a unique minimizer wλ Even in this simplified scenarioProblem (1)-(2) remains challenging to solve Indeed ingeneral there is no closed form expression wλ so it is notpossible to directly optimize the outer objective functionWhile a possible strategy (implicit differentiation) is to ap-ply the implicit function theorem tonablaLλ = 0 (Pedregosa2016 Koh and Liang 2017 Beirami et al 2017) anothercompelling approach is to replace the inner problem with adynamical system This point discussed in (Domke 2012Maclaurin et al 2015 Franceschi et al 2017) is developedfurther in this paper

Specifically we let [T ] = 1 T where T is a pre-scribed positive integer and consider the following approxi-mation of Problem (1)-(2)

minλfT (λ) = E(wTλ λ) (5)

where E is a smooth scalar function and2

w0λ = Φ0(λ) wtλ = Φt(wtminus1λ λ) t isin [T ] (6)

with Φ0 Rm rarr Rd a smooth initialization mapping andfor every t isin [T ] Φt Rd times Rm rarr Rd a smooth mappingthat represents the operation performed by the t-th step ofan optimization algorithm For example the optimizationdynamics could be gradient descent Φt(wt λ) = wt minusηtnablaLλ(middot) where (ηt)tisin[T ] is a sequence of steps sizes

The approximation of the bilevel problem (1)-(2) by theprocedure (5)-(6) raises the issue of the quality of this ap-proximation and we return to this issue in the next section

2In general the algorithm used to minimize the inner objectivemay involve auxiliary variables eg velocities when using gradi-ent descent with momentum so w should be intended as a largervector containing both model parameters and auxiliary variables

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

However it also suggests to consider the inner dynamicsas a form of approximate empirical error minimization (egearly stopping) which is valid in its own right From thisperspective ndash conversely to the implicit differentiation strat-egy ndash it is possible to include among the components of λvariables which are associated with the optimization algo-rithm itself For example λ may include the step sizes ormomentum factors if the dynamics Φt in Eq (6) is gradientdescent with momentum in (Andrychowicz et al 2016Wichrowska et al 2017) the mapping Φt is implemented asa recurrent neural network while (Finn et al 2017) focuson the initialization mapping by letting Φ0(λ) = λ

A major advantage of this reformulation is that it makes itpossible to compute efficiently the gradient of fT which wecall hypergradient either in time or in memory (Maclaurinet al 2015 Franceschi et al 2017) by making use of re-verse or forward mode algorithmic differentiation (Griewankand Walther 2008 Baydin et al 2017) This allows us tooptimize a number of hyperparameters of the same order ofthat of parameters a situation which arise in ML

3 Exact and Approximate BilevelProgramming

In this section we provide results about the existence ofsolutions of Problem (1)-(2) and the approximation proper-ties of Procedure (5)-(6) with respect to the original bilevelproblem Proofs of these results are provided in the supple-mentary material

Procedure (5)-(6) though related to the bilevel problem(1)-(2) may not be in general a good approximation ofit Indeed making the assumptions (which sound perfectlyreasonable) that for every λ isin Λ wTλ rarr wλ for somewλ isin arg maxLλ and that E(middot λ) is continuous one canonly assert that limTrarrinfin fT (λ) = E(wλ λ) ge f(λ) Thisis because the optimization dynamics converge to someminimizer of the inner objective Lλ but not necessarilyto the one that also minimizes the function E This isillustrated in Figure 2 The situation is however differentif the inner problem admits a unique minimizer for everyλ isin Λ Indeed in this case it is possible to show that theset of minimizers of the approximate problems convergeas T rarr +infin and in an appropriate sense to the set ofminimizers of the bilevel problem More precisely wemake the following assumptions

(i) Λ is a compact subset of Rm

(ii) E Rd times Λrarr R is jointly continuous

(iii) the map (w λ) 7rarr Lλ(w) is jointly continuous andsuch that arg minLλ is a singleton for every λ isin Λ

(iv) wλ = arg minLλ remains bounded as λ varies in Λ

w

E (λ)

λ λ(1) (2)w w

Figure 2 In this cartoon for a fixed λ argmin Lλ =

w(1)λ w

(2)λ the iterates of an optimization mapping Φ could

converge to w(1)λ with E(w

(1)λ λ) gt E(w

(2)λ λ)

Then problem (1)-(2) becomes

minλisinΛ

f(λ) = E(wλ λ) wλ = argminuLλ(u) (7)

Under the above assumptions in the following we giveresults about the existence of solutions of problem (7) andthe (variational) convergence of the approximate problems(5)-(6) towards problem (7) mdash relating the minima as wellas the set of minimizers In this respect we note that sinceboth f and fT are nonconvex argmin fT and argmin f arein general nonsingleton so an appropriate definition of setconvergence is required

Theorem 31 (Existence) Under Assumptions (i)-(iv) prob-lem (7) admits solutionsProof See Appendix A

The result below follows from general facts on the stabil-ity of minimizers in optimization problems (Dontchev andZolezzi 1993)

Theorem 32 (Convergence) In addition to Assump-tions (i)-(iv) suppose that

(v) E(middot λ) is uniformly Lipschitz continuous

(vi) The iterates (wTλ)TisinN converge uniformly to wλ onΛ as T rarr +infin

Then

(a) inf fT rarr inf f

(b) argmin fT rarr argmin f meaning that for every(λT )TisinN such that λT isin argmin fT we have that

- (λT )TisinN admits a convergent subsequence- for every subsequence (λKT

)TisinN such thatλKT

rarr λ we have λ isin argmin f

Proof See Appendix A

We stress that assumptions (i)-(vi) are very natural and sat-isfied by many problems of practical interests Thus theabove results provide full theoretical justification to the pro-posed approximate procedure (5)-(6) The following remarkdiscusses assumption (vi) while the subsequent examplewill be relevant to the experiments in Sec 5

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Algorithm 1 Reverse-HG for Hyper-representationInput λ current values of the hyperparameter T num-ber of iteration of GD η ground learning rate B mini-batch of episodes from DOutput Gradient of meta-training error wrt λ on Bfor j = 1 to |B| do

wj0 = 0for t = 1 to T do

wjt larr wtminus1 minus ηnablawLj(wjtminus1 λDjtr)

αjT larr nablawLj(wjT λDval)

pj larr nablaλLj(wjT λDval)for t = T minus 1 downto 0 do

pj larr pj minus αjt+1ηnablaλnablawLj(wjt λD

jtr)

αjt larr αjt+1

[I minus ηnablawnablawLj(wjt λD

jtr)

]return

sumj p

j

Remark 33 If Lλ is strongly convex then many gradient-based algorithms (eg standard and accelerated gradient de-scent) yield linear convergence of the iterates wTλrsquos More-over in such cases the rate of linear convergence is of type(νλ minus microλ)(νλ + microλ) where νλ and microλ are the Lipschitzconstant of the gradient and the modulus of strong convexityof Lλ respectively So this rate can be uniformly boundedfrom above by ρ isin ]0 1[ provided that supλisinΛ νλ lt +infinand infλisinΛ microλ gt 0 Thus in these cases wTλ convergesuniformly to wλ on Λ (at a linear rate)

Example 34 Let us consider the following form of theinner objective

LH(w) = y minusXHw2 + ρw2 (8)

where ρ gt 0 is a fixed regularization parameter andH isin Rdtimesd is the hyperparameter representing a linear fea-ture map LH is strongly convex with modulus micro = ρ gt 0(independent on the hyperparameter H) and Lipschitzsmooth with constant νH = 2(XH)gtXH+ρI which isbounded from above if H ranges in a bounded set of squarematrices In this case assumptions (i)-(vi) are satisfied

4 Learning Hyper-RepresentationsIn this section we instantiate the bilevel programming ap-proach for ML outlined in Sec 22 in the case of deep learn-ing where representation layers are shared across episodesFinding good data representations is a centerpiece in ma-chine learning Classical approaches (Baxter 1995 Caru-ana 1998) learn both the weights of the representation map-ping and those of the ground classifiers jointly on the samedata Here we follow the bilevel approach and split eachdatasetepisode in training and validation sets

Our method involves the learning of a cross-task intermedi-

ate representation hλ X rarr Rk (parametrized by a vectorλ) on top of which task specific models gj Rk rarr Yj(parametrized by vectors wj) are trained The final groundmodel for task j is thus given by gj h To find λ we solveProblem (1)-(2) with inner and outer objectives as in Eqs(3) and (4) respectively Since in general this problemcannot be solved exactly we instantiate the approximationscheme in Eqs (5)-(6) as follows

minλfT (λ) =

Nsumj=1

Lj(wjT λDjval) (9)

wjt = wjtminus1minusηnablawLj(wjtminus1 λD

jtr) t j isin [T ] [N ] (10)

Starting from an initial value the weights of the task-specificmodels are learned by T iterations of gradient descentThe gradient of fT can be computed efficiently in timeby making use of an extended reverse-hypergradient pro-cedure (Franceschi et al 2017) which we present in Al-gorithm 1 Since in general the number of episodes in ameta-training set is large we compute a stochastic approx-imation of the gradient of fT by sampling a mini-batch ofepisodes At test time given a new episode D the represen-tation h is kept fixed and all the examples in D are used totune the weights w of the episode-specific model g

Like other initialization and optimization strategies forML our method does not require lookups in a supportset as the memorization and metric strategies do (Santoroet al 2016 Vinyals et al 2016 Mishra et al 2018) Un-like (Andrychowicz et al 2016 Ravi and Larochelle 2017)we do not tune the optimization algorithm which in our caseis plain empirical loss minimization by gradient descentand rather focus on the hypothesis space Unlike (Finnet al 2017) that aims at maximizing sensitivity of new tasklosses to the model parameters we aim at maximizing thegeneralization to novel examples during training episodeswith respect to λ Our assumptions about the structure ofthe model are slightly stronger than in (Finn et al 2017)but still mild namely that some (hyper)parameters definethe representation and the remaining parameters define theclassification function In (Munkhdalai and Yu 2017) themeta-knowledge is distributed among fast and slow weightsand an external memory our approach is more direct sincethe meta-knowledge is solely distilled by λ A further advan-tage of our method is that if the episode-specific models arelinear (eg logistic regressors) and each loss Lj is stronglyconvex in w the theoretical guarantees of Theorem 32 ap-ply (see Remark 33) These assumptions are satisfied in theexperiments reported in the next section

5 ExperimentsThe aim of the following experiments is threefold Firstwe investigate the impact of the number of iterations of theoptimization dynamics on the quality of the solution on a

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

0 1000 2000 3000 4000 5000

Hyperiterations

60

80

100

120

140

160

180Sq

uare

der

ror

Exact solutonT = 1

T = 4

T = 16

T = 64

T = 256

Figure 3 Optimization of the outer objectives f and fT for exactand approximate problems The optimization of H is performedwith gradient descent with momentum with same initializationstep size and momentum factor for each run

simple multiclass classification problem Second we testour hyper-representation method in the context of few-shotlearning on two benchmark datasets Finally we constrastthe bilevel ML approach against classical approaches tolearn shared representations 3

51 The Effect of T

Motivated by the theoretical findings of Sec 3 we empir-ically investigate how solving the inner problem approxi-mately (ie using small T ) affects convergence generaliza-tion performances and running time We focus in particularon the linear feature map described in Example 34 whichallows us to compare the approximated solution against theclosed-form analytical solution given by

wH = [(XH)TXH + ρI]minus1(XH)TY

In this setting the bilevel problem reduces to a (non-convex)optimization problem in H

We use a subset of 100 classes extracted from Omniglotdataset (Lake et al 2017) to construct a HO problem aimedat tuning H A training set Dtr and a validation set Dvaleach consisting of three randomly drawn examples per classwere sampled to form the HO problem A third set Dtestconsisting of fifteen examples per class was used for testingInstead of using raw images as input we employ featurevectors x isin R256 computed by the convolutional networktrained on one-shot five-ways ML setting as described inSec 52

For the approximate problems we compute the hypergra-dient using Algorithm 1 where it is intended that B =(Dtr Dval) Figure 3 shows the values of functions f andfT (see Eqs (1) and (5) respectively) during the optimiza-tion of H As T increases the solution of the approximate

3The code for reproducing the experiments based on thepackage FAR-HO (httpsbitlyfar-ho) is availableat httpsbitlyhyper-repr

0 1000 2000 3000 4000 5000

Hyperiterations

85

86

87

88

89

90

91

92

93

Acc

urac

y

Exact solutonT = 1

T = 4

T = 16

T = 64

T = 256

Figure 4 Accuracy on Dtest of exact and approximated solutionsduring optimization of H Training and validation accuraciesreach almost 100 already for T = 4 and after few hundredhyperiterations and therefore are not reported

Table 2 Execution times on a NVidia Tesla M40 GPU

T 1 4 16 64 256 ExactTime (sec) 60 119 356 1344 5532 320

problem approaches the true bilevel solution However per-forming a small number of gradient descent steps for solvingthe inner problem acts as implicit regularizer As it is evi-dent from Figure 4 the generalization error is better whenT is smaller than the value yielding the best approximationof the inner solution This is to be expected since in thissetting the dimensions of parameters and hyperparametersare of the same order leading to a concrete possibility ofoverfitting the outer objective (validation error) An appro-priate problem dependent choice of T may help avoidingthis issue (see also Appendix C) As T increases the numberof hyperiterations required to reach the maximum test accu-racy decreases further suggesting that there is an interplaybetween the number of iterations used to solve the inner andthe outer objective Finally the running time of Algorithm1 is linear in T and the size of w and independent of thesize of H (see also Table 2) making it even more appealingto reduce the number of iterations

52 Few-shot Learning

We new turn our attention to learning-to-learn precisely tofew-shot supervised learning implementing the ML strategyoutlined in Sec 4 on two different benchmark datasets

bull OMNIGLOT (Lake et al 2015) a dataset that containsexamples of 1623 different handwritten characters from 50alphabets We downsample the images to 28times 28

bull MINIIMAGENET (Vinyals et al 2016) a subset of Ima-geNet (Deng et al 2009) that contains 60000 downsampledimages from 100 different classes

Following the experimental protocol used in a number of re-cent works we build a meta-training set D from which wesample datasets to solve Problem (9)-(10) a meta-validation

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

set V for tuning ML hyperparameters and finally a meta-test set T which is used to estimate accuracy Operationallyeach meta-dataset consists of a pool of samples belonging todifferent (non-overlapping between separate meta-dataset)classes which can be combined to form ground classifica-tion datasets Dj = Dj

tr cup Djval with 5 or 20 classes (for

Omniglot) The Djtrrsquos contain 1 or 5 examples per class

which are used to fit wj (see Eq 10) The Djvalrsquos contain-

ing 15 examples per class is used either to compute fT (λ)(see Eq (9)) and its (stochastic) gradient if Dj isin D or toprovide a generalization score if Dj comes from either V orT For MiniImagenet we use the same split and images pro-posed in (Ravi and Larochelle 2017) while for Omniglotwe use the protocol defined by (Santoro et al 2016)

As ground classifiers we use multinomial logistic regressorsand as task losses `j we employ cross-entropy The innerproblems being strongly convex admit unique minimizersyet require numerical computation of the solutions We ini-tialize ground models parameters wj to 0 and according tothe observation in Sec 51 we perform T gradient descentsteps where T is treated as a ML hyperparameter that has tobe validated Figure 6 shows an example of meta-validationof T for one-shot learning on MiniImagenet We compute astochastic approximation ofnablafT (λ) with Algorithm 1 anduse Adam with decaying learning rate to optimize λ

Regarding the specific implementation of the representationmapping h we employ for Omniglot a four-layers convo-lutional neural network with strided convolutions and 64filters per layer as in (Vinyals et al 2016) and other suc-cessive works For MiniImagenet we tried two differentarchitectures

bull C4L a four-layers convolutional neural network with max-pooling and 32 filters per layer

bull RN a residual network (He et al 2016) built of fourresidual blocks followed by two convolutional layers

The first network architecture has been proposed in (Raviand Larochelle 2017) and then used in (Finn et al 2017)while a similar residual network architecture has been em-ployed in a more recent work (Mishra et al 2018) Furtherdetails on the architectures of h as well as other ML hyper-parameters are specified in the supplementary material Wereport our results using RN for MiniImagenet in Table 3alongside scores from various recently proposed methodsfor comparison

The proposed method achieves competitive results highlight-ing the relative importance of learning a task independentrepresentation on the top of which logistic classifiers trainedwith very few samples generalize well Moreover utilizingmore expressive models such as residual network as repre-sentation mappings is beneficial for our proposed strategyand unlike other methods does not result in overfitting

of the outer objective as reported in (Mishra et al 2018)Indeed compared to C4L RN achieves a relative improve-ment of 65 on one-shot and 42 on five-shot Figure 5provides a visual example of the goodness of the learned rep-resentation showing that MiniImagenet examples (the firstfrom meta-training the second from the meta-testing sets)from similar classes (different dog breeds) are mapped neareach other by h and conversely samples from dissimilarclasses are mapped afar

Figure 5 After sampling two datasets D isin D and Dprime isin T weshow on the top the two images x isin D xprime isin Dprime that minimize||hλ(x)minus hλ(xprime)|| and on the bottom those that maximize it Inbetween each of the two couples we compare a random subset ofcomponents of hλ(x) (blue) and hλ(xprime) (green)

3 5 8 1266

68

70

72

74

76

78Accuracy on Dtr

3 5 8 12

49

50

51

52

53

Accuracy on DtestAccuracy on Dval

Figure 6 Meta-validation of the number of gradient descent steps(T ) of the ground models for MiniImagenet using the RN repre-sentation Early stopping on the accuracy on meta-validation setduring meta-training resulted in halting the optimization of λ after42k 40k 22k and 15k hyperiterations for T equal to 3 5 8 and12 respectively in line with our observation in Sec 51

53 On Variants of Representation Learning Methods

In this section we show the benefits of learning a represen-tation within the proposed bilevel framework compared toother possible approaches that involve an explicit factoriza-tion of a classifier as gj h The representation mapping his either pretrained or learned with different meta-learningalgorithms We focus on the problem of one-shot learningon MiniImagenet and we use C4L as architecture for therepresentation mapping In all the experiments the groundmodels gj are multinomial logistic regressor as in Sec 52tuned with 5 steps of gradient descent We ran the followingexperiments

bull Multiclass the mapping h X rarr R64 is given by the

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Table 3 Accuracy scores computed on episodes from T of various methods on 1-shot and 5-shot classification problems on Omniglotand MiniImagenet For MiniImagenet 95 confidence intervals are reported For Hyper-representation the scores are computed over 600randomly drawn episodes For other methods we show results as reported by their respective authors

OMNIGLOT 5 classes OMNIGLOT 20 classes MINIIMAGENET 5 classesMethod 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot

Siamese nets (Koch et al 2015) 973 984 882 970 minus minusMatching nets (Vinyals et al 2016) 981 989 938 985 4344plusmn 077 5531plusmn 073Neural stat (Edwards and Storkey 2016) 981 995 932 981 minus minusMemory mod (Kaiser et al 2017) 984 996 950 986 minus minusMeta-LSTM (Ravi and Larochelle 2017) minus minus minus minus 4356plusmn 084 6060plusmn 071MAML (Finn et al 2017) 987 999 958 989 4870plusmn 175 6311plusmn 092Meta-networks (Munkhdalai and Yu 2017) 989 minus 970 minus 4921plusmn 096 minusPrototypical Net (Snell et al 2017) 988 997 960 989 4942plusmn 078 6820plusmn 066SNAIL (Mishra et al 2018) 991 998 976 994 5571plusmn 099 6888plusmn 092Hyper-representation 986 995 955 984 5054plusmn 085 6453plusmn 068

linear outputs before the softmax operation of a network4

pretrained on the totality of examples contained in the train-ing meta-dataset (600 examples for each of the 64 classes)In this setting we found that using the second last layer orthe output after the softmax yields worst results

bull Bilevel-train we use a bilevel approach but unlike in Sec4 we optimize the parameter vector λ of the representationmapping by minimizing the loss on the training sets of eachepisode The hypergradient is still computed with Algorithm1 albeit we set Dj

val = Djtr for each training episodes

bull Approx and Approx-train we consider an approxima-tion of the hypergradient nablafT (λ) by disregarding the op-timization dynamics of the inner objectives (ie we setnablaλwjT = 0) In Approx-train we just use the training sets

bull Classic as in (Baxter 1995) we learn h by jointly opti-mize f(λw1 wN ) =

sumNj=1 L

j(wj λDjtr) and treat

the problem as standard multitask learning with the ex-ception that we evaluate f on mini-batches of 4 episodesrandomly sampled every 5 gradient descent iterations

In settings where we do not use the validation sets we letthe training sets of each episode contain 16 examples perclass Using training episodes with just one example perclass resulted in performances just above random chanceWhile the first experiment constitutes a standard baselinethe others have the specific aim of assessing (i) the impor-tance of splitting episodes of meta-training set into trainingand validation and (ii) the importance of computing thehypergradient of the approximate bilevel problem with Al-gorithm 1 The results reported in Table 4 suggest thatboth the trainingvalidation splitting and the full computa-tion of the hypergradient constitute key factors for learninga good representation in a meta-learning context On theother side using pretrained representations especially ina low-dimensional space turns out to be a rather effectivebaseline One possible explanation is that in this context

4The network is similar to C4L but has 64 filters per layer

Table 4 Performance of various methods where the representationis either transfered or learned with variants of hyper-representationmethods The last raw reports for comparison the score obtainedwith hyper-representation

Method filters Accuracy 1-shot

Multiclass 64 4302Bilevel-train 32 2963Approx 32 4112Approx-train 32 3880Classic-train 32 4046Hyper-representation-C4L 32 4751

some classes in the training and testing meta-datasets arerather similar (eg various dog breeds) and thus groundclassifiers can leverage on very specific representations

6 ConclusionsWe have shown that both HO and ML can be formulated interms of bilevel programming and solved with an iterativeapproach When the inner problem has a unique solution(eg is strongly convex) our theoretical results show that theiterative approach has convergence guarantees a result thatis interesting in its own right In the case of ML by adaptingclassical strategies (Baxter 1995) to the bilevel frameworkwith trainingvalidation splitting we present a method forlearning hyper-representations which is experimentally ef-fective and supported by our theoretical guarantees

Our framework encompasses recently proposed methodsfor meta-learning such as learning to optimize but alsosuggests different design patterns for the inner learningalgorithm which could be interesting to explore in futurework The resulting inner problems may not satisfy theassumptions of our convergence analysis raising the needfor further theoretical investigations An additional futuredirection of research is the study of the statistical propertiesof bilevel strategies where outer objectives are based on thegeneralization ability of the inner model to new (validation)data Ideas from (Maurer et al 2016 Denevi et al 2018)may be useful in this direction

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

ReferencesAndrychowicz M Denil M Gomez S Hoffman M W

Pfau D Schaul T and de Freitas N (2016) Learningto learn by gradient descent by gradient descent In Ad-vances in Neural Information Processing Systems (NIPS)pages 3981ndash3989

Bard J F (2013) Practical bilevel optimization algo-rithms and applications volume 30 Springer Science ampBusiness Media 01251

Baxter J (1995) Learning internal representations In Pro-ceedings of the 8th Annual Conference on ComputationalLearning Theory (COLT) pages 311ndash320 ACM

Baydin A G Pearlmutter B A Radul A A and SiskindJ M (2017) Automatic differentiation in machine learn-ing a survey Journal of Machine Learning Research181531ndash15343

Beirami A Razaviyayn M Shahrampour S and TarokhV (2017) On optimal generalizability in parametriclearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3458ndash3468

Bergstra J and Bengio Y (2012) Random search for hyper-parameter optimization Journal of Machine LearningResearch 13(Feb)281ndash305

Bergstra J Yamins D and Cox D D (2013) Makinga science of model search Hyperparameter optimiza-tion in hundreds of dimensions for vision architecturesIn Proceedings of the 30th International Conference onMachine Learning (ICML) pages 115ndash123

Caruana R (1998) Multitask learning In Learning tolearn pages 95ndash133 Springer 02683

Colson B Marcotte P and Savard G (2007) Anoverview of bilevel optimization Annals of operationsresearch 153(1)235ndash256

Denevi G Ciliberto C Stamos D and Pontil M (2018)Incremental learning-to-learn with statistical guaranteesarXiv preprint arXiv180308089 To appear in UAI 2018

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL (2009) Imagenet A large-scale hierarchical imagedatabase In Computer Vision and Pattern Recognition(CVPR) pages 248ndash255

Domke J (2012) Generic Methods for Optimization-BasedModeling In AISTATS volume 22 pages 318ndash326

Dontchev A L and Zolezzi T (1993) Well-posed op-timization problems volume 1543 of Lecture Notes inMathematics Springer-Verlag Berlin

Edwards H and Storkey A (2016) Towards a NeuralStatistician arXiv160602185 [cs stat] 00027 arXiv160602185

Evgeniou T Micchelli C A and Pontil M (2005) Learn-ing multiple tasks with kernel methods Journal of Ma-chine Learning Research 6(Apr)615ndash637

Finn C Abbeel P and Levine S (2017) Model-agnosticmeta-learning for fast adaptation of deep networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 1126ndash1135

Flamary R Rakotomamonjy A and Gasso G(2014) Learning constrained task similarities in graph-regularized multi-task learning Regularization Opti-mization Kernels and Support Vector Machines page103

Franceschi L Donini M Frasconi P and Pontil M(2017) Forward and reverse gradient-based hyperparam-eter optimization In Proceedings of the 34th Interna-tional Conference on Machine Learning (ICML) pages1165ndash1173

Glorot X and Bengio Y (2010) Understanding the diffi-culty of training deep feedforward neural networks InProceedings of the 13th International Conference on Arti-ficial Intelligence and Statistics (AIStat) pages 249ndash256

Griewank A and Walther A (2008) Evaluating deriva-tives principles and techniques of algorithmic differenti-ation SIAM

He K Zhang X Ren S and Sun J (2016) Deep residuallearning for image recognition In Computer Vision andPattern Recognition (CVPR) pages 770ndash778

Hutter F Lcke J and Schmidt-Thieme L (2015) BeyondManual Tuning of Hyperparameters KI - KunstlicheIntelligenz 29(4)329ndash337

Kaiser L Nachum O Roy A and Bengio S (2017)Learning to remember rare events

Keerthi S S Sindhwani V and Chapelle O (2007) Anefficient method for gradient-based adaptation of hyper-parameters in svm models In Advances in Neural Infor-mation Processing Systems (NIPS) pages 673ndash680

Khosla A Zhou T Malisiewicz T Efros A A andTorralba A (2012) Undoing the damage of dataset biasIn European Conference on Computer Vision pages 158ndash171 Springer

Koch G Zemel R and Salakhutdinov R (2015) Siameseneural networks for one-shot image recognition In ICMLDeep Learning Workshop volume 2 00127

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Koh P W and Liang P (2017) Understanding black-boxpredictions via influence functions In Proceedings ofthe 34th International Conference on Machine Learning(ICML) pages 1885ndash1894

Kunapuli G Bennett K Hu J and Pang J-S (2008)Classification model selection via bilevel programmingOptimization Methods and Software 23(4)475ndash489

Kuzborskij I Orabona F and Caputo B (2013) Fromn to n+ 1 Multiclass transfer incremental learning InComputer Vision and Pattern Recognition (CVPR) 2013IEEE Conference on pages 3358ndash3365

Lake B M Salakhutdinov R and Tenenbaum J B(2015) Human-level concept learning through probabilis-tic program induction Science 350(6266)1332ndash1338

Lake B M Ullman T D Tenenbaum J B and Gersh-man S J (2017) Building machines that learn and thinklike people Behavioral and Brain Sciences 40 00152

Maclaurin D Duvenaud D K and Adams R P (2015)Gradient-based hyperparameter optimization through re-versible learning In Proceedings of the 32nd Interna-tional Conference on Machine Learning (ICML pages2113ndash2122

Maurer A Pontil M and Romera-Paredes B (2016) Thebenefit of multitask representation learning The Journalof Machine Learning Research 17(1)2853ndash2884

Mishra N Rohaninejad M Chen X and Abbeel P(2018) A simple neural attentive meta-learner

Munkhdalai T and Yu H (2017) Meta networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 2554ndash2563

Pedregosa F (2016) Hyperparameter optimization withapproximate gradient In Proceedings of The 33rd Inter-national Conference on Machine Learning (ICML) pages737ndash746

Ravi S and Larochelle H (2017) Optimization as a modelfor few-shot learning In In International Conference onLearning Representations (ICLR)

Santoro A Bartunov S Botvinick M Wierstra Dand Lillicrap T (2016) Meta-learning with memory-augmented neural networks In Proceedings ofthe 33rdInternational Conference on Machine Learning pages1842ndash1850

Schmidhuber J (1992) Learning to Control Fast-WeightMemories An Alternative to Dynamic Recurrent Net-works Neural Computation 4(1)131ndash139 00082

Snell J Swersky K and Zemel R S (2017) Prototypicalnetworks for few-shot learning In Advances in neuralinformation processing systems pages 4080ndash4090

Snoek J Larochelle H and Adams R P (2012) Practicalbayesian optimization of machine learning algorithmsIn Advances in Neural Information Processing Systems(NIPS) pages 2951ndash2959

Thrun S and Pratt L (1998) Learning to learn Springer

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser L and Polosukhin I (2017)Attention is all you need In Advances in Neural Informa-tion Processing Systems (NIPS) pages 6000ndash6010

Vinyals O Blundell C Lillicrap T Kavukcuoglu Kand Wierstra D (2016) Matching networks for one shotlearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3630ndash3638

Wichrowska O Maheswaranathan N Hoffman M WColmenarejo S G Denil M Freitas N and Sohl-Dickstein J (2017) Learned optimizers that scale andgeneralize In Proceedings of the 34th International Con-ference on Machine Learning (ICML) pages 3751ndash3760

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

A Proofs of the Results in Sec 3Proof of Theorem 31 Since Λ is compact it follows fromWeierstrass theorem that a sufficient condition for the exis-tence of minimizers is that f is continuous Thus let λ isin Λand let (λn)nisinN be a sequence in Λ such that λn rarr λWe prove that f(λn) = E(wλn

λn)rarr E(wλ λ) = f(λ)Since (wλn

)nisinN is bounded there exists a subsequence(wkn)nisinN such that wkn rarr w for some w isin Rd Nowsince λkn rarr λ and the map (w λ) 7rarr Lλ(w) is jointlycontinuous we have

forallw isin Rd Lλ(w) = limnLλkn

(wkn)

le limnLλkn

(w) = Lλ(w)

Therefore w is a minimizer of Lλ and hence w = wλThis prove that (wλn)nisinN is a bounded sequence havinga unique cluster point Hence (wλn)nisinN is convergenceto its unique cluster point which is wλ Finally since(wλn

λn)rarr (wλ λ) and E is jointly continuous we haveE(wλn λn)rarr E(wλ λ) and the statement follows

We recal a fundamental fact concerning the stability of min-ima and minimizers in optimization problems (Dontchevand Zolezzi 1993) We provide the proof for completeness

Theorem A1 (Convergence) Let ϕT and ϕ be lower semi-continuous functions defined on a compact set Λ Supposethat ϕT converges uniformly to ϕ on Λ as T rarr +infin Then

(a) inf ϕT rarr inf ϕ

(b) argminϕT rarr argminϕ meaning that for every(λT )TisinN such that λT isin argminϕT we have that

- (λT )TisinN admits a convergent subsequence- for every subsequence (λKT

)TisinN such thatλKT

rarr λ we have λ isin argminϕ

Proof Let (λT )TisinN be a sequence in Λ such that for everyT isin N λT isin argminϕT We prove that

1) (λT )TisinN admits a convergent subsequence

2) for every subsequence (λKT)TisinN such that λKT

rarr λwe have λ isin argminϕ and ϕKT

(λKT)rarr inf ϕ

3) inf ϕT rarr inf ϕ

The first point follows from the fact that Λ is compactConcerning the second point let (λKT

)TisinN be a subse-quence such that λKT

rarr λ Since ϕKTconverge uniformly

to ϕ we have

|ϕKT(λKT

)minus ϕ(λKT)| le sup

λisinΛ|ϕKT

(λ)minus ϕ(λ)| rarr 0

Therefore using also the continuity of ϕ we have

forallλ isin Λ ϕ(λ) = limTϕ(λKT

) = limTϕKT

(λKT)

le limTϕKT

(λ) = ϕ(λ)

So λ isin argminϕ andϕ(λ) = limT ϕKT(λKT

) le inf ϕ =ϕ(λ) that is limT ϕKT

(λKT) = inf ϕ

Finally as regards the last point we proceed by contradic-tion If (ϕT (λT ))TisinN does not convergce to inf f thenthere exists an ε gt 0 and a subsequence (ϕKT

(λKT))TisinN

such that

|ϕKT(λKT

)minus inf ϕ| ge ε forallT isin N (11)

Now let (λK

(1)T

) be a convergent subsequence of (λKT)TisinN

Suppose that λK

(1)T

rarr λ Clearly (λK

(1)T

) is also a subse-quence of (λT )TisinN Then it follows from point 2) abovethat ϕ

K(1)T

(λK

(1)T

) rarr inf ϕ This latter finding togetherwith equation (11) gives a contradiction

Proof of Theorem 32 Since E(middot λ) is uniformly Lipschitzcontinuous there exists ν gt 0 such that for every T isin Nand every λ isin Λ

|fT (λ)minus f(λ)| = |E(wTλ λ)minus E(wλ λ)|le νwTλ minus wλ

It follows from assumption (vi) that fT (λ) converges tof(λ) uniformly on Λ as T rarr +infin Then the statementfollows from Theorem A1

B Cross-validation and Bilevel ProgrammingWe note that the (approximate) bilevel programming frame-work easily accommodates also estimations of the gener-alization error generated by a cross-validation proceduresWe describe here the case of K-fold cross-validation whichincludes also leave-one-out cross validation

Let D = (xi yi)ni=1 be the set of available data K-foldcross validation with K isin 1 2 N consists in parti-tioning D in K subsets DjKj=1 and fit as many modelsgwj on training data Dj

tr =⋃i6=j D

i The models are thenevaluated on Dj

val = Dj Denoting by w = (wj)Kj=1 thevector of stacked weights the K-fold cross validation erroris given by

E(w λ) =1

K

Ksumj=1

Ej(wj λ)

where Ej(wj λ) =sum

(xy)isinDj `(gwj (x) y) E can betreated as the outer objective in the bilevel framework whilethe inner objective Lλ may be given by the sum of regu-larized empirical errors over each Dj

tr for the K models

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Under this perspective a K-fold cross-validation procedureclosely resemble the bilevel problem for ML formulatedin Sec 22 where in this case the meta-distribution col-lapses on the data (ground) distribution and the episodes aresampled from the same dataset of points

By following the procedure outlined in Sec 23 we canapproximate the minimization of Lλ with T steps of anoptimization dynamics and compute the hypergradient offT (λ) = 1

K

sumj E

j(wjT λ) by training the K models andproceed with either forward or reverse differentiation Themodels may be fitted sequentially in parallel or stochas-tically Specifically in this last case one can repeatedlysample one fold at a time (or possibly a mini-batch of folds)and compute a stochastic hypergradient that can be used ina SGD procedure in order to minimize fT At last we notethat similar ideas for leave-one out cross-validation error aredeveloped in (Beirami et al 2017) where the hypergradientof an approximation of the outer objective is computed bymeans of the implicit function theorem

C The Effect of T Ridge RegressionIn Sec 51 we showed that increasing the number of itera-tions T leads to a better optimization of the outer objective fthrough the approximations fT which converge uniformlyto f by Proposition 32 This however does not necessaryresults in better test scores due to possible overfitting ofthe training and validation errors as we noted for the linearhyper-representation multiclass classification experiment inSec 51 The optimal choice of T appears to be indeedproblem dependent if in the aforementioned experiment aquite small T led to the best test accuracy (after hyperparam-eter optimization) in this section we present a small scalelinear regression experiment that benefits from an increasingnumber of inner iterations

We generated 90 noisy synthetic data points with 30 featuresof which only 5 were informative and divided them equallyto form training validation and test sets As outer objectivewe set the mean squared error E(w) = ||Xvalw minus yval||2where Xval and yval are the validation design matrix andtargets respectively and w isin R30 is the weight vector of themodel We set as inner inner objective

Lλ(w) = ||Xtrw minus ytr||2 +

30sumi=1

eλiw2i

and optimized the L2 vector of regularization coefficients λ(equivalent to a diagonal Tikhonov regularization matrix)The results reported in Table 5 show that in this scenariooverfitting is not an issue and 250 inner iterations yield thebest test result Note that as in Sec 51 also this problemadmits an analytical solution from which it is possible tocompute the hypergradient analytically and perform exact

hyperparameter optimization as reported in the last row ofthe Table 5

Table 5 Validation and test mean absolute percentage error(MAPE) for various values of T

T Validation MAPE Test MAPE

10 1135 434950 128 522100 055 126250 047 050Exact 037 057

D Further Details on Few-shot ExperimentsThis appendix contains implementation details of the repre-sentation mapping h and the meta-learning hyperparametersused in the few-shot learning experiments of Sec 52

To optimize the representation mapping in all the exper-iments we use Adam with learning rate set to 10minus3 anda decay-rate of 10minus5 We used the initialization strategyproposed in (Glorot and Bengio 2010) for all the weights λof the representation

For Omniglot experiments we used a meta-batch size of 32episodes for five-way and of 16 episodes for 20-way Totrain the episode-specific classifiers we set the learning rateto 01

For one set of experiments with Mini-imagenet we used anhyper-representation (C4L) consisting of 4 convolutionallayers where each layer is composed by a convolution with32 filters a batch normalization followed by a ReLU acti-vation and a 2x2 max-pooling The classifiers were trainedusing mini-batches of 4 episodes for one-shot and 2 episodesfor five-shot with learning rate set to 001

The other set of experiments with Mini-imagenet employeda Residual Network (RN) as the representation mappingbuilt of 4 residual blocks (with 64 96 128 256 filters) andthen the block that follows 1times 1 conv (2048 filters) avgpooling 1 times 1 conv (512 filters) Each residual blockrepeats the following block 3 times 1 times 1 conv batchnormalization leaky ReLU (leak 01) before the residualconnection In this case the classifiers were optimized usingmini-batches of 2 episodes for both one and five-shot withlearning rate set to 004

Page 3: Bilevel Programming for Hyperparameter Optimization and ...

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

w

Error

Train errors

Validation error

Figure 1 Each blue line represents the average training error whenvarying w in gwλ and the corresponding inner minimizer is shownas a blue dot The validation error evaluated at each minimizeryields the black curve representing the outer objective f(λ) whoseminimizer is shown as a red dot

since in HO λ is instrumental in finding a good model gwwhich is our final goal As a more specific example considerlinear models gw(x) = 〈w x〉 let ` be the square loss andlet Ωλ(w) = λw2 in which case the inner objective isridge regression (Tikhonov regularization) and the bilevelproblem optimizes over the regularization parameter thevalidation error of ridge regression

22 Meta-Learning

In meta-learning (ML) the inner and outer objectives arecomputed by averaging a training and a validation errorover multiple tasks respectively The goal is to producea learning algorithm that will work well on novel tasks1For this purpose we have available a meta-training setD = Dj = Dj

tr cup DjvalNj=1 which is a collection of

datasets sampled from a meta-distribution P Each datasetDj = (xji y

ji )

nj

i=1 with (xji yji ) isin X times Yj is linked to

a specific task Note that the output space is task depen-dent (eg a multi-class classification problem with variablenumber of classes) The model for each task is a functiongwj λ X rarr Yj identified by a parameter vectors wj andhyperparameters λ A key point here is that λ is sharedbetween the tasks With this notation the inner and outerobjectives are

Lλ(w) =

Nsumj=1

Lj(wj λDjtr) (3)

E(w λ) =

Nsumj=1

Lj(wj λDjval) (4)

respectively The loss Lj(wj λ S) represents the empiricalerror of the pair (wj λ) on a set of examples S Note that the

1The ML problem is also related to multitask learning howeverin ML the goal is to extrapolate from the given tasks

inner and outer losses for task j use different trainvalidationsplits of the corresponding dataset Dj Furthermore unlikein HO in ML the final goal is to find a good λ and the wj

are now instrumental

The cartoon in Figure 1 illustrates ML as a bilevel problemThe parameter λ indexes an hypothesis space within whichthe inner objective is minimized A particular example de-tailed in Sec 4 is to choose the model gwλ = 〈w hλ(x)〉in which case λ parameterizes a feature mapping Yet an-other choice would be to consider gwj λ(x) = 〈w + λ x〉in which case λ represents a common model around whichtask specific models are to be found (see eg Evgeniou et al2005 Finn et al 2017 Khosla et al 2012 Kuzborskij et al2013 and reference therein)

23 Gradient-Based Approach

We now discuss a general approach to solve Problem (1)-(2)when the hyperparameter vector λ is real-valued To sim-plify our discussion let us assume that the inner objectivehas a unique minimizer wλ Even in this simplified scenarioProblem (1)-(2) remains challenging to solve Indeed ingeneral there is no closed form expression wλ so it is notpossible to directly optimize the outer objective functionWhile a possible strategy (implicit differentiation) is to ap-ply the implicit function theorem tonablaLλ = 0 (Pedregosa2016 Koh and Liang 2017 Beirami et al 2017) anothercompelling approach is to replace the inner problem with adynamical system This point discussed in (Domke 2012Maclaurin et al 2015 Franceschi et al 2017) is developedfurther in this paper

Specifically we let [T ] = 1 T where T is a pre-scribed positive integer and consider the following approxi-mation of Problem (1)-(2)

minλfT (λ) = E(wTλ λ) (5)

where E is a smooth scalar function and2

w0λ = Φ0(λ) wtλ = Φt(wtminus1λ λ) t isin [T ] (6)

with Φ0 Rm rarr Rd a smooth initialization mapping andfor every t isin [T ] Φt Rd times Rm rarr Rd a smooth mappingthat represents the operation performed by the t-th step ofan optimization algorithm For example the optimizationdynamics could be gradient descent Φt(wt λ) = wt minusηtnablaLλ(middot) where (ηt)tisin[T ] is a sequence of steps sizes

The approximation of the bilevel problem (1)-(2) by theprocedure (5)-(6) raises the issue of the quality of this ap-proximation and we return to this issue in the next section

2In general the algorithm used to minimize the inner objectivemay involve auxiliary variables eg velocities when using gradi-ent descent with momentum so w should be intended as a largervector containing both model parameters and auxiliary variables

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

However it also suggests to consider the inner dynamicsas a form of approximate empirical error minimization (egearly stopping) which is valid in its own right From thisperspective ndash conversely to the implicit differentiation strat-egy ndash it is possible to include among the components of λvariables which are associated with the optimization algo-rithm itself For example λ may include the step sizes ormomentum factors if the dynamics Φt in Eq (6) is gradientdescent with momentum in (Andrychowicz et al 2016Wichrowska et al 2017) the mapping Φt is implemented asa recurrent neural network while (Finn et al 2017) focuson the initialization mapping by letting Φ0(λ) = λ

A major advantage of this reformulation is that it makes itpossible to compute efficiently the gradient of fT which wecall hypergradient either in time or in memory (Maclaurinet al 2015 Franceschi et al 2017) by making use of re-verse or forward mode algorithmic differentiation (Griewankand Walther 2008 Baydin et al 2017) This allows us tooptimize a number of hyperparameters of the same order ofthat of parameters a situation which arise in ML

3 Exact and Approximate BilevelProgramming

In this section we provide results about the existence ofsolutions of Problem (1)-(2) and the approximation proper-ties of Procedure (5)-(6) with respect to the original bilevelproblem Proofs of these results are provided in the supple-mentary material

Procedure (5)-(6) though related to the bilevel problem(1)-(2) may not be in general a good approximation ofit Indeed making the assumptions (which sound perfectlyreasonable) that for every λ isin Λ wTλ rarr wλ for somewλ isin arg maxLλ and that E(middot λ) is continuous one canonly assert that limTrarrinfin fT (λ) = E(wλ λ) ge f(λ) Thisis because the optimization dynamics converge to someminimizer of the inner objective Lλ but not necessarilyto the one that also minimizes the function E This isillustrated in Figure 2 The situation is however differentif the inner problem admits a unique minimizer for everyλ isin Λ Indeed in this case it is possible to show that theset of minimizers of the approximate problems convergeas T rarr +infin and in an appropriate sense to the set ofminimizers of the bilevel problem More precisely wemake the following assumptions

(i) Λ is a compact subset of Rm

(ii) E Rd times Λrarr R is jointly continuous

(iii) the map (w λ) 7rarr Lλ(w) is jointly continuous andsuch that arg minLλ is a singleton for every λ isin Λ

(iv) wλ = arg minLλ remains bounded as λ varies in Λ

w

E (λ)

λ λ(1) (2)w w

Figure 2 In this cartoon for a fixed λ argmin Lλ =

w(1)λ w

(2)λ the iterates of an optimization mapping Φ could

converge to w(1)λ with E(w

(1)λ λ) gt E(w

(2)λ λ)

Then problem (1)-(2) becomes

minλisinΛ

f(λ) = E(wλ λ) wλ = argminuLλ(u) (7)

Under the above assumptions in the following we giveresults about the existence of solutions of problem (7) andthe (variational) convergence of the approximate problems(5)-(6) towards problem (7) mdash relating the minima as wellas the set of minimizers In this respect we note that sinceboth f and fT are nonconvex argmin fT and argmin f arein general nonsingleton so an appropriate definition of setconvergence is required

Theorem 31 (Existence) Under Assumptions (i)-(iv) prob-lem (7) admits solutionsProof See Appendix A

The result below follows from general facts on the stabil-ity of minimizers in optimization problems (Dontchev andZolezzi 1993)

Theorem 32 (Convergence) In addition to Assump-tions (i)-(iv) suppose that

(v) E(middot λ) is uniformly Lipschitz continuous

(vi) The iterates (wTλ)TisinN converge uniformly to wλ onΛ as T rarr +infin

Then

(a) inf fT rarr inf f

(b) argmin fT rarr argmin f meaning that for every(λT )TisinN such that λT isin argmin fT we have that

- (λT )TisinN admits a convergent subsequence- for every subsequence (λKT

)TisinN such thatλKT

rarr λ we have λ isin argmin f

Proof See Appendix A

We stress that assumptions (i)-(vi) are very natural and sat-isfied by many problems of practical interests Thus theabove results provide full theoretical justification to the pro-posed approximate procedure (5)-(6) The following remarkdiscusses assumption (vi) while the subsequent examplewill be relevant to the experiments in Sec 5

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Algorithm 1 Reverse-HG for Hyper-representationInput λ current values of the hyperparameter T num-ber of iteration of GD η ground learning rate B mini-batch of episodes from DOutput Gradient of meta-training error wrt λ on Bfor j = 1 to |B| do

wj0 = 0for t = 1 to T do

wjt larr wtminus1 minus ηnablawLj(wjtminus1 λDjtr)

αjT larr nablawLj(wjT λDval)

pj larr nablaλLj(wjT λDval)for t = T minus 1 downto 0 do

pj larr pj minus αjt+1ηnablaλnablawLj(wjt λD

jtr)

αjt larr αjt+1

[I minus ηnablawnablawLj(wjt λD

jtr)

]return

sumj p

j

Remark 33 If Lλ is strongly convex then many gradient-based algorithms (eg standard and accelerated gradient de-scent) yield linear convergence of the iterates wTλrsquos More-over in such cases the rate of linear convergence is of type(νλ minus microλ)(νλ + microλ) where νλ and microλ are the Lipschitzconstant of the gradient and the modulus of strong convexityof Lλ respectively So this rate can be uniformly boundedfrom above by ρ isin ]0 1[ provided that supλisinΛ νλ lt +infinand infλisinΛ microλ gt 0 Thus in these cases wTλ convergesuniformly to wλ on Λ (at a linear rate)

Example 34 Let us consider the following form of theinner objective

LH(w) = y minusXHw2 + ρw2 (8)

where ρ gt 0 is a fixed regularization parameter andH isin Rdtimesd is the hyperparameter representing a linear fea-ture map LH is strongly convex with modulus micro = ρ gt 0(independent on the hyperparameter H) and Lipschitzsmooth with constant νH = 2(XH)gtXH+ρI which isbounded from above if H ranges in a bounded set of squarematrices In this case assumptions (i)-(vi) are satisfied

4 Learning Hyper-RepresentationsIn this section we instantiate the bilevel programming ap-proach for ML outlined in Sec 22 in the case of deep learn-ing where representation layers are shared across episodesFinding good data representations is a centerpiece in ma-chine learning Classical approaches (Baxter 1995 Caru-ana 1998) learn both the weights of the representation map-ping and those of the ground classifiers jointly on the samedata Here we follow the bilevel approach and split eachdatasetepisode in training and validation sets

Our method involves the learning of a cross-task intermedi-

ate representation hλ X rarr Rk (parametrized by a vectorλ) on top of which task specific models gj Rk rarr Yj(parametrized by vectors wj) are trained The final groundmodel for task j is thus given by gj h To find λ we solveProblem (1)-(2) with inner and outer objectives as in Eqs(3) and (4) respectively Since in general this problemcannot be solved exactly we instantiate the approximationscheme in Eqs (5)-(6) as follows

minλfT (λ) =

Nsumj=1

Lj(wjT λDjval) (9)

wjt = wjtminus1minusηnablawLj(wjtminus1 λD

jtr) t j isin [T ] [N ] (10)

Starting from an initial value the weights of the task-specificmodels are learned by T iterations of gradient descentThe gradient of fT can be computed efficiently in timeby making use of an extended reverse-hypergradient pro-cedure (Franceschi et al 2017) which we present in Al-gorithm 1 Since in general the number of episodes in ameta-training set is large we compute a stochastic approx-imation of the gradient of fT by sampling a mini-batch ofepisodes At test time given a new episode D the represen-tation h is kept fixed and all the examples in D are used totune the weights w of the episode-specific model g

Like other initialization and optimization strategies forML our method does not require lookups in a supportset as the memorization and metric strategies do (Santoroet al 2016 Vinyals et al 2016 Mishra et al 2018) Un-like (Andrychowicz et al 2016 Ravi and Larochelle 2017)we do not tune the optimization algorithm which in our caseis plain empirical loss minimization by gradient descentand rather focus on the hypothesis space Unlike (Finnet al 2017) that aims at maximizing sensitivity of new tasklosses to the model parameters we aim at maximizing thegeneralization to novel examples during training episodeswith respect to λ Our assumptions about the structure ofthe model are slightly stronger than in (Finn et al 2017)but still mild namely that some (hyper)parameters definethe representation and the remaining parameters define theclassification function In (Munkhdalai and Yu 2017) themeta-knowledge is distributed among fast and slow weightsand an external memory our approach is more direct sincethe meta-knowledge is solely distilled by λ A further advan-tage of our method is that if the episode-specific models arelinear (eg logistic regressors) and each loss Lj is stronglyconvex in w the theoretical guarantees of Theorem 32 ap-ply (see Remark 33) These assumptions are satisfied in theexperiments reported in the next section

5 ExperimentsThe aim of the following experiments is threefold Firstwe investigate the impact of the number of iterations of theoptimization dynamics on the quality of the solution on a

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

0 1000 2000 3000 4000 5000

Hyperiterations

60

80

100

120

140

160

180Sq

uare

der

ror

Exact solutonT = 1

T = 4

T = 16

T = 64

T = 256

Figure 3 Optimization of the outer objectives f and fT for exactand approximate problems The optimization of H is performedwith gradient descent with momentum with same initializationstep size and momentum factor for each run

simple multiclass classification problem Second we testour hyper-representation method in the context of few-shotlearning on two benchmark datasets Finally we constrastthe bilevel ML approach against classical approaches tolearn shared representations 3

51 The Effect of T

Motivated by the theoretical findings of Sec 3 we empir-ically investigate how solving the inner problem approxi-mately (ie using small T ) affects convergence generaliza-tion performances and running time We focus in particularon the linear feature map described in Example 34 whichallows us to compare the approximated solution against theclosed-form analytical solution given by

wH = [(XH)TXH + ρI]minus1(XH)TY

In this setting the bilevel problem reduces to a (non-convex)optimization problem in H

We use a subset of 100 classes extracted from Omniglotdataset (Lake et al 2017) to construct a HO problem aimedat tuning H A training set Dtr and a validation set Dvaleach consisting of three randomly drawn examples per classwere sampled to form the HO problem A third set Dtestconsisting of fifteen examples per class was used for testingInstead of using raw images as input we employ featurevectors x isin R256 computed by the convolutional networktrained on one-shot five-ways ML setting as described inSec 52

For the approximate problems we compute the hypergra-dient using Algorithm 1 where it is intended that B =(Dtr Dval) Figure 3 shows the values of functions f andfT (see Eqs (1) and (5) respectively) during the optimiza-tion of H As T increases the solution of the approximate

3The code for reproducing the experiments based on thepackage FAR-HO (httpsbitlyfar-ho) is availableat httpsbitlyhyper-repr

0 1000 2000 3000 4000 5000

Hyperiterations

85

86

87

88

89

90

91

92

93

Acc

urac

y

Exact solutonT = 1

T = 4

T = 16

T = 64

T = 256

Figure 4 Accuracy on Dtest of exact and approximated solutionsduring optimization of H Training and validation accuraciesreach almost 100 already for T = 4 and after few hundredhyperiterations and therefore are not reported

Table 2 Execution times on a NVidia Tesla M40 GPU

T 1 4 16 64 256 ExactTime (sec) 60 119 356 1344 5532 320

problem approaches the true bilevel solution However per-forming a small number of gradient descent steps for solvingthe inner problem acts as implicit regularizer As it is evi-dent from Figure 4 the generalization error is better whenT is smaller than the value yielding the best approximationof the inner solution This is to be expected since in thissetting the dimensions of parameters and hyperparametersare of the same order leading to a concrete possibility ofoverfitting the outer objective (validation error) An appro-priate problem dependent choice of T may help avoidingthis issue (see also Appendix C) As T increases the numberof hyperiterations required to reach the maximum test accu-racy decreases further suggesting that there is an interplaybetween the number of iterations used to solve the inner andthe outer objective Finally the running time of Algorithm1 is linear in T and the size of w and independent of thesize of H (see also Table 2) making it even more appealingto reduce the number of iterations

52 Few-shot Learning

We new turn our attention to learning-to-learn precisely tofew-shot supervised learning implementing the ML strategyoutlined in Sec 4 on two different benchmark datasets

bull OMNIGLOT (Lake et al 2015) a dataset that containsexamples of 1623 different handwritten characters from 50alphabets We downsample the images to 28times 28

bull MINIIMAGENET (Vinyals et al 2016) a subset of Ima-geNet (Deng et al 2009) that contains 60000 downsampledimages from 100 different classes

Following the experimental protocol used in a number of re-cent works we build a meta-training set D from which wesample datasets to solve Problem (9)-(10) a meta-validation

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

set V for tuning ML hyperparameters and finally a meta-test set T which is used to estimate accuracy Operationallyeach meta-dataset consists of a pool of samples belonging todifferent (non-overlapping between separate meta-dataset)classes which can be combined to form ground classifica-tion datasets Dj = Dj

tr cup Djval with 5 or 20 classes (for

Omniglot) The Djtrrsquos contain 1 or 5 examples per class

which are used to fit wj (see Eq 10) The Djvalrsquos contain-

ing 15 examples per class is used either to compute fT (λ)(see Eq (9)) and its (stochastic) gradient if Dj isin D or toprovide a generalization score if Dj comes from either V orT For MiniImagenet we use the same split and images pro-posed in (Ravi and Larochelle 2017) while for Omniglotwe use the protocol defined by (Santoro et al 2016)

As ground classifiers we use multinomial logistic regressorsand as task losses `j we employ cross-entropy The innerproblems being strongly convex admit unique minimizersyet require numerical computation of the solutions We ini-tialize ground models parameters wj to 0 and according tothe observation in Sec 51 we perform T gradient descentsteps where T is treated as a ML hyperparameter that has tobe validated Figure 6 shows an example of meta-validationof T for one-shot learning on MiniImagenet We compute astochastic approximation ofnablafT (λ) with Algorithm 1 anduse Adam with decaying learning rate to optimize λ

Regarding the specific implementation of the representationmapping h we employ for Omniglot a four-layers convo-lutional neural network with strided convolutions and 64filters per layer as in (Vinyals et al 2016) and other suc-cessive works For MiniImagenet we tried two differentarchitectures

bull C4L a four-layers convolutional neural network with max-pooling and 32 filters per layer

bull RN a residual network (He et al 2016) built of fourresidual blocks followed by two convolutional layers

The first network architecture has been proposed in (Raviand Larochelle 2017) and then used in (Finn et al 2017)while a similar residual network architecture has been em-ployed in a more recent work (Mishra et al 2018) Furtherdetails on the architectures of h as well as other ML hyper-parameters are specified in the supplementary material Wereport our results using RN for MiniImagenet in Table 3alongside scores from various recently proposed methodsfor comparison

The proposed method achieves competitive results highlight-ing the relative importance of learning a task independentrepresentation on the top of which logistic classifiers trainedwith very few samples generalize well Moreover utilizingmore expressive models such as residual network as repre-sentation mappings is beneficial for our proposed strategyand unlike other methods does not result in overfitting

of the outer objective as reported in (Mishra et al 2018)Indeed compared to C4L RN achieves a relative improve-ment of 65 on one-shot and 42 on five-shot Figure 5provides a visual example of the goodness of the learned rep-resentation showing that MiniImagenet examples (the firstfrom meta-training the second from the meta-testing sets)from similar classes (different dog breeds) are mapped neareach other by h and conversely samples from dissimilarclasses are mapped afar

Figure 5 After sampling two datasets D isin D and Dprime isin T weshow on the top the two images x isin D xprime isin Dprime that minimize||hλ(x)minus hλ(xprime)|| and on the bottom those that maximize it Inbetween each of the two couples we compare a random subset ofcomponents of hλ(x) (blue) and hλ(xprime) (green)

3 5 8 1266

68

70

72

74

76

78Accuracy on Dtr

3 5 8 12

49

50

51

52

53

Accuracy on DtestAccuracy on Dval

Figure 6 Meta-validation of the number of gradient descent steps(T ) of the ground models for MiniImagenet using the RN repre-sentation Early stopping on the accuracy on meta-validation setduring meta-training resulted in halting the optimization of λ after42k 40k 22k and 15k hyperiterations for T equal to 3 5 8 and12 respectively in line with our observation in Sec 51

53 On Variants of Representation Learning Methods

In this section we show the benefits of learning a represen-tation within the proposed bilevel framework compared toother possible approaches that involve an explicit factoriza-tion of a classifier as gj h The representation mapping his either pretrained or learned with different meta-learningalgorithms We focus on the problem of one-shot learningon MiniImagenet and we use C4L as architecture for therepresentation mapping In all the experiments the groundmodels gj are multinomial logistic regressor as in Sec 52tuned with 5 steps of gradient descent We ran the followingexperiments

bull Multiclass the mapping h X rarr R64 is given by the

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Table 3 Accuracy scores computed on episodes from T of various methods on 1-shot and 5-shot classification problems on Omniglotand MiniImagenet For MiniImagenet 95 confidence intervals are reported For Hyper-representation the scores are computed over 600randomly drawn episodes For other methods we show results as reported by their respective authors

OMNIGLOT 5 classes OMNIGLOT 20 classes MINIIMAGENET 5 classesMethod 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot

Siamese nets (Koch et al 2015) 973 984 882 970 minus minusMatching nets (Vinyals et al 2016) 981 989 938 985 4344plusmn 077 5531plusmn 073Neural stat (Edwards and Storkey 2016) 981 995 932 981 minus minusMemory mod (Kaiser et al 2017) 984 996 950 986 minus minusMeta-LSTM (Ravi and Larochelle 2017) minus minus minus minus 4356plusmn 084 6060plusmn 071MAML (Finn et al 2017) 987 999 958 989 4870plusmn 175 6311plusmn 092Meta-networks (Munkhdalai and Yu 2017) 989 minus 970 minus 4921plusmn 096 minusPrototypical Net (Snell et al 2017) 988 997 960 989 4942plusmn 078 6820plusmn 066SNAIL (Mishra et al 2018) 991 998 976 994 5571plusmn 099 6888plusmn 092Hyper-representation 986 995 955 984 5054plusmn 085 6453plusmn 068

linear outputs before the softmax operation of a network4

pretrained on the totality of examples contained in the train-ing meta-dataset (600 examples for each of the 64 classes)In this setting we found that using the second last layer orthe output after the softmax yields worst results

bull Bilevel-train we use a bilevel approach but unlike in Sec4 we optimize the parameter vector λ of the representationmapping by minimizing the loss on the training sets of eachepisode The hypergradient is still computed with Algorithm1 albeit we set Dj

val = Djtr for each training episodes

bull Approx and Approx-train we consider an approxima-tion of the hypergradient nablafT (λ) by disregarding the op-timization dynamics of the inner objectives (ie we setnablaλwjT = 0) In Approx-train we just use the training sets

bull Classic as in (Baxter 1995) we learn h by jointly opti-mize f(λw1 wN ) =

sumNj=1 L

j(wj λDjtr) and treat

the problem as standard multitask learning with the ex-ception that we evaluate f on mini-batches of 4 episodesrandomly sampled every 5 gradient descent iterations

In settings where we do not use the validation sets we letthe training sets of each episode contain 16 examples perclass Using training episodes with just one example perclass resulted in performances just above random chanceWhile the first experiment constitutes a standard baselinethe others have the specific aim of assessing (i) the impor-tance of splitting episodes of meta-training set into trainingand validation and (ii) the importance of computing thehypergradient of the approximate bilevel problem with Al-gorithm 1 The results reported in Table 4 suggest thatboth the trainingvalidation splitting and the full computa-tion of the hypergradient constitute key factors for learninga good representation in a meta-learning context On theother side using pretrained representations especially ina low-dimensional space turns out to be a rather effectivebaseline One possible explanation is that in this context

4The network is similar to C4L but has 64 filters per layer

Table 4 Performance of various methods where the representationis either transfered or learned with variants of hyper-representationmethods The last raw reports for comparison the score obtainedwith hyper-representation

Method filters Accuracy 1-shot

Multiclass 64 4302Bilevel-train 32 2963Approx 32 4112Approx-train 32 3880Classic-train 32 4046Hyper-representation-C4L 32 4751

some classes in the training and testing meta-datasets arerather similar (eg various dog breeds) and thus groundclassifiers can leverage on very specific representations

6 ConclusionsWe have shown that both HO and ML can be formulated interms of bilevel programming and solved with an iterativeapproach When the inner problem has a unique solution(eg is strongly convex) our theoretical results show that theiterative approach has convergence guarantees a result thatis interesting in its own right In the case of ML by adaptingclassical strategies (Baxter 1995) to the bilevel frameworkwith trainingvalidation splitting we present a method forlearning hyper-representations which is experimentally ef-fective and supported by our theoretical guarantees

Our framework encompasses recently proposed methodsfor meta-learning such as learning to optimize but alsosuggests different design patterns for the inner learningalgorithm which could be interesting to explore in futurework The resulting inner problems may not satisfy theassumptions of our convergence analysis raising the needfor further theoretical investigations An additional futuredirection of research is the study of the statistical propertiesof bilevel strategies where outer objectives are based on thegeneralization ability of the inner model to new (validation)data Ideas from (Maurer et al 2016 Denevi et al 2018)may be useful in this direction

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

ReferencesAndrychowicz M Denil M Gomez S Hoffman M W

Pfau D Schaul T and de Freitas N (2016) Learningto learn by gradient descent by gradient descent In Ad-vances in Neural Information Processing Systems (NIPS)pages 3981ndash3989

Bard J F (2013) Practical bilevel optimization algo-rithms and applications volume 30 Springer Science ampBusiness Media 01251

Baxter J (1995) Learning internal representations In Pro-ceedings of the 8th Annual Conference on ComputationalLearning Theory (COLT) pages 311ndash320 ACM

Baydin A G Pearlmutter B A Radul A A and SiskindJ M (2017) Automatic differentiation in machine learn-ing a survey Journal of Machine Learning Research181531ndash15343

Beirami A Razaviyayn M Shahrampour S and TarokhV (2017) On optimal generalizability in parametriclearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3458ndash3468

Bergstra J and Bengio Y (2012) Random search for hyper-parameter optimization Journal of Machine LearningResearch 13(Feb)281ndash305

Bergstra J Yamins D and Cox D D (2013) Makinga science of model search Hyperparameter optimiza-tion in hundreds of dimensions for vision architecturesIn Proceedings of the 30th International Conference onMachine Learning (ICML) pages 115ndash123

Caruana R (1998) Multitask learning In Learning tolearn pages 95ndash133 Springer 02683

Colson B Marcotte P and Savard G (2007) Anoverview of bilevel optimization Annals of operationsresearch 153(1)235ndash256

Denevi G Ciliberto C Stamos D and Pontil M (2018)Incremental learning-to-learn with statistical guaranteesarXiv preprint arXiv180308089 To appear in UAI 2018

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL (2009) Imagenet A large-scale hierarchical imagedatabase In Computer Vision and Pattern Recognition(CVPR) pages 248ndash255

Domke J (2012) Generic Methods for Optimization-BasedModeling In AISTATS volume 22 pages 318ndash326

Dontchev A L and Zolezzi T (1993) Well-posed op-timization problems volume 1543 of Lecture Notes inMathematics Springer-Verlag Berlin

Edwards H and Storkey A (2016) Towards a NeuralStatistician arXiv160602185 [cs stat] 00027 arXiv160602185

Evgeniou T Micchelli C A and Pontil M (2005) Learn-ing multiple tasks with kernel methods Journal of Ma-chine Learning Research 6(Apr)615ndash637

Finn C Abbeel P and Levine S (2017) Model-agnosticmeta-learning for fast adaptation of deep networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 1126ndash1135

Flamary R Rakotomamonjy A and Gasso G(2014) Learning constrained task similarities in graph-regularized multi-task learning Regularization Opti-mization Kernels and Support Vector Machines page103

Franceschi L Donini M Frasconi P and Pontil M(2017) Forward and reverse gradient-based hyperparam-eter optimization In Proceedings of the 34th Interna-tional Conference on Machine Learning (ICML) pages1165ndash1173

Glorot X and Bengio Y (2010) Understanding the diffi-culty of training deep feedforward neural networks InProceedings of the 13th International Conference on Arti-ficial Intelligence and Statistics (AIStat) pages 249ndash256

Griewank A and Walther A (2008) Evaluating deriva-tives principles and techniques of algorithmic differenti-ation SIAM

He K Zhang X Ren S and Sun J (2016) Deep residuallearning for image recognition In Computer Vision andPattern Recognition (CVPR) pages 770ndash778

Hutter F Lcke J and Schmidt-Thieme L (2015) BeyondManual Tuning of Hyperparameters KI - KunstlicheIntelligenz 29(4)329ndash337

Kaiser L Nachum O Roy A and Bengio S (2017)Learning to remember rare events

Keerthi S S Sindhwani V and Chapelle O (2007) Anefficient method for gradient-based adaptation of hyper-parameters in svm models In Advances in Neural Infor-mation Processing Systems (NIPS) pages 673ndash680

Khosla A Zhou T Malisiewicz T Efros A A andTorralba A (2012) Undoing the damage of dataset biasIn European Conference on Computer Vision pages 158ndash171 Springer

Koch G Zemel R and Salakhutdinov R (2015) Siameseneural networks for one-shot image recognition In ICMLDeep Learning Workshop volume 2 00127

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Koh P W and Liang P (2017) Understanding black-boxpredictions via influence functions In Proceedings ofthe 34th International Conference on Machine Learning(ICML) pages 1885ndash1894

Kunapuli G Bennett K Hu J and Pang J-S (2008)Classification model selection via bilevel programmingOptimization Methods and Software 23(4)475ndash489

Kuzborskij I Orabona F and Caputo B (2013) Fromn to n+ 1 Multiclass transfer incremental learning InComputer Vision and Pattern Recognition (CVPR) 2013IEEE Conference on pages 3358ndash3365

Lake B M Salakhutdinov R and Tenenbaum J B(2015) Human-level concept learning through probabilis-tic program induction Science 350(6266)1332ndash1338

Lake B M Ullman T D Tenenbaum J B and Gersh-man S J (2017) Building machines that learn and thinklike people Behavioral and Brain Sciences 40 00152

Maclaurin D Duvenaud D K and Adams R P (2015)Gradient-based hyperparameter optimization through re-versible learning In Proceedings of the 32nd Interna-tional Conference on Machine Learning (ICML pages2113ndash2122

Maurer A Pontil M and Romera-Paredes B (2016) Thebenefit of multitask representation learning The Journalof Machine Learning Research 17(1)2853ndash2884

Mishra N Rohaninejad M Chen X and Abbeel P(2018) A simple neural attentive meta-learner

Munkhdalai T and Yu H (2017) Meta networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 2554ndash2563

Pedregosa F (2016) Hyperparameter optimization withapproximate gradient In Proceedings of The 33rd Inter-national Conference on Machine Learning (ICML) pages737ndash746

Ravi S and Larochelle H (2017) Optimization as a modelfor few-shot learning In In International Conference onLearning Representations (ICLR)

Santoro A Bartunov S Botvinick M Wierstra Dand Lillicrap T (2016) Meta-learning with memory-augmented neural networks In Proceedings ofthe 33rdInternational Conference on Machine Learning pages1842ndash1850

Schmidhuber J (1992) Learning to Control Fast-WeightMemories An Alternative to Dynamic Recurrent Net-works Neural Computation 4(1)131ndash139 00082

Snell J Swersky K and Zemel R S (2017) Prototypicalnetworks for few-shot learning In Advances in neuralinformation processing systems pages 4080ndash4090

Snoek J Larochelle H and Adams R P (2012) Practicalbayesian optimization of machine learning algorithmsIn Advances in Neural Information Processing Systems(NIPS) pages 2951ndash2959

Thrun S and Pratt L (1998) Learning to learn Springer

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser L and Polosukhin I (2017)Attention is all you need In Advances in Neural Informa-tion Processing Systems (NIPS) pages 6000ndash6010

Vinyals O Blundell C Lillicrap T Kavukcuoglu Kand Wierstra D (2016) Matching networks for one shotlearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3630ndash3638

Wichrowska O Maheswaranathan N Hoffman M WColmenarejo S G Denil M Freitas N and Sohl-Dickstein J (2017) Learned optimizers that scale andgeneralize In Proceedings of the 34th International Con-ference on Machine Learning (ICML) pages 3751ndash3760

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

A Proofs of the Results in Sec 3Proof of Theorem 31 Since Λ is compact it follows fromWeierstrass theorem that a sufficient condition for the exis-tence of minimizers is that f is continuous Thus let λ isin Λand let (λn)nisinN be a sequence in Λ such that λn rarr λWe prove that f(λn) = E(wλn

λn)rarr E(wλ λ) = f(λ)Since (wλn

)nisinN is bounded there exists a subsequence(wkn)nisinN such that wkn rarr w for some w isin Rd Nowsince λkn rarr λ and the map (w λ) 7rarr Lλ(w) is jointlycontinuous we have

forallw isin Rd Lλ(w) = limnLλkn

(wkn)

le limnLλkn

(w) = Lλ(w)

Therefore w is a minimizer of Lλ and hence w = wλThis prove that (wλn)nisinN is a bounded sequence havinga unique cluster point Hence (wλn)nisinN is convergenceto its unique cluster point which is wλ Finally since(wλn

λn)rarr (wλ λ) and E is jointly continuous we haveE(wλn λn)rarr E(wλ λ) and the statement follows

We recal a fundamental fact concerning the stability of min-ima and minimizers in optimization problems (Dontchevand Zolezzi 1993) We provide the proof for completeness

Theorem A1 (Convergence) Let ϕT and ϕ be lower semi-continuous functions defined on a compact set Λ Supposethat ϕT converges uniformly to ϕ on Λ as T rarr +infin Then

(a) inf ϕT rarr inf ϕ

(b) argminϕT rarr argminϕ meaning that for every(λT )TisinN such that λT isin argminϕT we have that

- (λT )TisinN admits a convergent subsequence- for every subsequence (λKT

)TisinN such thatλKT

rarr λ we have λ isin argminϕ

Proof Let (λT )TisinN be a sequence in Λ such that for everyT isin N λT isin argminϕT We prove that

1) (λT )TisinN admits a convergent subsequence

2) for every subsequence (λKT)TisinN such that λKT

rarr λwe have λ isin argminϕ and ϕKT

(λKT)rarr inf ϕ

3) inf ϕT rarr inf ϕ

The first point follows from the fact that Λ is compactConcerning the second point let (λKT

)TisinN be a subse-quence such that λKT

rarr λ Since ϕKTconverge uniformly

to ϕ we have

|ϕKT(λKT

)minus ϕ(λKT)| le sup

λisinΛ|ϕKT

(λ)minus ϕ(λ)| rarr 0

Therefore using also the continuity of ϕ we have

forallλ isin Λ ϕ(λ) = limTϕ(λKT

) = limTϕKT

(λKT)

le limTϕKT

(λ) = ϕ(λ)

So λ isin argminϕ andϕ(λ) = limT ϕKT(λKT

) le inf ϕ =ϕ(λ) that is limT ϕKT

(λKT) = inf ϕ

Finally as regards the last point we proceed by contradic-tion If (ϕT (λT ))TisinN does not convergce to inf f thenthere exists an ε gt 0 and a subsequence (ϕKT

(λKT))TisinN

such that

|ϕKT(λKT

)minus inf ϕ| ge ε forallT isin N (11)

Now let (λK

(1)T

) be a convergent subsequence of (λKT)TisinN

Suppose that λK

(1)T

rarr λ Clearly (λK

(1)T

) is also a subse-quence of (λT )TisinN Then it follows from point 2) abovethat ϕ

K(1)T

(λK

(1)T

) rarr inf ϕ This latter finding togetherwith equation (11) gives a contradiction

Proof of Theorem 32 Since E(middot λ) is uniformly Lipschitzcontinuous there exists ν gt 0 such that for every T isin Nand every λ isin Λ

|fT (λ)minus f(λ)| = |E(wTλ λ)minus E(wλ λ)|le νwTλ minus wλ

It follows from assumption (vi) that fT (λ) converges tof(λ) uniformly on Λ as T rarr +infin Then the statementfollows from Theorem A1

B Cross-validation and Bilevel ProgrammingWe note that the (approximate) bilevel programming frame-work easily accommodates also estimations of the gener-alization error generated by a cross-validation proceduresWe describe here the case of K-fold cross-validation whichincludes also leave-one-out cross validation

Let D = (xi yi)ni=1 be the set of available data K-foldcross validation with K isin 1 2 N consists in parti-tioning D in K subsets DjKj=1 and fit as many modelsgwj on training data Dj

tr =⋃i6=j D

i The models are thenevaluated on Dj

val = Dj Denoting by w = (wj)Kj=1 thevector of stacked weights the K-fold cross validation erroris given by

E(w λ) =1

K

Ksumj=1

Ej(wj λ)

where Ej(wj λ) =sum

(xy)isinDj `(gwj (x) y) E can betreated as the outer objective in the bilevel framework whilethe inner objective Lλ may be given by the sum of regu-larized empirical errors over each Dj

tr for the K models

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Under this perspective a K-fold cross-validation procedureclosely resemble the bilevel problem for ML formulatedin Sec 22 where in this case the meta-distribution col-lapses on the data (ground) distribution and the episodes aresampled from the same dataset of points

By following the procedure outlined in Sec 23 we canapproximate the minimization of Lλ with T steps of anoptimization dynamics and compute the hypergradient offT (λ) = 1

K

sumj E

j(wjT λ) by training the K models andproceed with either forward or reverse differentiation Themodels may be fitted sequentially in parallel or stochas-tically Specifically in this last case one can repeatedlysample one fold at a time (or possibly a mini-batch of folds)and compute a stochastic hypergradient that can be used ina SGD procedure in order to minimize fT At last we notethat similar ideas for leave-one out cross-validation error aredeveloped in (Beirami et al 2017) where the hypergradientof an approximation of the outer objective is computed bymeans of the implicit function theorem

C The Effect of T Ridge RegressionIn Sec 51 we showed that increasing the number of itera-tions T leads to a better optimization of the outer objective fthrough the approximations fT which converge uniformlyto f by Proposition 32 This however does not necessaryresults in better test scores due to possible overfitting ofthe training and validation errors as we noted for the linearhyper-representation multiclass classification experiment inSec 51 The optimal choice of T appears to be indeedproblem dependent if in the aforementioned experiment aquite small T led to the best test accuracy (after hyperparam-eter optimization) in this section we present a small scalelinear regression experiment that benefits from an increasingnumber of inner iterations

We generated 90 noisy synthetic data points with 30 featuresof which only 5 were informative and divided them equallyto form training validation and test sets As outer objectivewe set the mean squared error E(w) = ||Xvalw minus yval||2where Xval and yval are the validation design matrix andtargets respectively and w isin R30 is the weight vector of themodel We set as inner inner objective

Lλ(w) = ||Xtrw minus ytr||2 +

30sumi=1

eλiw2i

and optimized the L2 vector of regularization coefficients λ(equivalent to a diagonal Tikhonov regularization matrix)The results reported in Table 5 show that in this scenariooverfitting is not an issue and 250 inner iterations yield thebest test result Note that as in Sec 51 also this problemadmits an analytical solution from which it is possible tocompute the hypergradient analytically and perform exact

hyperparameter optimization as reported in the last row ofthe Table 5

Table 5 Validation and test mean absolute percentage error(MAPE) for various values of T

T Validation MAPE Test MAPE

10 1135 434950 128 522100 055 126250 047 050Exact 037 057

D Further Details on Few-shot ExperimentsThis appendix contains implementation details of the repre-sentation mapping h and the meta-learning hyperparametersused in the few-shot learning experiments of Sec 52

To optimize the representation mapping in all the exper-iments we use Adam with learning rate set to 10minus3 anda decay-rate of 10minus5 We used the initialization strategyproposed in (Glorot and Bengio 2010) for all the weights λof the representation

For Omniglot experiments we used a meta-batch size of 32episodes for five-way and of 16 episodes for 20-way Totrain the episode-specific classifiers we set the learning rateto 01

For one set of experiments with Mini-imagenet we used anhyper-representation (C4L) consisting of 4 convolutionallayers where each layer is composed by a convolution with32 filters a batch normalization followed by a ReLU acti-vation and a 2x2 max-pooling The classifiers were trainedusing mini-batches of 4 episodes for one-shot and 2 episodesfor five-shot with learning rate set to 001

The other set of experiments with Mini-imagenet employeda Residual Network (RN) as the representation mappingbuilt of 4 residual blocks (with 64 96 128 256 filters) andthen the block that follows 1times 1 conv (2048 filters) avgpooling 1 times 1 conv (512 filters) Each residual blockrepeats the following block 3 times 1 times 1 conv batchnormalization leaky ReLU (leak 01) before the residualconnection In this case the classifiers were optimized usingmini-batches of 2 episodes for both one and five-shot withlearning rate set to 004

Page 4: Bilevel Programming for Hyperparameter Optimization and ...

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

However it also suggests to consider the inner dynamicsas a form of approximate empirical error minimization (egearly stopping) which is valid in its own right From thisperspective ndash conversely to the implicit differentiation strat-egy ndash it is possible to include among the components of λvariables which are associated with the optimization algo-rithm itself For example λ may include the step sizes ormomentum factors if the dynamics Φt in Eq (6) is gradientdescent with momentum in (Andrychowicz et al 2016Wichrowska et al 2017) the mapping Φt is implemented asa recurrent neural network while (Finn et al 2017) focuson the initialization mapping by letting Φ0(λ) = λ

A major advantage of this reformulation is that it makes itpossible to compute efficiently the gradient of fT which wecall hypergradient either in time or in memory (Maclaurinet al 2015 Franceschi et al 2017) by making use of re-verse or forward mode algorithmic differentiation (Griewankand Walther 2008 Baydin et al 2017) This allows us tooptimize a number of hyperparameters of the same order ofthat of parameters a situation which arise in ML

3 Exact and Approximate BilevelProgramming

In this section we provide results about the existence ofsolutions of Problem (1)-(2) and the approximation proper-ties of Procedure (5)-(6) with respect to the original bilevelproblem Proofs of these results are provided in the supple-mentary material

Procedure (5)-(6) though related to the bilevel problem(1)-(2) may not be in general a good approximation ofit Indeed making the assumptions (which sound perfectlyreasonable) that for every λ isin Λ wTλ rarr wλ for somewλ isin arg maxLλ and that E(middot λ) is continuous one canonly assert that limTrarrinfin fT (λ) = E(wλ λ) ge f(λ) Thisis because the optimization dynamics converge to someminimizer of the inner objective Lλ but not necessarilyto the one that also minimizes the function E This isillustrated in Figure 2 The situation is however differentif the inner problem admits a unique minimizer for everyλ isin Λ Indeed in this case it is possible to show that theset of minimizers of the approximate problems convergeas T rarr +infin and in an appropriate sense to the set ofminimizers of the bilevel problem More precisely wemake the following assumptions

(i) Λ is a compact subset of Rm

(ii) E Rd times Λrarr R is jointly continuous

(iii) the map (w λ) 7rarr Lλ(w) is jointly continuous andsuch that arg minLλ is a singleton for every λ isin Λ

(iv) wλ = arg minLλ remains bounded as λ varies in Λ

w

E (λ)

λ λ(1) (2)w w

Figure 2 In this cartoon for a fixed λ argmin Lλ =

w(1)λ w

(2)λ the iterates of an optimization mapping Φ could

converge to w(1)λ with E(w

(1)λ λ) gt E(w

(2)λ λ)

Then problem (1)-(2) becomes

minλisinΛ

f(λ) = E(wλ λ) wλ = argminuLλ(u) (7)

Under the above assumptions in the following we giveresults about the existence of solutions of problem (7) andthe (variational) convergence of the approximate problems(5)-(6) towards problem (7) mdash relating the minima as wellas the set of minimizers In this respect we note that sinceboth f and fT are nonconvex argmin fT and argmin f arein general nonsingleton so an appropriate definition of setconvergence is required

Theorem 31 (Existence) Under Assumptions (i)-(iv) prob-lem (7) admits solutionsProof See Appendix A

The result below follows from general facts on the stabil-ity of minimizers in optimization problems (Dontchev andZolezzi 1993)

Theorem 32 (Convergence) In addition to Assump-tions (i)-(iv) suppose that

(v) E(middot λ) is uniformly Lipschitz continuous

(vi) The iterates (wTλ)TisinN converge uniformly to wλ onΛ as T rarr +infin

Then

(a) inf fT rarr inf f

(b) argmin fT rarr argmin f meaning that for every(λT )TisinN such that λT isin argmin fT we have that

- (λT )TisinN admits a convergent subsequence- for every subsequence (λKT

)TisinN such thatλKT

rarr λ we have λ isin argmin f

Proof See Appendix A

We stress that assumptions (i)-(vi) are very natural and sat-isfied by many problems of practical interests Thus theabove results provide full theoretical justification to the pro-posed approximate procedure (5)-(6) The following remarkdiscusses assumption (vi) while the subsequent examplewill be relevant to the experiments in Sec 5

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Algorithm 1 Reverse-HG for Hyper-representationInput λ current values of the hyperparameter T num-ber of iteration of GD η ground learning rate B mini-batch of episodes from DOutput Gradient of meta-training error wrt λ on Bfor j = 1 to |B| do

wj0 = 0for t = 1 to T do

wjt larr wtminus1 minus ηnablawLj(wjtminus1 λDjtr)

αjT larr nablawLj(wjT λDval)

pj larr nablaλLj(wjT λDval)for t = T minus 1 downto 0 do

pj larr pj minus αjt+1ηnablaλnablawLj(wjt λD

jtr)

αjt larr αjt+1

[I minus ηnablawnablawLj(wjt λD

jtr)

]return

sumj p

j

Remark 33 If Lλ is strongly convex then many gradient-based algorithms (eg standard and accelerated gradient de-scent) yield linear convergence of the iterates wTλrsquos More-over in such cases the rate of linear convergence is of type(νλ minus microλ)(νλ + microλ) where νλ and microλ are the Lipschitzconstant of the gradient and the modulus of strong convexityof Lλ respectively So this rate can be uniformly boundedfrom above by ρ isin ]0 1[ provided that supλisinΛ νλ lt +infinand infλisinΛ microλ gt 0 Thus in these cases wTλ convergesuniformly to wλ on Λ (at a linear rate)

Example 34 Let us consider the following form of theinner objective

LH(w) = y minusXHw2 + ρw2 (8)

where ρ gt 0 is a fixed regularization parameter andH isin Rdtimesd is the hyperparameter representing a linear fea-ture map LH is strongly convex with modulus micro = ρ gt 0(independent on the hyperparameter H) and Lipschitzsmooth with constant νH = 2(XH)gtXH+ρI which isbounded from above if H ranges in a bounded set of squarematrices In this case assumptions (i)-(vi) are satisfied

4 Learning Hyper-RepresentationsIn this section we instantiate the bilevel programming ap-proach for ML outlined in Sec 22 in the case of deep learn-ing where representation layers are shared across episodesFinding good data representations is a centerpiece in ma-chine learning Classical approaches (Baxter 1995 Caru-ana 1998) learn both the weights of the representation map-ping and those of the ground classifiers jointly on the samedata Here we follow the bilevel approach and split eachdatasetepisode in training and validation sets

Our method involves the learning of a cross-task intermedi-

ate representation hλ X rarr Rk (parametrized by a vectorλ) on top of which task specific models gj Rk rarr Yj(parametrized by vectors wj) are trained The final groundmodel for task j is thus given by gj h To find λ we solveProblem (1)-(2) with inner and outer objectives as in Eqs(3) and (4) respectively Since in general this problemcannot be solved exactly we instantiate the approximationscheme in Eqs (5)-(6) as follows

minλfT (λ) =

Nsumj=1

Lj(wjT λDjval) (9)

wjt = wjtminus1minusηnablawLj(wjtminus1 λD

jtr) t j isin [T ] [N ] (10)

Starting from an initial value the weights of the task-specificmodels are learned by T iterations of gradient descentThe gradient of fT can be computed efficiently in timeby making use of an extended reverse-hypergradient pro-cedure (Franceschi et al 2017) which we present in Al-gorithm 1 Since in general the number of episodes in ameta-training set is large we compute a stochastic approx-imation of the gradient of fT by sampling a mini-batch ofepisodes At test time given a new episode D the represen-tation h is kept fixed and all the examples in D are used totune the weights w of the episode-specific model g

Like other initialization and optimization strategies forML our method does not require lookups in a supportset as the memorization and metric strategies do (Santoroet al 2016 Vinyals et al 2016 Mishra et al 2018) Un-like (Andrychowicz et al 2016 Ravi and Larochelle 2017)we do not tune the optimization algorithm which in our caseis plain empirical loss minimization by gradient descentand rather focus on the hypothesis space Unlike (Finnet al 2017) that aims at maximizing sensitivity of new tasklosses to the model parameters we aim at maximizing thegeneralization to novel examples during training episodeswith respect to λ Our assumptions about the structure ofthe model are slightly stronger than in (Finn et al 2017)but still mild namely that some (hyper)parameters definethe representation and the remaining parameters define theclassification function In (Munkhdalai and Yu 2017) themeta-knowledge is distributed among fast and slow weightsand an external memory our approach is more direct sincethe meta-knowledge is solely distilled by λ A further advan-tage of our method is that if the episode-specific models arelinear (eg logistic regressors) and each loss Lj is stronglyconvex in w the theoretical guarantees of Theorem 32 ap-ply (see Remark 33) These assumptions are satisfied in theexperiments reported in the next section

5 ExperimentsThe aim of the following experiments is threefold Firstwe investigate the impact of the number of iterations of theoptimization dynamics on the quality of the solution on a

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

0 1000 2000 3000 4000 5000

Hyperiterations

60

80

100

120

140

160

180Sq

uare

der

ror

Exact solutonT = 1

T = 4

T = 16

T = 64

T = 256

Figure 3 Optimization of the outer objectives f and fT for exactand approximate problems The optimization of H is performedwith gradient descent with momentum with same initializationstep size and momentum factor for each run

simple multiclass classification problem Second we testour hyper-representation method in the context of few-shotlearning on two benchmark datasets Finally we constrastthe bilevel ML approach against classical approaches tolearn shared representations 3

51 The Effect of T

Motivated by the theoretical findings of Sec 3 we empir-ically investigate how solving the inner problem approxi-mately (ie using small T ) affects convergence generaliza-tion performances and running time We focus in particularon the linear feature map described in Example 34 whichallows us to compare the approximated solution against theclosed-form analytical solution given by

wH = [(XH)TXH + ρI]minus1(XH)TY

In this setting the bilevel problem reduces to a (non-convex)optimization problem in H

We use a subset of 100 classes extracted from Omniglotdataset (Lake et al 2017) to construct a HO problem aimedat tuning H A training set Dtr and a validation set Dvaleach consisting of three randomly drawn examples per classwere sampled to form the HO problem A third set Dtestconsisting of fifteen examples per class was used for testingInstead of using raw images as input we employ featurevectors x isin R256 computed by the convolutional networktrained on one-shot five-ways ML setting as described inSec 52

For the approximate problems we compute the hypergra-dient using Algorithm 1 where it is intended that B =(Dtr Dval) Figure 3 shows the values of functions f andfT (see Eqs (1) and (5) respectively) during the optimiza-tion of H As T increases the solution of the approximate

3The code for reproducing the experiments based on thepackage FAR-HO (httpsbitlyfar-ho) is availableat httpsbitlyhyper-repr

0 1000 2000 3000 4000 5000

Hyperiterations

85

86

87

88

89

90

91

92

93

Acc

urac

y

Exact solutonT = 1

T = 4

T = 16

T = 64

T = 256

Figure 4 Accuracy on Dtest of exact and approximated solutionsduring optimization of H Training and validation accuraciesreach almost 100 already for T = 4 and after few hundredhyperiterations and therefore are not reported

Table 2 Execution times on a NVidia Tesla M40 GPU

T 1 4 16 64 256 ExactTime (sec) 60 119 356 1344 5532 320

problem approaches the true bilevel solution However per-forming a small number of gradient descent steps for solvingthe inner problem acts as implicit regularizer As it is evi-dent from Figure 4 the generalization error is better whenT is smaller than the value yielding the best approximationof the inner solution This is to be expected since in thissetting the dimensions of parameters and hyperparametersare of the same order leading to a concrete possibility ofoverfitting the outer objective (validation error) An appro-priate problem dependent choice of T may help avoidingthis issue (see also Appendix C) As T increases the numberof hyperiterations required to reach the maximum test accu-racy decreases further suggesting that there is an interplaybetween the number of iterations used to solve the inner andthe outer objective Finally the running time of Algorithm1 is linear in T and the size of w and independent of thesize of H (see also Table 2) making it even more appealingto reduce the number of iterations

52 Few-shot Learning

We new turn our attention to learning-to-learn precisely tofew-shot supervised learning implementing the ML strategyoutlined in Sec 4 on two different benchmark datasets

bull OMNIGLOT (Lake et al 2015) a dataset that containsexamples of 1623 different handwritten characters from 50alphabets We downsample the images to 28times 28

bull MINIIMAGENET (Vinyals et al 2016) a subset of Ima-geNet (Deng et al 2009) that contains 60000 downsampledimages from 100 different classes

Following the experimental protocol used in a number of re-cent works we build a meta-training set D from which wesample datasets to solve Problem (9)-(10) a meta-validation

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

set V for tuning ML hyperparameters and finally a meta-test set T which is used to estimate accuracy Operationallyeach meta-dataset consists of a pool of samples belonging todifferent (non-overlapping between separate meta-dataset)classes which can be combined to form ground classifica-tion datasets Dj = Dj

tr cup Djval with 5 or 20 classes (for

Omniglot) The Djtrrsquos contain 1 or 5 examples per class

which are used to fit wj (see Eq 10) The Djvalrsquos contain-

ing 15 examples per class is used either to compute fT (λ)(see Eq (9)) and its (stochastic) gradient if Dj isin D or toprovide a generalization score if Dj comes from either V orT For MiniImagenet we use the same split and images pro-posed in (Ravi and Larochelle 2017) while for Omniglotwe use the protocol defined by (Santoro et al 2016)

As ground classifiers we use multinomial logistic regressorsand as task losses `j we employ cross-entropy The innerproblems being strongly convex admit unique minimizersyet require numerical computation of the solutions We ini-tialize ground models parameters wj to 0 and according tothe observation in Sec 51 we perform T gradient descentsteps where T is treated as a ML hyperparameter that has tobe validated Figure 6 shows an example of meta-validationof T for one-shot learning on MiniImagenet We compute astochastic approximation ofnablafT (λ) with Algorithm 1 anduse Adam with decaying learning rate to optimize λ

Regarding the specific implementation of the representationmapping h we employ for Omniglot a four-layers convo-lutional neural network with strided convolutions and 64filters per layer as in (Vinyals et al 2016) and other suc-cessive works For MiniImagenet we tried two differentarchitectures

bull C4L a four-layers convolutional neural network with max-pooling and 32 filters per layer

bull RN a residual network (He et al 2016) built of fourresidual blocks followed by two convolutional layers

The first network architecture has been proposed in (Raviand Larochelle 2017) and then used in (Finn et al 2017)while a similar residual network architecture has been em-ployed in a more recent work (Mishra et al 2018) Furtherdetails on the architectures of h as well as other ML hyper-parameters are specified in the supplementary material Wereport our results using RN for MiniImagenet in Table 3alongside scores from various recently proposed methodsfor comparison

The proposed method achieves competitive results highlight-ing the relative importance of learning a task independentrepresentation on the top of which logistic classifiers trainedwith very few samples generalize well Moreover utilizingmore expressive models such as residual network as repre-sentation mappings is beneficial for our proposed strategyand unlike other methods does not result in overfitting

of the outer objective as reported in (Mishra et al 2018)Indeed compared to C4L RN achieves a relative improve-ment of 65 on one-shot and 42 on five-shot Figure 5provides a visual example of the goodness of the learned rep-resentation showing that MiniImagenet examples (the firstfrom meta-training the second from the meta-testing sets)from similar classes (different dog breeds) are mapped neareach other by h and conversely samples from dissimilarclasses are mapped afar

Figure 5 After sampling two datasets D isin D and Dprime isin T weshow on the top the two images x isin D xprime isin Dprime that minimize||hλ(x)minus hλ(xprime)|| and on the bottom those that maximize it Inbetween each of the two couples we compare a random subset ofcomponents of hλ(x) (blue) and hλ(xprime) (green)

3 5 8 1266

68

70

72

74

76

78Accuracy on Dtr

3 5 8 12

49

50

51

52

53

Accuracy on DtestAccuracy on Dval

Figure 6 Meta-validation of the number of gradient descent steps(T ) of the ground models for MiniImagenet using the RN repre-sentation Early stopping on the accuracy on meta-validation setduring meta-training resulted in halting the optimization of λ after42k 40k 22k and 15k hyperiterations for T equal to 3 5 8 and12 respectively in line with our observation in Sec 51

53 On Variants of Representation Learning Methods

In this section we show the benefits of learning a represen-tation within the proposed bilevel framework compared toother possible approaches that involve an explicit factoriza-tion of a classifier as gj h The representation mapping his either pretrained or learned with different meta-learningalgorithms We focus on the problem of one-shot learningon MiniImagenet and we use C4L as architecture for therepresentation mapping In all the experiments the groundmodels gj are multinomial logistic regressor as in Sec 52tuned with 5 steps of gradient descent We ran the followingexperiments

bull Multiclass the mapping h X rarr R64 is given by the

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Table 3 Accuracy scores computed on episodes from T of various methods on 1-shot and 5-shot classification problems on Omniglotand MiniImagenet For MiniImagenet 95 confidence intervals are reported For Hyper-representation the scores are computed over 600randomly drawn episodes For other methods we show results as reported by their respective authors

OMNIGLOT 5 classes OMNIGLOT 20 classes MINIIMAGENET 5 classesMethod 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot

Siamese nets (Koch et al 2015) 973 984 882 970 minus minusMatching nets (Vinyals et al 2016) 981 989 938 985 4344plusmn 077 5531plusmn 073Neural stat (Edwards and Storkey 2016) 981 995 932 981 minus minusMemory mod (Kaiser et al 2017) 984 996 950 986 minus minusMeta-LSTM (Ravi and Larochelle 2017) minus minus minus minus 4356plusmn 084 6060plusmn 071MAML (Finn et al 2017) 987 999 958 989 4870plusmn 175 6311plusmn 092Meta-networks (Munkhdalai and Yu 2017) 989 minus 970 minus 4921plusmn 096 minusPrototypical Net (Snell et al 2017) 988 997 960 989 4942plusmn 078 6820plusmn 066SNAIL (Mishra et al 2018) 991 998 976 994 5571plusmn 099 6888plusmn 092Hyper-representation 986 995 955 984 5054plusmn 085 6453plusmn 068

linear outputs before the softmax operation of a network4

pretrained on the totality of examples contained in the train-ing meta-dataset (600 examples for each of the 64 classes)In this setting we found that using the second last layer orthe output after the softmax yields worst results

bull Bilevel-train we use a bilevel approach but unlike in Sec4 we optimize the parameter vector λ of the representationmapping by minimizing the loss on the training sets of eachepisode The hypergradient is still computed with Algorithm1 albeit we set Dj

val = Djtr for each training episodes

bull Approx and Approx-train we consider an approxima-tion of the hypergradient nablafT (λ) by disregarding the op-timization dynamics of the inner objectives (ie we setnablaλwjT = 0) In Approx-train we just use the training sets

bull Classic as in (Baxter 1995) we learn h by jointly opti-mize f(λw1 wN ) =

sumNj=1 L

j(wj λDjtr) and treat

the problem as standard multitask learning with the ex-ception that we evaluate f on mini-batches of 4 episodesrandomly sampled every 5 gradient descent iterations

In settings where we do not use the validation sets we letthe training sets of each episode contain 16 examples perclass Using training episodes with just one example perclass resulted in performances just above random chanceWhile the first experiment constitutes a standard baselinethe others have the specific aim of assessing (i) the impor-tance of splitting episodes of meta-training set into trainingand validation and (ii) the importance of computing thehypergradient of the approximate bilevel problem with Al-gorithm 1 The results reported in Table 4 suggest thatboth the trainingvalidation splitting and the full computa-tion of the hypergradient constitute key factors for learninga good representation in a meta-learning context On theother side using pretrained representations especially ina low-dimensional space turns out to be a rather effectivebaseline One possible explanation is that in this context

4The network is similar to C4L but has 64 filters per layer

Table 4 Performance of various methods where the representationis either transfered or learned with variants of hyper-representationmethods The last raw reports for comparison the score obtainedwith hyper-representation

Method filters Accuracy 1-shot

Multiclass 64 4302Bilevel-train 32 2963Approx 32 4112Approx-train 32 3880Classic-train 32 4046Hyper-representation-C4L 32 4751

some classes in the training and testing meta-datasets arerather similar (eg various dog breeds) and thus groundclassifiers can leverage on very specific representations

6 ConclusionsWe have shown that both HO and ML can be formulated interms of bilevel programming and solved with an iterativeapproach When the inner problem has a unique solution(eg is strongly convex) our theoretical results show that theiterative approach has convergence guarantees a result thatis interesting in its own right In the case of ML by adaptingclassical strategies (Baxter 1995) to the bilevel frameworkwith trainingvalidation splitting we present a method forlearning hyper-representations which is experimentally ef-fective and supported by our theoretical guarantees

Our framework encompasses recently proposed methodsfor meta-learning such as learning to optimize but alsosuggests different design patterns for the inner learningalgorithm which could be interesting to explore in futurework The resulting inner problems may not satisfy theassumptions of our convergence analysis raising the needfor further theoretical investigations An additional futuredirection of research is the study of the statistical propertiesof bilevel strategies where outer objectives are based on thegeneralization ability of the inner model to new (validation)data Ideas from (Maurer et al 2016 Denevi et al 2018)may be useful in this direction

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

ReferencesAndrychowicz M Denil M Gomez S Hoffman M W

Pfau D Schaul T and de Freitas N (2016) Learningto learn by gradient descent by gradient descent In Ad-vances in Neural Information Processing Systems (NIPS)pages 3981ndash3989

Bard J F (2013) Practical bilevel optimization algo-rithms and applications volume 30 Springer Science ampBusiness Media 01251

Baxter J (1995) Learning internal representations In Pro-ceedings of the 8th Annual Conference on ComputationalLearning Theory (COLT) pages 311ndash320 ACM

Baydin A G Pearlmutter B A Radul A A and SiskindJ M (2017) Automatic differentiation in machine learn-ing a survey Journal of Machine Learning Research181531ndash15343

Beirami A Razaviyayn M Shahrampour S and TarokhV (2017) On optimal generalizability in parametriclearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3458ndash3468

Bergstra J and Bengio Y (2012) Random search for hyper-parameter optimization Journal of Machine LearningResearch 13(Feb)281ndash305

Bergstra J Yamins D and Cox D D (2013) Makinga science of model search Hyperparameter optimiza-tion in hundreds of dimensions for vision architecturesIn Proceedings of the 30th International Conference onMachine Learning (ICML) pages 115ndash123

Caruana R (1998) Multitask learning In Learning tolearn pages 95ndash133 Springer 02683

Colson B Marcotte P and Savard G (2007) Anoverview of bilevel optimization Annals of operationsresearch 153(1)235ndash256

Denevi G Ciliberto C Stamos D and Pontil M (2018)Incremental learning-to-learn with statistical guaranteesarXiv preprint arXiv180308089 To appear in UAI 2018

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL (2009) Imagenet A large-scale hierarchical imagedatabase In Computer Vision and Pattern Recognition(CVPR) pages 248ndash255

Domke J (2012) Generic Methods for Optimization-BasedModeling In AISTATS volume 22 pages 318ndash326

Dontchev A L and Zolezzi T (1993) Well-posed op-timization problems volume 1543 of Lecture Notes inMathematics Springer-Verlag Berlin

Edwards H and Storkey A (2016) Towards a NeuralStatistician arXiv160602185 [cs stat] 00027 arXiv160602185

Evgeniou T Micchelli C A and Pontil M (2005) Learn-ing multiple tasks with kernel methods Journal of Ma-chine Learning Research 6(Apr)615ndash637

Finn C Abbeel P and Levine S (2017) Model-agnosticmeta-learning for fast adaptation of deep networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 1126ndash1135

Flamary R Rakotomamonjy A and Gasso G(2014) Learning constrained task similarities in graph-regularized multi-task learning Regularization Opti-mization Kernels and Support Vector Machines page103

Franceschi L Donini M Frasconi P and Pontil M(2017) Forward and reverse gradient-based hyperparam-eter optimization In Proceedings of the 34th Interna-tional Conference on Machine Learning (ICML) pages1165ndash1173

Glorot X and Bengio Y (2010) Understanding the diffi-culty of training deep feedforward neural networks InProceedings of the 13th International Conference on Arti-ficial Intelligence and Statistics (AIStat) pages 249ndash256

Griewank A and Walther A (2008) Evaluating deriva-tives principles and techniques of algorithmic differenti-ation SIAM

He K Zhang X Ren S and Sun J (2016) Deep residuallearning for image recognition In Computer Vision andPattern Recognition (CVPR) pages 770ndash778

Hutter F Lcke J and Schmidt-Thieme L (2015) BeyondManual Tuning of Hyperparameters KI - KunstlicheIntelligenz 29(4)329ndash337

Kaiser L Nachum O Roy A and Bengio S (2017)Learning to remember rare events

Keerthi S S Sindhwani V and Chapelle O (2007) Anefficient method for gradient-based adaptation of hyper-parameters in svm models In Advances in Neural Infor-mation Processing Systems (NIPS) pages 673ndash680

Khosla A Zhou T Malisiewicz T Efros A A andTorralba A (2012) Undoing the damage of dataset biasIn European Conference on Computer Vision pages 158ndash171 Springer

Koch G Zemel R and Salakhutdinov R (2015) Siameseneural networks for one-shot image recognition In ICMLDeep Learning Workshop volume 2 00127

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Koh P W and Liang P (2017) Understanding black-boxpredictions via influence functions In Proceedings ofthe 34th International Conference on Machine Learning(ICML) pages 1885ndash1894

Kunapuli G Bennett K Hu J and Pang J-S (2008)Classification model selection via bilevel programmingOptimization Methods and Software 23(4)475ndash489

Kuzborskij I Orabona F and Caputo B (2013) Fromn to n+ 1 Multiclass transfer incremental learning InComputer Vision and Pattern Recognition (CVPR) 2013IEEE Conference on pages 3358ndash3365

Lake B M Salakhutdinov R and Tenenbaum J B(2015) Human-level concept learning through probabilis-tic program induction Science 350(6266)1332ndash1338

Lake B M Ullman T D Tenenbaum J B and Gersh-man S J (2017) Building machines that learn and thinklike people Behavioral and Brain Sciences 40 00152

Maclaurin D Duvenaud D K and Adams R P (2015)Gradient-based hyperparameter optimization through re-versible learning In Proceedings of the 32nd Interna-tional Conference on Machine Learning (ICML pages2113ndash2122

Maurer A Pontil M and Romera-Paredes B (2016) Thebenefit of multitask representation learning The Journalof Machine Learning Research 17(1)2853ndash2884

Mishra N Rohaninejad M Chen X and Abbeel P(2018) A simple neural attentive meta-learner

Munkhdalai T and Yu H (2017) Meta networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 2554ndash2563

Pedregosa F (2016) Hyperparameter optimization withapproximate gradient In Proceedings of The 33rd Inter-national Conference on Machine Learning (ICML) pages737ndash746

Ravi S and Larochelle H (2017) Optimization as a modelfor few-shot learning In In International Conference onLearning Representations (ICLR)

Santoro A Bartunov S Botvinick M Wierstra Dand Lillicrap T (2016) Meta-learning with memory-augmented neural networks In Proceedings ofthe 33rdInternational Conference on Machine Learning pages1842ndash1850

Schmidhuber J (1992) Learning to Control Fast-WeightMemories An Alternative to Dynamic Recurrent Net-works Neural Computation 4(1)131ndash139 00082

Snell J Swersky K and Zemel R S (2017) Prototypicalnetworks for few-shot learning In Advances in neuralinformation processing systems pages 4080ndash4090

Snoek J Larochelle H and Adams R P (2012) Practicalbayesian optimization of machine learning algorithmsIn Advances in Neural Information Processing Systems(NIPS) pages 2951ndash2959

Thrun S and Pratt L (1998) Learning to learn Springer

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser L and Polosukhin I (2017)Attention is all you need In Advances in Neural Informa-tion Processing Systems (NIPS) pages 6000ndash6010

Vinyals O Blundell C Lillicrap T Kavukcuoglu Kand Wierstra D (2016) Matching networks for one shotlearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3630ndash3638

Wichrowska O Maheswaranathan N Hoffman M WColmenarejo S G Denil M Freitas N and Sohl-Dickstein J (2017) Learned optimizers that scale andgeneralize In Proceedings of the 34th International Con-ference on Machine Learning (ICML) pages 3751ndash3760

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

A Proofs of the Results in Sec 3Proof of Theorem 31 Since Λ is compact it follows fromWeierstrass theorem that a sufficient condition for the exis-tence of minimizers is that f is continuous Thus let λ isin Λand let (λn)nisinN be a sequence in Λ such that λn rarr λWe prove that f(λn) = E(wλn

λn)rarr E(wλ λ) = f(λ)Since (wλn

)nisinN is bounded there exists a subsequence(wkn)nisinN such that wkn rarr w for some w isin Rd Nowsince λkn rarr λ and the map (w λ) 7rarr Lλ(w) is jointlycontinuous we have

forallw isin Rd Lλ(w) = limnLλkn

(wkn)

le limnLλkn

(w) = Lλ(w)

Therefore w is a minimizer of Lλ and hence w = wλThis prove that (wλn)nisinN is a bounded sequence havinga unique cluster point Hence (wλn)nisinN is convergenceto its unique cluster point which is wλ Finally since(wλn

λn)rarr (wλ λ) and E is jointly continuous we haveE(wλn λn)rarr E(wλ λ) and the statement follows

We recal a fundamental fact concerning the stability of min-ima and minimizers in optimization problems (Dontchevand Zolezzi 1993) We provide the proof for completeness

Theorem A1 (Convergence) Let ϕT and ϕ be lower semi-continuous functions defined on a compact set Λ Supposethat ϕT converges uniformly to ϕ on Λ as T rarr +infin Then

(a) inf ϕT rarr inf ϕ

(b) argminϕT rarr argminϕ meaning that for every(λT )TisinN such that λT isin argminϕT we have that

- (λT )TisinN admits a convergent subsequence- for every subsequence (λKT

)TisinN such thatλKT

rarr λ we have λ isin argminϕ

Proof Let (λT )TisinN be a sequence in Λ such that for everyT isin N λT isin argminϕT We prove that

1) (λT )TisinN admits a convergent subsequence

2) for every subsequence (λKT)TisinN such that λKT

rarr λwe have λ isin argminϕ and ϕKT

(λKT)rarr inf ϕ

3) inf ϕT rarr inf ϕ

The first point follows from the fact that Λ is compactConcerning the second point let (λKT

)TisinN be a subse-quence such that λKT

rarr λ Since ϕKTconverge uniformly

to ϕ we have

|ϕKT(λKT

)minus ϕ(λKT)| le sup

λisinΛ|ϕKT

(λ)minus ϕ(λ)| rarr 0

Therefore using also the continuity of ϕ we have

forallλ isin Λ ϕ(λ) = limTϕ(λKT

) = limTϕKT

(λKT)

le limTϕKT

(λ) = ϕ(λ)

So λ isin argminϕ andϕ(λ) = limT ϕKT(λKT

) le inf ϕ =ϕ(λ) that is limT ϕKT

(λKT) = inf ϕ

Finally as regards the last point we proceed by contradic-tion If (ϕT (λT ))TisinN does not convergce to inf f thenthere exists an ε gt 0 and a subsequence (ϕKT

(λKT))TisinN

such that

|ϕKT(λKT

)minus inf ϕ| ge ε forallT isin N (11)

Now let (λK

(1)T

) be a convergent subsequence of (λKT)TisinN

Suppose that λK

(1)T

rarr λ Clearly (λK

(1)T

) is also a subse-quence of (λT )TisinN Then it follows from point 2) abovethat ϕ

K(1)T

(λK

(1)T

) rarr inf ϕ This latter finding togetherwith equation (11) gives a contradiction

Proof of Theorem 32 Since E(middot λ) is uniformly Lipschitzcontinuous there exists ν gt 0 such that for every T isin Nand every λ isin Λ

|fT (λ)minus f(λ)| = |E(wTλ λ)minus E(wλ λ)|le νwTλ minus wλ

It follows from assumption (vi) that fT (λ) converges tof(λ) uniformly on Λ as T rarr +infin Then the statementfollows from Theorem A1

B Cross-validation and Bilevel ProgrammingWe note that the (approximate) bilevel programming frame-work easily accommodates also estimations of the gener-alization error generated by a cross-validation proceduresWe describe here the case of K-fold cross-validation whichincludes also leave-one-out cross validation

Let D = (xi yi)ni=1 be the set of available data K-foldcross validation with K isin 1 2 N consists in parti-tioning D in K subsets DjKj=1 and fit as many modelsgwj on training data Dj

tr =⋃i6=j D

i The models are thenevaluated on Dj

val = Dj Denoting by w = (wj)Kj=1 thevector of stacked weights the K-fold cross validation erroris given by

E(w λ) =1

K

Ksumj=1

Ej(wj λ)

where Ej(wj λ) =sum

(xy)isinDj `(gwj (x) y) E can betreated as the outer objective in the bilevel framework whilethe inner objective Lλ may be given by the sum of regu-larized empirical errors over each Dj

tr for the K models

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Under this perspective a K-fold cross-validation procedureclosely resemble the bilevel problem for ML formulatedin Sec 22 where in this case the meta-distribution col-lapses on the data (ground) distribution and the episodes aresampled from the same dataset of points

By following the procedure outlined in Sec 23 we canapproximate the minimization of Lλ with T steps of anoptimization dynamics and compute the hypergradient offT (λ) = 1

K

sumj E

j(wjT λ) by training the K models andproceed with either forward or reverse differentiation Themodels may be fitted sequentially in parallel or stochas-tically Specifically in this last case one can repeatedlysample one fold at a time (or possibly a mini-batch of folds)and compute a stochastic hypergradient that can be used ina SGD procedure in order to minimize fT At last we notethat similar ideas for leave-one out cross-validation error aredeveloped in (Beirami et al 2017) where the hypergradientof an approximation of the outer objective is computed bymeans of the implicit function theorem

C The Effect of T Ridge RegressionIn Sec 51 we showed that increasing the number of itera-tions T leads to a better optimization of the outer objective fthrough the approximations fT which converge uniformlyto f by Proposition 32 This however does not necessaryresults in better test scores due to possible overfitting ofthe training and validation errors as we noted for the linearhyper-representation multiclass classification experiment inSec 51 The optimal choice of T appears to be indeedproblem dependent if in the aforementioned experiment aquite small T led to the best test accuracy (after hyperparam-eter optimization) in this section we present a small scalelinear regression experiment that benefits from an increasingnumber of inner iterations

We generated 90 noisy synthetic data points with 30 featuresof which only 5 were informative and divided them equallyto form training validation and test sets As outer objectivewe set the mean squared error E(w) = ||Xvalw minus yval||2where Xval and yval are the validation design matrix andtargets respectively and w isin R30 is the weight vector of themodel We set as inner inner objective

Lλ(w) = ||Xtrw minus ytr||2 +

30sumi=1

eλiw2i

and optimized the L2 vector of regularization coefficients λ(equivalent to a diagonal Tikhonov regularization matrix)The results reported in Table 5 show that in this scenariooverfitting is not an issue and 250 inner iterations yield thebest test result Note that as in Sec 51 also this problemadmits an analytical solution from which it is possible tocompute the hypergradient analytically and perform exact

hyperparameter optimization as reported in the last row ofthe Table 5

Table 5 Validation and test mean absolute percentage error(MAPE) for various values of T

T Validation MAPE Test MAPE

10 1135 434950 128 522100 055 126250 047 050Exact 037 057

D Further Details on Few-shot ExperimentsThis appendix contains implementation details of the repre-sentation mapping h and the meta-learning hyperparametersused in the few-shot learning experiments of Sec 52

To optimize the representation mapping in all the exper-iments we use Adam with learning rate set to 10minus3 anda decay-rate of 10minus5 We used the initialization strategyproposed in (Glorot and Bengio 2010) for all the weights λof the representation

For Omniglot experiments we used a meta-batch size of 32episodes for five-way and of 16 episodes for 20-way Totrain the episode-specific classifiers we set the learning rateto 01

For one set of experiments with Mini-imagenet we used anhyper-representation (C4L) consisting of 4 convolutionallayers where each layer is composed by a convolution with32 filters a batch normalization followed by a ReLU acti-vation and a 2x2 max-pooling The classifiers were trainedusing mini-batches of 4 episodes for one-shot and 2 episodesfor five-shot with learning rate set to 001

The other set of experiments with Mini-imagenet employeda Residual Network (RN) as the representation mappingbuilt of 4 residual blocks (with 64 96 128 256 filters) andthen the block that follows 1times 1 conv (2048 filters) avgpooling 1 times 1 conv (512 filters) Each residual blockrepeats the following block 3 times 1 times 1 conv batchnormalization leaky ReLU (leak 01) before the residualconnection In this case the classifiers were optimized usingmini-batches of 2 episodes for both one and five-shot withlearning rate set to 004

Page 5: Bilevel Programming for Hyperparameter Optimization and ...

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Algorithm 1 Reverse-HG for Hyper-representationInput λ current values of the hyperparameter T num-ber of iteration of GD η ground learning rate B mini-batch of episodes from DOutput Gradient of meta-training error wrt λ on Bfor j = 1 to |B| do

wj0 = 0for t = 1 to T do

wjt larr wtminus1 minus ηnablawLj(wjtminus1 λDjtr)

αjT larr nablawLj(wjT λDval)

pj larr nablaλLj(wjT λDval)for t = T minus 1 downto 0 do

pj larr pj minus αjt+1ηnablaλnablawLj(wjt λD

jtr)

αjt larr αjt+1

[I minus ηnablawnablawLj(wjt λD

jtr)

]return

sumj p

j

Remark 33 If Lλ is strongly convex then many gradient-based algorithms (eg standard and accelerated gradient de-scent) yield linear convergence of the iterates wTλrsquos More-over in such cases the rate of linear convergence is of type(νλ minus microλ)(νλ + microλ) where νλ and microλ are the Lipschitzconstant of the gradient and the modulus of strong convexityof Lλ respectively So this rate can be uniformly boundedfrom above by ρ isin ]0 1[ provided that supλisinΛ νλ lt +infinand infλisinΛ microλ gt 0 Thus in these cases wTλ convergesuniformly to wλ on Λ (at a linear rate)

Example 34 Let us consider the following form of theinner objective

LH(w) = y minusXHw2 + ρw2 (8)

where ρ gt 0 is a fixed regularization parameter andH isin Rdtimesd is the hyperparameter representing a linear fea-ture map LH is strongly convex with modulus micro = ρ gt 0(independent on the hyperparameter H) and Lipschitzsmooth with constant νH = 2(XH)gtXH+ρI which isbounded from above if H ranges in a bounded set of squarematrices In this case assumptions (i)-(vi) are satisfied

4 Learning Hyper-RepresentationsIn this section we instantiate the bilevel programming ap-proach for ML outlined in Sec 22 in the case of deep learn-ing where representation layers are shared across episodesFinding good data representations is a centerpiece in ma-chine learning Classical approaches (Baxter 1995 Caru-ana 1998) learn both the weights of the representation map-ping and those of the ground classifiers jointly on the samedata Here we follow the bilevel approach and split eachdatasetepisode in training and validation sets

Our method involves the learning of a cross-task intermedi-

ate representation hλ X rarr Rk (parametrized by a vectorλ) on top of which task specific models gj Rk rarr Yj(parametrized by vectors wj) are trained The final groundmodel for task j is thus given by gj h To find λ we solveProblem (1)-(2) with inner and outer objectives as in Eqs(3) and (4) respectively Since in general this problemcannot be solved exactly we instantiate the approximationscheme in Eqs (5)-(6) as follows

minλfT (λ) =

Nsumj=1

Lj(wjT λDjval) (9)

wjt = wjtminus1minusηnablawLj(wjtminus1 λD

jtr) t j isin [T ] [N ] (10)

Starting from an initial value the weights of the task-specificmodels are learned by T iterations of gradient descentThe gradient of fT can be computed efficiently in timeby making use of an extended reverse-hypergradient pro-cedure (Franceschi et al 2017) which we present in Al-gorithm 1 Since in general the number of episodes in ameta-training set is large we compute a stochastic approx-imation of the gradient of fT by sampling a mini-batch ofepisodes At test time given a new episode D the represen-tation h is kept fixed and all the examples in D are used totune the weights w of the episode-specific model g

Like other initialization and optimization strategies forML our method does not require lookups in a supportset as the memorization and metric strategies do (Santoroet al 2016 Vinyals et al 2016 Mishra et al 2018) Un-like (Andrychowicz et al 2016 Ravi and Larochelle 2017)we do not tune the optimization algorithm which in our caseis plain empirical loss minimization by gradient descentand rather focus on the hypothesis space Unlike (Finnet al 2017) that aims at maximizing sensitivity of new tasklosses to the model parameters we aim at maximizing thegeneralization to novel examples during training episodeswith respect to λ Our assumptions about the structure ofthe model are slightly stronger than in (Finn et al 2017)but still mild namely that some (hyper)parameters definethe representation and the remaining parameters define theclassification function In (Munkhdalai and Yu 2017) themeta-knowledge is distributed among fast and slow weightsand an external memory our approach is more direct sincethe meta-knowledge is solely distilled by λ A further advan-tage of our method is that if the episode-specific models arelinear (eg logistic regressors) and each loss Lj is stronglyconvex in w the theoretical guarantees of Theorem 32 ap-ply (see Remark 33) These assumptions are satisfied in theexperiments reported in the next section

5 ExperimentsThe aim of the following experiments is threefold Firstwe investigate the impact of the number of iterations of theoptimization dynamics on the quality of the solution on a

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

0 1000 2000 3000 4000 5000

Hyperiterations

60

80

100

120

140

160

180Sq

uare

der

ror

Exact solutonT = 1

T = 4

T = 16

T = 64

T = 256

Figure 3 Optimization of the outer objectives f and fT for exactand approximate problems The optimization of H is performedwith gradient descent with momentum with same initializationstep size and momentum factor for each run

simple multiclass classification problem Second we testour hyper-representation method in the context of few-shotlearning on two benchmark datasets Finally we constrastthe bilevel ML approach against classical approaches tolearn shared representations 3

51 The Effect of T

Motivated by the theoretical findings of Sec 3 we empir-ically investigate how solving the inner problem approxi-mately (ie using small T ) affects convergence generaliza-tion performances and running time We focus in particularon the linear feature map described in Example 34 whichallows us to compare the approximated solution against theclosed-form analytical solution given by

wH = [(XH)TXH + ρI]minus1(XH)TY

In this setting the bilevel problem reduces to a (non-convex)optimization problem in H

We use a subset of 100 classes extracted from Omniglotdataset (Lake et al 2017) to construct a HO problem aimedat tuning H A training set Dtr and a validation set Dvaleach consisting of three randomly drawn examples per classwere sampled to form the HO problem A third set Dtestconsisting of fifteen examples per class was used for testingInstead of using raw images as input we employ featurevectors x isin R256 computed by the convolutional networktrained on one-shot five-ways ML setting as described inSec 52

For the approximate problems we compute the hypergra-dient using Algorithm 1 where it is intended that B =(Dtr Dval) Figure 3 shows the values of functions f andfT (see Eqs (1) and (5) respectively) during the optimiza-tion of H As T increases the solution of the approximate

3The code for reproducing the experiments based on thepackage FAR-HO (httpsbitlyfar-ho) is availableat httpsbitlyhyper-repr

0 1000 2000 3000 4000 5000

Hyperiterations

85

86

87

88

89

90

91

92

93

Acc

urac

y

Exact solutonT = 1

T = 4

T = 16

T = 64

T = 256

Figure 4 Accuracy on Dtest of exact and approximated solutionsduring optimization of H Training and validation accuraciesreach almost 100 already for T = 4 and after few hundredhyperiterations and therefore are not reported

Table 2 Execution times on a NVidia Tesla M40 GPU

T 1 4 16 64 256 ExactTime (sec) 60 119 356 1344 5532 320

problem approaches the true bilevel solution However per-forming a small number of gradient descent steps for solvingthe inner problem acts as implicit regularizer As it is evi-dent from Figure 4 the generalization error is better whenT is smaller than the value yielding the best approximationof the inner solution This is to be expected since in thissetting the dimensions of parameters and hyperparametersare of the same order leading to a concrete possibility ofoverfitting the outer objective (validation error) An appro-priate problem dependent choice of T may help avoidingthis issue (see also Appendix C) As T increases the numberof hyperiterations required to reach the maximum test accu-racy decreases further suggesting that there is an interplaybetween the number of iterations used to solve the inner andthe outer objective Finally the running time of Algorithm1 is linear in T and the size of w and independent of thesize of H (see also Table 2) making it even more appealingto reduce the number of iterations

52 Few-shot Learning

We new turn our attention to learning-to-learn precisely tofew-shot supervised learning implementing the ML strategyoutlined in Sec 4 on two different benchmark datasets

bull OMNIGLOT (Lake et al 2015) a dataset that containsexamples of 1623 different handwritten characters from 50alphabets We downsample the images to 28times 28

bull MINIIMAGENET (Vinyals et al 2016) a subset of Ima-geNet (Deng et al 2009) that contains 60000 downsampledimages from 100 different classes

Following the experimental protocol used in a number of re-cent works we build a meta-training set D from which wesample datasets to solve Problem (9)-(10) a meta-validation

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

set V for tuning ML hyperparameters and finally a meta-test set T which is used to estimate accuracy Operationallyeach meta-dataset consists of a pool of samples belonging todifferent (non-overlapping between separate meta-dataset)classes which can be combined to form ground classifica-tion datasets Dj = Dj

tr cup Djval with 5 or 20 classes (for

Omniglot) The Djtrrsquos contain 1 or 5 examples per class

which are used to fit wj (see Eq 10) The Djvalrsquos contain-

ing 15 examples per class is used either to compute fT (λ)(see Eq (9)) and its (stochastic) gradient if Dj isin D or toprovide a generalization score if Dj comes from either V orT For MiniImagenet we use the same split and images pro-posed in (Ravi and Larochelle 2017) while for Omniglotwe use the protocol defined by (Santoro et al 2016)

As ground classifiers we use multinomial logistic regressorsand as task losses `j we employ cross-entropy The innerproblems being strongly convex admit unique minimizersyet require numerical computation of the solutions We ini-tialize ground models parameters wj to 0 and according tothe observation in Sec 51 we perform T gradient descentsteps where T is treated as a ML hyperparameter that has tobe validated Figure 6 shows an example of meta-validationof T for one-shot learning on MiniImagenet We compute astochastic approximation ofnablafT (λ) with Algorithm 1 anduse Adam with decaying learning rate to optimize λ

Regarding the specific implementation of the representationmapping h we employ for Omniglot a four-layers convo-lutional neural network with strided convolutions and 64filters per layer as in (Vinyals et al 2016) and other suc-cessive works For MiniImagenet we tried two differentarchitectures

bull C4L a four-layers convolutional neural network with max-pooling and 32 filters per layer

bull RN a residual network (He et al 2016) built of fourresidual blocks followed by two convolutional layers

The first network architecture has been proposed in (Raviand Larochelle 2017) and then used in (Finn et al 2017)while a similar residual network architecture has been em-ployed in a more recent work (Mishra et al 2018) Furtherdetails on the architectures of h as well as other ML hyper-parameters are specified in the supplementary material Wereport our results using RN for MiniImagenet in Table 3alongside scores from various recently proposed methodsfor comparison

The proposed method achieves competitive results highlight-ing the relative importance of learning a task independentrepresentation on the top of which logistic classifiers trainedwith very few samples generalize well Moreover utilizingmore expressive models such as residual network as repre-sentation mappings is beneficial for our proposed strategyand unlike other methods does not result in overfitting

of the outer objective as reported in (Mishra et al 2018)Indeed compared to C4L RN achieves a relative improve-ment of 65 on one-shot and 42 on five-shot Figure 5provides a visual example of the goodness of the learned rep-resentation showing that MiniImagenet examples (the firstfrom meta-training the second from the meta-testing sets)from similar classes (different dog breeds) are mapped neareach other by h and conversely samples from dissimilarclasses are mapped afar

Figure 5 After sampling two datasets D isin D and Dprime isin T weshow on the top the two images x isin D xprime isin Dprime that minimize||hλ(x)minus hλ(xprime)|| and on the bottom those that maximize it Inbetween each of the two couples we compare a random subset ofcomponents of hλ(x) (blue) and hλ(xprime) (green)

3 5 8 1266

68

70

72

74

76

78Accuracy on Dtr

3 5 8 12

49

50

51

52

53

Accuracy on DtestAccuracy on Dval

Figure 6 Meta-validation of the number of gradient descent steps(T ) of the ground models for MiniImagenet using the RN repre-sentation Early stopping on the accuracy on meta-validation setduring meta-training resulted in halting the optimization of λ after42k 40k 22k and 15k hyperiterations for T equal to 3 5 8 and12 respectively in line with our observation in Sec 51

53 On Variants of Representation Learning Methods

In this section we show the benefits of learning a represen-tation within the proposed bilevel framework compared toother possible approaches that involve an explicit factoriza-tion of a classifier as gj h The representation mapping his either pretrained or learned with different meta-learningalgorithms We focus on the problem of one-shot learningon MiniImagenet and we use C4L as architecture for therepresentation mapping In all the experiments the groundmodels gj are multinomial logistic regressor as in Sec 52tuned with 5 steps of gradient descent We ran the followingexperiments

bull Multiclass the mapping h X rarr R64 is given by the

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Table 3 Accuracy scores computed on episodes from T of various methods on 1-shot and 5-shot classification problems on Omniglotand MiniImagenet For MiniImagenet 95 confidence intervals are reported For Hyper-representation the scores are computed over 600randomly drawn episodes For other methods we show results as reported by their respective authors

OMNIGLOT 5 classes OMNIGLOT 20 classes MINIIMAGENET 5 classesMethod 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot

Siamese nets (Koch et al 2015) 973 984 882 970 minus minusMatching nets (Vinyals et al 2016) 981 989 938 985 4344plusmn 077 5531plusmn 073Neural stat (Edwards and Storkey 2016) 981 995 932 981 minus minusMemory mod (Kaiser et al 2017) 984 996 950 986 minus minusMeta-LSTM (Ravi and Larochelle 2017) minus minus minus minus 4356plusmn 084 6060plusmn 071MAML (Finn et al 2017) 987 999 958 989 4870plusmn 175 6311plusmn 092Meta-networks (Munkhdalai and Yu 2017) 989 minus 970 minus 4921plusmn 096 minusPrototypical Net (Snell et al 2017) 988 997 960 989 4942plusmn 078 6820plusmn 066SNAIL (Mishra et al 2018) 991 998 976 994 5571plusmn 099 6888plusmn 092Hyper-representation 986 995 955 984 5054plusmn 085 6453plusmn 068

linear outputs before the softmax operation of a network4

pretrained on the totality of examples contained in the train-ing meta-dataset (600 examples for each of the 64 classes)In this setting we found that using the second last layer orthe output after the softmax yields worst results

bull Bilevel-train we use a bilevel approach but unlike in Sec4 we optimize the parameter vector λ of the representationmapping by minimizing the loss on the training sets of eachepisode The hypergradient is still computed with Algorithm1 albeit we set Dj

val = Djtr for each training episodes

bull Approx and Approx-train we consider an approxima-tion of the hypergradient nablafT (λ) by disregarding the op-timization dynamics of the inner objectives (ie we setnablaλwjT = 0) In Approx-train we just use the training sets

bull Classic as in (Baxter 1995) we learn h by jointly opti-mize f(λw1 wN ) =

sumNj=1 L

j(wj λDjtr) and treat

the problem as standard multitask learning with the ex-ception that we evaluate f on mini-batches of 4 episodesrandomly sampled every 5 gradient descent iterations

In settings where we do not use the validation sets we letthe training sets of each episode contain 16 examples perclass Using training episodes with just one example perclass resulted in performances just above random chanceWhile the first experiment constitutes a standard baselinethe others have the specific aim of assessing (i) the impor-tance of splitting episodes of meta-training set into trainingand validation and (ii) the importance of computing thehypergradient of the approximate bilevel problem with Al-gorithm 1 The results reported in Table 4 suggest thatboth the trainingvalidation splitting and the full computa-tion of the hypergradient constitute key factors for learninga good representation in a meta-learning context On theother side using pretrained representations especially ina low-dimensional space turns out to be a rather effectivebaseline One possible explanation is that in this context

4The network is similar to C4L but has 64 filters per layer

Table 4 Performance of various methods where the representationis either transfered or learned with variants of hyper-representationmethods The last raw reports for comparison the score obtainedwith hyper-representation

Method filters Accuracy 1-shot

Multiclass 64 4302Bilevel-train 32 2963Approx 32 4112Approx-train 32 3880Classic-train 32 4046Hyper-representation-C4L 32 4751

some classes in the training and testing meta-datasets arerather similar (eg various dog breeds) and thus groundclassifiers can leverage on very specific representations

6 ConclusionsWe have shown that both HO and ML can be formulated interms of bilevel programming and solved with an iterativeapproach When the inner problem has a unique solution(eg is strongly convex) our theoretical results show that theiterative approach has convergence guarantees a result thatis interesting in its own right In the case of ML by adaptingclassical strategies (Baxter 1995) to the bilevel frameworkwith trainingvalidation splitting we present a method forlearning hyper-representations which is experimentally ef-fective and supported by our theoretical guarantees

Our framework encompasses recently proposed methodsfor meta-learning such as learning to optimize but alsosuggests different design patterns for the inner learningalgorithm which could be interesting to explore in futurework The resulting inner problems may not satisfy theassumptions of our convergence analysis raising the needfor further theoretical investigations An additional futuredirection of research is the study of the statistical propertiesof bilevel strategies where outer objectives are based on thegeneralization ability of the inner model to new (validation)data Ideas from (Maurer et al 2016 Denevi et al 2018)may be useful in this direction

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

ReferencesAndrychowicz M Denil M Gomez S Hoffman M W

Pfau D Schaul T and de Freitas N (2016) Learningto learn by gradient descent by gradient descent In Ad-vances in Neural Information Processing Systems (NIPS)pages 3981ndash3989

Bard J F (2013) Practical bilevel optimization algo-rithms and applications volume 30 Springer Science ampBusiness Media 01251

Baxter J (1995) Learning internal representations In Pro-ceedings of the 8th Annual Conference on ComputationalLearning Theory (COLT) pages 311ndash320 ACM

Baydin A G Pearlmutter B A Radul A A and SiskindJ M (2017) Automatic differentiation in machine learn-ing a survey Journal of Machine Learning Research181531ndash15343

Beirami A Razaviyayn M Shahrampour S and TarokhV (2017) On optimal generalizability in parametriclearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3458ndash3468

Bergstra J and Bengio Y (2012) Random search for hyper-parameter optimization Journal of Machine LearningResearch 13(Feb)281ndash305

Bergstra J Yamins D and Cox D D (2013) Makinga science of model search Hyperparameter optimiza-tion in hundreds of dimensions for vision architecturesIn Proceedings of the 30th International Conference onMachine Learning (ICML) pages 115ndash123

Caruana R (1998) Multitask learning In Learning tolearn pages 95ndash133 Springer 02683

Colson B Marcotte P and Savard G (2007) Anoverview of bilevel optimization Annals of operationsresearch 153(1)235ndash256

Denevi G Ciliberto C Stamos D and Pontil M (2018)Incremental learning-to-learn with statistical guaranteesarXiv preprint arXiv180308089 To appear in UAI 2018

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL (2009) Imagenet A large-scale hierarchical imagedatabase In Computer Vision and Pattern Recognition(CVPR) pages 248ndash255

Domke J (2012) Generic Methods for Optimization-BasedModeling In AISTATS volume 22 pages 318ndash326

Dontchev A L and Zolezzi T (1993) Well-posed op-timization problems volume 1543 of Lecture Notes inMathematics Springer-Verlag Berlin

Edwards H and Storkey A (2016) Towards a NeuralStatistician arXiv160602185 [cs stat] 00027 arXiv160602185

Evgeniou T Micchelli C A and Pontil M (2005) Learn-ing multiple tasks with kernel methods Journal of Ma-chine Learning Research 6(Apr)615ndash637

Finn C Abbeel P and Levine S (2017) Model-agnosticmeta-learning for fast adaptation of deep networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 1126ndash1135

Flamary R Rakotomamonjy A and Gasso G(2014) Learning constrained task similarities in graph-regularized multi-task learning Regularization Opti-mization Kernels and Support Vector Machines page103

Franceschi L Donini M Frasconi P and Pontil M(2017) Forward and reverse gradient-based hyperparam-eter optimization In Proceedings of the 34th Interna-tional Conference on Machine Learning (ICML) pages1165ndash1173

Glorot X and Bengio Y (2010) Understanding the diffi-culty of training deep feedforward neural networks InProceedings of the 13th International Conference on Arti-ficial Intelligence and Statistics (AIStat) pages 249ndash256

Griewank A and Walther A (2008) Evaluating deriva-tives principles and techniques of algorithmic differenti-ation SIAM

He K Zhang X Ren S and Sun J (2016) Deep residuallearning for image recognition In Computer Vision andPattern Recognition (CVPR) pages 770ndash778

Hutter F Lcke J and Schmidt-Thieme L (2015) BeyondManual Tuning of Hyperparameters KI - KunstlicheIntelligenz 29(4)329ndash337

Kaiser L Nachum O Roy A and Bengio S (2017)Learning to remember rare events

Keerthi S S Sindhwani V and Chapelle O (2007) Anefficient method for gradient-based adaptation of hyper-parameters in svm models In Advances in Neural Infor-mation Processing Systems (NIPS) pages 673ndash680

Khosla A Zhou T Malisiewicz T Efros A A andTorralba A (2012) Undoing the damage of dataset biasIn European Conference on Computer Vision pages 158ndash171 Springer

Koch G Zemel R and Salakhutdinov R (2015) Siameseneural networks for one-shot image recognition In ICMLDeep Learning Workshop volume 2 00127

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Koh P W and Liang P (2017) Understanding black-boxpredictions via influence functions In Proceedings ofthe 34th International Conference on Machine Learning(ICML) pages 1885ndash1894

Kunapuli G Bennett K Hu J and Pang J-S (2008)Classification model selection via bilevel programmingOptimization Methods and Software 23(4)475ndash489

Kuzborskij I Orabona F and Caputo B (2013) Fromn to n+ 1 Multiclass transfer incremental learning InComputer Vision and Pattern Recognition (CVPR) 2013IEEE Conference on pages 3358ndash3365

Lake B M Salakhutdinov R and Tenenbaum J B(2015) Human-level concept learning through probabilis-tic program induction Science 350(6266)1332ndash1338

Lake B M Ullman T D Tenenbaum J B and Gersh-man S J (2017) Building machines that learn and thinklike people Behavioral and Brain Sciences 40 00152

Maclaurin D Duvenaud D K and Adams R P (2015)Gradient-based hyperparameter optimization through re-versible learning In Proceedings of the 32nd Interna-tional Conference on Machine Learning (ICML pages2113ndash2122

Maurer A Pontil M and Romera-Paredes B (2016) Thebenefit of multitask representation learning The Journalof Machine Learning Research 17(1)2853ndash2884

Mishra N Rohaninejad M Chen X and Abbeel P(2018) A simple neural attentive meta-learner

Munkhdalai T and Yu H (2017) Meta networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 2554ndash2563

Pedregosa F (2016) Hyperparameter optimization withapproximate gradient In Proceedings of The 33rd Inter-national Conference on Machine Learning (ICML) pages737ndash746

Ravi S and Larochelle H (2017) Optimization as a modelfor few-shot learning In In International Conference onLearning Representations (ICLR)

Santoro A Bartunov S Botvinick M Wierstra Dand Lillicrap T (2016) Meta-learning with memory-augmented neural networks In Proceedings ofthe 33rdInternational Conference on Machine Learning pages1842ndash1850

Schmidhuber J (1992) Learning to Control Fast-WeightMemories An Alternative to Dynamic Recurrent Net-works Neural Computation 4(1)131ndash139 00082

Snell J Swersky K and Zemel R S (2017) Prototypicalnetworks for few-shot learning In Advances in neuralinformation processing systems pages 4080ndash4090

Snoek J Larochelle H and Adams R P (2012) Practicalbayesian optimization of machine learning algorithmsIn Advances in Neural Information Processing Systems(NIPS) pages 2951ndash2959

Thrun S and Pratt L (1998) Learning to learn Springer

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser L and Polosukhin I (2017)Attention is all you need In Advances in Neural Informa-tion Processing Systems (NIPS) pages 6000ndash6010

Vinyals O Blundell C Lillicrap T Kavukcuoglu Kand Wierstra D (2016) Matching networks for one shotlearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3630ndash3638

Wichrowska O Maheswaranathan N Hoffman M WColmenarejo S G Denil M Freitas N and Sohl-Dickstein J (2017) Learned optimizers that scale andgeneralize In Proceedings of the 34th International Con-ference on Machine Learning (ICML) pages 3751ndash3760

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

A Proofs of the Results in Sec 3Proof of Theorem 31 Since Λ is compact it follows fromWeierstrass theorem that a sufficient condition for the exis-tence of minimizers is that f is continuous Thus let λ isin Λand let (λn)nisinN be a sequence in Λ such that λn rarr λWe prove that f(λn) = E(wλn

λn)rarr E(wλ λ) = f(λ)Since (wλn

)nisinN is bounded there exists a subsequence(wkn)nisinN such that wkn rarr w for some w isin Rd Nowsince λkn rarr λ and the map (w λ) 7rarr Lλ(w) is jointlycontinuous we have

forallw isin Rd Lλ(w) = limnLλkn

(wkn)

le limnLλkn

(w) = Lλ(w)

Therefore w is a minimizer of Lλ and hence w = wλThis prove that (wλn)nisinN is a bounded sequence havinga unique cluster point Hence (wλn)nisinN is convergenceto its unique cluster point which is wλ Finally since(wλn

λn)rarr (wλ λ) and E is jointly continuous we haveE(wλn λn)rarr E(wλ λ) and the statement follows

We recal a fundamental fact concerning the stability of min-ima and minimizers in optimization problems (Dontchevand Zolezzi 1993) We provide the proof for completeness

Theorem A1 (Convergence) Let ϕT and ϕ be lower semi-continuous functions defined on a compact set Λ Supposethat ϕT converges uniformly to ϕ on Λ as T rarr +infin Then

(a) inf ϕT rarr inf ϕ

(b) argminϕT rarr argminϕ meaning that for every(λT )TisinN such that λT isin argminϕT we have that

- (λT )TisinN admits a convergent subsequence- for every subsequence (λKT

)TisinN such thatλKT

rarr λ we have λ isin argminϕ

Proof Let (λT )TisinN be a sequence in Λ such that for everyT isin N λT isin argminϕT We prove that

1) (λT )TisinN admits a convergent subsequence

2) for every subsequence (λKT)TisinN such that λKT

rarr λwe have λ isin argminϕ and ϕKT

(λKT)rarr inf ϕ

3) inf ϕT rarr inf ϕ

The first point follows from the fact that Λ is compactConcerning the second point let (λKT

)TisinN be a subse-quence such that λKT

rarr λ Since ϕKTconverge uniformly

to ϕ we have

|ϕKT(λKT

)minus ϕ(λKT)| le sup

λisinΛ|ϕKT

(λ)minus ϕ(λ)| rarr 0

Therefore using also the continuity of ϕ we have

forallλ isin Λ ϕ(λ) = limTϕ(λKT

) = limTϕKT

(λKT)

le limTϕKT

(λ) = ϕ(λ)

So λ isin argminϕ andϕ(λ) = limT ϕKT(λKT

) le inf ϕ =ϕ(λ) that is limT ϕKT

(λKT) = inf ϕ

Finally as regards the last point we proceed by contradic-tion If (ϕT (λT ))TisinN does not convergce to inf f thenthere exists an ε gt 0 and a subsequence (ϕKT

(λKT))TisinN

such that

|ϕKT(λKT

)minus inf ϕ| ge ε forallT isin N (11)

Now let (λK

(1)T

) be a convergent subsequence of (λKT)TisinN

Suppose that λK

(1)T

rarr λ Clearly (λK

(1)T

) is also a subse-quence of (λT )TisinN Then it follows from point 2) abovethat ϕ

K(1)T

(λK

(1)T

) rarr inf ϕ This latter finding togetherwith equation (11) gives a contradiction

Proof of Theorem 32 Since E(middot λ) is uniformly Lipschitzcontinuous there exists ν gt 0 such that for every T isin Nand every λ isin Λ

|fT (λ)minus f(λ)| = |E(wTλ λ)minus E(wλ λ)|le νwTλ minus wλ

It follows from assumption (vi) that fT (λ) converges tof(λ) uniformly on Λ as T rarr +infin Then the statementfollows from Theorem A1

B Cross-validation and Bilevel ProgrammingWe note that the (approximate) bilevel programming frame-work easily accommodates also estimations of the gener-alization error generated by a cross-validation proceduresWe describe here the case of K-fold cross-validation whichincludes also leave-one-out cross validation

Let D = (xi yi)ni=1 be the set of available data K-foldcross validation with K isin 1 2 N consists in parti-tioning D in K subsets DjKj=1 and fit as many modelsgwj on training data Dj

tr =⋃i6=j D

i The models are thenevaluated on Dj

val = Dj Denoting by w = (wj)Kj=1 thevector of stacked weights the K-fold cross validation erroris given by

E(w λ) =1

K

Ksumj=1

Ej(wj λ)

where Ej(wj λ) =sum

(xy)isinDj `(gwj (x) y) E can betreated as the outer objective in the bilevel framework whilethe inner objective Lλ may be given by the sum of regu-larized empirical errors over each Dj

tr for the K models

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Under this perspective a K-fold cross-validation procedureclosely resemble the bilevel problem for ML formulatedin Sec 22 where in this case the meta-distribution col-lapses on the data (ground) distribution and the episodes aresampled from the same dataset of points

By following the procedure outlined in Sec 23 we canapproximate the minimization of Lλ with T steps of anoptimization dynamics and compute the hypergradient offT (λ) = 1

K

sumj E

j(wjT λ) by training the K models andproceed with either forward or reverse differentiation Themodels may be fitted sequentially in parallel or stochas-tically Specifically in this last case one can repeatedlysample one fold at a time (or possibly a mini-batch of folds)and compute a stochastic hypergradient that can be used ina SGD procedure in order to minimize fT At last we notethat similar ideas for leave-one out cross-validation error aredeveloped in (Beirami et al 2017) where the hypergradientof an approximation of the outer objective is computed bymeans of the implicit function theorem

C The Effect of T Ridge RegressionIn Sec 51 we showed that increasing the number of itera-tions T leads to a better optimization of the outer objective fthrough the approximations fT which converge uniformlyto f by Proposition 32 This however does not necessaryresults in better test scores due to possible overfitting ofthe training and validation errors as we noted for the linearhyper-representation multiclass classification experiment inSec 51 The optimal choice of T appears to be indeedproblem dependent if in the aforementioned experiment aquite small T led to the best test accuracy (after hyperparam-eter optimization) in this section we present a small scalelinear regression experiment that benefits from an increasingnumber of inner iterations

We generated 90 noisy synthetic data points with 30 featuresof which only 5 were informative and divided them equallyto form training validation and test sets As outer objectivewe set the mean squared error E(w) = ||Xvalw minus yval||2where Xval and yval are the validation design matrix andtargets respectively and w isin R30 is the weight vector of themodel We set as inner inner objective

Lλ(w) = ||Xtrw minus ytr||2 +

30sumi=1

eλiw2i

and optimized the L2 vector of regularization coefficients λ(equivalent to a diagonal Tikhonov regularization matrix)The results reported in Table 5 show that in this scenariooverfitting is not an issue and 250 inner iterations yield thebest test result Note that as in Sec 51 also this problemadmits an analytical solution from which it is possible tocompute the hypergradient analytically and perform exact

hyperparameter optimization as reported in the last row ofthe Table 5

Table 5 Validation and test mean absolute percentage error(MAPE) for various values of T

T Validation MAPE Test MAPE

10 1135 434950 128 522100 055 126250 047 050Exact 037 057

D Further Details on Few-shot ExperimentsThis appendix contains implementation details of the repre-sentation mapping h and the meta-learning hyperparametersused in the few-shot learning experiments of Sec 52

To optimize the representation mapping in all the exper-iments we use Adam with learning rate set to 10minus3 anda decay-rate of 10minus5 We used the initialization strategyproposed in (Glorot and Bengio 2010) for all the weights λof the representation

For Omniglot experiments we used a meta-batch size of 32episodes for five-way and of 16 episodes for 20-way Totrain the episode-specific classifiers we set the learning rateto 01

For one set of experiments with Mini-imagenet we used anhyper-representation (C4L) consisting of 4 convolutionallayers where each layer is composed by a convolution with32 filters a batch normalization followed by a ReLU acti-vation and a 2x2 max-pooling The classifiers were trainedusing mini-batches of 4 episodes for one-shot and 2 episodesfor five-shot with learning rate set to 001

The other set of experiments with Mini-imagenet employeda Residual Network (RN) as the representation mappingbuilt of 4 residual blocks (with 64 96 128 256 filters) andthen the block that follows 1times 1 conv (2048 filters) avgpooling 1 times 1 conv (512 filters) Each residual blockrepeats the following block 3 times 1 times 1 conv batchnormalization leaky ReLU (leak 01) before the residualconnection In this case the classifiers were optimized usingmini-batches of 2 episodes for both one and five-shot withlearning rate set to 004

Page 6: Bilevel Programming for Hyperparameter Optimization and ...

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

0 1000 2000 3000 4000 5000

Hyperiterations

60

80

100

120

140

160

180Sq

uare

der

ror

Exact solutonT = 1

T = 4

T = 16

T = 64

T = 256

Figure 3 Optimization of the outer objectives f and fT for exactand approximate problems The optimization of H is performedwith gradient descent with momentum with same initializationstep size and momentum factor for each run

simple multiclass classification problem Second we testour hyper-representation method in the context of few-shotlearning on two benchmark datasets Finally we constrastthe bilevel ML approach against classical approaches tolearn shared representations 3

51 The Effect of T

Motivated by the theoretical findings of Sec 3 we empir-ically investigate how solving the inner problem approxi-mately (ie using small T ) affects convergence generaliza-tion performances and running time We focus in particularon the linear feature map described in Example 34 whichallows us to compare the approximated solution against theclosed-form analytical solution given by

wH = [(XH)TXH + ρI]minus1(XH)TY

In this setting the bilevel problem reduces to a (non-convex)optimization problem in H

We use a subset of 100 classes extracted from Omniglotdataset (Lake et al 2017) to construct a HO problem aimedat tuning H A training set Dtr and a validation set Dvaleach consisting of three randomly drawn examples per classwere sampled to form the HO problem A third set Dtestconsisting of fifteen examples per class was used for testingInstead of using raw images as input we employ featurevectors x isin R256 computed by the convolutional networktrained on one-shot five-ways ML setting as described inSec 52

For the approximate problems we compute the hypergra-dient using Algorithm 1 where it is intended that B =(Dtr Dval) Figure 3 shows the values of functions f andfT (see Eqs (1) and (5) respectively) during the optimiza-tion of H As T increases the solution of the approximate

3The code for reproducing the experiments based on thepackage FAR-HO (httpsbitlyfar-ho) is availableat httpsbitlyhyper-repr

0 1000 2000 3000 4000 5000

Hyperiterations

85

86

87

88

89

90

91

92

93

Acc

urac

y

Exact solutonT = 1

T = 4

T = 16

T = 64

T = 256

Figure 4 Accuracy on Dtest of exact and approximated solutionsduring optimization of H Training and validation accuraciesreach almost 100 already for T = 4 and after few hundredhyperiterations and therefore are not reported

Table 2 Execution times on a NVidia Tesla M40 GPU

T 1 4 16 64 256 ExactTime (sec) 60 119 356 1344 5532 320

problem approaches the true bilevel solution However per-forming a small number of gradient descent steps for solvingthe inner problem acts as implicit regularizer As it is evi-dent from Figure 4 the generalization error is better whenT is smaller than the value yielding the best approximationof the inner solution This is to be expected since in thissetting the dimensions of parameters and hyperparametersare of the same order leading to a concrete possibility ofoverfitting the outer objective (validation error) An appro-priate problem dependent choice of T may help avoidingthis issue (see also Appendix C) As T increases the numberof hyperiterations required to reach the maximum test accu-racy decreases further suggesting that there is an interplaybetween the number of iterations used to solve the inner andthe outer objective Finally the running time of Algorithm1 is linear in T and the size of w and independent of thesize of H (see also Table 2) making it even more appealingto reduce the number of iterations

52 Few-shot Learning

We new turn our attention to learning-to-learn precisely tofew-shot supervised learning implementing the ML strategyoutlined in Sec 4 on two different benchmark datasets

bull OMNIGLOT (Lake et al 2015) a dataset that containsexamples of 1623 different handwritten characters from 50alphabets We downsample the images to 28times 28

bull MINIIMAGENET (Vinyals et al 2016) a subset of Ima-geNet (Deng et al 2009) that contains 60000 downsampledimages from 100 different classes

Following the experimental protocol used in a number of re-cent works we build a meta-training set D from which wesample datasets to solve Problem (9)-(10) a meta-validation

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

set V for tuning ML hyperparameters and finally a meta-test set T which is used to estimate accuracy Operationallyeach meta-dataset consists of a pool of samples belonging todifferent (non-overlapping between separate meta-dataset)classes which can be combined to form ground classifica-tion datasets Dj = Dj

tr cup Djval with 5 or 20 classes (for

Omniglot) The Djtrrsquos contain 1 or 5 examples per class

which are used to fit wj (see Eq 10) The Djvalrsquos contain-

ing 15 examples per class is used either to compute fT (λ)(see Eq (9)) and its (stochastic) gradient if Dj isin D or toprovide a generalization score if Dj comes from either V orT For MiniImagenet we use the same split and images pro-posed in (Ravi and Larochelle 2017) while for Omniglotwe use the protocol defined by (Santoro et al 2016)

As ground classifiers we use multinomial logistic regressorsand as task losses `j we employ cross-entropy The innerproblems being strongly convex admit unique minimizersyet require numerical computation of the solutions We ini-tialize ground models parameters wj to 0 and according tothe observation in Sec 51 we perform T gradient descentsteps where T is treated as a ML hyperparameter that has tobe validated Figure 6 shows an example of meta-validationof T for one-shot learning on MiniImagenet We compute astochastic approximation ofnablafT (λ) with Algorithm 1 anduse Adam with decaying learning rate to optimize λ

Regarding the specific implementation of the representationmapping h we employ for Omniglot a four-layers convo-lutional neural network with strided convolutions and 64filters per layer as in (Vinyals et al 2016) and other suc-cessive works For MiniImagenet we tried two differentarchitectures

bull C4L a four-layers convolutional neural network with max-pooling and 32 filters per layer

bull RN a residual network (He et al 2016) built of fourresidual blocks followed by two convolutional layers

The first network architecture has been proposed in (Raviand Larochelle 2017) and then used in (Finn et al 2017)while a similar residual network architecture has been em-ployed in a more recent work (Mishra et al 2018) Furtherdetails on the architectures of h as well as other ML hyper-parameters are specified in the supplementary material Wereport our results using RN for MiniImagenet in Table 3alongside scores from various recently proposed methodsfor comparison

The proposed method achieves competitive results highlight-ing the relative importance of learning a task independentrepresentation on the top of which logistic classifiers trainedwith very few samples generalize well Moreover utilizingmore expressive models such as residual network as repre-sentation mappings is beneficial for our proposed strategyand unlike other methods does not result in overfitting

of the outer objective as reported in (Mishra et al 2018)Indeed compared to C4L RN achieves a relative improve-ment of 65 on one-shot and 42 on five-shot Figure 5provides a visual example of the goodness of the learned rep-resentation showing that MiniImagenet examples (the firstfrom meta-training the second from the meta-testing sets)from similar classes (different dog breeds) are mapped neareach other by h and conversely samples from dissimilarclasses are mapped afar

Figure 5 After sampling two datasets D isin D and Dprime isin T weshow on the top the two images x isin D xprime isin Dprime that minimize||hλ(x)minus hλ(xprime)|| and on the bottom those that maximize it Inbetween each of the two couples we compare a random subset ofcomponents of hλ(x) (blue) and hλ(xprime) (green)

3 5 8 1266

68

70

72

74

76

78Accuracy on Dtr

3 5 8 12

49

50

51

52

53

Accuracy on DtestAccuracy on Dval

Figure 6 Meta-validation of the number of gradient descent steps(T ) of the ground models for MiniImagenet using the RN repre-sentation Early stopping on the accuracy on meta-validation setduring meta-training resulted in halting the optimization of λ after42k 40k 22k and 15k hyperiterations for T equal to 3 5 8 and12 respectively in line with our observation in Sec 51

53 On Variants of Representation Learning Methods

In this section we show the benefits of learning a represen-tation within the proposed bilevel framework compared toother possible approaches that involve an explicit factoriza-tion of a classifier as gj h The representation mapping his either pretrained or learned with different meta-learningalgorithms We focus on the problem of one-shot learningon MiniImagenet and we use C4L as architecture for therepresentation mapping In all the experiments the groundmodels gj are multinomial logistic regressor as in Sec 52tuned with 5 steps of gradient descent We ran the followingexperiments

bull Multiclass the mapping h X rarr R64 is given by the

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Table 3 Accuracy scores computed on episodes from T of various methods on 1-shot and 5-shot classification problems on Omniglotand MiniImagenet For MiniImagenet 95 confidence intervals are reported For Hyper-representation the scores are computed over 600randomly drawn episodes For other methods we show results as reported by their respective authors

OMNIGLOT 5 classes OMNIGLOT 20 classes MINIIMAGENET 5 classesMethod 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot

Siamese nets (Koch et al 2015) 973 984 882 970 minus minusMatching nets (Vinyals et al 2016) 981 989 938 985 4344plusmn 077 5531plusmn 073Neural stat (Edwards and Storkey 2016) 981 995 932 981 minus minusMemory mod (Kaiser et al 2017) 984 996 950 986 minus minusMeta-LSTM (Ravi and Larochelle 2017) minus minus minus minus 4356plusmn 084 6060plusmn 071MAML (Finn et al 2017) 987 999 958 989 4870plusmn 175 6311plusmn 092Meta-networks (Munkhdalai and Yu 2017) 989 minus 970 minus 4921plusmn 096 minusPrototypical Net (Snell et al 2017) 988 997 960 989 4942plusmn 078 6820plusmn 066SNAIL (Mishra et al 2018) 991 998 976 994 5571plusmn 099 6888plusmn 092Hyper-representation 986 995 955 984 5054plusmn 085 6453plusmn 068

linear outputs before the softmax operation of a network4

pretrained on the totality of examples contained in the train-ing meta-dataset (600 examples for each of the 64 classes)In this setting we found that using the second last layer orthe output after the softmax yields worst results

bull Bilevel-train we use a bilevel approach but unlike in Sec4 we optimize the parameter vector λ of the representationmapping by minimizing the loss on the training sets of eachepisode The hypergradient is still computed with Algorithm1 albeit we set Dj

val = Djtr for each training episodes

bull Approx and Approx-train we consider an approxima-tion of the hypergradient nablafT (λ) by disregarding the op-timization dynamics of the inner objectives (ie we setnablaλwjT = 0) In Approx-train we just use the training sets

bull Classic as in (Baxter 1995) we learn h by jointly opti-mize f(λw1 wN ) =

sumNj=1 L

j(wj λDjtr) and treat

the problem as standard multitask learning with the ex-ception that we evaluate f on mini-batches of 4 episodesrandomly sampled every 5 gradient descent iterations

In settings where we do not use the validation sets we letthe training sets of each episode contain 16 examples perclass Using training episodes with just one example perclass resulted in performances just above random chanceWhile the first experiment constitutes a standard baselinethe others have the specific aim of assessing (i) the impor-tance of splitting episodes of meta-training set into trainingand validation and (ii) the importance of computing thehypergradient of the approximate bilevel problem with Al-gorithm 1 The results reported in Table 4 suggest thatboth the trainingvalidation splitting and the full computa-tion of the hypergradient constitute key factors for learninga good representation in a meta-learning context On theother side using pretrained representations especially ina low-dimensional space turns out to be a rather effectivebaseline One possible explanation is that in this context

4The network is similar to C4L but has 64 filters per layer

Table 4 Performance of various methods where the representationis either transfered or learned with variants of hyper-representationmethods The last raw reports for comparison the score obtainedwith hyper-representation

Method filters Accuracy 1-shot

Multiclass 64 4302Bilevel-train 32 2963Approx 32 4112Approx-train 32 3880Classic-train 32 4046Hyper-representation-C4L 32 4751

some classes in the training and testing meta-datasets arerather similar (eg various dog breeds) and thus groundclassifiers can leverage on very specific representations

6 ConclusionsWe have shown that both HO and ML can be formulated interms of bilevel programming and solved with an iterativeapproach When the inner problem has a unique solution(eg is strongly convex) our theoretical results show that theiterative approach has convergence guarantees a result thatis interesting in its own right In the case of ML by adaptingclassical strategies (Baxter 1995) to the bilevel frameworkwith trainingvalidation splitting we present a method forlearning hyper-representations which is experimentally ef-fective and supported by our theoretical guarantees

Our framework encompasses recently proposed methodsfor meta-learning such as learning to optimize but alsosuggests different design patterns for the inner learningalgorithm which could be interesting to explore in futurework The resulting inner problems may not satisfy theassumptions of our convergence analysis raising the needfor further theoretical investigations An additional futuredirection of research is the study of the statistical propertiesof bilevel strategies where outer objectives are based on thegeneralization ability of the inner model to new (validation)data Ideas from (Maurer et al 2016 Denevi et al 2018)may be useful in this direction

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

ReferencesAndrychowicz M Denil M Gomez S Hoffman M W

Pfau D Schaul T and de Freitas N (2016) Learningto learn by gradient descent by gradient descent In Ad-vances in Neural Information Processing Systems (NIPS)pages 3981ndash3989

Bard J F (2013) Practical bilevel optimization algo-rithms and applications volume 30 Springer Science ampBusiness Media 01251

Baxter J (1995) Learning internal representations In Pro-ceedings of the 8th Annual Conference on ComputationalLearning Theory (COLT) pages 311ndash320 ACM

Baydin A G Pearlmutter B A Radul A A and SiskindJ M (2017) Automatic differentiation in machine learn-ing a survey Journal of Machine Learning Research181531ndash15343

Beirami A Razaviyayn M Shahrampour S and TarokhV (2017) On optimal generalizability in parametriclearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3458ndash3468

Bergstra J and Bengio Y (2012) Random search for hyper-parameter optimization Journal of Machine LearningResearch 13(Feb)281ndash305

Bergstra J Yamins D and Cox D D (2013) Makinga science of model search Hyperparameter optimiza-tion in hundreds of dimensions for vision architecturesIn Proceedings of the 30th International Conference onMachine Learning (ICML) pages 115ndash123

Caruana R (1998) Multitask learning In Learning tolearn pages 95ndash133 Springer 02683

Colson B Marcotte P and Savard G (2007) Anoverview of bilevel optimization Annals of operationsresearch 153(1)235ndash256

Denevi G Ciliberto C Stamos D and Pontil M (2018)Incremental learning-to-learn with statistical guaranteesarXiv preprint arXiv180308089 To appear in UAI 2018

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL (2009) Imagenet A large-scale hierarchical imagedatabase In Computer Vision and Pattern Recognition(CVPR) pages 248ndash255

Domke J (2012) Generic Methods for Optimization-BasedModeling In AISTATS volume 22 pages 318ndash326

Dontchev A L and Zolezzi T (1993) Well-posed op-timization problems volume 1543 of Lecture Notes inMathematics Springer-Verlag Berlin

Edwards H and Storkey A (2016) Towards a NeuralStatistician arXiv160602185 [cs stat] 00027 arXiv160602185

Evgeniou T Micchelli C A and Pontil M (2005) Learn-ing multiple tasks with kernel methods Journal of Ma-chine Learning Research 6(Apr)615ndash637

Finn C Abbeel P and Levine S (2017) Model-agnosticmeta-learning for fast adaptation of deep networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 1126ndash1135

Flamary R Rakotomamonjy A and Gasso G(2014) Learning constrained task similarities in graph-regularized multi-task learning Regularization Opti-mization Kernels and Support Vector Machines page103

Franceschi L Donini M Frasconi P and Pontil M(2017) Forward and reverse gradient-based hyperparam-eter optimization In Proceedings of the 34th Interna-tional Conference on Machine Learning (ICML) pages1165ndash1173

Glorot X and Bengio Y (2010) Understanding the diffi-culty of training deep feedforward neural networks InProceedings of the 13th International Conference on Arti-ficial Intelligence and Statistics (AIStat) pages 249ndash256

Griewank A and Walther A (2008) Evaluating deriva-tives principles and techniques of algorithmic differenti-ation SIAM

He K Zhang X Ren S and Sun J (2016) Deep residuallearning for image recognition In Computer Vision andPattern Recognition (CVPR) pages 770ndash778

Hutter F Lcke J and Schmidt-Thieme L (2015) BeyondManual Tuning of Hyperparameters KI - KunstlicheIntelligenz 29(4)329ndash337

Kaiser L Nachum O Roy A and Bengio S (2017)Learning to remember rare events

Keerthi S S Sindhwani V and Chapelle O (2007) Anefficient method for gradient-based adaptation of hyper-parameters in svm models In Advances in Neural Infor-mation Processing Systems (NIPS) pages 673ndash680

Khosla A Zhou T Malisiewicz T Efros A A andTorralba A (2012) Undoing the damage of dataset biasIn European Conference on Computer Vision pages 158ndash171 Springer

Koch G Zemel R and Salakhutdinov R (2015) Siameseneural networks for one-shot image recognition In ICMLDeep Learning Workshop volume 2 00127

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Koh P W and Liang P (2017) Understanding black-boxpredictions via influence functions In Proceedings ofthe 34th International Conference on Machine Learning(ICML) pages 1885ndash1894

Kunapuli G Bennett K Hu J and Pang J-S (2008)Classification model selection via bilevel programmingOptimization Methods and Software 23(4)475ndash489

Kuzborskij I Orabona F and Caputo B (2013) Fromn to n+ 1 Multiclass transfer incremental learning InComputer Vision and Pattern Recognition (CVPR) 2013IEEE Conference on pages 3358ndash3365

Lake B M Salakhutdinov R and Tenenbaum J B(2015) Human-level concept learning through probabilis-tic program induction Science 350(6266)1332ndash1338

Lake B M Ullman T D Tenenbaum J B and Gersh-man S J (2017) Building machines that learn and thinklike people Behavioral and Brain Sciences 40 00152

Maclaurin D Duvenaud D K and Adams R P (2015)Gradient-based hyperparameter optimization through re-versible learning In Proceedings of the 32nd Interna-tional Conference on Machine Learning (ICML pages2113ndash2122

Maurer A Pontil M and Romera-Paredes B (2016) Thebenefit of multitask representation learning The Journalof Machine Learning Research 17(1)2853ndash2884

Mishra N Rohaninejad M Chen X and Abbeel P(2018) A simple neural attentive meta-learner

Munkhdalai T and Yu H (2017) Meta networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 2554ndash2563

Pedregosa F (2016) Hyperparameter optimization withapproximate gradient In Proceedings of The 33rd Inter-national Conference on Machine Learning (ICML) pages737ndash746

Ravi S and Larochelle H (2017) Optimization as a modelfor few-shot learning In In International Conference onLearning Representations (ICLR)

Santoro A Bartunov S Botvinick M Wierstra Dand Lillicrap T (2016) Meta-learning with memory-augmented neural networks In Proceedings ofthe 33rdInternational Conference on Machine Learning pages1842ndash1850

Schmidhuber J (1992) Learning to Control Fast-WeightMemories An Alternative to Dynamic Recurrent Net-works Neural Computation 4(1)131ndash139 00082

Snell J Swersky K and Zemel R S (2017) Prototypicalnetworks for few-shot learning In Advances in neuralinformation processing systems pages 4080ndash4090

Snoek J Larochelle H and Adams R P (2012) Practicalbayesian optimization of machine learning algorithmsIn Advances in Neural Information Processing Systems(NIPS) pages 2951ndash2959

Thrun S and Pratt L (1998) Learning to learn Springer

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser L and Polosukhin I (2017)Attention is all you need In Advances in Neural Informa-tion Processing Systems (NIPS) pages 6000ndash6010

Vinyals O Blundell C Lillicrap T Kavukcuoglu Kand Wierstra D (2016) Matching networks for one shotlearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3630ndash3638

Wichrowska O Maheswaranathan N Hoffman M WColmenarejo S G Denil M Freitas N and Sohl-Dickstein J (2017) Learned optimizers that scale andgeneralize In Proceedings of the 34th International Con-ference on Machine Learning (ICML) pages 3751ndash3760

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

A Proofs of the Results in Sec 3Proof of Theorem 31 Since Λ is compact it follows fromWeierstrass theorem that a sufficient condition for the exis-tence of minimizers is that f is continuous Thus let λ isin Λand let (λn)nisinN be a sequence in Λ such that λn rarr λWe prove that f(λn) = E(wλn

λn)rarr E(wλ λ) = f(λ)Since (wλn

)nisinN is bounded there exists a subsequence(wkn)nisinN such that wkn rarr w for some w isin Rd Nowsince λkn rarr λ and the map (w λ) 7rarr Lλ(w) is jointlycontinuous we have

forallw isin Rd Lλ(w) = limnLλkn

(wkn)

le limnLλkn

(w) = Lλ(w)

Therefore w is a minimizer of Lλ and hence w = wλThis prove that (wλn)nisinN is a bounded sequence havinga unique cluster point Hence (wλn)nisinN is convergenceto its unique cluster point which is wλ Finally since(wλn

λn)rarr (wλ λ) and E is jointly continuous we haveE(wλn λn)rarr E(wλ λ) and the statement follows

We recal a fundamental fact concerning the stability of min-ima and minimizers in optimization problems (Dontchevand Zolezzi 1993) We provide the proof for completeness

Theorem A1 (Convergence) Let ϕT and ϕ be lower semi-continuous functions defined on a compact set Λ Supposethat ϕT converges uniformly to ϕ on Λ as T rarr +infin Then

(a) inf ϕT rarr inf ϕ

(b) argminϕT rarr argminϕ meaning that for every(λT )TisinN such that λT isin argminϕT we have that

- (λT )TisinN admits a convergent subsequence- for every subsequence (λKT

)TisinN such thatλKT

rarr λ we have λ isin argminϕ

Proof Let (λT )TisinN be a sequence in Λ such that for everyT isin N λT isin argminϕT We prove that

1) (λT )TisinN admits a convergent subsequence

2) for every subsequence (λKT)TisinN such that λKT

rarr λwe have λ isin argminϕ and ϕKT

(λKT)rarr inf ϕ

3) inf ϕT rarr inf ϕ

The first point follows from the fact that Λ is compactConcerning the second point let (λKT

)TisinN be a subse-quence such that λKT

rarr λ Since ϕKTconverge uniformly

to ϕ we have

|ϕKT(λKT

)minus ϕ(λKT)| le sup

λisinΛ|ϕKT

(λ)minus ϕ(λ)| rarr 0

Therefore using also the continuity of ϕ we have

forallλ isin Λ ϕ(λ) = limTϕ(λKT

) = limTϕKT

(λKT)

le limTϕKT

(λ) = ϕ(λ)

So λ isin argminϕ andϕ(λ) = limT ϕKT(λKT

) le inf ϕ =ϕ(λ) that is limT ϕKT

(λKT) = inf ϕ

Finally as regards the last point we proceed by contradic-tion If (ϕT (λT ))TisinN does not convergce to inf f thenthere exists an ε gt 0 and a subsequence (ϕKT

(λKT))TisinN

such that

|ϕKT(λKT

)minus inf ϕ| ge ε forallT isin N (11)

Now let (λK

(1)T

) be a convergent subsequence of (λKT)TisinN

Suppose that λK

(1)T

rarr λ Clearly (λK

(1)T

) is also a subse-quence of (λT )TisinN Then it follows from point 2) abovethat ϕ

K(1)T

(λK

(1)T

) rarr inf ϕ This latter finding togetherwith equation (11) gives a contradiction

Proof of Theorem 32 Since E(middot λ) is uniformly Lipschitzcontinuous there exists ν gt 0 such that for every T isin Nand every λ isin Λ

|fT (λ)minus f(λ)| = |E(wTλ λ)minus E(wλ λ)|le νwTλ minus wλ

It follows from assumption (vi) that fT (λ) converges tof(λ) uniformly on Λ as T rarr +infin Then the statementfollows from Theorem A1

B Cross-validation and Bilevel ProgrammingWe note that the (approximate) bilevel programming frame-work easily accommodates also estimations of the gener-alization error generated by a cross-validation proceduresWe describe here the case of K-fold cross-validation whichincludes also leave-one-out cross validation

Let D = (xi yi)ni=1 be the set of available data K-foldcross validation with K isin 1 2 N consists in parti-tioning D in K subsets DjKj=1 and fit as many modelsgwj on training data Dj

tr =⋃i6=j D

i The models are thenevaluated on Dj

val = Dj Denoting by w = (wj)Kj=1 thevector of stacked weights the K-fold cross validation erroris given by

E(w λ) =1

K

Ksumj=1

Ej(wj λ)

where Ej(wj λ) =sum

(xy)isinDj `(gwj (x) y) E can betreated as the outer objective in the bilevel framework whilethe inner objective Lλ may be given by the sum of regu-larized empirical errors over each Dj

tr for the K models

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Under this perspective a K-fold cross-validation procedureclosely resemble the bilevel problem for ML formulatedin Sec 22 where in this case the meta-distribution col-lapses on the data (ground) distribution and the episodes aresampled from the same dataset of points

By following the procedure outlined in Sec 23 we canapproximate the minimization of Lλ with T steps of anoptimization dynamics and compute the hypergradient offT (λ) = 1

K

sumj E

j(wjT λ) by training the K models andproceed with either forward or reverse differentiation Themodels may be fitted sequentially in parallel or stochas-tically Specifically in this last case one can repeatedlysample one fold at a time (or possibly a mini-batch of folds)and compute a stochastic hypergradient that can be used ina SGD procedure in order to minimize fT At last we notethat similar ideas for leave-one out cross-validation error aredeveloped in (Beirami et al 2017) where the hypergradientof an approximation of the outer objective is computed bymeans of the implicit function theorem

C The Effect of T Ridge RegressionIn Sec 51 we showed that increasing the number of itera-tions T leads to a better optimization of the outer objective fthrough the approximations fT which converge uniformlyto f by Proposition 32 This however does not necessaryresults in better test scores due to possible overfitting ofthe training and validation errors as we noted for the linearhyper-representation multiclass classification experiment inSec 51 The optimal choice of T appears to be indeedproblem dependent if in the aforementioned experiment aquite small T led to the best test accuracy (after hyperparam-eter optimization) in this section we present a small scalelinear regression experiment that benefits from an increasingnumber of inner iterations

We generated 90 noisy synthetic data points with 30 featuresof which only 5 were informative and divided them equallyto form training validation and test sets As outer objectivewe set the mean squared error E(w) = ||Xvalw minus yval||2where Xval and yval are the validation design matrix andtargets respectively and w isin R30 is the weight vector of themodel We set as inner inner objective

Lλ(w) = ||Xtrw minus ytr||2 +

30sumi=1

eλiw2i

and optimized the L2 vector of regularization coefficients λ(equivalent to a diagonal Tikhonov regularization matrix)The results reported in Table 5 show that in this scenariooverfitting is not an issue and 250 inner iterations yield thebest test result Note that as in Sec 51 also this problemadmits an analytical solution from which it is possible tocompute the hypergradient analytically and perform exact

hyperparameter optimization as reported in the last row ofthe Table 5

Table 5 Validation and test mean absolute percentage error(MAPE) for various values of T

T Validation MAPE Test MAPE

10 1135 434950 128 522100 055 126250 047 050Exact 037 057

D Further Details on Few-shot ExperimentsThis appendix contains implementation details of the repre-sentation mapping h and the meta-learning hyperparametersused in the few-shot learning experiments of Sec 52

To optimize the representation mapping in all the exper-iments we use Adam with learning rate set to 10minus3 anda decay-rate of 10minus5 We used the initialization strategyproposed in (Glorot and Bengio 2010) for all the weights λof the representation

For Omniglot experiments we used a meta-batch size of 32episodes for five-way and of 16 episodes for 20-way Totrain the episode-specific classifiers we set the learning rateto 01

For one set of experiments with Mini-imagenet we used anhyper-representation (C4L) consisting of 4 convolutionallayers where each layer is composed by a convolution with32 filters a batch normalization followed by a ReLU acti-vation and a 2x2 max-pooling The classifiers were trainedusing mini-batches of 4 episodes for one-shot and 2 episodesfor five-shot with learning rate set to 001

The other set of experiments with Mini-imagenet employeda Residual Network (RN) as the representation mappingbuilt of 4 residual blocks (with 64 96 128 256 filters) andthen the block that follows 1times 1 conv (2048 filters) avgpooling 1 times 1 conv (512 filters) Each residual blockrepeats the following block 3 times 1 times 1 conv batchnormalization leaky ReLU (leak 01) before the residualconnection In this case the classifiers were optimized usingmini-batches of 2 episodes for both one and five-shot withlearning rate set to 004

Page 7: Bilevel Programming for Hyperparameter Optimization and ...

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

set V for tuning ML hyperparameters and finally a meta-test set T which is used to estimate accuracy Operationallyeach meta-dataset consists of a pool of samples belonging todifferent (non-overlapping between separate meta-dataset)classes which can be combined to form ground classifica-tion datasets Dj = Dj

tr cup Djval with 5 or 20 classes (for

Omniglot) The Djtrrsquos contain 1 or 5 examples per class

which are used to fit wj (see Eq 10) The Djvalrsquos contain-

ing 15 examples per class is used either to compute fT (λ)(see Eq (9)) and its (stochastic) gradient if Dj isin D or toprovide a generalization score if Dj comes from either V orT For MiniImagenet we use the same split and images pro-posed in (Ravi and Larochelle 2017) while for Omniglotwe use the protocol defined by (Santoro et al 2016)

As ground classifiers we use multinomial logistic regressorsand as task losses `j we employ cross-entropy The innerproblems being strongly convex admit unique minimizersyet require numerical computation of the solutions We ini-tialize ground models parameters wj to 0 and according tothe observation in Sec 51 we perform T gradient descentsteps where T is treated as a ML hyperparameter that has tobe validated Figure 6 shows an example of meta-validationof T for one-shot learning on MiniImagenet We compute astochastic approximation ofnablafT (λ) with Algorithm 1 anduse Adam with decaying learning rate to optimize λ

Regarding the specific implementation of the representationmapping h we employ for Omniglot a four-layers convo-lutional neural network with strided convolutions and 64filters per layer as in (Vinyals et al 2016) and other suc-cessive works For MiniImagenet we tried two differentarchitectures

bull C4L a four-layers convolutional neural network with max-pooling and 32 filters per layer

bull RN a residual network (He et al 2016) built of fourresidual blocks followed by two convolutional layers

The first network architecture has been proposed in (Raviand Larochelle 2017) and then used in (Finn et al 2017)while a similar residual network architecture has been em-ployed in a more recent work (Mishra et al 2018) Furtherdetails on the architectures of h as well as other ML hyper-parameters are specified in the supplementary material Wereport our results using RN for MiniImagenet in Table 3alongside scores from various recently proposed methodsfor comparison

The proposed method achieves competitive results highlight-ing the relative importance of learning a task independentrepresentation on the top of which logistic classifiers trainedwith very few samples generalize well Moreover utilizingmore expressive models such as residual network as repre-sentation mappings is beneficial for our proposed strategyand unlike other methods does not result in overfitting

of the outer objective as reported in (Mishra et al 2018)Indeed compared to C4L RN achieves a relative improve-ment of 65 on one-shot and 42 on five-shot Figure 5provides a visual example of the goodness of the learned rep-resentation showing that MiniImagenet examples (the firstfrom meta-training the second from the meta-testing sets)from similar classes (different dog breeds) are mapped neareach other by h and conversely samples from dissimilarclasses are mapped afar

Figure 5 After sampling two datasets D isin D and Dprime isin T weshow on the top the two images x isin D xprime isin Dprime that minimize||hλ(x)minus hλ(xprime)|| and on the bottom those that maximize it Inbetween each of the two couples we compare a random subset ofcomponents of hλ(x) (blue) and hλ(xprime) (green)

3 5 8 1266

68

70

72

74

76

78Accuracy on Dtr

3 5 8 12

49

50

51

52

53

Accuracy on DtestAccuracy on Dval

Figure 6 Meta-validation of the number of gradient descent steps(T ) of the ground models for MiniImagenet using the RN repre-sentation Early stopping on the accuracy on meta-validation setduring meta-training resulted in halting the optimization of λ after42k 40k 22k and 15k hyperiterations for T equal to 3 5 8 and12 respectively in line with our observation in Sec 51

53 On Variants of Representation Learning Methods

In this section we show the benefits of learning a represen-tation within the proposed bilevel framework compared toother possible approaches that involve an explicit factoriza-tion of a classifier as gj h The representation mapping his either pretrained or learned with different meta-learningalgorithms We focus on the problem of one-shot learningon MiniImagenet and we use C4L as architecture for therepresentation mapping In all the experiments the groundmodels gj are multinomial logistic regressor as in Sec 52tuned with 5 steps of gradient descent We ran the followingexperiments

bull Multiclass the mapping h X rarr R64 is given by the

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Table 3 Accuracy scores computed on episodes from T of various methods on 1-shot and 5-shot classification problems on Omniglotand MiniImagenet For MiniImagenet 95 confidence intervals are reported For Hyper-representation the scores are computed over 600randomly drawn episodes For other methods we show results as reported by their respective authors

OMNIGLOT 5 classes OMNIGLOT 20 classes MINIIMAGENET 5 classesMethod 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot

Siamese nets (Koch et al 2015) 973 984 882 970 minus minusMatching nets (Vinyals et al 2016) 981 989 938 985 4344plusmn 077 5531plusmn 073Neural stat (Edwards and Storkey 2016) 981 995 932 981 minus minusMemory mod (Kaiser et al 2017) 984 996 950 986 minus minusMeta-LSTM (Ravi and Larochelle 2017) minus minus minus minus 4356plusmn 084 6060plusmn 071MAML (Finn et al 2017) 987 999 958 989 4870plusmn 175 6311plusmn 092Meta-networks (Munkhdalai and Yu 2017) 989 minus 970 minus 4921plusmn 096 minusPrototypical Net (Snell et al 2017) 988 997 960 989 4942plusmn 078 6820plusmn 066SNAIL (Mishra et al 2018) 991 998 976 994 5571plusmn 099 6888plusmn 092Hyper-representation 986 995 955 984 5054plusmn 085 6453plusmn 068

linear outputs before the softmax operation of a network4

pretrained on the totality of examples contained in the train-ing meta-dataset (600 examples for each of the 64 classes)In this setting we found that using the second last layer orthe output after the softmax yields worst results

bull Bilevel-train we use a bilevel approach but unlike in Sec4 we optimize the parameter vector λ of the representationmapping by minimizing the loss on the training sets of eachepisode The hypergradient is still computed with Algorithm1 albeit we set Dj

val = Djtr for each training episodes

bull Approx and Approx-train we consider an approxima-tion of the hypergradient nablafT (λ) by disregarding the op-timization dynamics of the inner objectives (ie we setnablaλwjT = 0) In Approx-train we just use the training sets

bull Classic as in (Baxter 1995) we learn h by jointly opti-mize f(λw1 wN ) =

sumNj=1 L

j(wj λDjtr) and treat

the problem as standard multitask learning with the ex-ception that we evaluate f on mini-batches of 4 episodesrandomly sampled every 5 gradient descent iterations

In settings where we do not use the validation sets we letthe training sets of each episode contain 16 examples perclass Using training episodes with just one example perclass resulted in performances just above random chanceWhile the first experiment constitutes a standard baselinethe others have the specific aim of assessing (i) the impor-tance of splitting episodes of meta-training set into trainingand validation and (ii) the importance of computing thehypergradient of the approximate bilevel problem with Al-gorithm 1 The results reported in Table 4 suggest thatboth the trainingvalidation splitting and the full computa-tion of the hypergradient constitute key factors for learninga good representation in a meta-learning context On theother side using pretrained representations especially ina low-dimensional space turns out to be a rather effectivebaseline One possible explanation is that in this context

4The network is similar to C4L but has 64 filters per layer

Table 4 Performance of various methods where the representationis either transfered or learned with variants of hyper-representationmethods The last raw reports for comparison the score obtainedwith hyper-representation

Method filters Accuracy 1-shot

Multiclass 64 4302Bilevel-train 32 2963Approx 32 4112Approx-train 32 3880Classic-train 32 4046Hyper-representation-C4L 32 4751

some classes in the training and testing meta-datasets arerather similar (eg various dog breeds) and thus groundclassifiers can leverage on very specific representations

6 ConclusionsWe have shown that both HO and ML can be formulated interms of bilevel programming and solved with an iterativeapproach When the inner problem has a unique solution(eg is strongly convex) our theoretical results show that theiterative approach has convergence guarantees a result thatis interesting in its own right In the case of ML by adaptingclassical strategies (Baxter 1995) to the bilevel frameworkwith trainingvalidation splitting we present a method forlearning hyper-representations which is experimentally ef-fective and supported by our theoretical guarantees

Our framework encompasses recently proposed methodsfor meta-learning such as learning to optimize but alsosuggests different design patterns for the inner learningalgorithm which could be interesting to explore in futurework The resulting inner problems may not satisfy theassumptions of our convergence analysis raising the needfor further theoretical investigations An additional futuredirection of research is the study of the statistical propertiesof bilevel strategies where outer objectives are based on thegeneralization ability of the inner model to new (validation)data Ideas from (Maurer et al 2016 Denevi et al 2018)may be useful in this direction

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

ReferencesAndrychowicz M Denil M Gomez S Hoffman M W

Pfau D Schaul T and de Freitas N (2016) Learningto learn by gradient descent by gradient descent In Ad-vances in Neural Information Processing Systems (NIPS)pages 3981ndash3989

Bard J F (2013) Practical bilevel optimization algo-rithms and applications volume 30 Springer Science ampBusiness Media 01251

Baxter J (1995) Learning internal representations In Pro-ceedings of the 8th Annual Conference on ComputationalLearning Theory (COLT) pages 311ndash320 ACM

Baydin A G Pearlmutter B A Radul A A and SiskindJ M (2017) Automatic differentiation in machine learn-ing a survey Journal of Machine Learning Research181531ndash15343

Beirami A Razaviyayn M Shahrampour S and TarokhV (2017) On optimal generalizability in parametriclearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3458ndash3468

Bergstra J and Bengio Y (2012) Random search for hyper-parameter optimization Journal of Machine LearningResearch 13(Feb)281ndash305

Bergstra J Yamins D and Cox D D (2013) Makinga science of model search Hyperparameter optimiza-tion in hundreds of dimensions for vision architecturesIn Proceedings of the 30th International Conference onMachine Learning (ICML) pages 115ndash123

Caruana R (1998) Multitask learning In Learning tolearn pages 95ndash133 Springer 02683

Colson B Marcotte P and Savard G (2007) Anoverview of bilevel optimization Annals of operationsresearch 153(1)235ndash256

Denevi G Ciliberto C Stamos D and Pontil M (2018)Incremental learning-to-learn with statistical guaranteesarXiv preprint arXiv180308089 To appear in UAI 2018

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL (2009) Imagenet A large-scale hierarchical imagedatabase In Computer Vision and Pattern Recognition(CVPR) pages 248ndash255

Domke J (2012) Generic Methods for Optimization-BasedModeling In AISTATS volume 22 pages 318ndash326

Dontchev A L and Zolezzi T (1993) Well-posed op-timization problems volume 1543 of Lecture Notes inMathematics Springer-Verlag Berlin

Edwards H and Storkey A (2016) Towards a NeuralStatistician arXiv160602185 [cs stat] 00027 arXiv160602185

Evgeniou T Micchelli C A and Pontil M (2005) Learn-ing multiple tasks with kernel methods Journal of Ma-chine Learning Research 6(Apr)615ndash637

Finn C Abbeel P and Levine S (2017) Model-agnosticmeta-learning for fast adaptation of deep networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 1126ndash1135

Flamary R Rakotomamonjy A and Gasso G(2014) Learning constrained task similarities in graph-regularized multi-task learning Regularization Opti-mization Kernels and Support Vector Machines page103

Franceschi L Donini M Frasconi P and Pontil M(2017) Forward and reverse gradient-based hyperparam-eter optimization In Proceedings of the 34th Interna-tional Conference on Machine Learning (ICML) pages1165ndash1173

Glorot X and Bengio Y (2010) Understanding the diffi-culty of training deep feedforward neural networks InProceedings of the 13th International Conference on Arti-ficial Intelligence and Statistics (AIStat) pages 249ndash256

Griewank A and Walther A (2008) Evaluating deriva-tives principles and techniques of algorithmic differenti-ation SIAM

He K Zhang X Ren S and Sun J (2016) Deep residuallearning for image recognition In Computer Vision andPattern Recognition (CVPR) pages 770ndash778

Hutter F Lcke J and Schmidt-Thieme L (2015) BeyondManual Tuning of Hyperparameters KI - KunstlicheIntelligenz 29(4)329ndash337

Kaiser L Nachum O Roy A and Bengio S (2017)Learning to remember rare events

Keerthi S S Sindhwani V and Chapelle O (2007) Anefficient method for gradient-based adaptation of hyper-parameters in svm models In Advances in Neural Infor-mation Processing Systems (NIPS) pages 673ndash680

Khosla A Zhou T Malisiewicz T Efros A A andTorralba A (2012) Undoing the damage of dataset biasIn European Conference on Computer Vision pages 158ndash171 Springer

Koch G Zemel R and Salakhutdinov R (2015) Siameseneural networks for one-shot image recognition In ICMLDeep Learning Workshop volume 2 00127

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Koh P W and Liang P (2017) Understanding black-boxpredictions via influence functions In Proceedings ofthe 34th International Conference on Machine Learning(ICML) pages 1885ndash1894

Kunapuli G Bennett K Hu J and Pang J-S (2008)Classification model selection via bilevel programmingOptimization Methods and Software 23(4)475ndash489

Kuzborskij I Orabona F and Caputo B (2013) Fromn to n+ 1 Multiclass transfer incremental learning InComputer Vision and Pattern Recognition (CVPR) 2013IEEE Conference on pages 3358ndash3365

Lake B M Salakhutdinov R and Tenenbaum J B(2015) Human-level concept learning through probabilis-tic program induction Science 350(6266)1332ndash1338

Lake B M Ullman T D Tenenbaum J B and Gersh-man S J (2017) Building machines that learn and thinklike people Behavioral and Brain Sciences 40 00152

Maclaurin D Duvenaud D K and Adams R P (2015)Gradient-based hyperparameter optimization through re-versible learning In Proceedings of the 32nd Interna-tional Conference on Machine Learning (ICML pages2113ndash2122

Maurer A Pontil M and Romera-Paredes B (2016) Thebenefit of multitask representation learning The Journalof Machine Learning Research 17(1)2853ndash2884

Mishra N Rohaninejad M Chen X and Abbeel P(2018) A simple neural attentive meta-learner

Munkhdalai T and Yu H (2017) Meta networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 2554ndash2563

Pedregosa F (2016) Hyperparameter optimization withapproximate gradient In Proceedings of The 33rd Inter-national Conference on Machine Learning (ICML) pages737ndash746

Ravi S and Larochelle H (2017) Optimization as a modelfor few-shot learning In In International Conference onLearning Representations (ICLR)

Santoro A Bartunov S Botvinick M Wierstra Dand Lillicrap T (2016) Meta-learning with memory-augmented neural networks In Proceedings ofthe 33rdInternational Conference on Machine Learning pages1842ndash1850

Schmidhuber J (1992) Learning to Control Fast-WeightMemories An Alternative to Dynamic Recurrent Net-works Neural Computation 4(1)131ndash139 00082

Snell J Swersky K and Zemel R S (2017) Prototypicalnetworks for few-shot learning In Advances in neuralinformation processing systems pages 4080ndash4090

Snoek J Larochelle H and Adams R P (2012) Practicalbayesian optimization of machine learning algorithmsIn Advances in Neural Information Processing Systems(NIPS) pages 2951ndash2959

Thrun S and Pratt L (1998) Learning to learn Springer

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser L and Polosukhin I (2017)Attention is all you need In Advances in Neural Informa-tion Processing Systems (NIPS) pages 6000ndash6010

Vinyals O Blundell C Lillicrap T Kavukcuoglu Kand Wierstra D (2016) Matching networks for one shotlearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3630ndash3638

Wichrowska O Maheswaranathan N Hoffman M WColmenarejo S G Denil M Freitas N and Sohl-Dickstein J (2017) Learned optimizers that scale andgeneralize In Proceedings of the 34th International Con-ference on Machine Learning (ICML) pages 3751ndash3760

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

A Proofs of the Results in Sec 3Proof of Theorem 31 Since Λ is compact it follows fromWeierstrass theorem that a sufficient condition for the exis-tence of minimizers is that f is continuous Thus let λ isin Λand let (λn)nisinN be a sequence in Λ such that λn rarr λWe prove that f(λn) = E(wλn

λn)rarr E(wλ λ) = f(λ)Since (wλn

)nisinN is bounded there exists a subsequence(wkn)nisinN such that wkn rarr w for some w isin Rd Nowsince λkn rarr λ and the map (w λ) 7rarr Lλ(w) is jointlycontinuous we have

forallw isin Rd Lλ(w) = limnLλkn

(wkn)

le limnLλkn

(w) = Lλ(w)

Therefore w is a minimizer of Lλ and hence w = wλThis prove that (wλn)nisinN is a bounded sequence havinga unique cluster point Hence (wλn)nisinN is convergenceto its unique cluster point which is wλ Finally since(wλn

λn)rarr (wλ λ) and E is jointly continuous we haveE(wλn λn)rarr E(wλ λ) and the statement follows

We recal a fundamental fact concerning the stability of min-ima and minimizers in optimization problems (Dontchevand Zolezzi 1993) We provide the proof for completeness

Theorem A1 (Convergence) Let ϕT and ϕ be lower semi-continuous functions defined on a compact set Λ Supposethat ϕT converges uniformly to ϕ on Λ as T rarr +infin Then

(a) inf ϕT rarr inf ϕ

(b) argminϕT rarr argminϕ meaning that for every(λT )TisinN such that λT isin argminϕT we have that

- (λT )TisinN admits a convergent subsequence- for every subsequence (λKT

)TisinN such thatλKT

rarr λ we have λ isin argminϕ

Proof Let (λT )TisinN be a sequence in Λ such that for everyT isin N λT isin argminϕT We prove that

1) (λT )TisinN admits a convergent subsequence

2) for every subsequence (λKT)TisinN such that λKT

rarr λwe have λ isin argminϕ and ϕKT

(λKT)rarr inf ϕ

3) inf ϕT rarr inf ϕ

The first point follows from the fact that Λ is compactConcerning the second point let (λKT

)TisinN be a subse-quence such that λKT

rarr λ Since ϕKTconverge uniformly

to ϕ we have

|ϕKT(λKT

)minus ϕ(λKT)| le sup

λisinΛ|ϕKT

(λ)minus ϕ(λ)| rarr 0

Therefore using also the continuity of ϕ we have

forallλ isin Λ ϕ(λ) = limTϕ(λKT

) = limTϕKT

(λKT)

le limTϕKT

(λ) = ϕ(λ)

So λ isin argminϕ andϕ(λ) = limT ϕKT(λKT

) le inf ϕ =ϕ(λ) that is limT ϕKT

(λKT) = inf ϕ

Finally as regards the last point we proceed by contradic-tion If (ϕT (λT ))TisinN does not convergce to inf f thenthere exists an ε gt 0 and a subsequence (ϕKT

(λKT))TisinN

such that

|ϕKT(λKT

)minus inf ϕ| ge ε forallT isin N (11)

Now let (λK

(1)T

) be a convergent subsequence of (λKT)TisinN

Suppose that λK

(1)T

rarr λ Clearly (λK

(1)T

) is also a subse-quence of (λT )TisinN Then it follows from point 2) abovethat ϕ

K(1)T

(λK

(1)T

) rarr inf ϕ This latter finding togetherwith equation (11) gives a contradiction

Proof of Theorem 32 Since E(middot λ) is uniformly Lipschitzcontinuous there exists ν gt 0 such that for every T isin Nand every λ isin Λ

|fT (λ)minus f(λ)| = |E(wTλ λ)minus E(wλ λ)|le νwTλ minus wλ

It follows from assumption (vi) that fT (λ) converges tof(λ) uniformly on Λ as T rarr +infin Then the statementfollows from Theorem A1

B Cross-validation and Bilevel ProgrammingWe note that the (approximate) bilevel programming frame-work easily accommodates also estimations of the gener-alization error generated by a cross-validation proceduresWe describe here the case of K-fold cross-validation whichincludes also leave-one-out cross validation

Let D = (xi yi)ni=1 be the set of available data K-foldcross validation with K isin 1 2 N consists in parti-tioning D in K subsets DjKj=1 and fit as many modelsgwj on training data Dj

tr =⋃i6=j D

i The models are thenevaluated on Dj

val = Dj Denoting by w = (wj)Kj=1 thevector of stacked weights the K-fold cross validation erroris given by

E(w λ) =1

K

Ksumj=1

Ej(wj λ)

where Ej(wj λ) =sum

(xy)isinDj `(gwj (x) y) E can betreated as the outer objective in the bilevel framework whilethe inner objective Lλ may be given by the sum of regu-larized empirical errors over each Dj

tr for the K models

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Under this perspective a K-fold cross-validation procedureclosely resemble the bilevel problem for ML formulatedin Sec 22 where in this case the meta-distribution col-lapses on the data (ground) distribution and the episodes aresampled from the same dataset of points

By following the procedure outlined in Sec 23 we canapproximate the minimization of Lλ with T steps of anoptimization dynamics and compute the hypergradient offT (λ) = 1

K

sumj E

j(wjT λ) by training the K models andproceed with either forward or reverse differentiation Themodels may be fitted sequentially in parallel or stochas-tically Specifically in this last case one can repeatedlysample one fold at a time (or possibly a mini-batch of folds)and compute a stochastic hypergradient that can be used ina SGD procedure in order to minimize fT At last we notethat similar ideas for leave-one out cross-validation error aredeveloped in (Beirami et al 2017) where the hypergradientof an approximation of the outer objective is computed bymeans of the implicit function theorem

C The Effect of T Ridge RegressionIn Sec 51 we showed that increasing the number of itera-tions T leads to a better optimization of the outer objective fthrough the approximations fT which converge uniformlyto f by Proposition 32 This however does not necessaryresults in better test scores due to possible overfitting ofthe training and validation errors as we noted for the linearhyper-representation multiclass classification experiment inSec 51 The optimal choice of T appears to be indeedproblem dependent if in the aforementioned experiment aquite small T led to the best test accuracy (after hyperparam-eter optimization) in this section we present a small scalelinear regression experiment that benefits from an increasingnumber of inner iterations

We generated 90 noisy synthetic data points with 30 featuresof which only 5 were informative and divided them equallyto form training validation and test sets As outer objectivewe set the mean squared error E(w) = ||Xvalw minus yval||2where Xval and yval are the validation design matrix andtargets respectively and w isin R30 is the weight vector of themodel We set as inner inner objective

Lλ(w) = ||Xtrw minus ytr||2 +

30sumi=1

eλiw2i

and optimized the L2 vector of regularization coefficients λ(equivalent to a diagonal Tikhonov regularization matrix)The results reported in Table 5 show that in this scenariooverfitting is not an issue and 250 inner iterations yield thebest test result Note that as in Sec 51 also this problemadmits an analytical solution from which it is possible tocompute the hypergradient analytically and perform exact

hyperparameter optimization as reported in the last row ofthe Table 5

Table 5 Validation and test mean absolute percentage error(MAPE) for various values of T

T Validation MAPE Test MAPE

10 1135 434950 128 522100 055 126250 047 050Exact 037 057

D Further Details on Few-shot ExperimentsThis appendix contains implementation details of the repre-sentation mapping h and the meta-learning hyperparametersused in the few-shot learning experiments of Sec 52

To optimize the representation mapping in all the exper-iments we use Adam with learning rate set to 10minus3 anda decay-rate of 10minus5 We used the initialization strategyproposed in (Glorot and Bengio 2010) for all the weights λof the representation

For Omniglot experiments we used a meta-batch size of 32episodes for five-way and of 16 episodes for 20-way Totrain the episode-specific classifiers we set the learning rateto 01

For one set of experiments with Mini-imagenet we used anhyper-representation (C4L) consisting of 4 convolutionallayers where each layer is composed by a convolution with32 filters a batch normalization followed by a ReLU acti-vation and a 2x2 max-pooling The classifiers were trainedusing mini-batches of 4 episodes for one-shot and 2 episodesfor five-shot with learning rate set to 001

The other set of experiments with Mini-imagenet employeda Residual Network (RN) as the representation mappingbuilt of 4 residual blocks (with 64 96 128 256 filters) andthen the block that follows 1times 1 conv (2048 filters) avgpooling 1 times 1 conv (512 filters) Each residual blockrepeats the following block 3 times 1 times 1 conv batchnormalization leaky ReLU (leak 01) before the residualconnection In this case the classifiers were optimized usingmini-batches of 2 episodes for both one and five-shot withlearning rate set to 004

Page 8: Bilevel Programming for Hyperparameter Optimization and ...

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Table 3 Accuracy scores computed on episodes from T of various methods on 1-shot and 5-shot classification problems on Omniglotand MiniImagenet For MiniImagenet 95 confidence intervals are reported For Hyper-representation the scores are computed over 600randomly drawn episodes For other methods we show results as reported by their respective authors

OMNIGLOT 5 classes OMNIGLOT 20 classes MINIIMAGENET 5 classesMethod 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot

Siamese nets (Koch et al 2015) 973 984 882 970 minus minusMatching nets (Vinyals et al 2016) 981 989 938 985 4344plusmn 077 5531plusmn 073Neural stat (Edwards and Storkey 2016) 981 995 932 981 minus minusMemory mod (Kaiser et al 2017) 984 996 950 986 minus minusMeta-LSTM (Ravi and Larochelle 2017) minus minus minus minus 4356plusmn 084 6060plusmn 071MAML (Finn et al 2017) 987 999 958 989 4870plusmn 175 6311plusmn 092Meta-networks (Munkhdalai and Yu 2017) 989 minus 970 minus 4921plusmn 096 minusPrototypical Net (Snell et al 2017) 988 997 960 989 4942plusmn 078 6820plusmn 066SNAIL (Mishra et al 2018) 991 998 976 994 5571plusmn 099 6888plusmn 092Hyper-representation 986 995 955 984 5054plusmn 085 6453plusmn 068

linear outputs before the softmax operation of a network4

pretrained on the totality of examples contained in the train-ing meta-dataset (600 examples for each of the 64 classes)In this setting we found that using the second last layer orthe output after the softmax yields worst results

bull Bilevel-train we use a bilevel approach but unlike in Sec4 we optimize the parameter vector λ of the representationmapping by minimizing the loss on the training sets of eachepisode The hypergradient is still computed with Algorithm1 albeit we set Dj

val = Djtr for each training episodes

bull Approx and Approx-train we consider an approxima-tion of the hypergradient nablafT (λ) by disregarding the op-timization dynamics of the inner objectives (ie we setnablaλwjT = 0) In Approx-train we just use the training sets

bull Classic as in (Baxter 1995) we learn h by jointly opti-mize f(λw1 wN ) =

sumNj=1 L

j(wj λDjtr) and treat

the problem as standard multitask learning with the ex-ception that we evaluate f on mini-batches of 4 episodesrandomly sampled every 5 gradient descent iterations

In settings where we do not use the validation sets we letthe training sets of each episode contain 16 examples perclass Using training episodes with just one example perclass resulted in performances just above random chanceWhile the first experiment constitutes a standard baselinethe others have the specific aim of assessing (i) the impor-tance of splitting episodes of meta-training set into trainingand validation and (ii) the importance of computing thehypergradient of the approximate bilevel problem with Al-gorithm 1 The results reported in Table 4 suggest thatboth the trainingvalidation splitting and the full computa-tion of the hypergradient constitute key factors for learninga good representation in a meta-learning context On theother side using pretrained representations especially ina low-dimensional space turns out to be a rather effectivebaseline One possible explanation is that in this context

4The network is similar to C4L but has 64 filters per layer

Table 4 Performance of various methods where the representationis either transfered or learned with variants of hyper-representationmethods The last raw reports for comparison the score obtainedwith hyper-representation

Method filters Accuracy 1-shot

Multiclass 64 4302Bilevel-train 32 2963Approx 32 4112Approx-train 32 3880Classic-train 32 4046Hyper-representation-C4L 32 4751

some classes in the training and testing meta-datasets arerather similar (eg various dog breeds) and thus groundclassifiers can leverage on very specific representations

6 ConclusionsWe have shown that both HO and ML can be formulated interms of bilevel programming and solved with an iterativeapproach When the inner problem has a unique solution(eg is strongly convex) our theoretical results show that theiterative approach has convergence guarantees a result thatis interesting in its own right In the case of ML by adaptingclassical strategies (Baxter 1995) to the bilevel frameworkwith trainingvalidation splitting we present a method forlearning hyper-representations which is experimentally ef-fective and supported by our theoretical guarantees

Our framework encompasses recently proposed methodsfor meta-learning such as learning to optimize but alsosuggests different design patterns for the inner learningalgorithm which could be interesting to explore in futurework The resulting inner problems may not satisfy theassumptions of our convergence analysis raising the needfor further theoretical investigations An additional futuredirection of research is the study of the statistical propertiesof bilevel strategies where outer objectives are based on thegeneralization ability of the inner model to new (validation)data Ideas from (Maurer et al 2016 Denevi et al 2018)may be useful in this direction

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

ReferencesAndrychowicz M Denil M Gomez S Hoffman M W

Pfau D Schaul T and de Freitas N (2016) Learningto learn by gradient descent by gradient descent In Ad-vances in Neural Information Processing Systems (NIPS)pages 3981ndash3989

Bard J F (2013) Practical bilevel optimization algo-rithms and applications volume 30 Springer Science ampBusiness Media 01251

Baxter J (1995) Learning internal representations In Pro-ceedings of the 8th Annual Conference on ComputationalLearning Theory (COLT) pages 311ndash320 ACM

Baydin A G Pearlmutter B A Radul A A and SiskindJ M (2017) Automatic differentiation in machine learn-ing a survey Journal of Machine Learning Research181531ndash15343

Beirami A Razaviyayn M Shahrampour S and TarokhV (2017) On optimal generalizability in parametriclearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3458ndash3468

Bergstra J and Bengio Y (2012) Random search for hyper-parameter optimization Journal of Machine LearningResearch 13(Feb)281ndash305

Bergstra J Yamins D and Cox D D (2013) Makinga science of model search Hyperparameter optimiza-tion in hundreds of dimensions for vision architecturesIn Proceedings of the 30th International Conference onMachine Learning (ICML) pages 115ndash123

Caruana R (1998) Multitask learning In Learning tolearn pages 95ndash133 Springer 02683

Colson B Marcotte P and Savard G (2007) Anoverview of bilevel optimization Annals of operationsresearch 153(1)235ndash256

Denevi G Ciliberto C Stamos D and Pontil M (2018)Incremental learning-to-learn with statistical guaranteesarXiv preprint arXiv180308089 To appear in UAI 2018

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL (2009) Imagenet A large-scale hierarchical imagedatabase In Computer Vision and Pattern Recognition(CVPR) pages 248ndash255

Domke J (2012) Generic Methods for Optimization-BasedModeling In AISTATS volume 22 pages 318ndash326

Dontchev A L and Zolezzi T (1993) Well-posed op-timization problems volume 1543 of Lecture Notes inMathematics Springer-Verlag Berlin

Edwards H and Storkey A (2016) Towards a NeuralStatistician arXiv160602185 [cs stat] 00027 arXiv160602185

Evgeniou T Micchelli C A and Pontil M (2005) Learn-ing multiple tasks with kernel methods Journal of Ma-chine Learning Research 6(Apr)615ndash637

Finn C Abbeel P and Levine S (2017) Model-agnosticmeta-learning for fast adaptation of deep networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 1126ndash1135

Flamary R Rakotomamonjy A and Gasso G(2014) Learning constrained task similarities in graph-regularized multi-task learning Regularization Opti-mization Kernels and Support Vector Machines page103

Franceschi L Donini M Frasconi P and Pontil M(2017) Forward and reverse gradient-based hyperparam-eter optimization In Proceedings of the 34th Interna-tional Conference on Machine Learning (ICML) pages1165ndash1173

Glorot X and Bengio Y (2010) Understanding the diffi-culty of training deep feedforward neural networks InProceedings of the 13th International Conference on Arti-ficial Intelligence and Statistics (AIStat) pages 249ndash256

Griewank A and Walther A (2008) Evaluating deriva-tives principles and techniques of algorithmic differenti-ation SIAM

He K Zhang X Ren S and Sun J (2016) Deep residuallearning for image recognition In Computer Vision andPattern Recognition (CVPR) pages 770ndash778

Hutter F Lcke J and Schmidt-Thieme L (2015) BeyondManual Tuning of Hyperparameters KI - KunstlicheIntelligenz 29(4)329ndash337

Kaiser L Nachum O Roy A and Bengio S (2017)Learning to remember rare events

Keerthi S S Sindhwani V and Chapelle O (2007) Anefficient method for gradient-based adaptation of hyper-parameters in svm models In Advances in Neural Infor-mation Processing Systems (NIPS) pages 673ndash680

Khosla A Zhou T Malisiewicz T Efros A A andTorralba A (2012) Undoing the damage of dataset biasIn European Conference on Computer Vision pages 158ndash171 Springer

Koch G Zemel R and Salakhutdinov R (2015) Siameseneural networks for one-shot image recognition In ICMLDeep Learning Workshop volume 2 00127

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Koh P W and Liang P (2017) Understanding black-boxpredictions via influence functions In Proceedings ofthe 34th International Conference on Machine Learning(ICML) pages 1885ndash1894

Kunapuli G Bennett K Hu J and Pang J-S (2008)Classification model selection via bilevel programmingOptimization Methods and Software 23(4)475ndash489

Kuzborskij I Orabona F and Caputo B (2013) Fromn to n+ 1 Multiclass transfer incremental learning InComputer Vision and Pattern Recognition (CVPR) 2013IEEE Conference on pages 3358ndash3365

Lake B M Salakhutdinov R and Tenenbaum J B(2015) Human-level concept learning through probabilis-tic program induction Science 350(6266)1332ndash1338

Lake B M Ullman T D Tenenbaum J B and Gersh-man S J (2017) Building machines that learn and thinklike people Behavioral and Brain Sciences 40 00152

Maclaurin D Duvenaud D K and Adams R P (2015)Gradient-based hyperparameter optimization through re-versible learning In Proceedings of the 32nd Interna-tional Conference on Machine Learning (ICML pages2113ndash2122

Maurer A Pontil M and Romera-Paredes B (2016) Thebenefit of multitask representation learning The Journalof Machine Learning Research 17(1)2853ndash2884

Mishra N Rohaninejad M Chen X and Abbeel P(2018) A simple neural attentive meta-learner

Munkhdalai T and Yu H (2017) Meta networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 2554ndash2563

Pedregosa F (2016) Hyperparameter optimization withapproximate gradient In Proceedings of The 33rd Inter-national Conference on Machine Learning (ICML) pages737ndash746

Ravi S and Larochelle H (2017) Optimization as a modelfor few-shot learning In In International Conference onLearning Representations (ICLR)

Santoro A Bartunov S Botvinick M Wierstra Dand Lillicrap T (2016) Meta-learning with memory-augmented neural networks In Proceedings ofthe 33rdInternational Conference on Machine Learning pages1842ndash1850

Schmidhuber J (1992) Learning to Control Fast-WeightMemories An Alternative to Dynamic Recurrent Net-works Neural Computation 4(1)131ndash139 00082

Snell J Swersky K and Zemel R S (2017) Prototypicalnetworks for few-shot learning In Advances in neuralinformation processing systems pages 4080ndash4090

Snoek J Larochelle H and Adams R P (2012) Practicalbayesian optimization of machine learning algorithmsIn Advances in Neural Information Processing Systems(NIPS) pages 2951ndash2959

Thrun S and Pratt L (1998) Learning to learn Springer

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser L and Polosukhin I (2017)Attention is all you need In Advances in Neural Informa-tion Processing Systems (NIPS) pages 6000ndash6010

Vinyals O Blundell C Lillicrap T Kavukcuoglu Kand Wierstra D (2016) Matching networks for one shotlearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3630ndash3638

Wichrowska O Maheswaranathan N Hoffman M WColmenarejo S G Denil M Freitas N and Sohl-Dickstein J (2017) Learned optimizers that scale andgeneralize In Proceedings of the 34th International Con-ference on Machine Learning (ICML) pages 3751ndash3760

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

A Proofs of the Results in Sec 3Proof of Theorem 31 Since Λ is compact it follows fromWeierstrass theorem that a sufficient condition for the exis-tence of minimizers is that f is continuous Thus let λ isin Λand let (λn)nisinN be a sequence in Λ such that λn rarr λWe prove that f(λn) = E(wλn

λn)rarr E(wλ λ) = f(λ)Since (wλn

)nisinN is bounded there exists a subsequence(wkn)nisinN such that wkn rarr w for some w isin Rd Nowsince λkn rarr λ and the map (w λ) 7rarr Lλ(w) is jointlycontinuous we have

forallw isin Rd Lλ(w) = limnLλkn

(wkn)

le limnLλkn

(w) = Lλ(w)

Therefore w is a minimizer of Lλ and hence w = wλThis prove that (wλn)nisinN is a bounded sequence havinga unique cluster point Hence (wλn)nisinN is convergenceto its unique cluster point which is wλ Finally since(wλn

λn)rarr (wλ λ) and E is jointly continuous we haveE(wλn λn)rarr E(wλ λ) and the statement follows

We recal a fundamental fact concerning the stability of min-ima and minimizers in optimization problems (Dontchevand Zolezzi 1993) We provide the proof for completeness

Theorem A1 (Convergence) Let ϕT and ϕ be lower semi-continuous functions defined on a compact set Λ Supposethat ϕT converges uniformly to ϕ on Λ as T rarr +infin Then

(a) inf ϕT rarr inf ϕ

(b) argminϕT rarr argminϕ meaning that for every(λT )TisinN such that λT isin argminϕT we have that

- (λT )TisinN admits a convergent subsequence- for every subsequence (λKT

)TisinN such thatλKT

rarr λ we have λ isin argminϕ

Proof Let (λT )TisinN be a sequence in Λ such that for everyT isin N λT isin argminϕT We prove that

1) (λT )TisinN admits a convergent subsequence

2) for every subsequence (λKT)TisinN such that λKT

rarr λwe have λ isin argminϕ and ϕKT

(λKT)rarr inf ϕ

3) inf ϕT rarr inf ϕ

The first point follows from the fact that Λ is compactConcerning the second point let (λKT

)TisinN be a subse-quence such that λKT

rarr λ Since ϕKTconverge uniformly

to ϕ we have

|ϕKT(λKT

)minus ϕ(λKT)| le sup

λisinΛ|ϕKT

(λ)minus ϕ(λ)| rarr 0

Therefore using also the continuity of ϕ we have

forallλ isin Λ ϕ(λ) = limTϕ(λKT

) = limTϕKT

(λKT)

le limTϕKT

(λ) = ϕ(λ)

So λ isin argminϕ andϕ(λ) = limT ϕKT(λKT

) le inf ϕ =ϕ(λ) that is limT ϕKT

(λKT) = inf ϕ

Finally as regards the last point we proceed by contradic-tion If (ϕT (λT ))TisinN does not convergce to inf f thenthere exists an ε gt 0 and a subsequence (ϕKT

(λKT))TisinN

such that

|ϕKT(λKT

)minus inf ϕ| ge ε forallT isin N (11)

Now let (λK

(1)T

) be a convergent subsequence of (λKT)TisinN

Suppose that λK

(1)T

rarr λ Clearly (λK

(1)T

) is also a subse-quence of (λT )TisinN Then it follows from point 2) abovethat ϕ

K(1)T

(λK

(1)T

) rarr inf ϕ This latter finding togetherwith equation (11) gives a contradiction

Proof of Theorem 32 Since E(middot λ) is uniformly Lipschitzcontinuous there exists ν gt 0 such that for every T isin Nand every λ isin Λ

|fT (λ)minus f(λ)| = |E(wTλ λ)minus E(wλ λ)|le νwTλ minus wλ

It follows from assumption (vi) that fT (λ) converges tof(λ) uniformly on Λ as T rarr +infin Then the statementfollows from Theorem A1

B Cross-validation and Bilevel ProgrammingWe note that the (approximate) bilevel programming frame-work easily accommodates also estimations of the gener-alization error generated by a cross-validation proceduresWe describe here the case of K-fold cross-validation whichincludes also leave-one-out cross validation

Let D = (xi yi)ni=1 be the set of available data K-foldcross validation with K isin 1 2 N consists in parti-tioning D in K subsets DjKj=1 and fit as many modelsgwj on training data Dj

tr =⋃i6=j D

i The models are thenevaluated on Dj

val = Dj Denoting by w = (wj)Kj=1 thevector of stacked weights the K-fold cross validation erroris given by

E(w λ) =1

K

Ksumj=1

Ej(wj λ)

where Ej(wj λ) =sum

(xy)isinDj `(gwj (x) y) E can betreated as the outer objective in the bilevel framework whilethe inner objective Lλ may be given by the sum of regu-larized empirical errors over each Dj

tr for the K models

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Under this perspective a K-fold cross-validation procedureclosely resemble the bilevel problem for ML formulatedin Sec 22 where in this case the meta-distribution col-lapses on the data (ground) distribution and the episodes aresampled from the same dataset of points

By following the procedure outlined in Sec 23 we canapproximate the minimization of Lλ with T steps of anoptimization dynamics and compute the hypergradient offT (λ) = 1

K

sumj E

j(wjT λ) by training the K models andproceed with either forward or reverse differentiation Themodels may be fitted sequentially in parallel or stochas-tically Specifically in this last case one can repeatedlysample one fold at a time (or possibly a mini-batch of folds)and compute a stochastic hypergradient that can be used ina SGD procedure in order to minimize fT At last we notethat similar ideas for leave-one out cross-validation error aredeveloped in (Beirami et al 2017) where the hypergradientof an approximation of the outer objective is computed bymeans of the implicit function theorem

C The Effect of T Ridge RegressionIn Sec 51 we showed that increasing the number of itera-tions T leads to a better optimization of the outer objective fthrough the approximations fT which converge uniformlyto f by Proposition 32 This however does not necessaryresults in better test scores due to possible overfitting ofthe training and validation errors as we noted for the linearhyper-representation multiclass classification experiment inSec 51 The optimal choice of T appears to be indeedproblem dependent if in the aforementioned experiment aquite small T led to the best test accuracy (after hyperparam-eter optimization) in this section we present a small scalelinear regression experiment that benefits from an increasingnumber of inner iterations

We generated 90 noisy synthetic data points with 30 featuresof which only 5 were informative and divided them equallyto form training validation and test sets As outer objectivewe set the mean squared error E(w) = ||Xvalw minus yval||2where Xval and yval are the validation design matrix andtargets respectively and w isin R30 is the weight vector of themodel We set as inner inner objective

Lλ(w) = ||Xtrw minus ytr||2 +

30sumi=1

eλiw2i

and optimized the L2 vector of regularization coefficients λ(equivalent to a diagonal Tikhonov regularization matrix)The results reported in Table 5 show that in this scenariooverfitting is not an issue and 250 inner iterations yield thebest test result Note that as in Sec 51 also this problemadmits an analytical solution from which it is possible tocompute the hypergradient analytically and perform exact

hyperparameter optimization as reported in the last row ofthe Table 5

Table 5 Validation and test mean absolute percentage error(MAPE) for various values of T

T Validation MAPE Test MAPE

10 1135 434950 128 522100 055 126250 047 050Exact 037 057

D Further Details on Few-shot ExperimentsThis appendix contains implementation details of the repre-sentation mapping h and the meta-learning hyperparametersused in the few-shot learning experiments of Sec 52

To optimize the representation mapping in all the exper-iments we use Adam with learning rate set to 10minus3 anda decay-rate of 10minus5 We used the initialization strategyproposed in (Glorot and Bengio 2010) for all the weights λof the representation

For Omniglot experiments we used a meta-batch size of 32episodes for five-way and of 16 episodes for 20-way Totrain the episode-specific classifiers we set the learning rateto 01

For one set of experiments with Mini-imagenet we used anhyper-representation (C4L) consisting of 4 convolutionallayers where each layer is composed by a convolution with32 filters a batch normalization followed by a ReLU acti-vation and a 2x2 max-pooling The classifiers were trainedusing mini-batches of 4 episodes for one-shot and 2 episodesfor five-shot with learning rate set to 001

The other set of experiments with Mini-imagenet employeda Residual Network (RN) as the representation mappingbuilt of 4 residual blocks (with 64 96 128 256 filters) andthen the block that follows 1times 1 conv (2048 filters) avgpooling 1 times 1 conv (512 filters) Each residual blockrepeats the following block 3 times 1 times 1 conv batchnormalization leaky ReLU (leak 01) before the residualconnection In this case the classifiers were optimized usingmini-batches of 2 episodes for both one and five-shot withlearning rate set to 004

Page 9: Bilevel Programming for Hyperparameter Optimization and ...

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

ReferencesAndrychowicz M Denil M Gomez S Hoffman M W

Pfau D Schaul T and de Freitas N (2016) Learningto learn by gradient descent by gradient descent In Ad-vances in Neural Information Processing Systems (NIPS)pages 3981ndash3989

Bard J F (2013) Practical bilevel optimization algo-rithms and applications volume 30 Springer Science ampBusiness Media 01251

Baxter J (1995) Learning internal representations In Pro-ceedings of the 8th Annual Conference on ComputationalLearning Theory (COLT) pages 311ndash320 ACM

Baydin A G Pearlmutter B A Radul A A and SiskindJ M (2017) Automatic differentiation in machine learn-ing a survey Journal of Machine Learning Research181531ndash15343

Beirami A Razaviyayn M Shahrampour S and TarokhV (2017) On optimal generalizability in parametriclearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3458ndash3468

Bergstra J and Bengio Y (2012) Random search for hyper-parameter optimization Journal of Machine LearningResearch 13(Feb)281ndash305

Bergstra J Yamins D and Cox D D (2013) Makinga science of model search Hyperparameter optimiza-tion in hundreds of dimensions for vision architecturesIn Proceedings of the 30th International Conference onMachine Learning (ICML) pages 115ndash123

Caruana R (1998) Multitask learning In Learning tolearn pages 95ndash133 Springer 02683

Colson B Marcotte P and Savard G (2007) Anoverview of bilevel optimization Annals of operationsresearch 153(1)235ndash256

Denevi G Ciliberto C Stamos D and Pontil M (2018)Incremental learning-to-learn with statistical guaranteesarXiv preprint arXiv180308089 To appear in UAI 2018

Deng J Dong W Socher R Li L-J Li K and Fei-FeiL (2009) Imagenet A large-scale hierarchical imagedatabase In Computer Vision and Pattern Recognition(CVPR) pages 248ndash255

Domke J (2012) Generic Methods for Optimization-BasedModeling In AISTATS volume 22 pages 318ndash326

Dontchev A L and Zolezzi T (1993) Well-posed op-timization problems volume 1543 of Lecture Notes inMathematics Springer-Verlag Berlin

Edwards H and Storkey A (2016) Towards a NeuralStatistician arXiv160602185 [cs stat] 00027 arXiv160602185

Evgeniou T Micchelli C A and Pontil M (2005) Learn-ing multiple tasks with kernel methods Journal of Ma-chine Learning Research 6(Apr)615ndash637

Finn C Abbeel P and Levine S (2017) Model-agnosticmeta-learning for fast adaptation of deep networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 1126ndash1135

Flamary R Rakotomamonjy A and Gasso G(2014) Learning constrained task similarities in graph-regularized multi-task learning Regularization Opti-mization Kernels and Support Vector Machines page103

Franceschi L Donini M Frasconi P and Pontil M(2017) Forward and reverse gradient-based hyperparam-eter optimization In Proceedings of the 34th Interna-tional Conference on Machine Learning (ICML) pages1165ndash1173

Glorot X and Bengio Y (2010) Understanding the diffi-culty of training deep feedforward neural networks InProceedings of the 13th International Conference on Arti-ficial Intelligence and Statistics (AIStat) pages 249ndash256

Griewank A and Walther A (2008) Evaluating deriva-tives principles and techniques of algorithmic differenti-ation SIAM

He K Zhang X Ren S and Sun J (2016) Deep residuallearning for image recognition In Computer Vision andPattern Recognition (CVPR) pages 770ndash778

Hutter F Lcke J and Schmidt-Thieme L (2015) BeyondManual Tuning of Hyperparameters KI - KunstlicheIntelligenz 29(4)329ndash337

Kaiser L Nachum O Roy A and Bengio S (2017)Learning to remember rare events

Keerthi S S Sindhwani V and Chapelle O (2007) Anefficient method for gradient-based adaptation of hyper-parameters in svm models In Advances in Neural Infor-mation Processing Systems (NIPS) pages 673ndash680

Khosla A Zhou T Malisiewicz T Efros A A andTorralba A (2012) Undoing the damage of dataset biasIn European Conference on Computer Vision pages 158ndash171 Springer

Koch G Zemel R and Salakhutdinov R (2015) Siameseneural networks for one-shot image recognition In ICMLDeep Learning Workshop volume 2 00127

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Koh P W and Liang P (2017) Understanding black-boxpredictions via influence functions In Proceedings ofthe 34th International Conference on Machine Learning(ICML) pages 1885ndash1894

Kunapuli G Bennett K Hu J and Pang J-S (2008)Classification model selection via bilevel programmingOptimization Methods and Software 23(4)475ndash489

Kuzborskij I Orabona F and Caputo B (2013) Fromn to n+ 1 Multiclass transfer incremental learning InComputer Vision and Pattern Recognition (CVPR) 2013IEEE Conference on pages 3358ndash3365

Lake B M Salakhutdinov R and Tenenbaum J B(2015) Human-level concept learning through probabilis-tic program induction Science 350(6266)1332ndash1338

Lake B M Ullman T D Tenenbaum J B and Gersh-man S J (2017) Building machines that learn and thinklike people Behavioral and Brain Sciences 40 00152

Maclaurin D Duvenaud D K and Adams R P (2015)Gradient-based hyperparameter optimization through re-versible learning In Proceedings of the 32nd Interna-tional Conference on Machine Learning (ICML pages2113ndash2122

Maurer A Pontil M and Romera-Paredes B (2016) Thebenefit of multitask representation learning The Journalof Machine Learning Research 17(1)2853ndash2884

Mishra N Rohaninejad M Chen X and Abbeel P(2018) A simple neural attentive meta-learner

Munkhdalai T and Yu H (2017) Meta networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 2554ndash2563

Pedregosa F (2016) Hyperparameter optimization withapproximate gradient In Proceedings of The 33rd Inter-national Conference on Machine Learning (ICML) pages737ndash746

Ravi S and Larochelle H (2017) Optimization as a modelfor few-shot learning In In International Conference onLearning Representations (ICLR)

Santoro A Bartunov S Botvinick M Wierstra Dand Lillicrap T (2016) Meta-learning with memory-augmented neural networks In Proceedings ofthe 33rdInternational Conference on Machine Learning pages1842ndash1850

Schmidhuber J (1992) Learning to Control Fast-WeightMemories An Alternative to Dynamic Recurrent Net-works Neural Computation 4(1)131ndash139 00082

Snell J Swersky K and Zemel R S (2017) Prototypicalnetworks for few-shot learning In Advances in neuralinformation processing systems pages 4080ndash4090

Snoek J Larochelle H and Adams R P (2012) Practicalbayesian optimization of machine learning algorithmsIn Advances in Neural Information Processing Systems(NIPS) pages 2951ndash2959

Thrun S and Pratt L (1998) Learning to learn Springer

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser L and Polosukhin I (2017)Attention is all you need In Advances in Neural Informa-tion Processing Systems (NIPS) pages 6000ndash6010

Vinyals O Blundell C Lillicrap T Kavukcuoglu Kand Wierstra D (2016) Matching networks for one shotlearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3630ndash3638

Wichrowska O Maheswaranathan N Hoffman M WColmenarejo S G Denil M Freitas N and Sohl-Dickstein J (2017) Learned optimizers that scale andgeneralize In Proceedings of the 34th International Con-ference on Machine Learning (ICML) pages 3751ndash3760

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

A Proofs of the Results in Sec 3Proof of Theorem 31 Since Λ is compact it follows fromWeierstrass theorem that a sufficient condition for the exis-tence of minimizers is that f is continuous Thus let λ isin Λand let (λn)nisinN be a sequence in Λ such that λn rarr λWe prove that f(λn) = E(wλn

λn)rarr E(wλ λ) = f(λ)Since (wλn

)nisinN is bounded there exists a subsequence(wkn)nisinN such that wkn rarr w for some w isin Rd Nowsince λkn rarr λ and the map (w λ) 7rarr Lλ(w) is jointlycontinuous we have

forallw isin Rd Lλ(w) = limnLλkn

(wkn)

le limnLλkn

(w) = Lλ(w)

Therefore w is a minimizer of Lλ and hence w = wλThis prove that (wλn)nisinN is a bounded sequence havinga unique cluster point Hence (wλn)nisinN is convergenceto its unique cluster point which is wλ Finally since(wλn

λn)rarr (wλ λ) and E is jointly continuous we haveE(wλn λn)rarr E(wλ λ) and the statement follows

We recal a fundamental fact concerning the stability of min-ima and minimizers in optimization problems (Dontchevand Zolezzi 1993) We provide the proof for completeness

Theorem A1 (Convergence) Let ϕT and ϕ be lower semi-continuous functions defined on a compact set Λ Supposethat ϕT converges uniformly to ϕ on Λ as T rarr +infin Then

(a) inf ϕT rarr inf ϕ

(b) argminϕT rarr argminϕ meaning that for every(λT )TisinN such that λT isin argminϕT we have that

- (λT )TisinN admits a convergent subsequence- for every subsequence (λKT

)TisinN such thatλKT

rarr λ we have λ isin argminϕ

Proof Let (λT )TisinN be a sequence in Λ such that for everyT isin N λT isin argminϕT We prove that

1) (λT )TisinN admits a convergent subsequence

2) for every subsequence (λKT)TisinN such that λKT

rarr λwe have λ isin argminϕ and ϕKT

(λKT)rarr inf ϕ

3) inf ϕT rarr inf ϕ

The first point follows from the fact that Λ is compactConcerning the second point let (λKT

)TisinN be a subse-quence such that λKT

rarr λ Since ϕKTconverge uniformly

to ϕ we have

|ϕKT(λKT

)minus ϕ(λKT)| le sup

λisinΛ|ϕKT

(λ)minus ϕ(λ)| rarr 0

Therefore using also the continuity of ϕ we have

forallλ isin Λ ϕ(λ) = limTϕ(λKT

) = limTϕKT

(λKT)

le limTϕKT

(λ) = ϕ(λ)

So λ isin argminϕ andϕ(λ) = limT ϕKT(λKT

) le inf ϕ =ϕ(λ) that is limT ϕKT

(λKT) = inf ϕ

Finally as regards the last point we proceed by contradic-tion If (ϕT (λT ))TisinN does not convergce to inf f thenthere exists an ε gt 0 and a subsequence (ϕKT

(λKT))TisinN

such that

|ϕKT(λKT

)minus inf ϕ| ge ε forallT isin N (11)

Now let (λK

(1)T

) be a convergent subsequence of (λKT)TisinN

Suppose that λK

(1)T

rarr λ Clearly (λK

(1)T

) is also a subse-quence of (λT )TisinN Then it follows from point 2) abovethat ϕ

K(1)T

(λK

(1)T

) rarr inf ϕ This latter finding togetherwith equation (11) gives a contradiction

Proof of Theorem 32 Since E(middot λ) is uniformly Lipschitzcontinuous there exists ν gt 0 such that for every T isin Nand every λ isin Λ

|fT (λ)minus f(λ)| = |E(wTλ λ)minus E(wλ λ)|le νwTλ minus wλ

It follows from assumption (vi) that fT (λ) converges tof(λ) uniformly on Λ as T rarr +infin Then the statementfollows from Theorem A1

B Cross-validation and Bilevel ProgrammingWe note that the (approximate) bilevel programming frame-work easily accommodates also estimations of the gener-alization error generated by a cross-validation proceduresWe describe here the case of K-fold cross-validation whichincludes also leave-one-out cross validation

Let D = (xi yi)ni=1 be the set of available data K-foldcross validation with K isin 1 2 N consists in parti-tioning D in K subsets DjKj=1 and fit as many modelsgwj on training data Dj

tr =⋃i6=j D

i The models are thenevaluated on Dj

val = Dj Denoting by w = (wj)Kj=1 thevector of stacked weights the K-fold cross validation erroris given by

E(w λ) =1

K

Ksumj=1

Ej(wj λ)

where Ej(wj λ) =sum

(xy)isinDj `(gwj (x) y) E can betreated as the outer objective in the bilevel framework whilethe inner objective Lλ may be given by the sum of regu-larized empirical errors over each Dj

tr for the K models

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Under this perspective a K-fold cross-validation procedureclosely resemble the bilevel problem for ML formulatedin Sec 22 where in this case the meta-distribution col-lapses on the data (ground) distribution and the episodes aresampled from the same dataset of points

By following the procedure outlined in Sec 23 we canapproximate the minimization of Lλ with T steps of anoptimization dynamics and compute the hypergradient offT (λ) = 1

K

sumj E

j(wjT λ) by training the K models andproceed with either forward or reverse differentiation Themodels may be fitted sequentially in parallel or stochas-tically Specifically in this last case one can repeatedlysample one fold at a time (or possibly a mini-batch of folds)and compute a stochastic hypergradient that can be used ina SGD procedure in order to minimize fT At last we notethat similar ideas for leave-one out cross-validation error aredeveloped in (Beirami et al 2017) where the hypergradientof an approximation of the outer objective is computed bymeans of the implicit function theorem

C The Effect of T Ridge RegressionIn Sec 51 we showed that increasing the number of itera-tions T leads to a better optimization of the outer objective fthrough the approximations fT which converge uniformlyto f by Proposition 32 This however does not necessaryresults in better test scores due to possible overfitting ofthe training and validation errors as we noted for the linearhyper-representation multiclass classification experiment inSec 51 The optimal choice of T appears to be indeedproblem dependent if in the aforementioned experiment aquite small T led to the best test accuracy (after hyperparam-eter optimization) in this section we present a small scalelinear regression experiment that benefits from an increasingnumber of inner iterations

We generated 90 noisy synthetic data points with 30 featuresof which only 5 were informative and divided them equallyto form training validation and test sets As outer objectivewe set the mean squared error E(w) = ||Xvalw minus yval||2where Xval and yval are the validation design matrix andtargets respectively and w isin R30 is the weight vector of themodel We set as inner inner objective

Lλ(w) = ||Xtrw minus ytr||2 +

30sumi=1

eλiw2i

and optimized the L2 vector of regularization coefficients λ(equivalent to a diagonal Tikhonov regularization matrix)The results reported in Table 5 show that in this scenariooverfitting is not an issue and 250 inner iterations yield thebest test result Note that as in Sec 51 also this problemadmits an analytical solution from which it is possible tocompute the hypergradient analytically and perform exact

hyperparameter optimization as reported in the last row ofthe Table 5

Table 5 Validation and test mean absolute percentage error(MAPE) for various values of T

T Validation MAPE Test MAPE

10 1135 434950 128 522100 055 126250 047 050Exact 037 057

D Further Details on Few-shot ExperimentsThis appendix contains implementation details of the repre-sentation mapping h and the meta-learning hyperparametersused in the few-shot learning experiments of Sec 52

To optimize the representation mapping in all the exper-iments we use Adam with learning rate set to 10minus3 anda decay-rate of 10minus5 We used the initialization strategyproposed in (Glorot and Bengio 2010) for all the weights λof the representation

For Omniglot experiments we used a meta-batch size of 32episodes for five-way and of 16 episodes for 20-way Totrain the episode-specific classifiers we set the learning rateto 01

For one set of experiments with Mini-imagenet we used anhyper-representation (C4L) consisting of 4 convolutionallayers where each layer is composed by a convolution with32 filters a batch normalization followed by a ReLU acti-vation and a 2x2 max-pooling The classifiers were trainedusing mini-batches of 4 episodes for one-shot and 2 episodesfor five-shot with learning rate set to 001

The other set of experiments with Mini-imagenet employeda Residual Network (RN) as the representation mappingbuilt of 4 residual blocks (with 64 96 128 256 filters) andthen the block that follows 1times 1 conv (2048 filters) avgpooling 1 times 1 conv (512 filters) Each residual blockrepeats the following block 3 times 1 times 1 conv batchnormalization leaky ReLU (leak 01) before the residualconnection In this case the classifiers were optimized usingmini-batches of 2 episodes for both one and five-shot withlearning rate set to 004

Page 10: Bilevel Programming for Hyperparameter Optimization and ...

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Koh P W and Liang P (2017) Understanding black-boxpredictions via influence functions In Proceedings ofthe 34th International Conference on Machine Learning(ICML) pages 1885ndash1894

Kunapuli G Bennett K Hu J and Pang J-S (2008)Classification model selection via bilevel programmingOptimization Methods and Software 23(4)475ndash489

Kuzborskij I Orabona F and Caputo B (2013) Fromn to n+ 1 Multiclass transfer incremental learning InComputer Vision and Pattern Recognition (CVPR) 2013IEEE Conference on pages 3358ndash3365

Lake B M Salakhutdinov R and Tenenbaum J B(2015) Human-level concept learning through probabilis-tic program induction Science 350(6266)1332ndash1338

Lake B M Ullman T D Tenenbaum J B and Gersh-man S J (2017) Building machines that learn and thinklike people Behavioral and Brain Sciences 40 00152

Maclaurin D Duvenaud D K and Adams R P (2015)Gradient-based hyperparameter optimization through re-versible learning In Proceedings of the 32nd Interna-tional Conference on Machine Learning (ICML pages2113ndash2122

Maurer A Pontil M and Romera-Paredes B (2016) Thebenefit of multitask representation learning The Journalof Machine Learning Research 17(1)2853ndash2884

Mishra N Rohaninejad M Chen X and Abbeel P(2018) A simple neural attentive meta-learner

Munkhdalai T and Yu H (2017) Meta networks InProceedings of the 34th International Conference on Ma-chine Learning (ICML) pages 2554ndash2563

Pedregosa F (2016) Hyperparameter optimization withapproximate gradient In Proceedings of The 33rd Inter-national Conference on Machine Learning (ICML) pages737ndash746

Ravi S and Larochelle H (2017) Optimization as a modelfor few-shot learning In In International Conference onLearning Representations (ICLR)

Santoro A Bartunov S Botvinick M Wierstra Dand Lillicrap T (2016) Meta-learning with memory-augmented neural networks In Proceedings ofthe 33rdInternational Conference on Machine Learning pages1842ndash1850

Schmidhuber J (1992) Learning to Control Fast-WeightMemories An Alternative to Dynamic Recurrent Net-works Neural Computation 4(1)131ndash139 00082

Snell J Swersky K and Zemel R S (2017) Prototypicalnetworks for few-shot learning In Advances in neuralinformation processing systems pages 4080ndash4090

Snoek J Larochelle H and Adams R P (2012) Practicalbayesian optimization of machine learning algorithmsIn Advances in Neural Information Processing Systems(NIPS) pages 2951ndash2959

Thrun S and Pratt L (1998) Learning to learn Springer

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser L and Polosukhin I (2017)Attention is all you need In Advances in Neural Informa-tion Processing Systems (NIPS) pages 6000ndash6010

Vinyals O Blundell C Lillicrap T Kavukcuoglu Kand Wierstra D (2016) Matching networks for one shotlearning In Advances in Neural Information ProcessingSystems (NIPS) pages 3630ndash3638

Wichrowska O Maheswaranathan N Hoffman M WColmenarejo S G Denil M Freitas N and Sohl-Dickstein J (2017) Learned optimizers that scale andgeneralize In Proceedings of the 34th International Con-ference on Machine Learning (ICML) pages 3751ndash3760

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

A Proofs of the Results in Sec 3Proof of Theorem 31 Since Λ is compact it follows fromWeierstrass theorem that a sufficient condition for the exis-tence of minimizers is that f is continuous Thus let λ isin Λand let (λn)nisinN be a sequence in Λ such that λn rarr λWe prove that f(λn) = E(wλn

λn)rarr E(wλ λ) = f(λ)Since (wλn

)nisinN is bounded there exists a subsequence(wkn)nisinN such that wkn rarr w for some w isin Rd Nowsince λkn rarr λ and the map (w λ) 7rarr Lλ(w) is jointlycontinuous we have

forallw isin Rd Lλ(w) = limnLλkn

(wkn)

le limnLλkn

(w) = Lλ(w)

Therefore w is a minimizer of Lλ and hence w = wλThis prove that (wλn)nisinN is a bounded sequence havinga unique cluster point Hence (wλn)nisinN is convergenceto its unique cluster point which is wλ Finally since(wλn

λn)rarr (wλ λ) and E is jointly continuous we haveE(wλn λn)rarr E(wλ λ) and the statement follows

We recal a fundamental fact concerning the stability of min-ima and minimizers in optimization problems (Dontchevand Zolezzi 1993) We provide the proof for completeness

Theorem A1 (Convergence) Let ϕT and ϕ be lower semi-continuous functions defined on a compact set Λ Supposethat ϕT converges uniformly to ϕ on Λ as T rarr +infin Then

(a) inf ϕT rarr inf ϕ

(b) argminϕT rarr argminϕ meaning that for every(λT )TisinN such that λT isin argminϕT we have that

- (λT )TisinN admits a convergent subsequence- for every subsequence (λKT

)TisinN such thatλKT

rarr λ we have λ isin argminϕ

Proof Let (λT )TisinN be a sequence in Λ such that for everyT isin N λT isin argminϕT We prove that

1) (λT )TisinN admits a convergent subsequence

2) for every subsequence (λKT)TisinN such that λKT

rarr λwe have λ isin argminϕ and ϕKT

(λKT)rarr inf ϕ

3) inf ϕT rarr inf ϕ

The first point follows from the fact that Λ is compactConcerning the second point let (λKT

)TisinN be a subse-quence such that λKT

rarr λ Since ϕKTconverge uniformly

to ϕ we have

|ϕKT(λKT

)minus ϕ(λKT)| le sup

λisinΛ|ϕKT

(λ)minus ϕ(λ)| rarr 0

Therefore using also the continuity of ϕ we have

forallλ isin Λ ϕ(λ) = limTϕ(λKT

) = limTϕKT

(λKT)

le limTϕKT

(λ) = ϕ(λ)

So λ isin argminϕ andϕ(λ) = limT ϕKT(λKT

) le inf ϕ =ϕ(λ) that is limT ϕKT

(λKT) = inf ϕ

Finally as regards the last point we proceed by contradic-tion If (ϕT (λT ))TisinN does not convergce to inf f thenthere exists an ε gt 0 and a subsequence (ϕKT

(λKT))TisinN

such that

|ϕKT(λKT

)minus inf ϕ| ge ε forallT isin N (11)

Now let (λK

(1)T

) be a convergent subsequence of (λKT)TisinN

Suppose that λK

(1)T

rarr λ Clearly (λK

(1)T

) is also a subse-quence of (λT )TisinN Then it follows from point 2) abovethat ϕ

K(1)T

(λK

(1)T

) rarr inf ϕ This latter finding togetherwith equation (11) gives a contradiction

Proof of Theorem 32 Since E(middot λ) is uniformly Lipschitzcontinuous there exists ν gt 0 such that for every T isin Nand every λ isin Λ

|fT (λ)minus f(λ)| = |E(wTλ λ)minus E(wλ λ)|le νwTλ minus wλ

It follows from assumption (vi) that fT (λ) converges tof(λ) uniformly on Λ as T rarr +infin Then the statementfollows from Theorem A1

B Cross-validation and Bilevel ProgrammingWe note that the (approximate) bilevel programming frame-work easily accommodates also estimations of the gener-alization error generated by a cross-validation proceduresWe describe here the case of K-fold cross-validation whichincludes also leave-one-out cross validation

Let D = (xi yi)ni=1 be the set of available data K-foldcross validation with K isin 1 2 N consists in parti-tioning D in K subsets DjKj=1 and fit as many modelsgwj on training data Dj

tr =⋃i6=j D

i The models are thenevaluated on Dj

val = Dj Denoting by w = (wj)Kj=1 thevector of stacked weights the K-fold cross validation erroris given by

E(w λ) =1

K

Ksumj=1

Ej(wj λ)

where Ej(wj λ) =sum

(xy)isinDj `(gwj (x) y) E can betreated as the outer objective in the bilevel framework whilethe inner objective Lλ may be given by the sum of regu-larized empirical errors over each Dj

tr for the K models

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Under this perspective a K-fold cross-validation procedureclosely resemble the bilevel problem for ML formulatedin Sec 22 where in this case the meta-distribution col-lapses on the data (ground) distribution and the episodes aresampled from the same dataset of points

By following the procedure outlined in Sec 23 we canapproximate the minimization of Lλ with T steps of anoptimization dynamics and compute the hypergradient offT (λ) = 1

K

sumj E

j(wjT λ) by training the K models andproceed with either forward or reverse differentiation Themodels may be fitted sequentially in parallel or stochas-tically Specifically in this last case one can repeatedlysample one fold at a time (or possibly a mini-batch of folds)and compute a stochastic hypergradient that can be used ina SGD procedure in order to minimize fT At last we notethat similar ideas for leave-one out cross-validation error aredeveloped in (Beirami et al 2017) where the hypergradientof an approximation of the outer objective is computed bymeans of the implicit function theorem

C The Effect of T Ridge RegressionIn Sec 51 we showed that increasing the number of itera-tions T leads to a better optimization of the outer objective fthrough the approximations fT which converge uniformlyto f by Proposition 32 This however does not necessaryresults in better test scores due to possible overfitting ofthe training and validation errors as we noted for the linearhyper-representation multiclass classification experiment inSec 51 The optimal choice of T appears to be indeedproblem dependent if in the aforementioned experiment aquite small T led to the best test accuracy (after hyperparam-eter optimization) in this section we present a small scalelinear regression experiment that benefits from an increasingnumber of inner iterations

We generated 90 noisy synthetic data points with 30 featuresof which only 5 were informative and divided them equallyto form training validation and test sets As outer objectivewe set the mean squared error E(w) = ||Xvalw minus yval||2where Xval and yval are the validation design matrix andtargets respectively and w isin R30 is the weight vector of themodel We set as inner inner objective

Lλ(w) = ||Xtrw minus ytr||2 +

30sumi=1

eλiw2i

and optimized the L2 vector of regularization coefficients λ(equivalent to a diagonal Tikhonov regularization matrix)The results reported in Table 5 show that in this scenariooverfitting is not an issue and 250 inner iterations yield thebest test result Note that as in Sec 51 also this problemadmits an analytical solution from which it is possible tocompute the hypergradient analytically and perform exact

hyperparameter optimization as reported in the last row ofthe Table 5

Table 5 Validation and test mean absolute percentage error(MAPE) for various values of T

T Validation MAPE Test MAPE

10 1135 434950 128 522100 055 126250 047 050Exact 037 057

D Further Details on Few-shot ExperimentsThis appendix contains implementation details of the repre-sentation mapping h and the meta-learning hyperparametersused in the few-shot learning experiments of Sec 52

To optimize the representation mapping in all the exper-iments we use Adam with learning rate set to 10minus3 anda decay-rate of 10minus5 We used the initialization strategyproposed in (Glorot and Bengio 2010) for all the weights λof the representation

For Omniglot experiments we used a meta-batch size of 32episodes for five-way and of 16 episodes for 20-way Totrain the episode-specific classifiers we set the learning rateto 01

For one set of experiments with Mini-imagenet we used anhyper-representation (C4L) consisting of 4 convolutionallayers where each layer is composed by a convolution with32 filters a batch normalization followed by a ReLU acti-vation and a 2x2 max-pooling The classifiers were trainedusing mini-batches of 4 episodes for one-shot and 2 episodesfor five-shot with learning rate set to 001

The other set of experiments with Mini-imagenet employeda Residual Network (RN) as the representation mappingbuilt of 4 residual blocks (with 64 96 128 256 filters) andthen the block that follows 1times 1 conv (2048 filters) avgpooling 1 times 1 conv (512 filters) Each residual blockrepeats the following block 3 times 1 times 1 conv batchnormalization leaky ReLU (leak 01) before the residualconnection In this case the classifiers were optimized usingmini-batches of 2 episodes for both one and five-shot withlearning rate set to 004

Page 11: Bilevel Programming for Hyperparameter Optimization and ...

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

A Proofs of the Results in Sec 3Proof of Theorem 31 Since Λ is compact it follows fromWeierstrass theorem that a sufficient condition for the exis-tence of minimizers is that f is continuous Thus let λ isin Λand let (λn)nisinN be a sequence in Λ such that λn rarr λWe prove that f(λn) = E(wλn

λn)rarr E(wλ λ) = f(λ)Since (wλn

)nisinN is bounded there exists a subsequence(wkn)nisinN such that wkn rarr w for some w isin Rd Nowsince λkn rarr λ and the map (w λ) 7rarr Lλ(w) is jointlycontinuous we have

forallw isin Rd Lλ(w) = limnLλkn

(wkn)

le limnLλkn

(w) = Lλ(w)

Therefore w is a minimizer of Lλ and hence w = wλThis prove that (wλn)nisinN is a bounded sequence havinga unique cluster point Hence (wλn)nisinN is convergenceto its unique cluster point which is wλ Finally since(wλn

λn)rarr (wλ λ) and E is jointly continuous we haveE(wλn λn)rarr E(wλ λ) and the statement follows

We recal a fundamental fact concerning the stability of min-ima and minimizers in optimization problems (Dontchevand Zolezzi 1993) We provide the proof for completeness

Theorem A1 (Convergence) Let ϕT and ϕ be lower semi-continuous functions defined on a compact set Λ Supposethat ϕT converges uniformly to ϕ on Λ as T rarr +infin Then

(a) inf ϕT rarr inf ϕ

(b) argminϕT rarr argminϕ meaning that for every(λT )TisinN such that λT isin argminϕT we have that

- (λT )TisinN admits a convergent subsequence- for every subsequence (λKT

)TisinN such thatλKT

rarr λ we have λ isin argminϕ

Proof Let (λT )TisinN be a sequence in Λ such that for everyT isin N λT isin argminϕT We prove that

1) (λT )TisinN admits a convergent subsequence

2) for every subsequence (λKT)TisinN such that λKT

rarr λwe have λ isin argminϕ and ϕKT

(λKT)rarr inf ϕ

3) inf ϕT rarr inf ϕ

The first point follows from the fact that Λ is compactConcerning the second point let (λKT

)TisinN be a subse-quence such that λKT

rarr λ Since ϕKTconverge uniformly

to ϕ we have

|ϕKT(λKT

)minus ϕ(λKT)| le sup

λisinΛ|ϕKT

(λ)minus ϕ(λ)| rarr 0

Therefore using also the continuity of ϕ we have

forallλ isin Λ ϕ(λ) = limTϕ(λKT

) = limTϕKT

(λKT)

le limTϕKT

(λ) = ϕ(λ)

So λ isin argminϕ andϕ(λ) = limT ϕKT(λKT

) le inf ϕ =ϕ(λ) that is limT ϕKT

(λKT) = inf ϕ

Finally as regards the last point we proceed by contradic-tion If (ϕT (λT ))TisinN does not convergce to inf f thenthere exists an ε gt 0 and a subsequence (ϕKT

(λKT))TisinN

such that

|ϕKT(λKT

)minus inf ϕ| ge ε forallT isin N (11)

Now let (λK

(1)T

) be a convergent subsequence of (λKT)TisinN

Suppose that λK

(1)T

rarr λ Clearly (λK

(1)T

) is also a subse-quence of (λT )TisinN Then it follows from point 2) abovethat ϕ

K(1)T

(λK

(1)T

) rarr inf ϕ This latter finding togetherwith equation (11) gives a contradiction

Proof of Theorem 32 Since E(middot λ) is uniformly Lipschitzcontinuous there exists ν gt 0 such that for every T isin Nand every λ isin Λ

|fT (λ)minus f(λ)| = |E(wTλ λ)minus E(wλ λ)|le νwTλ minus wλ

It follows from assumption (vi) that fT (λ) converges tof(λ) uniformly on Λ as T rarr +infin Then the statementfollows from Theorem A1

B Cross-validation and Bilevel ProgrammingWe note that the (approximate) bilevel programming frame-work easily accommodates also estimations of the gener-alization error generated by a cross-validation proceduresWe describe here the case of K-fold cross-validation whichincludes also leave-one-out cross validation

Let D = (xi yi)ni=1 be the set of available data K-foldcross validation with K isin 1 2 N consists in parti-tioning D in K subsets DjKj=1 and fit as many modelsgwj on training data Dj

tr =⋃i6=j D

i The models are thenevaluated on Dj

val = Dj Denoting by w = (wj)Kj=1 thevector of stacked weights the K-fold cross validation erroris given by

E(w λ) =1

K

Ksumj=1

Ej(wj λ)

where Ej(wj λ) =sum

(xy)isinDj `(gwj (x) y) E can betreated as the outer objective in the bilevel framework whilethe inner objective Lλ may be given by the sum of regu-larized empirical errors over each Dj

tr for the K models

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Under this perspective a K-fold cross-validation procedureclosely resemble the bilevel problem for ML formulatedin Sec 22 where in this case the meta-distribution col-lapses on the data (ground) distribution and the episodes aresampled from the same dataset of points

By following the procedure outlined in Sec 23 we canapproximate the minimization of Lλ with T steps of anoptimization dynamics and compute the hypergradient offT (λ) = 1

K

sumj E

j(wjT λ) by training the K models andproceed with either forward or reverse differentiation Themodels may be fitted sequentially in parallel or stochas-tically Specifically in this last case one can repeatedlysample one fold at a time (or possibly a mini-batch of folds)and compute a stochastic hypergradient that can be used ina SGD procedure in order to minimize fT At last we notethat similar ideas for leave-one out cross-validation error aredeveloped in (Beirami et al 2017) where the hypergradientof an approximation of the outer objective is computed bymeans of the implicit function theorem

C The Effect of T Ridge RegressionIn Sec 51 we showed that increasing the number of itera-tions T leads to a better optimization of the outer objective fthrough the approximations fT which converge uniformlyto f by Proposition 32 This however does not necessaryresults in better test scores due to possible overfitting ofthe training and validation errors as we noted for the linearhyper-representation multiclass classification experiment inSec 51 The optimal choice of T appears to be indeedproblem dependent if in the aforementioned experiment aquite small T led to the best test accuracy (after hyperparam-eter optimization) in this section we present a small scalelinear regression experiment that benefits from an increasingnumber of inner iterations

We generated 90 noisy synthetic data points with 30 featuresof which only 5 were informative and divided them equallyto form training validation and test sets As outer objectivewe set the mean squared error E(w) = ||Xvalw minus yval||2where Xval and yval are the validation design matrix andtargets respectively and w isin R30 is the weight vector of themodel We set as inner inner objective

Lλ(w) = ||Xtrw minus ytr||2 +

30sumi=1

eλiw2i

and optimized the L2 vector of regularization coefficients λ(equivalent to a diagonal Tikhonov regularization matrix)The results reported in Table 5 show that in this scenariooverfitting is not an issue and 250 inner iterations yield thebest test result Note that as in Sec 51 also this problemadmits an analytical solution from which it is possible tocompute the hypergradient analytically and perform exact

hyperparameter optimization as reported in the last row ofthe Table 5

Table 5 Validation and test mean absolute percentage error(MAPE) for various values of T

T Validation MAPE Test MAPE

10 1135 434950 128 522100 055 126250 047 050Exact 037 057

D Further Details on Few-shot ExperimentsThis appendix contains implementation details of the repre-sentation mapping h and the meta-learning hyperparametersused in the few-shot learning experiments of Sec 52

To optimize the representation mapping in all the exper-iments we use Adam with learning rate set to 10minus3 anda decay-rate of 10minus5 We used the initialization strategyproposed in (Glorot and Bengio 2010) for all the weights λof the representation

For Omniglot experiments we used a meta-batch size of 32episodes for five-way and of 16 episodes for 20-way Totrain the episode-specific classifiers we set the learning rateto 01

For one set of experiments with Mini-imagenet we used anhyper-representation (C4L) consisting of 4 convolutionallayers where each layer is composed by a convolution with32 filters a batch normalization followed by a ReLU acti-vation and a 2x2 max-pooling The classifiers were trainedusing mini-batches of 4 episodes for one-shot and 2 episodesfor five-shot with learning rate set to 001

The other set of experiments with Mini-imagenet employeda Residual Network (RN) as the representation mappingbuilt of 4 residual blocks (with 64 96 128 256 filters) andthen the block that follows 1times 1 conv (2048 filters) avgpooling 1 times 1 conv (512 filters) Each residual blockrepeats the following block 3 times 1 times 1 conv batchnormalization leaky ReLU (leak 01) before the residualconnection In this case the classifiers were optimized usingmini-batches of 2 episodes for both one and five-shot withlearning rate set to 004

Page 12: Bilevel Programming for Hyperparameter Optimization and ...

Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Under this perspective a K-fold cross-validation procedureclosely resemble the bilevel problem for ML formulatedin Sec 22 where in this case the meta-distribution col-lapses on the data (ground) distribution and the episodes aresampled from the same dataset of points

By following the procedure outlined in Sec 23 we canapproximate the minimization of Lλ with T steps of anoptimization dynamics and compute the hypergradient offT (λ) = 1

K

sumj E

j(wjT λ) by training the K models andproceed with either forward or reverse differentiation Themodels may be fitted sequentially in parallel or stochas-tically Specifically in this last case one can repeatedlysample one fold at a time (or possibly a mini-batch of folds)and compute a stochastic hypergradient that can be used ina SGD procedure in order to minimize fT At last we notethat similar ideas for leave-one out cross-validation error aredeveloped in (Beirami et al 2017) where the hypergradientof an approximation of the outer objective is computed bymeans of the implicit function theorem

C The Effect of T Ridge RegressionIn Sec 51 we showed that increasing the number of itera-tions T leads to a better optimization of the outer objective fthrough the approximations fT which converge uniformlyto f by Proposition 32 This however does not necessaryresults in better test scores due to possible overfitting ofthe training and validation errors as we noted for the linearhyper-representation multiclass classification experiment inSec 51 The optimal choice of T appears to be indeedproblem dependent if in the aforementioned experiment aquite small T led to the best test accuracy (after hyperparam-eter optimization) in this section we present a small scalelinear regression experiment that benefits from an increasingnumber of inner iterations

We generated 90 noisy synthetic data points with 30 featuresof which only 5 were informative and divided them equallyto form training validation and test sets As outer objectivewe set the mean squared error E(w) = ||Xvalw minus yval||2where Xval and yval are the validation design matrix andtargets respectively and w isin R30 is the weight vector of themodel We set as inner inner objective

Lλ(w) = ||Xtrw minus ytr||2 +

30sumi=1

eλiw2i

and optimized the L2 vector of regularization coefficients λ(equivalent to a diagonal Tikhonov regularization matrix)The results reported in Table 5 show that in this scenariooverfitting is not an issue and 250 inner iterations yield thebest test result Note that as in Sec 51 also this problemadmits an analytical solution from which it is possible tocompute the hypergradient analytically and perform exact

hyperparameter optimization as reported in the last row ofthe Table 5

Table 5 Validation and test mean absolute percentage error(MAPE) for various values of T

T Validation MAPE Test MAPE

10 1135 434950 128 522100 055 126250 047 050Exact 037 057

D Further Details on Few-shot ExperimentsThis appendix contains implementation details of the repre-sentation mapping h and the meta-learning hyperparametersused in the few-shot learning experiments of Sec 52

To optimize the representation mapping in all the exper-iments we use Adam with learning rate set to 10minus3 anda decay-rate of 10minus5 We used the initialization strategyproposed in (Glorot and Bengio 2010) for all the weights λof the representation

For Omniglot experiments we used a meta-batch size of 32episodes for five-way and of 16 episodes for 20-way Totrain the episode-specific classifiers we set the learning rateto 01

For one set of experiments with Mini-imagenet we used anhyper-representation (C4L) consisting of 4 convolutionallayers where each layer is composed by a convolution with32 filters a batch normalization followed by a ReLU acti-vation and a 2x2 max-pooling The classifiers were trainedusing mini-batches of 4 episodes for one-shot and 2 episodesfor five-shot with learning rate set to 001

The other set of experiments with Mini-imagenet employeda Residual Network (RN) as the representation mappingbuilt of 4 residual blocks (with 64 96 128 256 filters) andthen the block that follows 1times 1 conv (2048 filters) avgpooling 1 times 1 conv (512 filters) Each residual blockrepeats the following block 3 times 1 times 1 conv batchnormalization leaky ReLU (leak 01) before the residualconnection In this case the classifiers were optimized usingmini-batches of 2 episodes for both one and five-shot withlearning rate set to 004