Hyperparameter optimization with approximate gradient

18
HYPERPARAMETER OPTIMIZATION WITH APPROXIMATE GRADIENT Fabian Pedregosa Chaire Havas-Dauphine Paris-Dauphine / École Normale Supérieure

Transcript of Hyperparameter optimization with approximate gradient

Page 1: Hyperparameter optimization with approximate gradient

HYPERPARAMETER OPTIMIZATION WITHAPPROXIMATE GRADIENT

Fabian Pedregosa

Chaire Havas-Dauphine Paris-Dauphine / École Normale

Supérieure

Page 2: Hyperparameter optimization with approximate gradient

HYPERPARAMETERSMost machine learning models depend on at least one

hyperparameter to control for model complexity. Examplesinclude:

Amount of regularization.Kernel parameters.Architecture of a neural network.

Model parameters Estimated using some

(regularized) goodness of�t on the data.

Hyperparameters Cannot be estimated usingthe same criteria as modelparameters (over�tting).

Page 3: Hyperparameter optimization with approximate gradient

HYPERPARAMETER SELECTIONCriterion to for hyperparameter selection:

Optimize loss on unseen data: cross-validation.Minimize risk estimator: SURE, AIC/BIC, etc.

Example: least squares with regularization.ℓ2

loss =

Costly evaluation function,non-convex.

Common methods: gridsearch, random search, SMBO.

( + X(λ)*ni=1 bi ai )2

Page 4: Hyperparameter optimization with approximate gradient

GRADIENT-BASED HYPERPARAMETER OPTIMIZATIONCompute gradients with respect to hyperparameters[Larsen 1996, 1998, Bengio 2000].Hyperparameter optimization as nested or bi-leveloptimization:

arg minλ!

s.t.  X(λ)⏟model parameters

  f (λ) u g(X(λ), λ) loss on test set

!  arg minx!ℝp

h(x, λ)⏟loss on train set

Page 5: Hyperparameter optimization with approximate gradient

GOAL: COMPUTE f (λ)By chain rule,

Two main approaches: implicit differentiation and iterativedifferentiation [Domke et al. 2012, Macaulin 2015]Implicit differentiation [Larsen 1996, Bengio 2000]:formulate inner optimization as implicit equation.

f = Þ+�g�λ

�g�X

known

�X�λ⏟unknown

X(λ) ! arg min h(x, λ) ⟺ h(X(λ), λ) = 0 1 implicit equation for X

Page 6: Hyperparameter optimization with approximate gradient

GRADIENT-BASED HYPERPARAMETER OPTIMIZATION f = g + g 2 ( h) 2

1,2T ( h) 2

1+1 1

Possible to compute gradient w.r.t. hyperparameters, given

Solution to the inner optimization

Solution to linear system

X(λ)g( h) 2

1+1 1

computationally expensive.⟹

Page 7: Hyperparameter optimization with approximate gradient

HOAG: HYPERPARAMETER OPTIMIZATION WITH APPROXIMATEGRADIENT

Loose approximation Cheap iterations, might

diverge.

Precise approximation Costly iterations,

convergence to stationarypoint.

Replace by an approximate solution of the inneroptimization.Approximately solve linear system.Update using

Tradeoff

X(λ)

λ    a fpk

Page 8: Hyperparameter optimization with approximate gradient

point.HOAG At iteration perform the following:k = 1, 2,…

i) Solve the inner optimization problem up to tolerance , i.e. �nd such that

ii) Solve the linear system up to tolerance . That is, �nd suchthat

iii) Compute approximate gradient as

iv) Update hyperparameters:

εk!xk ℝp

>X( ) + > } .λk xk εk

εk qk

> h( , ) + g( , )> } . 21 xk λk qk 1 xk λk εk

pk= g( , ) + h( , ,pk 2 xk λk 2

1,2 xk λk )T qk

= ( + ) .λk+1 P λk1L

pk

Page 9: Hyperparameter optimization with approximate gradient

ANALYSIS - GLOBAL CONVERGENCEAssumptions:

(A1). Lipschits and .(A2). non-singular

(A3). Domain is bounded.

g h 2

h(X(λ), λ) 21

Corollary: If , then converges to a

stationary point :

if is in the interior of then

< 7*7i=1 εi λk

λ0

⟨ f ( ), α + ⟩ ~ 0 , �α ! λ0 λ0

⟹ λ0 f ( ) = 0λ0

Page 10: Hyperparameter optimization with approximate gradient

EXPERIMENTSHow to choose tolerance ?εk

Different strategies for the tolerance decrease. Quadratic: , Cubic: , Exponential: = 0.1/εk k2 0.1/k3 0.1 × 0.9k

Approximate-gradient strategies achieve much fasterdecrease in early iterations.

Page 11: Hyperparameter optimization with approximate gradient

EXPERIMENTS I

Model: -regularizedlogistic regression.

1 Hyperparameter.

Datasets:20news (18k 130k )real-sim (73k 20k)

ℓ2

××

Page 12: Hyperparameter optimization with approximate gradient

EXPERIMENTS II

Kernel ridge regression.2 hyperparameters.Parkinson dataset: 654 17

Multinomial Logisticregression with onehyperparameter per feature[Maclaurin et al. 2015]

784 10hyperparametersMNIST dataset: 60k 784

×

×

×

Page 13: Hyperparameter optimization with approximate gradient

CONCLUSION

Hyperparameter optimization with inexact gradient:

can update hyperparameters before model parametershave fully converged.independent of inner optimization algorithm.convergence guarantees under smoothnessassumptions.

Open questions.

Non-smooth inner optimization (e.g. sparse models)?Stochastic / online approximation?

Page 14: Hyperparameter optimization with approximate gradient

REFERENCES

[Y. Bengio, 2000] Bengio, Yoshua. "Gradient-based optimization ofhyperparameters." Neural computation 12.8 (2000): 1889-1900.

[J. Bergstra, Y. Bengio 2012] Bergstra, James, and Yoshua Bengio. "Randomsearch for hyper-parameter optimization." The Journal of MachineLearning Research 13.1 (2012): 281-305.

[J. Snoek et al., 2015] Snoek, J. et al. Scalable Bayesian Optimization UsingDeep Neural Networks. (2015). at

[K. Swersky et al., 2014] Swersky, K., Snoek, J. & Adams, R. Freeze-ThawBayesian Optimization. arXiv Prepr. arXiv1406.3896 1–12 (2014). at

[F. Hutter et al., 2013] Hutter, F., Hoos, H. & Leyton-Brown, K. Anevaluation of sequential model-based optimization for expensive blackboxfunctions.

http://arxiv.org/abs/1502.05700a

http://arxiv.org/abs/1406.3896

Page 15: Hyperparameter optimization with approximate gradient

REFERENCES 2

[M. Schmidt et al., 2013] Schmidt, M., Roux, N. & Bach, F. Minimizing �nitesums with the stochastic average gradient. arXiv Prepr. arXiv1309.23881–45 (2013). at

[J. Domke et al., 2012] Domke, J. Generic Methods for Optimization-BasedModeling. Proc. Fifteenth Int. Conf. Artif. Intell. Stat. XX, 318–326 (2012).

[M. P. Friedlander et al., 2012] Friedlander, M. P. & Schmidt, M. HybridDeterministic-Stochastic Methods for Data Fitting. SIAM J. Sci. Comput.34, A1380–A1405 (2012).

http://arxiv.org/abs/1309.2388

Page 16: Hyperparameter optimization with approximate gradient

EXPERIMENTS - COST FUNCTION

Page 17: Hyperparameter optimization with approximate gradient

EXPERIMENTSComparison with other hyperparameter optimization

methods

Random = Random search, SMBO = Sequential Model-BasedOptimization (Gaussian process), Iterdiff = reverse-mode

differentiation .

Page 18: Hyperparameter optimization with approximate gradient

EXPERIMENTSComparison in terms of a validation loss.

Random = Random search, SMBO = Sequential Model-BasedOptimization (Gaussian process), Iterdiff = reverse-mode

differentiation .