Hyperparameter optimization with approximate gradient
-
Upload
fabian-pedregosa -
Category
Science
-
view
2.755 -
download
1
Transcript of Hyperparameter optimization with approximate gradient
HYPERPARAMETER OPTIMIZATION WITHAPPROXIMATE GRADIENT
Fabian Pedregosa
Chaire Havas-Dauphine Paris-Dauphine / École Normale
Supérieure
HYPERPARAMETERSMost machine learning models depend on at least one
hyperparameter to control for model complexity. Examplesinclude:
Amount of regularization.Kernel parameters.Architecture of a neural network.
Model parameters Estimated using some
(regularized) goodness of�t on the data.
Hyperparameters Cannot be estimated usingthe same criteria as modelparameters (over�tting).
HYPERPARAMETER SELECTIONCriterion to for hyperparameter selection:
Optimize loss on unseen data: cross-validation.Minimize risk estimator: SURE, AIC/BIC, etc.
Example: least squares with regularization.ℓ2
loss =
Costly evaluation function,non-convex.
Common methods: gridsearch, random search, SMBO.
( + X(λ)*ni=1 bi ai )2
GRADIENT-BASED HYPERPARAMETER OPTIMIZATIONCompute gradients with respect to hyperparameters[Larsen 1996, 1998, Bengio 2000].Hyperparameter optimization as nested or bi-leveloptimization:
arg minλ!
s.t. X(λ)⏟model parameters
f (λ) u g(X(λ), λ) loss on test set
! arg minx!ℝp
h(x, λ)⏟loss on train set
GOAL: COMPUTE f (λ)By chain rule,
Two main approaches: implicit differentiation and iterativedifferentiation [Domke et al. 2012, Macaulin 2015]Implicit differentiation [Larsen 1996, Bengio 2000]:formulate inner optimization as implicit equation.
f = Þ+�g�λ
�g�X
known
�X�λ⏟unknown
X(λ) ! arg min h(x, λ) ⟺ h(X(λ), λ) = 0 1 implicit equation for X
GRADIENT-BASED HYPERPARAMETER OPTIMIZATION f = g + g 2 ( h) 2
1,2T ( h) 2
1+1 1
Possible to compute gradient w.r.t. hyperparameters, given
Solution to the inner optimization
Solution to linear system
X(λ)g( h) 2
1+1 1
computationally expensive.⟹
HOAG: HYPERPARAMETER OPTIMIZATION WITH APPROXIMATEGRADIENT
Loose approximation Cheap iterations, might
diverge.
Precise approximation Costly iterations,
convergence to stationarypoint.
Replace by an approximate solution of the inneroptimization.Approximately solve linear system.Update using
Tradeoff
X(λ)
λ a fpk
point.HOAG At iteration perform the following:k = 1, 2,…
i) Solve the inner optimization problem up to tolerance , i.e. �nd such that
ii) Solve the linear system up to tolerance . That is, �nd suchthat
iii) Compute approximate gradient as
iv) Update hyperparameters:
εk!xk ℝp
>X( ) + > } .λk xk εk
εk qk
> h( , ) + g( , )> } . 21 xk λk qk 1 xk λk εk
pk= g( , ) + h( , ,pk 2 xk λk 2
1,2 xk λk )T qk
= ( + ) .λk+1 P λk1L
pk
ANALYSIS - GLOBAL CONVERGENCEAssumptions:
(A1). Lipschits and .(A2). non-singular
(A3). Domain is bounded.
g h 2
h(X(λ), λ) 21
Corollary: If , then converges to a
stationary point :
if is in the interior of then
< 7*7i=1 εi λk
λ0
⟨ f ( ), α + ⟩ ~ 0 , �α ! λ0 λ0
⟹ λ0 f ( ) = 0λ0
EXPERIMENTSHow to choose tolerance ?εk
Different strategies for the tolerance decrease. Quadratic: , Cubic: , Exponential: = 0.1/εk k2 0.1/k3 0.1 × 0.9k
Approximate-gradient strategies achieve much fasterdecrease in early iterations.
EXPERIMENTS I
Model: -regularizedlogistic regression.
1 Hyperparameter.
Datasets:20news (18k 130k )real-sim (73k 20k)
ℓ2
××
EXPERIMENTS II
Kernel ridge regression.2 hyperparameters.Parkinson dataset: 654 17
Multinomial Logisticregression with onehyperparameter per feature[Maclaurin et al. 2015]
784 10hyperparametersMNIST dataset: 60k 784
×
×
×
CONCLUSION
Hyperparameter optimization with inexact gradient:
can update hyperparameters before model parametershave fully converged.independent of inner optimization algorithm.convergence guarantees under smoothnessassumptions.
Open questions.
Non-smooth inner optimization (e.g. sparse models)?Stochastic / online approximation?
REFERENCES
[Y. Bengio, 2000] Bengio, Yoshua. "Gradient-based optimization ofhyperparameters." Neural computation 12.8 (2000): 1889-1900.
[J. Bergstra, Y. Bengio 2012] Bergstra, James, and Yoshua Bengio. "Randomsearch for hyper-parameter optimization." The Journal of MachineLearning Research 13.1 (2012): 281-305.
[J. Snoek et al., 2015] Snoek, J. et al. Scalable Bayesian Optimization UsingDeep Neural Networks. (2015). at
[K. Swersky et al., 2014] Swersky, K., Snoek, J. & Adams, R. Freeze-ThawBayesian Optimization. arXiv Prepr. arXiv1406.3896 1–12 (2014). at
[F. Hutter et al., 2013] Hutter, F., Hoos, H. & Leyton-Brown, K. Anevaluation of sequential model-based optimization for expensive blackboxfunctions.
http://arxiv.org/abs/1502.05700a
http://arxiv.org/abs/1406.3896
REFERENCES 2
[M. Schmidt et al., 2013] Schmidt, M., Roux, N. & Bach, F. Minimizing �nitesums with the stochastic average gradient. arXiv Prepr. arXiv1309.23881–45 (2013). at
[J. Domke et al., 2012] Domke, J. Generic Methods for Optimization-BasedModeling. Proc. Fifteenth Int. Conf. Artif. Intell. Stat. XX, 318–326 (2012).
[M. P. Friedlander et al., 2012] Friedlander, M. P. & Schmidt, M. HybridDeterministic-Stochastic Methods for Data Fitting. SIAM J. Sci. Comput.34, A1380–A1405 (2012).
http://arxiv.org/abs/1309.2388
EXPERIMENTS - COST FUNCTION
EXPERIMENTSComparison with other hyperparameter optimization
methods
Random = Random search, SMBO = Sequential Model-BasedOptimization (Gaussian process), Iterdiff = reverse-mode
differentiation .
EXPERIMENTSComparison in terms of a validation loss.
Random = Random search, SMBO = Sequential Model-BasedOptimization (Gaussian process), Iterdiff = reverse-mode
differentiation .