New A discrete Bandit approach to continuous Hyperparameter … · 2020. 2. 21. · known as...

POLITECNICO DI MILANO

School of Industrial and Information Engineering

Master of Science in Mathematical Engineering

A discrete Bandit approach to continuous

Hyperparameter Optimization

Supervisor:

Dr. Daniele Loiacono

Author:

Davide di Nello

matricola 875672

Academic Year 2018-2019

A Cece

Abstract

In machine learning, the step of choosing the right algorithm parameters is

known as Hyperparameter Optimization, and it is a growing field of research

due to the exploding popularity of Data Science. This document has the goal

of tuning the hyperparameters of a boosted trees model (a popular machine

learning algorithm), and it presents some of the common approaches to this

type of optimization, also presenting a new algorithm that was created as a

more stable adaptation of Hyperband, one of the most famous methods in

the literature.

iii

Sommario

Nella letteratura anglofona sul Machine Learning, il processo di scelta dei

giusti parametri per un algoritmo e conosciuto come Hyperparameter Opti-

mization, ed e un argomento di crescente interesse per via della grande popo-

larita acquisita dal Data Science. Questo documento presenta alcune delle

tecniche piu comuni per l’ottimizzazione dei metaparametri, introducendo

un nuovo algoritmo che ha il fine di migliorare Hyperband, uno dei metodi

di ottimizzazione piu famosi in letteratura, nel caso specifico in cui venga

usato su algoritmi che fanno uso di boosting su modelli ad alberi.

v

Ringraziamenti

Ringrazio i miei genitori, per avermi aiutato, motivato e supportato (e forse

anche sopportato), nonostante il mio vagabondare.

Ringrazio i miei amici di sempre, nella speranza non cambino mai.

Ringrazio i miei compagni di universita, in Italia come in Francia, per avermi

reso belli anche i momenti piu difficili.

Ringrazio tutte le persone con le quali ho condiviso una palla a spicchi e un

parquet; senza basket mi sarei probabilmente laureato prima e meglio, ma

non sarei mai stato altrettanto felice.

vii

Contents

Abstract iii

Ringraziamenti vii

1 Introduction 3

1.1 Motivation and Objectives . . . . . . . . . . . . . . . . . . . . 3

1.2 Outline of contents . . . . . . . . . . . . . . . . . . . . . . . . 4

2 State of the Art 5

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Model Free Optimization . . . . . . . . . . . . . . . . . . . . 8

2.3.1 Grid Search . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.2 Random Search . . . . . . . . . . . . . . . . . . . . . . 9

2.3.3 Quasi Random Search . . . . . . . . . . . . . . . . . . 9

2.3.4 Population Based Methods . . . . . . . . . . . . . . . 10

2.4 Bayesian Optimization . . . . . . . . . . . . . . . . . . . . . . 11

2.4.1 Surrogate Models . . . . . . . . . . . . . . . . . . . . . 12

2.4.2 Acquisition Functions . . . . . . . . . . . . . . . . . . 13

2.5 Multi-fidelity Optimization . . . . . . . . . . . . . . . . . . . 15

2.5.1 Hyperband . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5.2 Freeze-Thaw Bayesian Optimizaiton . . . . . . . . . . 18

ix

2.5.3 Fabolas . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Best Arm Identification Hyperband 23

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 The effect of Learning Rate on the Hyperband algorithm . . . 24

3.2.1 Learning Rate . . . . . . . . . . . . . . . . . . . . . . 24

3.2.2 Bias in the optimization . . . . . . . . . . . . . . . . . 25

3.2.3 Intuition behind the algorithm . . . . . . . . . . . . . 26

3.3 Multi Armed Bandits . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Bayesian updates . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 Thompson Sampling . . . . . . . . . . . . . . . . . . . . . . . 30

3.6 Best Arm Identification Hyperband (BAIHB) . . . . . . . . . 32

3.6.1 Epsilon greedy BAIHB . . . . . . . . . . . . . . . . . . 34

3.6.2 TS BAIHB . . . . . . . . . . . . . . . . . . . . . . . . 34

3.7 Better choices of the distributions . . . . . . . . . . . . . . . . 36

4 Implementation and Results 41

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Machine Learning Algorithm . . . . . . . . . . . . . . . . . . 42

4.4 Experiment design . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4.1 Hyperband . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4.2 Random Search . . . . . . . . . . . . . . . . . . . . . . 45

4.4.3 Thompson Sampling BAIHB . . . . . . . . . . . . . . 45

4.4.4 Number of equivalent evaluations . . . . . . . . . . . . 46

4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.6 Summary and Final Considerations . . . . . . . . . . . . . . . 52

5 Conclusions and future work 55

Bibliografia 57

A Appendix 61

A.1 Derivation of Expected Improvement for Gaussian random

variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

A.2 Original Hyperband algorithm . . . . . . . . . . . . . . . . . 62

Chapter 1

Introduction

With the broadest of definitions, Machine Learning is the science of teaching

computers how to act without programming them explicitly, and it is one

of the fields of engineering that has seen an explosion in the past decade.

It is a discipline that crosses the boundaries of mathematics, statistics, and

programming, and finds applications in almost any domain thanks to the

availability of data of this new Information Age society.

The financial industry, as often, is pushing the limits of new technologies,

and Machine Learning algorithms have been applied to the financial markets

since the late 1990s. Nevertheless, thanks to the fast evolution that this field

is experiencing, new technologies and algorithms are always waiting to be

developed and tested.

1.1 Motivation and Objectives

Financial data is very peculiar and particularly challenging for data scien-

tists. The efficiency of the market makes the signal to noise ratio almost

non-existent, making it extremely hard to find recognizable patterns. For

this reason, many statistical and machine learning practices that work well

in other environments are ineffective when dealing with financial data. Dur-

ing one of my latest working experiences, I was dealing for quite some time

on the automation of procedures to optimize the hyperparameters of ma-

chine learning algorithms for financial forecasting; these are the parameters

related to the implementation of the models themselves, often defining their

complexity and impacting strongly their performance. An example can be

the penalization coefficient of a Lasso Regression, the depth of the trees of a

Random Forest, or the number of neurons of a Deep Neural Network. The

goal of this document is to presents the results of my work, and to show an

application of hyperparameter optimization to financial data through the

means of a new algorithm which I developed during my working time to

improve the robustness and efficiency of Hyperband, a famous automated

hyperparameter tuning algorithm.

1.2 Outline of contents

The present work will be structured as follows:

• Chapter 2 will define the hyperparameter optimization problem and

analyze the current literature, presenting both the classic methods and

some of the latest techniques.

• Chapter 3 will serve as a presentation of the original work, explain-

ing its inception, detailing the underlying mathematical concepts and

presenting the algorithm itself.

• Chapter 4 will contain some practical results on a real set of data,

showing the potential advantages of this algorithm over the one it is

designed to improve.

• Chapter 5 will hold the conclusions, final remarks, and suggestions

about future work to be developed following this document

4

Chapter 2

State of the Art

2.1 Introduction

Every Machine Learning algorithm depends on intrinsic parameters that

help specify its complexity. We can find examples of these parameters in

linear models, for example with regularization coefficients, in unsupervised

methods, such as the number of centroids of K-Means, or in deep architec-

tures, where parameters define the shape, dimension, and functionality of

the network’s layers. The tuning of these values is called Hyperparameter

Optimization (HPO from now on) and it is one of the main steps of ma-

chine learning. Lately, the rising complexity of models as well as the push

for automation in this field, have boosted the research for fast and efficient

HPO algorithms. Indeed, algorithms are more than ever dependent on the

setting of their parameters, as the number of available configurations grows

exponentially with the complexity of the models themselves as does the com-

putational power required to train them. With these premises, automated

HPO can bring several advantages [11]:

• Reducing the manual effort and time required to apply machine learn-

ing procedures - many HPO algorithms nowadays work by budgeting

evaluation of expensive functions to speed up the tuning process.

• Improving the performance of machine learning algorithms - tailoring

the hyperparameter configuration to the problem at hand can result

in drastic improvements over the ”suggested configurations” provided

by libraries documentation.

• Improving the reproducibility of studies - reducing the manual in-

tervention of humans in the parameter choice makes work easier to

reproduce, as it removes the variability of the personal choices.

Several factors make HPO a tough challenge and differentiate it from clas-

sical mathematical optimization problems. To begin with, function evalu-

ations can be extremely expensive, both in terms of time and money. For

example, a single evaluation of a Deep Learning architecture entails its full

training with a given parametrization, which can take hours, days, or even

weeks, and may require renting an online cluster for computations. Another

reason is that the configuration space is often very complex, containing a

mix of continuous, categorical and conditional hyperparameters, sometimes

without a clear meaning or range. Furthermore, the loss function is often

retrieved via a black-box approach, without any (or limited) knowledge of

mathematical properties such as convexity, smoothness or derivatives (i.e.

no Gradient Descent). Finally, as it often happens in statistics, limited

training data is available, but the performance of the optimization should

be generalized.

In the following sections, we will give a formal definition of the problem and

provide an overview of the current literature, going from classical practices

to the most recent algorithms.

6

2.2 Definition

Let us call A a machine learning algorithm with N hyperparameters, laying

in the configuration space Λ = Λ1 × Λ2 × ...ΛN . The single domains of the

hyperparameters can be real-valued (e.g. penalization coefficients, learning

rate), integer-valued (e.g. depth of a tree, number of layers) or categorical

(e.g. loss metric). The configuration space can even contain some condi-

tionalities, meaning that some parameters may only be relevant for certain

values of a different parameter. This happens often in case of neural net-

works since the choice of the architecture becomes a parameter itself. In

that case, parameters related to the optimization of a given layer are depen-

dant on the existence of the layer itself, and consequently on the number of

total layers.

On a general note, given a dataset D taken from a population D, our goal

is to find

λ = argminλ∈Λ

ED[L(Aλ)]

i.e. the configuration λ that minimazes loss L over the population D, for

algorithm A. However, in practice we only have a limited amount of data,

and we will try to use this available data to generalize our choices as much

as possible, by choosing:

λ = argminλ∈Λ

ED[V (L,Aλ,Dtrain,Dval)]

where V (L,Aλ, D) measures the loss generated by algorithm A with con-

figuration λ, using validation protocol V on dataset D. Popular choices for

V (·, ·, ·, ·) are holdout validation, k-fold cross validation, or walk forward

validation (for time series data), which are well known methods in statistics

[23] that have been introduced in the late 1960s.

Another valid approach to HPO used in machine learning, but that finds

its origin in classical Bayesian statistics, is ensembling. This approach sug-

gests to average the outputs of the models resulting from multiple (possibly

7

uncorrelated) configurations rather than only using the best one, thus im-

proving generalization performance [27]. From an HPO side, this means

that our algorithm would be better off finding many ”good” configurations

rather than a single ”great” one.

2.3 Model Free Optimization

In the following sections, we will go through the main methods used for

HPO, starting with the simplest practices used by data scientist and going

to the latest algorithms on multi-fidelity optimization. One method that will

not be included is manual optimization. To some, it might seem even naive

to discuss it, but this is still the most valid method for many practitioners,

as some of the best Data Scientists on Kaggle (a popular website for Machine

Learning contests, [1]) admit. The main problem with this approach is that

it is extremely time-consuming, and even though with enough experience and

knowledge it can lead to sound solutions, it can not be used to implement

an automated production system.

2.3.1 Grid Search

The most basic method for parameter optimization is Grid Search (full fac-

torial design [28]). As the name suggests, it implies the evaluation of the

model on a preselected discrete grid of the parameters. The user selects a fi-

nite set of values for each hyperparameter, usually at a uniform distance (on

a linear or logarithmic scale), and generates a grid as the cartesian product of

these sets. The simplicity of this approach is however heavily outweighed by

its drawbacks; mainly, Grid Search suffers terribly from the curse of dimen-

sionality since the number of evaluations grows exponentially with the size

of the configuration space. Nevertheless, this is still a very common method

among practitioners, often combined with sequential manual optimization.

8

2.3.2 Random Search

A very simple and yet effective alternative is Random Search. This method

has been used for a long time but was popularized in 2012 by a paper of

Bergstra and Bengio [4] that aimed at demonstrating its superiority to Grid

Search. Random Search works by choosing each configuration by sampling

uniformly (or log-uniformly) on the domain of each hyperparameter. This

works very well for any case where the importance of the parameters differs

significantly, and visual representation can be seen in Figure 2.1. The reason

for this is that, given a fixed budget of evaluations B and a dimensionality of

the configuration space N, Grid Search will evaluate each hyperparameter in

B1/N distinct values, while Random Search will explore B different values.

Another major advantage of Random Search over Grid Search is that it is

easier to parallelize: workers do not need to communicate to perform parallel

evaluations, and any failure on a worker does not impact the whole design

of the experiment, granting a more flexible allocation of resources.

Random Search is often considered as a good baseline for HPO methods

and it is often integrated within other algorithms, in particular for early

exploration of the configuration space. Of course, it also has some pitfalls;

for starters, it is a purely exploratory algorithm and as such it will not learn

from past iterations and might end up sampling configurations very close

to each other even if they have extremely bad results. Therefore, in many

settings guided search algorithms tend to perform way better, as proved in

many papers [25, 10, 3]

2.3.3 Quasi Random Search

A slight variation of random search, also discussed by Bergstra and Ben-

gio in their paper [4] involves the use of Quasi-random number generation.

Techniques of this type - such as Sobol sequences or Latin Hypercube Sam-

pling - have been developed for Monte Carlo algorithms and have better

9

Figure 2.1: Graphical representation of the difference between Grid Search and Random

Search over a 2-D space, with one important and one unimportant parameter. Random

Search allows for a better exploration of the important parameter [4]

convergence properties than Random Search [5, 15], granting a more uni-

form exploration of the space by avoiding sampling points that are too close

together. Obviously, HPO differs from Monte Carlo simulations as the num-

ber of sampled points is lower even by many orders of magnitude, but even

if these techniques are asymptotically equivalent to Random Search, they

are proven to have a better convergence curve, achieving good results in a

shorter time.

2.3.4 Population Based Methods

Population-based methods are a large family that includes multiple algo-

rithms, such as genetic algorithms, evolutionary strategies and particle swarms.

These algorithms usually define a population (which in our case is a set of

configurations) and try to improve it over multiple iterations by modify-

ing (mutations) or combining (crossovers) some members, therefore creating

new “generations” of the population [31]. Some of these methods, such as

CMA-ES (Covariance Matrix Adaptation Evolutionary Strategy) [26] have

obtained some of the best performances for Black Box function optimiza-

10

tions. However, in this document, I will not dig deeper into these, as I

preferred focusing on different methods.

2.4 Bayesian Optimization

Bayesian optimization techniques have become very popular in recent years

for HPO tasks, obtaining extremely competitive results in the tuning of

machine learning algorithms such as neural networks for image classification

or speech recognition. The main point of Bayesian optimization is to use

the information that is extracted from previously tested configurations to

have a better choice strategy, therefore ”guiding” the sampling procedure

towards more promising areas of the search space.

Bayesian Optimization algorithms can differ strongly, but they are all based

on two key parts: a probabilistic model and an acquisition function. The

first one is a surrogate model over the space of the configurations which

allows giving a probabilistic estimation of the performance of the machine

learning algorithm based on sampled points; the second one is a function on

the surrogate model that is used to determine the suitability of all future

candidate points and choose where to sample next, balancing the exploration

of the space and the exploitation of current information.

Algorithm 1 Meta algorithm for Bayesian Optimization

1: Build a surrogate probability model for the objective function

2: Find the hyperparameters that perform best on the surrogate by the use

of a choice function

3: Apply the chosen configuration to the true objective function

4: Update the surrogate model incorporating the new point

5: Repeat steps 2-4 until max iteration or max time reached

6: Return best configuration

11

2.4.1 Surrogate Models

The most common approach to building a surrogate model for a target black-

box function is through the use of Gaussian processes. Gaussian processes

(GP) are a very convenient way to express prior distribution over functions,

and are defined in a way such as any finite set of M configurations {xn}Mn=1

defines a multivariate Gaussian distribution on RN [29]. Moreover, it allows

to analytically compute the predictive distributions for new configurations

given sampled ones. Given a mean function µ : X → R, a covariance

function κ : X 2 → R, and f ∼ GP(µ, κ), we have that f(x) follows a

gaussian distribution N (µ(x), κ(x, x)) for every x ∈ X . Then, if we have a

set of n observations Dn = {(xi, yi)}ni=1 from our function, we know the form

of the predictive distribution fn ∼ GP(µn, κn), which will have a posterior

mean and covariance function given by:

µn = kT (K + η2I)−1Y

κn(x, x′) = κ(x, x′)− kT (K + η2I)−1k′

where Y ∈ R is a vector containing the observations yi, while k, k′ ∈ Rn

are defined as ki = κ(x, xi), k′i = κ(x′, xi), and K ∈ Rn×n is characterized

by Ki,j = κ(xi, xj). The parameters of the GP itself, such as the kernel

(covariance) function, can usually be updated through maximum likelyhood

or Markov Chain Monte Carlo integration [12].

GPs have reached great results compared to vanilla Random Search, but

they have a major downfall which is their limited scalability: the compu-

tation of the kernel function scales like O(N3) with the number of sampled

configurations and like O(N2) with the dimensionality of the configuration

space. For this reason, other approaches have been proposed, such as ap-

proximated Gaussian process, sparse Gaussian process and other methods

that are based on different techniques such as Random Forest.

A particularly successful model that has been introduced in 2011 by Bergstra

12

[3] is Tree Parzen Estimators (TPE). Contrarily to Gaussian processes, this

algorithm does not model directly the posterior distribution of our function

given sampled points p(y|λ), but rather p(λ|y). Starting from a percentile α

on the target function, sampled observation are clustered into ”good” and

”bad”, and their densities (respectively l(λ) and g(λ) ) are calculated via

Kernel Density Estimation. This allows for easy optimization of Expected

Improvement, a popular acquisition function, and for very good scalability

both in number of sampled points and in dimensions (O(N)).

2.4.2 Acquisition Functions

Acquisition functions have the goal of indicating which configurations would

be more interesting to test next, given current information. Some simple

examples are Probability of improvement and Expected improvement, which

are defined as follows (for a configuration λ, and current sampled minimum

f?):

PI(λ) = E[1f(λ)<f? |λ,D]

EI(λ) = E[max(f? − f(λ), 0)|λ,D]

These two particular functions are pretty much self-explanatory and they

are easy to compute, in particular when (like in the case of GPs) our surro-

gate model can be described via Gaussian distributions. Moreover, Expected

Improvement is one step optimal, meaning that it always proposes the best

possible configuration to achieve a new minimum on the next step [12].

Newer acquisition functions such as Entropy Search and Knowledge Gradi-

ent are instead less intuitive from a probabilistic standpoint, and they look

for configurations achieving a greater decrease in the differential entropy of

the surrogate model. These are usually not available in closed form, and

therefore need more computation time, but have been proven to have con-

sistently good results on a set of different problems [14]. Nevertheless, given

13

Figure 2.2: Representation of three steps of a Gaussian process bayesian optimization

algorithm (for maximisation) with generic acquisition function over a 1-D domain.

At every step we compute the cheap acquisition function over the domain, find its

maximum, use it as the next sample point for the expensive objective function, and

finally update our Gaussian surrogate model with the new information.

14

its interpretability and its perfect matching with the TPE model, Expected

Improvement remains the most popular acquisition function in the field.

2.5 Multi-fidelity Optimization

Multi fidelity Optimization is a methodology for HPO that involves the use

of low budget evaluations to speed up the exploratory process. A good

formal definition is given by Kandasamy [19]: Let us assume that we want

to maximise a function f : X → R over its domain X . Our goal is to find

x? = argmaxx∈X f(x), and consequently f? = f(x?).

Multi-fidelity optimization assumes the existance of a fidelity space Z and

a function g : (Z × X ) → R, related to f via f(·) = g(z•, ·), where z• ∈

Z. Under the assumption that evaluations g(z, ·) at a cheaper fidelity z ∈

Z are informative about our target function g(z•, ·), our goal becomes to

maximise f through evaluations of g, by exploring the space through low-

fidelity evaluations and refining our search with high-fidelity tests only in

the most promising regions of X . In most common cases, the fidelity space

represents the amount of training data used for the model or the length of

the training, and z• is a thorough training with all available points. A good

example for a neural network would be Z = [0, N•]× [0, T•], where the two

intervals represent the number of points and the number of epochs. In this

case z• = (N•, T•) would be an evaluation with the full training dataset and

with the highest number of epochs allowed.

Recent years have seen a surge in popularity in this field, with many research

groups from prestigious universities focusing on this argument [19, 25, 22,

21, 32]. We will concentrate in particular on three algorithms that have been

published lately and that use different approaches: Hyperband is a simple

numerical algorithm that works as a ”smartly budgeted” Random Search,

Freeze Thaw Bayesian Optimization tries to model and predict the learning

curve for iterative models, and FABOLAS fits a Gaussian Process regression

15

Figure 2.3: The plot shows an iteration of Succesive Halving. The validation error of

a set of configurations is displayed as a function of assigned resources (training time).

As one can see, more promising configurations are trained longer, while bad ones are

dropped at every step.

model onto the extended fidelization space Z.

2.5.1 Hyperband

The first algorithm that we will introduce for multi-fidelity optimization is

called Hyperband, and has been published in 2016 by Li and Jamieson [25].

The algorithm itself is an extension of Successive Halvings, which was pro-

posed by Jamieson and Talwalker in 2015 [18]. The idea of Successive Halv-

ings is straight forward: allocate budget uniformly to a set of configurations,

evaluate their performance, throw out the worst-performing half of them and

then repeat until only one configuration is left. At every iteration, the total

allocated budget is constant while the number of configurations decreases,

and therefore the algorithm allocates exponentially increasing resources to

the configurations that are not excluded at every round.

This algorithm can have great results, but its performance depends strongly

16

Figure 2.4: Summary of a Hyperband run. Every block represents a bracket with new

configurations being tested, and it is indexed by s. In every bracket, configurations are

selected sequentially by testing ni of them with budget ri.

on the choice of the number of starting configurations n and the budget

B. The solution introduced by Li addresses this issue by proposing a sort

of hedging strategy that iterates over multiple values of n. In substance,

Hyperband loops over Successive Halving, changing every time the values of

n - the number of tested configurations - and r - the minimum resources to

be allocated to a single evaluation. The terminology used in the paper refers

to each separate iteration level of Successive Halving as a ”Bracket”, and

we will use the same in the rest of this document. Brackets are designed to

use approximately the same amount of resources but with an everchanging

allocation of n and r.

A good way for understanding the algorithm is looking at Figure 2.4. The

algorithm needs two parameters R and η; the first one represents the maxi-

mum budget we are allowed to allocate, while the second one is the inverse

of the ratio of retained configurations at every step. In the example, R = 81

and η = 3, therefore the algorithm uses five brackets. Within every bracket

an initial number of configurations n0 is tested with budget r0. The best

n1 = 1ηn0 configurations are kept, and reevaluated with budget r1 = ηr0,

and so on until we reach ri = R. By using different starting values for n0

and r0 the algorithm generally achieves a good performance, while having

17

a worst case scenario equivalent to a slower Random Search (given by the

last bracket). [Full algorithm in Appendix B]

This algorithm has been tested on a multitude of datasets, providing signifi-

cant speedups compared to Bayesian Optimization methods (×5 to ×30), in

particular when applied to kernel methods and neural networks. One of its

main drawbacks it’s that the search is not ”really” guided, and therefore its

performance will suffer on long runs compared to than Bayesian methods.

For this, a Bayesian version of this algorithm was developed in 2019, called

BOHB [10], guiding the choice of newer configuration through a Tree Parzen

Estimator approach.

2.5.2 Freeze-Thaw Bayesian Optimizaiton

A more complex but extremely interesting approach to Multi-task optimiza-

tion is given by [32]. For the sake of simplicity, we will try to stay clear of

the math presented in the paper as much as possible, but the interested

reader can find more detailed explanations in the original publication.

The idea behind Freeze-Thaw is to build a regression model that allows to

predict the learning curve of an iterative training and to do so via Gaus-

sian Processes. In particular, the authors make the (fair) assumption that

learning curves follow roughly an exponentially decaying function of the

form e−λt, and use this as a building block for the kernel of their Gaussian

Process. Instead of using a finite set of values for λ, Swesky and Snoek in-

tegrate over the whole [0,∞) range using a non-negative mixing measure ψ;

therefore, the GP kernel function between two-time steps takes the following

form:

k(t, t′) =

∫ ∞0

e−λte−λt′ψdλ

The authors then suggest using a Gamma distribution for the mixing mea-

sure, as this leads to simplifications and to an analytic solution of the inte-

gral.

18

Figure 2.5: (a) Directed acyclic graph representing the conditional independence model

used for training. Every row represents a learning curve that is drawn from an inde-

pendent GP prior, conditioned on its mean. The mean of each learning curve is jointly

drawn with the mean of the other learning curves using another GP prior. (b) Partially

completed training curves, and GP prediciton of their asymptote. (c) Posterior GP

prediction at the asymptote. Each colored point represents the GP prediction at the

configuration corresponding to the training curve of the same color. [Snoek]

Finally, the authors make an assumption on independence between training

curves, conditioned on a prior mean. This is, of course, a simplification, but

it is an essential one since a Gaussian Process fitting all configurations and

all training points would scale like O(N3T 3) in the number of configurations

N and the number of curve training points T . Conditional independence

implies that every learning curve is drawn from a separate GP whose mean

itself is drawn from a Global GP; this allows the whole algorithm to scale

like O(N3 + T 3 +NT 2).

As the training of a curve is occurring, the algorithm computes the Ex-

pected Improvement for the current training configuration as well as for new

untested ones - if new configurations look more promising than the current

one given the state of the learning curve, the training is frozen and the algo-

rithm switches to a new configuration. Moreover, a set of (10) past learning

curves is kept at all times updating their expected improvement, and old

configurations can be re-taken and trained from where it had been previ-

ously frozen.

All together, Freeze-Thaw is a very interesting algorithm that allows for

19

the exploitation of partial information as training is happening, and it is

a good solution for problems where the fidelity is related to the number of

iterations/epochs. Its main problem is that, like every GP based algorithm,

it has cubic complexity, and this can be a problem in situations with very

high-dimensional configuration spaces where many different tries are needed.

2.5.3 Fabolas

Fabolas - an acronym for FAst Bayesian Optimization on LArge data Sets

- is another approach to hyperparameter optimization based on Gaussian

Processes, published by Klein and Falkner [21].

This approach models the loss of a machine learning algorithm across dataset

sizes; to use the notation of the section on Multi Fidelity Optimization,

Fabolas fits a Gaussian Process over the space (Z × X ), where Z = [0, 1]

is a unitary interval representing data subset size that is being used, with

z• = 1 being the full dataset. Then, Entropy Search is used to find out

how much can be learned about performance on the full dataset from an

evaluation at any z. Having a Gaussian Process on the whole space of

(Z × X ) allows to predict the performance of the algorithm on the full

dataset without necessarily having to evaluate there, but this is only true

if the regression model correctly captures the behaviour of the loss with

respect to the dataset size used. It is safe to assume that the loss of a

machine learning algorithm decreases with size of the used dataset, and

for this reason the authors suggest the use of a factorized kernel, made of

a standard kernel over hyperparameter configurations (on X only) and a

covariance function in z (on Z only):

k((x, z), (x′, z′)) = k5/2(x, x′) · (φT (z) · Σφ · φ(z′))

The authors choose φf (z) = (1, (1−z)2)T to enforce the monotonicity of the

loss with respect to z. The algorithm then uses a complicated acquisition

20

function that models not only the potential gain in information but also the

computational cost of sampling at a given z, trying to balance both in the

smartest way possible.

When tested against other HPO algorithms on similar problems, Fabolas

has shown to have great performance, in particular with large datasets and

medium to small configuration spaces. Once again, its weakness stands in

the use of Gaussian Process and their limited scalability in terms of points,

which does not allow it to work well in problems where the dimensionality

space is too high or where the maximum is particularly tricky to find.

2.6 Summary

In this chapter, we have defined the problem of Hyperparameter Optimiza-

tion and shown both some of the most common and some of the latest

algorithms on the market. Recent algorithms have shown to have better

results, but are often more complicated to set up and to use, which is why

many practitioners still prefer using manual techniques on easier problems.

Nevertheless, the inclusion of environment variables such as dataset size,

number of iterations, etc. and the definition of multi-fidelity optimization is

probably one of the greatest recent advances in this field and is still a very

active argument for research.

21

Chapter 3

Best Arm Identification

Hyperband

3.1 Introduction

The goal of this chapter is to introduce a new algorithm that will try to

address the shortcomings of the previously mentioned Hyperband algorithm.

There are many reasons behind the choice of this algorithm rather than the

other ones presented in the last chapter. First of all, it is intuitive and does

not make assumptions on the optimization space [Z × X ]. Avoiding the

use of Gaussian Processes, it also scales better in terms of speed to higher-

dimensional spaces, and its implementation from scratch into production

code is much quicker.

The first step of the chapter will be to present some of the pitfalls that

I encountered when using the Hyperband algorithm and the main ideas

behind the modifications that I wished to include. Then, I will discuss some

concepts about Bayesian Statistics and Multi-Armed Bandit models that

will be useful when presenting the new algorithm. Finally, I will present the

full algorithm and discuss its use and implications.

3.2 The effect of Learning Rate on the Hyperband

algorithm

When introducing multi fidelity algorithms in the previous chapter, we

briefly explained that common choices for the fidelity (or budget) space are

dataset size and training time. For iterative algorithms such as tree-based

models, additive models, or neural networks, the number of iterations is also

a valid choice, as it allows for easier implementation and it has a more direct

and understandable impact on the training of a model. Moreover, many of

these iterative algorithms like the Gradient Boosted Machines used during

this study allow to save partially trained models, to load them, and to re-

sume training. In the optic of Hyperband, this is an advantage that can save

a huge amount of training time, and that is not available when budgeting

through the amount of data.

3.2.1 Learning Rate

Before diving in the details of the algorithm and its problems, we need to

take a step backwards and define what the learning rate is. Some algorithms,

such as Neural Networks, train their parameters using Gradient descent (or

some derivation), which is an iterative method that updates parameters by

minimizing the gradient of the chosen loss function. Others, such as Gradient

Boosted Machines and Generalized Additive Models, are built through an

iterative additive procedure that greatly resembles Gradient Descent.

The learning rate is the hyperparameter η common to these procedures, that

influences the size of the updates of every step. This parameter has proven

to be crucial to the training: too high values will make the iterative process

diverge, while too low ones will take too long to converge and will have more

probability of getting stuck in local minima. For this reason, many studies

have been published trying to assess the optimality for this parameter and

24

proposing variants that adapt its magnitude to the state of the training

with the goal of improving convergence. All and all, this hyperparameter

differentiates itself from the others in the fact that it is the one that most

influenced the speed and the convergence of the training [8].

3.2.2 Bias in the optimization

The problem we face when using Hyperband for iterative models is that it

makes the implicit assumptions that different configuration of a machine

learning algorithm will rank similarly at different budgets. If this was com-

pletely true, it would mean that learning curves for different configurations

would never cross. We know of course that this is false, and this is why the

evaluations are compared at different stages of the learning (i.e. at different

budgets); but if we want to extract as much information as possible out of

the learning curve we have to remove possible biases, and the learning rate

is one. As explained in the previous subsection, the learning rate explic-

itly drives the convergence speed, and therefore configurations with highly

different learning rates should only be compared once they have been fully

trained, as the shape of their curve will be extremely different, and some of

them will converge much quicker than others.

For example, let us assume we are trying to optimize the hyperparameters

for a Machine Learning algorithm that makes use of the Learning Rate, and

let us call Aη an instance of this algorithm that uses rate η. We will be

using the Hyperband framework and budgeting our evaluations through the

number of iterations (or epochs). Assuming that our learning rate is a con-

tinuous parameter that lays in the range [α, β], then the first step of the

Hyperband algorithm will be to sample multiple configurations, each with a

randomly picked learning rate ηi ∈ [α, β], and compare the validation score

V (L,Aηi ,Dtrain,Dval, n1) of our machine learning algorithm Aηi after n1 it-

erations. As it is the first exploration step, we have that n1 � nfull (nfull

25

representing the number of iterations needed for this algorithm’s instance

to converge). Given the shape of the learning curves, we will have that

V (L,Aηj ,Dtrain,Dval, n1)� V (L,Aηk ,Dtrain,Dval, n1)

∀ηj , ηk : ηj � ηk

This means that Hyperband, over low budget evaluations, will consistently

choose configurations with high learning rate. This problem has somewhat

been addressed by the original authors, as every new bracket evaluations

start with a higher budget, therefore reducing the bias. However, this is an

ineffective solution as it relies on the last brackets to test configurations with

low learning rate, while a high learning rate configuration will be output by

the lower brackets. This not only alters the original sampling distribution

for the learning rate, but is also biased since the last brackets have a very

limited exploration power (they sample a reduced number of configurations),

and this can reduce importantly the efficiency of this algorithm when the

best solutions have low learning rates.

3.2.3 Intuition behind the algorithm

Given the issues that were listed in the previous paragraph, the intuition

behind the algorithm is quite straight forward: only compare budgeted eval-

uations that have a similar learning rate. The first step to do so is splitting

the search interval of the learning rate uniformly on a logarithmic scale into

non-overlapping sub-intervals. Once the interval has been split into smaller

ones, the Hyperband algorithm can be run separately on these intervals,

so that the full space is explored thoroughly and fairly, comparing config-

urations that use the same order of magnitude of learning rate. Finally,

based on the performance obtained, one can choose to run Hyperband on

the subintervals that reward the best results.

The algorithm also takes care of automating this last step, choosing the

26

most performing split and balancing the exploration of the space and the

exploitation of the best performing interval, by using a Best Arm Identifi-

cation framework.

3.3 Multi Armed Bandits

Multi Armed Bandit is a vast framework of algorithms that has been studied

in probability theory since 1933 [33]. In its simplest setting, the algorithm

has K possible choices - called arms - and T rounds. At each round, the

algorithm chooses an arm and receives a reward, which is drawn from an

unknown fixed distribution πi, which depends on the arm. The goal of the

algorithm is then to maximise its total reward over the T rounds, or better

to minimize its cumulative regret, which is defined as the difference between

the reward that the player received and the one he would have observed by

always pulling the optimal arm.

This type of framework poses the classical problem of exploration versus

exploitation. Since there is an amount of uncertainty given by the fact that

rewards are stochastic, an algorithm must have the right amount of ”greed-

iness”: one that is too greedy might get stuck pulling forever a suboptimal

arm just because it gave a lucky reward at the beginning, one that is not

greedy enough will keep pulling suboptimal arms even when they are clearly

underperforming with respect to others.

A very easy option to balance exploration and exploitation are epsilon-

greedy algorithms; these algorithms start by pulling every arm at least once

and then keep pulling the best performing arm with probability 1− ε, while

they choose another random arm with probability ε. To improve perfor-

mance, it is possible to decrease ε with the number of rounds, in order to

explore more at the beginning when there is more uncertainty about the

arms, and then take advantage of the best performing arm once many re-

wards are available. Improving over this benchmark, many probabilistic

27

algorithms have been proposed with better asymptotic regret-minimizing

properties, such as Thompson Sampling [33] and Upper Confidence Bound

[24]. Most of these algorithms build some probabilistic model for the rewards

and then choose the arm that maximises some acquisition function (Prob-

ability of Improvement, Expected Improvement, Upper Confidence Bound,

etc.). Thompson Sampling, in particular, introduces extra randomization

by sampling the parameters from a Bayesian setting, and we will discuss it

further in an upcoming section.

Going back to our problem, it is easy to see how it may fall into this frame-

work. Once we split up the learning rate interval, the sub-intervals become

our arms and the stochastic reward is the loss obtained by a run of Hyper-

band on that particular split. However, our goal differs slightly from the

one of the basic setting; we are not interested in an arm that gives the best

results on average, but rather on the arm that can give us the maximum

result overall.

3.4 Bayesian updates

In this section we will give a very essential introduction to Bayesian statis-

tics, just enough to give the reader the means to understand the following

parts of this document.

When dealing with data, the goal is often to learn about some unknown

parameter such as the mean of a distribution, a correlation or a proportion.

In both frequentist and Bayesian statistics, data is used to make inference

on this parameter, but the two methods differ strongly when it comes to

dealing with uncertainty. Where frequentist methods give point estimation

and confidence intervals, Bayesian statistics expresses beliefs through proba-

bility distributions. The Bayes theorem is the fundamental tool of Bayesian

28

statistics, and expresses the relationship between prior and posterior belief:

p(θ|y) ∼ p(y|θ)p(θ)

The posterior distribution p(θ|y) of our unknown parameter θ is proportional

to its prior distribution p(θ) times the likelihood of the data p(y|θ).

The constant of proportionality is given by an integral that usually does not

have close form, and for this reason most of the times we need to estimate

posterior densities through Monte Carlo methods. In some special cases,

however, this is not necessary and we can easily compute the distribution.

Using the definition from Jackman [17]: Suppose a prior density p(θ) belongs

to a class of parametric of densities, F . Then the prior density is said to be

conjugate with respect to a likelihood p(y|θ) if the posterior density p(θ|y)

is also in F . Adding to this definition, we can say that in most cases the

parameters of the posterior distribution can be computed instantly through

some simple update rule.

Let us now give an example, that will be useful later, of such conjugate

distributions. Let us assume we have a situation where our data is coming

from some Gaussian distribution, y = (y1, ..., yn)′ with yi ∼iid N(µ, σ2). In

this case, our interest focuses on a vector of parameters θ = (µ, σ2). The

likelyhood of the data with respect to our parameters is given by

L(µ, σ2; y) =n∏i=1

1√2πσ2

exp(−(yi − µ)2

2σ2)

and it is a function of µ and σ2 defined on a two dimentional space. We find

that for this likelyhood, the conjugate prior density p(µ, σ2) is defined (by

marginalizing) as the product of a normal, conditional density for µ, and a

density for σ2. Explicitely: p(µ, σ2) = p(µ|σ2)p(σ2) where

µ|σ2 ∼ N(µ0, σ2/n0)

σ2 ∼ InverseGamma(ν0/2, ν0σ20/2)

29

where the distributions are defined by the hyperparameteres µ0, n0, ν0 and

σ20. What is interesting about this combination is that the posterior density

is also a Normal-InverseGamma, with updated hyperparameters

µ1 =n0µ0 + ny

n0 + n

n1 = n0 + n

ν1 = ν0 + n

ν1σ21 = ν0σ

20 +

n∑i=1

(yi − y)2 +n0n

n0 + n(µ0 − y)2

where n is the number of sampled points and y their average. This means we

have a very easy update rule and we can easily make inference on parameters

µ and σ2. Even better, in this case we also know the predictive density for

a new point y∗ given the current data, i.e. p(y∗|y), which is a student-t

density with location parameter µ1, scale parameter σ21

√(n1 + 1)/n1, and

n + ν0 degrees of freedom, which is a known density that we can sample

form.

3.5 Thompson Sampling

As previously introduced in the section about Multi Armed Bandits, Thomp-

son Sampling is a common algorithm for online decision in the Bandit frame-

work and has been introduced in the 1930s [33]. Even though its inception

is one of the oldest, the algorithm has been ignored for many years, coming

to fame in the 2010s thanks to two important publications that showed its

empirical performance [30, 6]. Its popularity has been rising since then, with

applications ranging from A/B testing [34] to recommender systems [20] and

Hyperparameter tuning [19].

The original Thompson Sampling algorithm was proposed in a scenario of

a Bernoulli Bandit. In this simple setting, every arm is characterized by

a Bernoulli distribution and produces a reward of 1 with probability pi.

30

The algorithm itself is then based on the Bayesian paradigm and therefore

treats the parameter pi as an unknown variable, quantifying its uncertainty

by setting a prior distribution given by a Beta(αi, βi) density. This prior

distribution just happens to be conjugated to the Bernoulli likelihood, and

therefore allows for an easy update of the parameters of the prior in order

to obtain the posterior.

Algorithm 2 Beta-Bernoulli Thompson Sampling

1: Initialization - Choose prior distribution parameters (α, β)

2: for t ∈ {1, 2, . . . } do

3: for k ∈ {1, . . . ,K} do

4: Sample pk ∼ beta(αk, βk)

5: end for

6: xt ← argmaxk(pk)

7: Pull arm xt and observe reward rt ∈ {0, 1}

8: (αxt , βxt)← (αxt + rt, βxt + 1− ri)

9: end for

In this simple case, the only step that differentiates this algorithm from

a greedy MAB one is number 4. A greedy approach would estimate the

parameter pk in a frequentist way (pk ← αkαk+βk

), ignoring the uncertainty

around it. The fact that instead the algorithm samples this parameter,

allows to take the uncertainty into account on the long run, balancing better

its exploration choices while still giving an edge to the best-performing arms

as time goes on.

This algorithm can be generalized to a broad array of problems beyond

the Bernoulli bandit, with more complex parameter spaces and non-binary

rewards. In a more general setting, for every action xt selected by the agent

at time t, an outcome yt is observed coming from some distribution qθ(·|xt).

The agent then receives a reward rt = r(yt), where r is a known function.

Given the observations, the algorithm updates its posterior distribution p

31

on the parameter θ, and at every round the choice of the arm happens

by maximising the expected reward over the likelihood qθ(·|xt), where θ is

sampled from the posterior distribution.

Algorithm 3 Generic K-arms Thompson Sampling

1: Initialization - Choose prior distribution π(θ)

2: for t ∈ {1, 2, . . . } do

3: for k ∈ {1, . . . ,K} do

4: Sample θk ∼ π

5: end for

6: xt ← argmaxkEqθk [r(yt)|x = xt]

7: Pull arm xt, observe outcome y and compute reward rt

8: π ← P (θ|x1, y1, ..., xt, yt)

9: end for

As previously mentioned, this algorithm has very good regret minimizing

properties as it balances very well its choices. Of course given its Bayesian

nature, it is dependent on the choice of the prior distribution, in particular

in the early runs when there is not much evidence from rewards.

3.6 Best Arm Identification Hyperband (BAIHB)

In this section we will finally put together all the pieces presented in this

chapter and propose a new algorithm for automatic Hyperparameter Op-

timization for iterative machine learning algorithms, that builds on the

strengths of Hyperband and tries to obviate to some of its shortcomings.

As the name suggests, BAIHB is based on two parts - a best arm identifi-

cation (multi armed bandit) setting and the Hyperband algorithm itself. It

is a framework more than an algorithm, as it can be set up differently de-

pending on the optimization procedure that is used within the Multi Armed

Bandit part. As explained in section 3.2.3 the intuition is easy - discretiz-

32

ing the learning rate parameter into separate intervals and run Hyperband

separately on each one, while the Multi Armed Bandit part has the goal of

choosing the arms in the way the maximises the probability of getting a good

configuration. A small change was however performed to the Hyperband al-

gorithm itself: like suggested by Falkner and Klein in [10], I removed the

last bracket of Hyperband within the algorithm, which is all in all equivalent

to a step of Random Search. The reason behind this last Random Search

step was to avoid penalizing good runs that had a slow learning curve, but

given that our adaptation has been conceived and structured to solve this

problem, it only ended up slowing down everything.

Algorithm 4 Meta algorithm for BAIHB

1: Split learning rate interval into non overlapping subintervals

2: Evaluate which interval (arm) to use next based on past runs (BAI)

3: Run Hyperband on selected interval and store results (HB)

4: Repeat steps 2-3 until budget is exausted

In general, Hyperband needs to generate hyperparameter configurations on

the whole configuration space, which is the cartesian product of the single

intervals where each hyperparameter is defined. Running Hyperband on a

sub-interval simply implies generating the configurations on the same space,

but using the sub-interval of the learning rate for the cartesian product.

Also, Hyperband evaluates several configurations at different budgets, but

only the full-budget evaluation are stored and fed as rewards to the Multi

Armed Bandit part of the optimization.

The Multi Armed Bandit part takes up step 3 of the Meta Algorithm pre-

sented above and can be designed in different ways, as mentioned in sections

3.3 and 3.5. In the following paragraphs, we will present two different pos-

sible designs, detailing the acquisition procedure.

33

3.6.1 Epsilon greedy BAIHB

The epsilon greedy version of the algorithm is of course the easier of the

two, and it is quite straightforward. Please note that in this subsection we

will use the terms ”outcome” and ”reward” interchangeably, and they will

both represent the validation score obtained by fully trained configurations

coming from a run of the Hyperband algorithm with a given learning rate

interval.

After a full round of full exploration to obtain at least one reward per in-

terval, the best arm is chosen based on past outcomes. In classic Bandit

literature this is done by choosing the arm that has the highest average out-

come, as the goal is to maximise the total cumulative reward; however, our

goal being to retain a small set of ”best” configurations, we will choose the

arm that has the best α-percentile mean (i.e. mean of the top α outcomes).

The next run of Hyperband is then executed in the best performing interval

with probability 1− ε, or in one of the others at random with probability ε.

This works quite well as very little computational overhead is added, and

the greediness of the algorithm can be tweaked using the parameter ε. Also,

this algorithm makes no assumption on the distribution of the outcomes or

the parameters it estimates, and therefore it does not introduce any bias.

Of course, as previously discussed, using a fixed ε either tends to explore for

too long if the parameter is big, or risks getting stuck into a local optimum,

and it is challenging to find the appropriate value. However, this should still

be sufficient to improve performance over the original Hyperband algorithm.

3.6.2 TS BAIHB

The application to Hyperband of the Thompson Sampling algorithm bears

only a few differences from the generic TS algorithm presented in the pre-

vious section. In particular, we will define this application specifying the

distributions used both for the outcomes and the parameters. These can of

34

Algorithm 5 Epsilon Greedy BAIHB

Split the learning rate interval into K sub-intervals.

for t ∈ {1, 2, . . . } do

ut ∼ Uniform(0, 1)

if ut < 1− ε then

xt ← argmaxk(αk)

else

xt ∼ RandInt(1,K)

end if

Run Hyperband on interval xt, observe outcomes Yt = (y1,t, y2,t, ...)

αxt ← Tail Average of {Yi1{xi=xt}}i=1..t

end for

course be modified, and we will discuss so in a following section. Also please

bear in mind that, opposed to the previous section, the terms ”reward” and

”outcome” are here not the same, as rewards will be a transformation of

outcomes.

We will use this algorithm with a Normal likelihood and a Normal-IG

prior, because of the ease of computation that they allow. Therefore we

assume that, for each arm, the outcomes yt of our Hyperband runs will

approximately be normally distributed, with parameters θ = (µ, σ2), with

µ|σ2 ∼ N(µ0, σ2/n0) and σ2 ∼ InverseGamma(ν0/2, ν0σ

20/2). The fixed

hyperparametes (µ0, n0, ν0, σ20) must be chosen to define the amount of in-

formation that the prior distribution bears, and as such are dependent of the

specific application and in particular of the score that will be returned by

Hyperband: for example, using the log-likelihood as score will require differ-

ent values than using the mean squared error. In general, a good approach

is to train the algorithm with a couple random configurations and, based on

the scores obtained, set up values for the prior that roughly mirror the at-

tended behaviour without inserting too much information, or simply set up

35

fully uninformative values. Once the setup is done, the Thompson Sampler

starts. For every interval, we sample (µ, σ2) from the respective distributions

and compute the expected reward given these values. In order to express

our interest for the extreme values of the distribution rather than its mean,

we use the Improvement as our reward function, i.e. r(yt, y?) = (yt−y?, 0)+.

This means that what we end up maximizing is the Expected Improvement,

a known quantity in the literature that we have already introduced in the

chapter about Bayesian Optimization, and that has an easy closed form in

the case of Normal distributions [Derivation in Appendix A]:

E[(yt − y?, 0)+|(µ, σ2)] = σφ(y? − µσ

) + (µ− y?)Φ(−y? − µσ

)

Once the arm with the highest Expected Improvement is selected, Hyper-

band is ran on the related interval and outcomes are stored. All the outcomes

relative to a given interval are then used to compute the posterior distribu-

tion for its parameters (µ, σ2) through the update rules that we saw in 3.4.

Compared to the greedy version, this algorithm introduces a bit more com-

putational overhead to Hyperband. Between every two Hyperband runs,

the model has to sample 2K values (K being the number of arms), compute

the Expected Improvement for each interval, selecting the maximum and

updating the posterior distribution. Nevertheless, there is no reason to use

a particularly high value of K and therefore the computational cost added

is still minuscule compared to the one usually needed for the training of the

machine learning models that we are optimizing. Moreover, contrarily to

its greedy version, this setup should automatically balance exploration and

exploitation, resulting in better performances on the long run.

3.7 Better choices of the distributions

After the last section on the Thompson Sampling BAIHB algorithm, any

reader with some statistical background and mathematical rigor at heart

36

Algorithm 6 TS BAIHB

Split the learning rate interval into K sub-intervals.

Initialize the parameters of the prior distributions µ0, σ20, ν0, n0.

for t ∈ {1, 2, . . . } do

for k ∈ {1, . . . ,K} do

Sample σ2k ∼ IG(ν1,k/2, ν1,kσ

21,k/2)

Sample µk|σ2 ∼ N(µ1,k, σ2k/n1,k)

EIk ← σφ(y?−µkσk

) + (µk − y?)Φ(−y?−µkσk

)

end for

xt ← argmaxkEIk

Run Hyperband on interval xt, observe outcomes Yt = (y1,t, y2,t, ...)

if maxi(yi,t) > y? then

y? ← maxi(yi,t)

end if

Update µ1,xt , σ21,xt , ν1,xt , n1,xt based on Yt

end for

37

will be left cringing in disgust. Indeed, I opted for the use of the normal

distribution to represent the likelihood of the outcomes obtained by Hyper-

band. However, these results come in the form of regression or classification

scores - R2 , Root Mean Squared Error, Logarithmic Loss, Accuracy, Area

Under the Curve - which are all bounded quantities, unlike Gaussian ran-

dom variables.

The main reasons behind this mathematical murder are computational speed

and simplicity. Our algorithm needs to sequentially compute posterior dis-

tribution multiple times, and it is based on an optimization procedure (Hy-

perband) that makes speed one of the main reasons of its competitiveness.

Using a custom chosen pair of distributions for likelihood and prior would

most likley mean having to compute posterior distributions using a Markov

Chain Monte Carlo sampler, making the computational overhead of our al-

gorithm explode. Moreover, the bayesian step of the algorithm has the goal

of giving an idea about the mean and the variance of the outputs and help

discern good intervals from bad ones, and the Gaussian distribution be-

ing R-valued does not prevent it to give us useful information about these

quantities. Finally, some distributions that be could used as more ”proper”

likelihoods, such as Beta or Gamma, have the Gaussian as their limiting

distribution [16], which suggests that on the long run our approximation

does not fall too far from the real values. Another point is that it is im-

portant to consider the environment in which this algorithm was generated.

As much as Data Science and Machine Learning are based on mathematics

and statistics, the everyday life of practitioners is often very far from the

formulas and the rigor that the academic world requires. The conception

and the implementation of this algorithm came, as briefly explained in the

introduction, in a work environment with the goal of speeding up the exis-

tent procedures. For this reason, it was better having algorithm that could

easily adapt to any score metric, rather than a mathematically sound one

38

that would need major adjustments at every new problem.

Nevertheless, I will briefly discuss a possible alternative choice of distribution

that could be part of future developments. We are aware from the literature

that it is possible to find a conjugate for any likelihood belonging to the ex-

ponential family [9]. This means that for any of these distributions, we know

explicitly the shape of the prior distribution and the update rule. However,

for known distributions such as Gamma and Beta, the resulting prior is

not a known distribution, and the challenge is finding an efficient sampling

strategy, possibly through Importance Sampling or Accept-Reject Sampling,

that allows obtaining values from these distributions without slowing down

the optimization process excessively.

39

Chapter 4

Implementation and Results

4.1 Introduction

This chapter includes some of the main parts of this thesis, it shows the

results of a practical implementation of the new algorithm, comparing its

performance to the original version of Hyperband and to Random Search.

We will first go over the data and the machine learning algorithm that were

used and then we will discuss the design of the experiment and its setup

process. Finally, we will analyze the results, and discuss possible future

improvements to the work.

4.2 Dataset

Given the confidentiality of the content, it was unfortunately not possible

to retain the original dataset that I used to train and develop the algorithm,

nor I can extensively discuss its content.

As an alternative, I decided to test the application on a toy dataset in-

stead. As it was not easy to find a dataset with similar properties to the

original, I turned to Kaggle [1]. There I could find a good dataset, both in

term of size and complexity, allowing for a useful application of hyperpa-

rameter optimization. The dataset contains anonimized information about

around 75 thousand clients of Santander, a Spanish credit institution, over

more than 360 preditors, with the goal of predicting their satisfaction with

the bank through a binary variable. Before the application the data was

cleaned, normalized, and transformed by taking the Principal Components.

No other operation on the data was needed nor possible since all variables

were anonymized and had no explicit meaning.

4.3 Machine Learning Algorithm

In order to perform classification on the data, we will use one of the most

common and effective algorithms in the Machine Learning literature: XG-

boost [7]. XGboost is an ensemble algorithm that makes use of boosting,

creating an additive model by fitting sequentially Classification and Regres-

sion Trees (CART) to the previous residuals [13]. This algorithm is known

to be one of the best performing ones for classification and regression tasks

on structured data, and has been used continuously in machine learning and

data science competitions since its publication, and with great results. Its

performance is however strongly dependent on multiple hyperparameters [2]:

• Number of trees - the maximum number of trees that are grown addi-

tively

• Learning rate - drives the dimension of the update steps of the iterative

procedure

• Depth - indicates the number of binary splits that each tree makes.

Higher numbers imply deeper, more complex and accurate trees, with

a higher risk of overfitting.

• Subsample ratio - indicates the percentage of the training set that

is used when building each tree. Having a subsample lower than 1

42

can reduce the correlation between trees and therefore improve the

ensembling score, in a procedure similar to bagging

• Column sample ratio - ratio of predictors that is sampled when build-

ing each tree. Can have a similar effect to data subsample, reducing

the correlation between trees.

• Gamma - the minimum loss reduction needed to make a new split in

the trees. It is a regularization parameter that can be used to avoid

growing unnecessarily deep trees.

• Alpha and Lambda - R1 and R2 regularization coefficients on the

weights, similar to those of Lasso and Ridge regressions

While the number of trees is used as a budget by our HPO algorithm,

the other parameters need to be sampled. Learning rate and depth will be

sampled on a logarithmic scale, as changes to these parameters have little

impact when they are using high values; in terms of predicting power, a 15

and a 17 splits deep trees will be almost equivalent, while the same cannot

be said about a 2 and 4 splits deep trees. The sampling rates and the

regularization coefficients, on the other hand, will be sampled on a uniform

scale, but with non-zero probability on their default value. What this mean

in practice is that, for example, data subsample rate will be drawn uniformly

between 0.5 and 1 half of the times and will be set to 1 the other half of the

times. The reason for this is that these parameters are not always necessary

and it is therefore a good idea to test plain models with no subsample,

no regularization or with only partial regularization. Finally, in order to

avoid overfitting, we will train the model through cross validation, stopping

the training early when no improvement is made with respect to the Area

Under the ROC Curve, and then score the configuration’s performance on

a validation set.

43

4.4 Experiment design

The design of the experiment is quite simple. The three algorithms ran

for a comparable amount of time with the goal of optimizing the validation

results of Xgboost over the described dataset. For every one of them, we

will compare the score of fully trained configurations and their distributions,

showing their similarities and differences. For BAIHB I have decided to test

the Thompson Sampling version, as it is the most interesting from both an

academic and an implementation perspective.

In the following subsections we will detail the setup for each individual

algorithm.

4.4.1 Hyperband

The setup of Hyperband, which is also necessary for BAIHB, requires the

choice of the budgets and of the exploration coefficient η. For the last one I

kept the standard value of η = 3 that is suggested by Li in the original paper,

as it has shown to be the best performing one in most standard situations.

This value implies that at every step of Successive Halvings we will keep

the one third of the configurations and we will triplicate the budget. In

order to set up minimum and maximum budgets, the easiest thing to do is

to reason in term of equivalences. Hyperband is most intuitive when using

a budget that ranges from 1 to a power of η, so in this case I set up an

equivalence of 1 budget unit to 30 Boosting iterations. This means that

the models will be first compared after growing 30 trees, so to avoid earlier

training phases where the score can be particularly volatile. At the same

time, I set a maximum budget R = η4 = 81, which equates to 2430 Boosting

iterations. This value has been explicitly chosen to be very high, in a way

that will allow slow growing configurations to reach full training while only

partially affecting fast ones thanks to the use of early stopping.

44

4.4.2 Random Search

The setup of Random Search is rather trivial, as we have already defined

the sampling strategy for all the parameters in the section about Xgboost.

The only variable that is not defined is the number of trees, and in order

to make the results comparable to Hyperband’s, Random Search will test

configurations on the equivalent of the maximum budget, which is 2430

trees, while using early stopping.

4.4.3 Thompson Sampling BAIHB

The preparation phase for the Thomspson Sampling BAIHB consists in three

parts:

• The setup of Hyperband, which is identical to the one explained earlier

• The setup of the Multi armed bandit via the discretization of the space

• The choice of the parameters for the prior distributions of the Thomp-

son Sampler

For the multi armed bandit part I chose to use only three arms, and there-

fore three sub-intervals in the learning rate domain [0.005, 0.3]. Indeed, the

domain in this application spaces over three orders of magnitude, so it feels

like three subintervals might be enough to make the learning rates compa-

rable, while avoiding overparametrizing the problem. As previously defined,

the split points for the interval are taken uniformly on the logarithmic scale,

which gives us the following:

• Arm 1 - η in [0.005, 0.020]

• Arm 2 - η in [0.020, 0.077]

• Arm 3 - η in [0.077, 0.3]

45

As for the parameters of the prior distributions, I followed a rather heuristic

approach. First, I set µ0 = 0.8 by running a couple of low budget evaluations

of the model and seeing where they landed. This allows the prior mean not

to be too far from the actual evaluations, which could influence positively

or negatively the convergence time of the posterior.

Second, I set n0 = 1. The reason for this is that n0 can be seen as an

”equivalent prior sample size”, and therefore by selecting a low value we

make sure that our posterior will be taking most of its information from the

data, even with only a handful of samples.

Finally, I set ν0 = 1 and σ20 = 0.005 by looking at the predictive distribution

(see chapter 3.4). In particular by setting this value for σ0, I make sure that

most of the density of the prediction distribution lays in [0.5, 1], which is

the domain of the AUC score that we are maximizing.

Figure 4.1: Initial predictive distribution given prior parameters

4.4.4 Number of equivalent evaluations

In order to compare the results of the three algorithms, we need to have them

running for a comparable amount of time. Since the code was executed in

different conditions and on different machines, the easiest way to compare

46

the computational effort is to ignore the overhead generated by BAIHB and

Hyperband and compare the three algorithms based on the total number

of Boosting Iterations they perform. When doing this, we also take into

consideration the fact that our implementation of BAIHB and Hyperband

allows saving and loading of partially trained models. Summing up, a single

run of Hyperband uses (ignoring the early stopping) 42120 boosting itera-

tions, while BAIHB stops at 29970. Since for this application I decided to

run 25 iterations of BAIHB, it meant that the equivalent was 18 runs of

Hyperband and 310 Random Search iterations.

4.5 Results

In this section, we will be walking through the results of the three algorithms

by increasing complexity.

In figure 4.3 are the results of the run of 310 Random Search evaluations,

plotted as AUC score vs Learning Rate. We can see that the distribution of

the results in term if AUC score is quite scattered and is slightly bimodal,

with most configurations settling between 0.80 and 0.82. On the other axis,

with no surprise, configurations are logarithmically distributed and they

concentrate around small values.

In figure 4.4 we can instead see the results for Hyperband. Here, the 180

final configurations are divided by the colour, which represents the different

brackets that are run within the algorithm. The first one (blue) contains

the configurations that were first run at very low budget (30 trees) and then

filtered up to full budget, while at the opposite end of the spectrum is the

fifth one (grey) which contains configurations that were tested directly at full

budget. We can see that the marginal distributions of the fifth bracket both

for the learning rate and the score are very similar to the ones found in the

previous figure for the Random Search case (although they look squeezed in

the density plot because of the other distributions), as one would expect. As

47

Figure 4.2: Scatterplot and Kernel Density Estimates for 300 configurations sampled

through Random Search. Learning rate is on the X-axis, and AUC score is on the Y-axis

we move up the brackets we see the behaviour that we described in Chapter

2.2.2: samples tend to have better distribution in terms of loss since they

come from rounds of exploration and successive halving, but they also tend

to be biassed towards higher learning rates. In particular, for the first two

brackets, we see that there is just one configuration sampled between learn-

ing rates of 0.005 and 0.05, which is instead where most of our points should

be. Nevertheless, we can see how the algorithm returns better configura-

tions than Random Search, and except for those of its fifth bracket, very

few results have an AUC lower than 0.83.

Finally, we can compare these previous results with the distributions ob-

tained by the newly adapted algorithm, BAIHB, which are depicted in Fig-

ure 4.5. Here, since the algorithm forced Hyperband to run separately on

the different subintervals of the learning rate, we can see that the distri-

butions of our configurations are much more stable through the brackets,

with a more thorough exploration of the configuration space. Moreover, the

48


through Hyperband algorithm. Learning rate is on the X-axis, and AUC score is on

the Y-axis. Configurations are colored differently depending on the bracket they were

sampled from

lack of a fifth bracket (equivalent to Random Search in Hyperband) allows

for more budget to be allocated on the other lower brackets, which show to

have better distributional properties both in Figure 4.4 and 4.5. The 125

configurations are mostly concentrated between AUCs od 0.83 and 0.84, and

although the overall sampling distribution for η is not as skewed as the orig-

inal logarithmic one of the Random Search, it is still a big improvement over

Hyperband’s.

In term of performance comparison, our algorithm’s worst-case scenario

would be to perform like Hyperband without the last bracket. In this par-

ticular application, our algorithm outperforms Hyperband both in terms of

best configuration found and consistency, as its ranked results are better

all around, as shown in table 4.1. Comparing to Random Search, we can

see that it is the latter that finds the best results, with two configurations

scoring over 0.8386. However, this only paints a partial picture; indeed, we

49


through Best Arm Identification Hyperband algorithm. Learning rate is on the X-axis,

and AUC score is on the Y-axis. Configurations are colored differently depending on

the bracket they were sampled from

can see how the score of the configurations found by Random Search de-

creases quickly as we go through them, while the ones found by BAIHB are

consistently good. This has three important upsides: first of all, it implies

that, on average, the algorithm will find a good configuration faster, as most

of its configurations have high scores (which is why Hyperband is popular

in the first place). Secondly, it allows for more choice in case of further

ensembling or averaging, which can grant a more stable performance on a

test set. Finally, it is an indicator of consistency, and since HPO has a

big stochastic component, it is preferable to have an algorithm whose good

results are stable and not coming from a statistical fluke. This is particu-

larly important in the case of more complex optimization problems where

good configurations are harder to find, like it was in the case of the original

dataset this algorithm was trained on.

50

Figure 4.5: Comparison between posterior predictive distributions for Arms 2 and 3

Configuration RS HB BAIHB

1st 0.838697 0.838324 0.838573

2nd 0.838674 0.838280 0.838539

3rd 0.838264 0.838162 0.838448

5th 0.837980 0.838030 0.838297

10th 0.837804 0.837781 0.838018

20th 0.836917 0.837235 0.837404

50th 0.832810 0.835362 0.836196

Table 4.1: Best performing configurations for each algorithm

Lastly, I will analyze the contribution of Thompson Sampling to the BAIHB

algorithm.

By the end of the 25 runs, the three arms corresponding to subintervals

[0.005, 0.020], [0.020, 0.077], and [0.077, 0.300] had been pulled respectively

8, 7 and 10 times. However, this difference is not shown in the posterior

distributions, where the arms are not statistically distinguishable. Indeed,

we can see in Figure 4.6 that the predictive distribution for Arm 3, which

51

Figure 4.6: Comparison between prior and posterior predictive distribution for Arm 2

has a sample size of 50, is almost the same as the one of Arm 2, which

has a sample size of 35. The main reason for this is that, given the sim-

plicity of our problem, XGboost ended up generalizing well for all levels

of the learning rate, not showing any particular difference in performance.

This problem is of course also dependent on the number of sampled config-

urations, and going deeper into the BAIHB exploration might have helped

reduce the variance of the predictive distributions and therefore make them

more separable. This is an average scenario, where our Bandit algorithm is

not able to discern between the performance and therefore keeps working in

a purely exploratory way. However, we can expect Thompson Sampling to

have a bigger impact on longer runs or on more complex problems where

the learning rate plays a bigger role.

4.6 Summary and Final Considerations

In this chapter, we have seen an application of the BAIHB algorithm and

compared its performance to the one of Random Search and traditional Hy-

perband. On the toy dataset used, our algorithm showed to have better

52

distributional properties in terms of Learning Rate compared to Hyperband

while managing to obtain more stable results than Random Search, sam-

pling good configurations quicker and in a more reliable manner. However,

Thompson Sampling proved to be inefficient (or rather neutral) in this ap-

plication, not bringing any additional value. When running on a single

machine, therefore, we can expect the effect of the Multi Arm Bandit part

to be either neutral or positive, depending on the length of the experiment

and the difference in performance between the arms. When running HPO on

multiple parallel machines however, it is important to notice that the MAB

setting makes the algorithm inherently sequential. This can, of course, be

a big disadvantage, as Hyperband is instead an embarrassingly parallel al-

gorithm whose speed up will be linearly related to the number of available

machines. A partial solution to this could be to allow our Bandit to make

multiple pulls at the same time, which each pull being handled by a different

machine. This solution, if developed, could greatly speed up the learning

process and would be quite advantageous in settings with many arms and

long HPO time. Finally another functional, although less fancy solution,

would be to simply discretize the learning rate and to run Hyperband in

parallel on the machines on different sub-intervals, avoiding the communi-

cation overhead introduced by the MAB setting.

53

Chapter 5

Conclusions and future work

In this document, we have presented the concept of Hyperparameter Opti-

mization for Machine Learning algorithms and presented some of the most

common approaches to it. In a second moment, we have described the de-

tails of a high performing algorithm, Hyperband, discussing some of its weak-

nesses in dealing with iterative Machine Learning algorithms. Finally, based

on said weaknesses, we have proposed a simple solution called Best Arm

Identification Hyperband that aims at extending Hyperband while main-

taining its speed, performance and simplicity. In an optimization example

with a real-life dataset, the algorithm was able to consistently sample good

performing hyperparameter configurations, candidating to be a valid alter-

native to Hyperband and Random Search. At the same time, the algorithm

is easier and more interpretable than most advanced HPO methods based

on Bayesian statistics, in particular regarding its setup.

This document still leaves unexplored various alternatives that this algo-

rithm opens. First of all, I did not have the time to execute a thorough test-

ing of the algorithm on standard benchmark optimization datasets, which

would be a more definitive indicator of performance. Also, it would be in-

teresting to see the adaptations of this model to other iterative algorithms

such as Neural Networks, to different Multi Armed Bandit algorithms, or

to different statistical distributions for the Thompson Sampler. Finally, as

mentioned in the last chapter, it could be useful to develop a version of this

algorithm that could deal with multiprocess computation without relying

exclusively on the parallelization internal to the Machine Learning model

used. This would not only grant faster performances but also an algorithm

which is more general and therefore more valuable for the Data Science

community.

56

Bibliography

[1] Kaggle. http://www.kaggle.com. Accessed: 2019-10-30.

[2] Xgboost parameters.

[3] James Bergstra, Remi Bardenet, Yoshua Bengio, and Balazs Kegl. Al-

gorithms for hyper-parameter optimization. In Proceedings of the 24th

International Conference on Neural Information Processing Systems,

NIPS’11, pages 2546–2554, USA, 2011. Curran Associates Inc.

[4] James Bergstra and Yoshua Bengio. Random search for hyper-

parameter optimization. J. Mach. Learn. Res., 13(1):281–305, February

2012.

[5] Sebastian Burhenne, Dirk Jacob, and Gregor Henze. Sampling based

on sobol’ sequences for monte carlo techniques applied to building sim-

ulations. pages 1816–1823, 01 2011.

[6] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson

sampling. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira,

and K. Q. Weinberger, editors, Advances in Neural Information Pro-

cessing Systems 24, pages 2249–2257. Curran Associates, Inc., 2011.

[7] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting

system. CoRR, abs/1603.02754, 2016.

57

[8] Yann N. Dauphin, Harm de Vries, and Yoshua Bengio. Equilibrated

adaptive learning rates for non-convex optimization, 2015.

[9] Persi Diaconis and Don Ylvisaker. Conjugate priors for exponential

families. The Annals of Statistics, 7, 03 1979.

[10] Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: Robust and ef-

ficient hyperparameter optimization at scale. In Proceedings of the 35th

International Conference on Machine Learning (ICML 2018), pages

1436–1445, July 2018.

[11] Matthias Feurer and Frank Hutter. Hyperparameter optimization. In

Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren, editors, Au-

toML: Methods, Sytems, Challenges, chapter 1, pages 3–37. Springer,

December 2018. To appear.

[12] Peter I. Frazier. Tutorial: Optimization via simulation with bayesian

statistics and dynamic programming. Proceedings Title: Proceedings of

the 2012 Winter Simulation Conference (WSC), pages 1–16, 2012.

[13] Jerome H. Friedman. Stochastic gradient boosting. Comput. Stat. Data

Anal., 38(4):367–378, February 2002.

[14] Philipp Hennig and Christian J. Schuler. Entropy search for

information-efficient global optimization, 2011.

[15] R.L. Iman, J.M. Davenport, and D.K. Zeigler. Latin hypercube sam-

pling (program user’s guide). [lhc, in fortran]. 1 1980.

[16] Mark E. Irwin. Convergence in distribution, central limit theorem,

2006.

[17] Simon Jackman. Bayesian analysis for the social sciences. Wiley, 2009.

[18] Kevin Jamieson and Ameet Talwalkar. Non-stochastic best arm iden-

tification and hyperparameter optimization, 2015.

58

[19] Kirthevasan Kandasamy, Gautam Dasarathy, Jeff G. Schneider, and

Barnabas Poczos. Multi-fidelity bayesian optimisation with continuous

approximations. In ICML, 2017.

[20] Jaya Kawale, Hung Bui, Branislav Kveton, Long Tran-Thanh, and San-

jay Chawla. Efficient thompson sampling for online matrix-factorization

recommendation. 12 2015.

[21] Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank

Hutter. Fast bayesian optimization of machine learning hyperparame-

ters on large datasets. CoRR, abs/1605.07079, 2016.

[22] Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hut-

ter. Learning curve prediction with bayesian neural networks. In ICLR,

2017.

[23] Ron Kohavi. A study of cross-validation and bootstrap for accuracy es-

timation and model selection. In Proceedings of the 14th International

Joint Conference on Artificial Intelligence - Volume 2, IJCAI’95, pages

1137–1143, San Francisco, CA, USA, 1995. Morgan Kaufmann Publish-

ers Inc.

[24] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive

allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.

[25] Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh,

and Ameet Talwalkar. Efficient hyperparameter optimization and in-

finitely many armed bandits. CoRR, abs/1603.06560, 2016.

[26] Ilya Loshchilov and Frank Hutter. CMA-ES for hyperparameter opti-

mization of deep neural networks. CoRR, abs/1604.07269, 2016.

[27] Michinari Momma and Kristin P. Bennett. A Pattern Search Method

for Model Selection of Support Vector Regression, pages 261–274.

59

[28] Douglas C. Montgomery. Design and analysis of experiments. John

Wiley Sons, Inc., 2013.

[29] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian

processes for machine learning. MIT Press, 2008.

[30] Steven L. Scott. A modern bayesian look at the multi-armed bandit.

Appl. Stoch. Model. Bus. Ind., 26(6):639–658, November 2010.

[31] Dan Simon. Evolutionary optimization algorithms. Wiley-Blackwell,

2013.

[32] Kevin Swersky, Jasper Snoek, and Ryan Prescott Adams. Freeze-thaw

bayesian optimization, 2014.

[33] William R Thompson. On the likelihood that one unknown probability

exceeds another in view of the evidence of two samples. Biometrika,

25(3/4):285–294, 1933.

[34] Xinhua Zhang, Thore Graepel, and Ralf Herbrich. Bayesian online

learning for multi-label and multi-variate performance measures. In

Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thir-

teenth International Conference on Artificial Intelligence and Statistics,

volume 9 of Proceedings of Machine Learning Research, pages 956–963,

Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR.

60

Appendix A

Appendix

A.1 Derivation of Expected Improvement for Gaus-

sian random variables

Given a gaussian random variable X and some fixed value x?, we want to

compute the expected value of the improvement function max(X − x?, 0).

Let us call Z a standard gaussian rv.

E[max(X − x?, 0)|(µ, σ)]X∼σZ+µ

=

∫R

(σz + µ− x?)1{σz+µ−x?>0}φ(z)dz

=

∫ ∞x?−µσ

σzφ(z)dz +

∫ ∞x?−µσ

(µ− x?)φ(z)dz

= σ

∫ ∞x?−µσ

1√2πze−z22 dz + (µ− x?)

∫ ∞x?−µσ

φ(z)dz

=σ√2π

[−e−z22 ]∞x?−µ

σ

+ (µ− x?)Φ(−x? − µσ

)

= σφ(x? − µσ

) + (µ− x?)Φ(µ− x?

σ)

(A.1)

A.2 Original Hyperband algorithm

Hereafter is the original code for the Hyperband algorithm, as published in

[25]

Algorithm 7 Original HYPERBAND algorithm

input : R, η initialization: smax = blogη(R)c, B = (smax + 1)R

for s ∈ {smax, smax − 1, ..., 0} do

n = dBRηs

(s+1)e, r = Rη−s

T = GetHyperparameterConfiguration(n)

for i ∈ {0, ..., s} do

ni = bnη−ic

ri = rηi

L = {Evaluate(t, ri) : t ∈ T}

T = TopK(T,L,bni/ηc)

end for

end for

return Configuration with the smallest intermediate loss

In this algorithm, three methods are used:

• GetHyperparameterConfiguration(n): samples and returns n ran-

dom configurations to be evaluated

• Evaluate(t,r): Runs the machine learning algorithm for configura-

tion t with budget d, and returns its validation score

• TopK(T,L,k): Given a set of tested configurations T and their respec-

tive scores L, returns the best k performing ones

62

New A discrete Bandit approach to continuous Hyperparameter … · 2020. 2. 21. · known as...

Documents

Transcript of New A discrete Bandit approach to continuous Hyperparameter … · 2020. 2. 21. · known as...