Gaussian processes: surrogate models for continuous black-box...
Transcript of Gaussian processes: surrogate models for continuous black-box...
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian processes: surrogate modelsfor continuous black-box optimization
Lukáš Bajer
MFF UK 04/2018
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 1
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Contents
1 OptimizationContinuous optimizationMetaheuristics, black-box functions
2 Gaussian processesGaussian process predictionGaussian process covariance functions
3 Doubly trained Surrogate CMA-ESCMA-ESDoubly trained Surrogate CMA-ESExperimental results
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 2
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Optimization
optimization (minimization) is finding such x? ∈ Rn that
f (x?) = min∀x∈Rn
f (x)
“near-optimal” solution is usually sufficient
f(x)
xf(x)
x
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 3
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Continuous white-box optimization
also known as numerical optimization methods
requirements:gradients: ∇f (x)
. . . can be approximated by finite difference approximations
and sometimes also Hessians: ∇2f (x)
1 gradient descend (1st order)2 Newthon method (2nd order)3 quasi-Newthon methods (2nd order approximated)4 trust-region, conjugate gradients
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 4
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
1st order: gradient descend
iterative steps in the direction of negative gradient
x(k+1) = x(k) − σ∇f (x(k))
σ – step size, usually changes every iteration, adapted, forexample, using a line search along the gradient direction
x 0
x 1
x 2
x 3 x 4
*
*
source: (CC) Wikipedia
pros:guaranteed to converge theoreticallysuitable even for large problems (deep NN. . . )
limitations:can be very slow, especially without a momentumoften ends-up much sooner due to round-off errors
source: (GNU) Wikipedia, author: P.A. Simionescu
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 5
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
2nd order: Newton’s method
take into account the second-order term of a Taylorexpansion of f (x) around x(k):
f (x(k) + h) ≈ q(k)(h) = f (x(k)) + h>∇f (k) +12
h>[∇2f (k)
]h
the next iterate is then
x(k+1) = (x(k) + h(k))
where h(k) minimizes q(k)(h)
x(k+1) = x(k) − γ[∇2f (k)
]−1∇f (k)
x
x0
pros:very fast convergence on quadratic-like function
limitations:needs a Hessian matrix to be computed and invertedthus rarely usable in practice
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 6
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Quasi-Newton’s methods
Hessian matrix ∇2f (k) is not computed,only iteratively approximated B(k),B(k+1), . . .
Hessians’ inverses are often calculated without inversions
BFGSthe most successful for the last three decadesindependently discovered by 4 (!) people in 1970C. G. Broyden, R. Fletcher, D. Goldfarb and D. Shanno
Hessian approximation updated via rank-two updatesworks even without derivatives (with finite differences)
shown to behave well on a variety of (even multimodal) functionsL-BFGS – a popular memory-limited version (Nocedal, 1980)in every optimization package (Matlab, Python,. . . )
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 7
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Other numerical optimization techniques
quadratic approximations:by far the most popular optimization technique
trust-region methodsquadratic approximations around current point x(k)
minimizes the model within region of trust
NEWOUA, BOBYQA (J. D. Powell, 2004, 2009)
construct the quadratic model using much fewer pointsthan (n + 1)(n + 2)/2 using additional minimizing a norm
that saves time and enhances performance
conjugate gradientsdo not approximate Hessiansconjugate vectors – a momentum guiding the searchcheaper variant to quasi-Newton’s methods
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 8
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Optimization of black-box functions
black-box functions
x f(x)
f
only evaluation of the function value, no derivatives orgradients→ no gradient methods available
we consider continuous domain: x ∈ Rn
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 9
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Optimization of empirical black-box functions
empirical function:assessing the function-value via an experiment(measuring, intensive calculation, evaluating a prototype)evaluating such functions are expensive (time and/ormoney)search cost ∼ the number of function evaluations
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 10
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Metaheuristics
Metaheuristicoptimization techniques finding sufficiently good solutiontreat the objective function as black-boxsample a set of candidate solutions(search space often too large to be completely sampled)often nature-inspired
particle/swarn optimizationsimulated annealing. . .evolutionary computation (EA, GA, ES, . . . )
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 11
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
EA’s for empirical black-box optimization
what can help with decreasingthe number of function evaluations:
utilize already measured values(at least prevent measuring the same thing twice)learn the shape of the function landscapeor learn the (global) gradient or step direction & size
f(X)
i
Re f(X)
CrSe
Mu?
X*
source: (GNU) Wikipedia, author: Johann "nojhan" Dréo
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 12
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Model-based methods accelerating the convergence
several methods are used in order to decreasethe number of objective function evaluations needed by EA’s
1 Bayesian optimization (EGO)
2 Surrogate modelling
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 13
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Bayesian optimization
Bayesian optimizer
Input : objective function f , the size of the initial sample dx1, . . . , xd ← generate an initial sampleA ← {(xi, yi)} /* initialize the archive */
for generation g = 1, 2, . . . until stopping conditions met doM← generate the probabilistic model based on Ax1, . . .← choose next points x ∈ X accord. to CM(x)y1, . . .← f (x1), . . . /* evaluate the new point(s) */A ← A∪ {(x1, y1), . . .} /* update the archive */
suitable for very low budgets of f -evaluations (∼ 10 · D)Gaussian processes used in the criterion CM most oftenexisting algorithms: EGO (D. R. Jones, 1998),SPOT (T. Bartz-Beielstein, 2005), SMAC (F. Hutter, 2011) etc.
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 14
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Continuous optimizationMetaheuristics, black-box functions
Surrogate modelling
Surrogate modellingtechnique which builds an approximating modelof the fitness function landscapethe model provides a cheap and fast,but also inaccurate replacement of the fitness functionfor part of the populationinaccurate approximating model can deceive the optimizer
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 15
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process
GP is a stochastic approximation method based on Gaussiandistributions
GP can express uncertainty of the prediction in a new point x:it gives a probability distribution of the output value
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 16
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process
Gaussian ProcessA Gaussian process is a collection of random variables, anyfinite number of which have a joint Gaussian distribution.
A Gaussian process is completely specified by itsmean function m(x) = E [fGP(x)]
covariance function cov(xi, xj) = cov(fGP(x1), fGP(x2))
and we write the Gaussian process as
f (x) ∼ GP(m(x), cov(x, x)).
(Rasmussen, Williams, 2006)
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 17
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process
given a set of N training points XN = (x1 . . . xN)>, xi ∈ Rd,and measured values yN = (y1, . . . , yN)>
of a function f being approximated
yi = f (xi), i = 1, . . . ,N
GP considers vector of these function values as a samplefrom N-variate Gaussian distribution
yN ∼ N(0,CN)
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 18
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process prior distribution
-5 0 5-3
-2
-1
0
1
2
3
-5 0 5-3
-2
-1
0
1
2
3
-5 0 5-3
-2
-1
0
1
2
3
Draws from Gaussian processes prior for three different covariance functionsKSE, Kν=3/2
Mat«ern , Kν=5/2Mat«ern (in that order), all of them with the parameters ` = 1 and
σ2f = 1 without noise
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 19
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process prediction (posterior)
Making predictionsLet CN+1 be extended covariance matrix – extended by entriesbelonging to an unseen point (x, y∗). Because yN is known and
the inverse C−1N+1 can be expressed using inverse of the training
covariance CN−1,
the density in a new point marginalize to 1D Gaussian density
p(y∗ |XN+1, yN) ∝ exp
(−1
2(y∗ − yN+1)2
s2yN+1
)wherethe mean yN+1 and thevariance s2
yN+1
is easily expressible fromCN−1 and yN .
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 20
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process prediction (posterior)
-5 0 5-3
-2
-1
0
1
2
3
-5 0 5-3
-2
-1
0
1
2
3
-5 0 5-3
-2
-1
0
1
2
3
Graphs of Gaussian processes prediction N = 2, 3, 4 training data. (+) –training set, thick line – mean prediction, thin lines – three draws from the GPposterior (without noise). Predictions y∗ and ±2s∗ are generated for 101points, computationally stable as the matrix inversion only for the trainingcovariace CN .
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 21
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process covariance
The covariance matrix CN is determined by the covariancefunction cov(xi, xj) which is defined on pairs from the inputspace
(C)ij = cov(xi, xj), xi,j ∈ Rd
expressing the degree of correlations between two points’values; typically decreasing functions on two points distance
d(xi,xj)
cov(xi,xj)1
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 22
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process covariance
The most frequent covariance function is squared-exponential
(K)ij = covSE(xi, xj) = θ exp
(−12`2 (xi − xj)
>(xi − xj)
)with the parameters (usually fitted by MLE)
θ – signal variance (scales the correlation)` – characteristic length scale
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 23
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process covariance
Another usual option in data-minig applications isMatérn covariance, which is for r = (xi − xj)
(K)ij = covMaternν=5/2(r) = θ
(1 +
√5r`
+5r2
3`2
)exp
(−√
5r`
).
with the parameters (same as for squared exponential)θ – signal variance` – characteristic length scale
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 24
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
Gaussian process predictionGaussian process covariance functions
Gaussian Process covariance
source: (Rasmussen and Williams, 2006)
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 25
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Stochastic search of Evolutionary algorithms
Stochastic black box searchinitilize distribution parameters θset population size λ ∈ Nwhile not terminate
1 sample distribution P(x|θ)→ x1, . . . , xλ ∈ Rn
2 evaluate x1, . . . , xλ on f3 update parameters θ
(A. Auger, Tutorial CMA-ES, GECCO 2013)
schema of most of the evolutionary strategies (and EDAalgorithms)as well as CMA-ES (Covariance Matrix Adaptation ES)– current state of the art in continuous optimization
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 26
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
The CMA-ESInput: m ∈ Rn, σ ∈ R+, λ ∈ NInitialize: C = I (and several other parameters)Set the weights w1, . . . wλ appropriately
while not terminate
1 xi = m + σyi, yi ∼ N(0,C), for i = 1, . . . , λ sampling
2 m←∑µ
i=1 wi xi:λ = m + σyw where yw =∑µ
i=1 wi yi:λ updatemean
3 update C4 update step-size σ
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 27
m1 σ1,C1
m1 σ1,C1
λ = 12
xi
m1 σ1,C1
µ = 3
m2
σy1:λ
σy2:λ
σy3:λ
µ = 3
m2
σ2,C2
m2
σ2,C2
m2
σ2,C2
m3
m3
σ3,C3
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Covariance matrix adaptation
eigenvectors of the covariance matrix C are the principlecomponents – the principle axes of the mutation ellipsoidCMA-ES learns and updates a new Mahalanobis metricsuccessively approximates the inverse Hessian onquadratic functions– transforms ellipsoid function into sphere function– it somehow holds for other functions, too (up to somedegree)
b1
b2
source: (S. Finck, N. Hansen, R. Ros, and A. Auger, 2009)
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 28
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Is the CMA-ES the best for everything?
CMA-ES is state-of-the-art optimization algorithm,especially for rugged and ill-conditioned objective functionshowever, not the fastest if we can affordonly very few objective function evaluations
what we have already seen:use a surrogate model!however, original evaluated solutions are availableonly along the search pathsolution: construct local surrogate models
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 29
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Doubly trained Surrogate CMA-ES
m,σ
1
1st modeltraining
m,σ
4
fitnessevaluationof a fewchosenpoints
m,σ
2
sampling fromN(m,σ)
CMA-ES m,σ
5
2nd modeltraining
m,σ3rd3rd
3 criterion rankingaccording to 1st model
1st1st
2nd2nd
s2 m,σ
6
population
mean-prediction2nd model
for the rest of
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 30
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Doubly trained Surrogate CMA-ES
1 sample a new population of size λ (standard CMA-ESoffspring),
2 train the first surrogate model on the original-evaluatedpoints from the archive A,
3 select dαλe point(s) wrt. a criterion C, which is based onthe first model’s prediction,
4 evaluate these point(s) with the original fitness,5 re-train the surrogate model also using these new point(s),
and6 predict the fitness for the non-original evaluated points with
this second model.
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 31
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Criteria for the selection of original-evaluated points
GP predictive mean
CM(x) = − y(x)
GP predictive standard deviation
CSTD(x) = s(x)
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 32
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Criteria for the selection of original-evaluated points
Expected improvement (EI). ymin – the minimum so-farfitness
CEI(x) = E((ymin − f (x))I(f (x) < ymin) | y1, . . . , yN) , where
I(f (x) < ymin) =
{1 for f (x) < ymin
0 for f (x) ≥ ymin
Probability of improvement (PoI). the probability of findinglower fitness than some threshold T
CPoI(x,T) = P(f (x) ≤ T | y1, . . . , yN) = Φ
(T − y(x)
s(x)
)where Φ is the CDF of N (0, 1), T = ymin or a slightly highervalue
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 33
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Criteria for the selection of original-evaluated pointsselected unimodal COCO functions f1,2, f8...14
0 50 100 150 200 250-8
-6
-4
-2
0
"flo
g
5-D
GP predictive mean (M)GP predictive stand. deviation (STD)Expected improvement (EI)
0 50 100 150 200 250-8
-6
-4
-2
0 20-D
Probability of improvement (PoI)Expected RDE (ERDE)
multimodal COCO functions f3,4, f15...24
0 50 100 150 200 250Number of evaluations / D
-8
-6
-4
-2
0
"flo
g
5-D
0 50 100 150 200 250Number of evaluations / D
-8
-6
-4
-2
0 20-D
The log10 of the median best f -value distances to the benchmarks’ optimawere scaled linearly to [−8, 0] for each COCO function.
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 34
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
GP model training
trainModel(A,Nmax,TSS, rAmax,K, σ(g),C(g),m(g))
(XN , yN)← select at most Nmax points from the archive A using TSSand rAmax
XN ← transform the selected points into the (σ(g))2C(g) basis with theorigin at m(g)
yN ← standardize the f -values in yN to zero mean and unit variance(mµ, σ2
f , `, σn)← fit the hyperparameters of µ(x) and K using MLestimation
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 35
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Training set selection
1 TSS1 taking up to Nmax most recently evaluated points
2 TSS2 selecting the union of the k nearest neighbors ofevery point for which the fitness should be predicted,where k is maximal such that the total number of selectedpoints does not exceed Nmax,
3 TSS3 clustering the points in the input space into Nmaxclusters and taking the points nearest to clusters’ centroids
4 TSS4 selecting Nmax points which are closest to any pointin the current population.
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 36
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
GP model parameters
parameter considered values
training set selection method TSS TSS1, TSS2 TSS3, TSS4maximum distance rAmax 2
√Qχ2(0.99,D), 4
√Qχ2(0.99,D)
Nmax 10 · D, 15 · D, 20 · Dcovariance function K KSE, Kν=3/2
Mat«ern , Kν=5/2Mat«ern
Parameters of the GP surrogate models. The maximum distance rAmax isderived using the Mahalanobis distance given by the covariance matrix σ2C.Qχ2(0.99,D) is the 0.99-quantile of the χ2
D distribution, and therefore√Qχ2(0.99,D) is the 0.99-quantile of the norm of a D-dimensional normal
distributed random vector.
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 37
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Gaussian process parameter settings – heatmap2-D
1 2 3 4 6 7 8 9 101112131415161718192021222324
COCO/BBOB functions
10
20
30
40
50
60
70
para
met
er s
ets
0.4
0.5
0.6
0.7
0.8
0.9
5-D
1 2 3 4 6 7 8 9 101112131415161718192021222324
COCO/BBOB functions
10
20
30
40
50
60
70
para
met
er s
ets
0.3
0.4
0.5
0.6
0.7
0.8
10-D
1 2 3 4 6 7 8 9 101112131415161718192021222324
COCO/BBOB functions
10
20
30
40
50
60
70
para
met
er s
ets
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
20-D
1 2 3 4 6 7 8 9 101112131415161718192021222324
COCO/BBOB functions
10
20
30
40
50
60
70
para
met
er s
ets
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Ranking prediction error. Solid horizontal lines separate different TSSmethods TSS1–TSS4 (in that order), dashed lines separate sectors withsmaller and larger maximum distance rAmax. Three triples of settings withineach sector represent raising values of Nmax and three covariance functionsKSE, Kν=3/2
Mat«ern , Kν=5/2Mat«ern within each triple.
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 38
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Testing frameworkBlack-Box Optimization Benchmarking (BBOB)COmparing Continuous Optimisers (COCO)
24 artificial functionsdifferent degree of separability, conditioning, modality or with orwithout a global structuretesting sets defined for dimensions 2, 3, 5, 10, 20 (and 40:)
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 39
f1 f3 f4
f8 f12 f13
f14 f16 f17
f20 f22 f23 f24 .
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Aggregated experimental results on BBOB
0 50 100 150 200 250-8
-6
-4
-2
0
"flo
g
2-D
S-CMA-ES0.05/2pop DTS-CMA-ESadaptive DTS-CMA-ESMA-ESGPOPSAPEOCMA-ES
0 50 100 150 200 250-8
-6
-4
-2
0 5-D
CMA-ES 2pop
BIPOP-s* ACMES-klmm-CMABOBYQASMACfmincon
0 50 100 150 200 250
Number of evaluations / D
-8
-6
-4
-2
0
"flo
g
10-D
0 50 100 150 200 250
Number of evaluations / D
-8
-6
-4
-2
0 20-D
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 40
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Experimental results on BBOB (5 D)
0 50 100 150 200 250-8
-6
-4
-2
0
"flo
g
f1 Sphere 5-D
S-CMA-ES0.05/2pop DTS-CMA-ESadaptive DTS-CMA-ESMA-ESGPOPSAPEOCMA-ES
0 50 100 150 200 250
-5
0
5
f2 Ellipsoidal 5-D
CMA-ES 2pop
BIPOP-s* ACMES-klmm-CMABOBYQASMACfmincon
0 50 100 150 200 250
0.5
1
1.5
2
2.5
"flo
g
f3 Rastrigin 5-D
0 50 100 150 200 250
1
1.5
2
2.5
f4 Bueche-Rastrigin 5-D
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 41
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Experimental results on BBOB (5 D)
0 50 100 150 200 250-8
-6
-4
-2
0
2
"flo
g
f5 Linear Slope 5-D
0 50 100 150 200 250
-5
0
5f6 Attractive Sector 5-D
0 50 100 150 200 250
-5
0
5f8 Rosenbrock, original 5-D
0 50 100 150 200 250-8
-6
-4
-2
0
2
4
"flo
g
f9 Rosenbrock, rotated 5-D
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 42
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Experimental results on BBOB (5 D)
0 50 100 150 200 250-8
-6
-4
-2
0
2
"flo
g
f13 Sharp Ridge 5-D
0 50 100 150 200 2500
0.5
1
1.5
2
2.5
"flo
g
f15 Rastrigin, multi-modal 5-D
0 50 100 150 200 250
-4
-2
0
"flo
g
f17 Schaffers F7 5-D
0 50 100 150 200 250
-4
-2
0
2f18 Schaffers F7, ill-conditioned 5-D
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 43
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Experimental results on BBOB (5 D)
0 50 100 150 200 250
-1
-0.5
0
0.5
1
"flo
g
f19 Composite Griewank-Rosenbrock F8F2 5-D
0 50 100 150 200 250
0
0.5
1
1.5
f22 Gallagher's Gaussian 21-hi Peaks 5-D
0 50 100 150 200 250
Number of evaluations / D
-0.5
0
0.5
1
"flo
g
f23 Katsuura 5-D
0 50 100 150 200 250
Number of evaluations / D
1
1.5
2
f24 Lunacek bi-Rastrigin 5-D
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 44
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Experimental results on BBOB (20 D)
0 50 100 150 200 250-8
-6
-4
-2
0
2
"flo
g
f1 Sphere 20-D
S-CMA-ES0.05/2pop DTS-CMA-ESadaptive DTS-CMA-ESMA-ESGPOPSAPEOCMA-ES
0 50 100 150 200 250
-5
0
5
f2 Ellipsoidal 20-D
CMA-ES 2pop
BIPOP-s* ACMES-klmm-CMABOBYQASMACfmincon
0 50 100 150 200 250
1.5
2
2.5
3
"flo
g
f3 Rastrigin 20-D
0 50 100 150 200 250
1.5
2
2.5
3
f4 Bueche-Rastrigin 20-D
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 45
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Experimental results on BBOB (20 D)
0 50 100 150 200 250-8
-6
-4
-2
0
2
"flo
g
f5 Linear Slope 20-D
0 50 100 150 200 250
-2
0
2
4
f6 Attractive Sector 20-D
0 50 100 150 200 250
-5
0
5
f8 Rosenbrock, original 20-D
0 50 100 150 200 250
-5
0
5
"flo
g
f9 Rosenbrock, rotated 20-D
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 46
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Experimental results on BBOB (20 D)
0 50 100 150 200 250-6
-4
-2
0
2
"flo
g
f13 Sharp Ridge 20-D
0 50 100 150 200 250
1.5
2
2.5
3
"flo
g
f15 Rastrigin, multi-modal 20-D
0 50 100 150 200 250
-3
-2
-1
0
1
"flo
g
f17 Schaffers F7 20-D
0 50 100 150 200 250-2
-1
0
1
2
f18 Schaffers F7, ill-conditioned 20-D
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 47
OptimizationGaussian processes
Doubly trained Surrogate CMA-ES
CMA-ESDoubly trained Surrogate CMA-ESExperimental results
Experimental results on BBOB (20 D)
0 50 100 150 200 250
-0.5
0
0.5
1
1.5
"flo
g
f19 Composite Griewank-Rosenbrock F8F2 20-D
0 50 100 150 200 250
0
0.5
1
1.5
f22 Gallagher's Gaussian 21-hi Peaks 20-D
0 50 100 150 200 250
Number of evaluations / D
-0.5
0
0.5
1
"flo
g
f23 Katsuura 20-D
0 50 100 150 200 250
Number of evaluations / D
1.5
2
2.5
f24 Lunacek bi-Rastrigin 20-D
Lukáš Bajer Gaussian processes: surrogates for cont. optimization 48