Learning Stochastic Binary Tasks using Bayesian Optimization … · 2014-01-24 · Learning...

Learning Stochastic Binary Tasks using Bayesian Optimization withShared Task Knowledge

Matthew Tesch [email protected] Schneider [email protected] Choset [email protected]

Robotics Institute, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213 USA

Abstract

Robotic systems often have tunable parame-ters which can affect performance; Bayesianoptimization methods provide for efficient pa-rameter optimization, reducing required testson the robot. This paper addresses Bayesianoptimization in the setting where perfor-mance is only observed through a stochasticbinary outcome – success or failure. We de-fine the stochastic binary optimization prob-lem, present a Bayesian framework usingGaussian processes for classification, adaptthe existing expected improvement metric forthe binary case, and benchmark its perfor-mance. We also exploit problem structureand task similarity to generate principledtask priors allowing efficient search for diffi-cult tasks. This method is used to create anadaptive policy for climbing over obstacles ofvarying heights.

1. Introduction

Many real-world optimization tasks take the form ofoptimization problems where the number of objectivefunction samples is severely limited. This often oc-curs with physical systems which are expensive to test,such as choosing optimal parameters for a robot’s con-trol policy. In cases where the objective is a continuousreal-valued function, the use of Bayesian sequential ex-periment selection metrics such as expected improve-ment (EI) has lead to efficient optimization of theseobjectives. An advantage of EI is that it requires notuning parameters.

We are interested in the problem setting where the ob-

Proceedings of the 30 th International Conference on Ma-chine Learning, Atlanta, Georgia, USA, 2013. JMLR:W&CP volume 28. Copyright 2013 by the author(s).

Figure 1. Efficient optimization is possible even with lim-ited function evaluations which only return a noisy ‘suc-cess’ or ‘failure’. Top: Moving over a 3.5 inch beam withthe predicted best motion after 20 evaluations, using noprior information. Bottom: Sharing results from previ-ous optimizations on different obstacles allows the robotto move over an 11 inch beam on the first attempt.

jective is not a deterministic continuous-valued func-tion, but a stochastic binary valued function. In thecase of a robot, instead of choosing parameters whichmaximize locomotive speed, the task may be to choosethe parameters of a policy which maximize the proba-bility of successfully moving over an obstacle (Fig. 1),where the success of this task is stochastic due to noisein the system.

Inspired by the success of Bayesian optimization forcontinuous problems, we propose using a similarframework for the stochastic binary setting. This pa-per defines the stochastic binary optimization prob-lem, describes the application of Gaussian processesfor classification to this problem, proposes a selectionmetric based on EI, and benchmarks performance onsynthetic test functions against a number of potentialbaseline approaches.

Unfortunately, working in parameter spaces where re-gions with significant probability of success are rela-tively sparse (e.g., a snake robot attempting to over-

Learning Stochastic Binary Tasks using Bayesian Optimization with Shared Task Knowledge

come a tall obstacle) will amount to a blind search.Inspired by ideas in multi-task learning, we exploittask structure to solve simpler problems first (smallerobstacles), and then use the learned knowledge as aprincipled prior for the difficult task. This enables ef-ficient optimization of the task which would otherwiserequire us to resort to an exhaustive search.

2. Related Work

For optimization problems where each function evalu-ation is expensive (either requiring significant time orresources) the choice of which point to sample becomesmore important than the speed at which a samplecan be chosen. Thus, Bayesian optimization of suchfunctions relies on a function regression method, suchas Gaussian processes (GPs) (Rasmussen & Williams,2006), to predict the entire unknown objective fromlimited sampled data. Given this prediction of thetrue objective, the central challenge is the explo-ration/exploitation tradeoff – balancing the need toexplore unknown areas of the space with the needto refine the knowledge in areas that are known tohave high function values. Metrics such as the up-per confidence bound (Auer et al., 2002), probabilityof improvement (Zilinskas, 1992), and expected im-provement (Mockus et al., 1978) attempt to trade offthese conflicting goals. Existing literature primarilyfocuses on deterministic, continuous, real-valued func-tions, rather than stochastic ones or ones with binaryoutputs.

Active learning (c.f. survey of Settles (2009)) is pri-marily focused on learning the binary class member-ship of a set of unlabeled data points but generallyattempts to accurately learn class membership of allunlabeled points with high confidence, which is ineffi-cient if the loss function is asymmetric (if it is moreimportant to identify successes than failures). The ac-tive binary-classification problem discussed in (Gar-nett et al., 2012) focuses on finding a Bayesian optimalpolicy for identifying a particular class, but assumesdeterministic class membership.

In the bandit literature, the subtopic of continuous-armed bandits or metric bandits (e.g., (Agrawal, 1995;Auer et al., 2007)) have a similar problem structure tothat described in our work; these embed the “arms”of the classic multi-arm bandit problem into a metricspace allowing a potentially uncountably infinite num-ber of arms. The focus of much bandit work is mini-mizing asymptotic bounds on the cumulative regret inan online setting, whereas we are concerned only theperformance of the algorithm recommendation afteran offline training phase.

Prior work in multi-task learning postulates that tab-ula rasa learning for multiple similar problems is to beavoided, especially when the task has descriptive fea-tures (or parameters). Approaches using a number oftechniques have been taken (e.g., (Bakker, B. and Hes-kes, 2003) suggest neural network predictors for gener-alizing task knowledge), but perhaps the most relevantis that of (Bonilla et al., 2007) or (Tesch et al., 2011);these both incorporate the task as additional parame-ters of the GP used to model the objective. Bonilla etal. attempt to efficiently and accurately model ratherthan optimize the objective at a new task. Tesch etal. focus on Bayesian optimization, but allow the al-gorithm to choose the task parameters of each experi-ment as well. Additionally, neither of these approachesconsiders the case of binary information.

3. Binary Stochastic Problem

Given an input (parameter) space X ⊂ R and an un-known function π : X → [0, 1] which represents theunderlying binomial probability of success of an ex-periment, the learner sequentially chooses a series ofpoints x = {x1, x2 . . . xn |xi ∈ X} to evaluate. Af-ter choosing each xi, the learner receives feedback yiwhere yi = 1 with probability π(xi) and yi = 0 withprobability 1 − π(xi). The choice of xi is made withknowledge of {y1, y2 . . . yi−1}. The goal of the learneris to recommend, after n experiments, a point xr whichminimizes the (typically unknown) error, or simple re-gret, maxx∈X π(x)− π(xr); this is equivalent to maxi-mizing performance π(xr).

4. Background

4.1. Bayesian Optimization

In Bayesian optimization of a continuous real-valueddeterministic function, the goal is to find xbest whichmaximizes the function f : X → R. The process relieson a data-driven probabilistic model f (often a GP) ofthe underlying function f , and a selection metric whichselects the next point to sample at each iteration.

The algorithm is an iterative process – at each step i,fit a model based on x and y, select a next xi, andevaluate xi on the true function f to obtain yi. Thecrux of the algorithm is the metric which is optimizedto choose the next point. EI has been popularized assuch a selection metric in the Efficient Global Opti-mization algorithm (Jones et al., 1998). Given a func-

tion estimate f , improvement is defined as

I(f(x)) = max(f(x)− ybest, 0), (1)


where ybest was the maximizer of the previously sam-pled y. The GP defines f(x) as a posterior distri-bution over f(x); the expectation over this, EI(x) =

E[I(f(x))], defines the EI:

EI(x) = (fxµ − ybest)(

1− Φ((ybest − fxµ )/fxσ ))

+ fxσφ((ybest − fxµ )/fxσ )

Above, φ and Φ are the probability and cumulativedensity functions (the pdf and cdf) of the standard

normal distribution; pxf is the pdf at f(x), fxµ , and fxσare the mean and standard deviation.

4.2. Gaussian Processes for Classification

A key idea behind Bayesian optimization is the prob-abalistic modeling of the unknown function. In thebinary stochastic case, standard GPs are not appro-priate because they are a regression technique, fit-ting continuous data. We use an adaptation of GPsfor classification; this provides a similar probabilisticmodel, but for stochastic binary data (c.f. (Rasmussen& Williams, 2006)).

As in linear binary classification, the use of a sigmoidalresponse function σ1 converts a model with a range of(−∞,∞) to an output that lies within [0, 1] (i.e., avalid probability). In Gaussian processes for classifi-

cation (GPC), a latent GP f defines a Gaussian pdfpxf for each x ∈ X (as well as joint Gaussian pdfs forany set of points in X). The corresponding probabilitydensity over class probability functions, pxπ, is

pxπ(y) = pxf (σ−1(y))δσ−1

δy(y). (2)

Although the response function σ maps from the la-tent space F to the class probability space Π, pxπ(y) 6=pxf (σ−1(y)) due to the change of variables. Also notethat as we do not observe values of f directly, theinference step requires an analytically intractable in-tegral. Advantages and disadvantages of different ap-proximate methods are discussed in (Nickisch & Ras-mussen, 2008); we use Minka’s expectation propaga-tion (EP) method (2001) due to its accuracy and rea-sonable speed.

1In this work, we assume σ is the standard normal cdf;however, any monotonically increasing function mappingfrom R to the unit interval can be used.

4.2.1. Expectation of Posterior on SuccessProbability

As noted above, pxπ(y) 6= pxf (σ−1(y)); therefore the ex-pectation of the posterior over the success probability,E[pxπ], is not generally equal to σ(E[pxf ]). To calculatethe former, we use the definition of expectation alongwith a change-of-variables substitution (π = σ(f) andy = σ(z)) to take this integral in the latent space(where approximations for the standard normal cdfcan be used):

E[pxπ] =

∫ 1

0

ypxπ(y)dy (3)

=

∫ σ−1(1)

σ−1(0)

σ(z)pxf (z)δσ−1

δy(σ(z))

δσ

δz(z)dz

=

∫ ∞−∞

σ(z)pxf (z)dz (4)

As noted in section 3.9 of (Rasmussen & Williams,2006), if σ is the Gaussian cdf this can be rewrittenas follows (for notational simplicity, we define π(x) =E[pxπ] for use later in the paper):

E[pxπ] = Φ

E[pxf ]√1 + V[pxf ]

(5)

5. Expected Improvement for BinaryResponses

In the case of stochastic binomial feedback, the notionof improvement that underlies the definition of EI mustchange. Because the only potential values for yi are 1and 0, after the first 1 is sampled ybest would be set to1. Because there is no possibility for a returned valuehigher than 1, improvement (and therefore EI) wouldbe identically zero for each x ∈ X.

Instead we note that these are noisy observations ofan underlying success probability and query the GPposterior at each point in x. Let

πmax = maxx

π(x). (6)

As the 0 and 1 responses are samples from a Bernoullidistribution with mean π(x), we define the improve-ment as if we could truly sample the underlying mean.Choosing this rather than conditioning our improve-ment on 0/1 is consistent with the fact that our πmaxrepresents a probability, not a single sample of 0/1. Inthis case,


Iπ(π(x)) = max(π(x)− πmax, 0) (7)

To calculate the EI, we follow a similar procedure tothat in §4.2.1 to calculate the expectation of Iπ(π(x)):

EIπ(π(x)) =

∫ 1

πmax

(y − πmax)pxπ(y)dy (8)

=

∫ ∞σ−1(πmax)

(σ(z)− πmax)pxf (z)dz

The marginalization trick that allowed us to evaluatethis integral and obtain a solution only requiring theGaussian cdf in the case of π (Eqn. (5)) does not workbecause here the integral is not from −∞ to ∞; fortu-nately it is one dimensional regardless of the dimensionof X and is easy to numerically evaluate in practice(e.g., using adaptive quadrature techniques).

6. Performance Benchmarks forStochastic Binary EI

To validate the performance of our EI metric forstochastic binary outputs, we created several challeng-ing synthetic test functions for π(x) on which we couldrun a large number of optimizations; these functionsexhibit properties such as multiple local optima, nar-row global optimum, and stochasticity (π(x) /∈ {0, 1}over much of X).

As baselines to compare against the metric we pro-pose in §5, we use uniform random selection, upperconfidence bound (UCB) on the latent function f , EI

on the latent function f (EIf ), and the Upper Confi-dence Bound for Continuous-armed bandits algorithm(UCBC) proposed in (Auer et al., 2007). For the βUCB parameter (standard deviation coefficient), wechose the value which performed best, β = 1, and forthe UCBC algorithm we chose the algorithm param-eter n = (T/ln(T ))1/4 = 2 via the method given inthe paper.2 Because we are not directly observing thesampled function value, we redefine the ybest term forEIf as ybest = maxx{σ−1(π(x))}, where x is all sam-pled xi.

To compare the various algorithms, we allowed eachalgorithm to sequentially choose a series of x ={x1, x2 . . . x50}, with feedback of yi generated from aBernoulli distribution with mean π(x) (according tothe test function) after each choice of xi. This wascompleted 100 times for each test function.3

2We also set n = 10, but obtained similar results.3Our MATLAB implementation of these algorithms

To obtain a measure of the algorithm’s performanceat step i, we use the natural Bayesian recommen-dation strategy of choosing the point which has thehighest expected probability of success given the pre-dicted function (xbest = argmaxX E[pxπ|{x1, x2 . . . xi}and {y1, y2 . . . yi}]).4 The point xbest is then evaluatedon the underlying true success probability function π,and the resulting value π(xbest) is given as the ex-pected performance of the algorithm at step i. For therandom selection and UCBC algorithms which do nothave a notion of π, a GP was fit to the data collectedby the algorithm to obtain this π using the same pa-rameters as for the Bayesian optimization algorithms.

In Fig. 2, we plot the average performance over 100runs of the proposed stochastic binary expected im-provement EIπ as well as various baselines. As ex-pected, the knowledge of the underlying function grewslowly but steadily as random sampling characterizedthe entire function. The focus of EIπ on areas of thefunction with the highest expectation for improvementled to a more efficient strategy which still chose to ex-plore, but focused experimental evaluations on morepromising areas of the search space. Notably, EIπmatched or outperformed tuned versions of all otheralgorithms tested, without requiring a tuning param-eter. The UCBC algorithm worked well for simplecases (test function 1 had a significant region with highprobability of success) but faltered as the functions be-came more difficult to optimize; challenges with thisalgorithm include lack of shared knowledge betweennearby intervals, dependence on a tuning parameter(number of intervals), and that it is not defined forhigher dimensions.

We also note that EIπ outperforms the naıve use ofBayesian optimization techniques on the latent GP f ,as shown in Fig. 2. This is largely true because theinterpretation of variances on the latent function whenused in the classification framework are unintuitive –the variance fσ is not based solely on the sampledpoints as in the regression case; instead larger valuesof fµ tend to have larger variances due to the nonlinearmapping into the space of probabilities π.

7. Robotic Application: Snakes andObstacles

The snake robot described in (Wright et al., 2012) hasimpressive locomotive capabilities, and is able to usecyclic motions called gaits to move quickly across flat

and more extensive results are available at http://www.mtesch.net/ICML2013/

4In practice one may optimize a utility function thatconsiders risk (e.g., the uncertainty in that probability).

http://www.mtesch.net/ICML2013/

http://www.mtesch.net/ICML2013/


0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

π(x

)

Sample Number

EIπ

UCBC, n = 2

Random

(a) Test Function A

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

π(x

)

Sample Number

EIπ

EIf

UCBf, β = 1

(b) Test Function A

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

π(x

)

Sample Number

EIπ

UCBC, n = 2

Random

(c) Test Function B

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

π(x

)

Sample Number

EIπ

EIf

UCBf, β = 1

(d) Test Function B

Figure 2. After each sample, the algorithm was queried as to its recommendation for a point x that would have amaximum expectation of success π(x). These results show the underlying probability value of that point averaged over100 runs of each algorithm. We compare the stochastic binary EI (EIπ) to the Auer’s continuous-armed bandit algorithmUCBC as well as uniform random selection in (a) and (c), and to EI and UCB on the latent function in (b) and (d). Morecomplete results are available online.

ground, forward through narrow openings, and up theinside and outside of irregular horizontal and verticalpipes. However, moving over cluttered, obstacle ladensurfaces (such as a rubble pile) provide a challenge forthe system. One such obstacle encountered in fielddeployments is a 4x4 beam., as we have encounteredin the field during disaster response training exercises.

A master-slave system was set up to record an expert’sinput to move the robot over the obstacle. Using asparse function approximation of the expert’s input,we created a 7-parameter model that was able to over-come obstacles of various sizes, albeit unreliably – thesame parameters would only sometimes result in suc-cess. Parameters of this model (offset, widths, andamplitudes of travelling waves) were difficult to opti-mize by hand to produce reliable results.

Using the EIπ metric, a 3-dimensional subspace wassearched to identify parameters which resulted in arobust motion over the original obstacle. Running 20robot experiments (function evaluations) resulted inthe recommendation of a parameter setting which pro-duced robust, successful motions (top of Fig. 1).

Attempting this same optimization on a 9 inch obsta-cle resulted in no successes within the first 20 trials;a solution with a non-zero probability of success wassparse enough that we were essentially conducting ablind search of the parameter space.

7.1. Exploiting Task Structure to SolveDifficult Problems

We wish to avoid an exhaustive search, even for prob-lems where the regions with high success probabilityare sparse within the space. When these problemsrepresent the optimization of a task, such as a robotmoving over an obstacle, one can often parameterize

that task. With a carefully chosen task parameteriza-tion, one can learn the general behavior and locationof optima of the objective from one of more simpleroptimization problems, and use these as a principledprior for optimization of the difficult task.

Applying ideas from (Bonilla et al., 2007) and (Teschet al., 2011) to the snake robot task, we attempted tolearn parameters of our expert-based model for moredifficult obstacles, such as the 9 inch beam we couldnot overcome. We added a fourth parameter, repre-senting obstacle height, to our GP function approxi-mation. This generated a prediction for all obstacleheights, allowing us to have a strong prior for subse-quent optimizations by incorporating previous data.Figure 3 shows a selected trial for each intermediatetask parameter (5.5, 7, and 9 inches), each using thedata from all previous optimizations.

As opposed to the initial experimental trial, we founda successful 9 inch trial on the first experiment sug-gested by EIπ, demonstrating that shared knowledgebetween tasks can improve real-world optimizationperformance. Parameters for overcoming an 11 inchbeam were then successfully predicted with no re-quired optimization (Fig. 1).

Although generalizing results from an easier task toa more difficult task works well for many problems,there are caveats. Common choices for GP covariancefunctions are axis-aligned, resulting in poorer general-ization if a trend across multiple tasks exists with prin-cipal direction that is not primarily along the task pa-rameter axes. In addition, if a global optima for a dif-ficult task is unrelated to an optima for a simple task,the sharing of knowledge across tasks is less likely toincrease efficiency (unless it helps identify global prop-erties of this function that could improve the search).


Figure 3. Successful trials from optimizations for 5.5 (top),7 (center), and 9 (bottom) inch obstacles. The robotstarted at the left side of the obstacle, and moved overthe obstacle to the right.

8. Conclusion and Future work

We have defined the stochastic binary optimizationproblem for expensive functions, presented a novel useof GPC to frame this problem as Bayesian optimiza-tion, and presented a new optimization algorithm thatcomputes expected improvement in the stochastic bi-nary case, outperforming several baseline metrics aswell as a leading continuous-armed bandit algorithm.We used our algorithm to learn a robust motion formoving a snake robot over an obstacle, and used multi-task learning concepts to efficiently create an adaptivepolicy for obstacles of various heights.

The problem we define is not limited to the demon-strated snake robot application, but applies to manyexpensive problems with parameterized policies andstochastic success/failure feedback, including varia-tions of applications where continuous-armed banditsare currently used such as auction mechanisms andoblivious routing which could contain an offline train-ing phase penalizing simple rather than continuous re-gret.

References

Agrawal, Rajeev. The Continuum-Armed BanditProblem. SIAM Journal on Control and Optimiza-tion, 33(6):1926–1951, November 1995. ISSN 0363-0129. doi: 10.1137/S0363012992237273.

Auer, Peter, Cesa-Bianchi, N, and Fischer, P. Finite-time analysis of the multiarmed bandit problem.Machine learning, pp. 235–256, 2002.

Auer, Peter, Ortner, Ronald, and Szepesvari, C. Im-proved rates for the stochastic continuum-armedbandit problem. Learning Theory, 2007.

Bakker, B. and Heskes, T. Task Clustering and Gatingfor Bayesian Multitask Learning. Journal of Ma-chine Learning Research, (4):83–99, 2003.

Bonilla, E, Agakov, F, and Williams, C. Kernel multi-task learning using task-specific features. In 11th In-ternational Conference on Artificial Intelligence andStatistics (AISTATS), 2007.

Garnett, Roman, Krishnamurthy, Yamuna, Xiong,Xuehan, Schneider, Jeff, and Mann, Richard P.Bayesian optimal active search and surveying. InProceedings of the 29th International Conference onMachine Learning (ICML 2012), 2012.

Jones, Donald R., Schonlau, Matthias, and Welch,William J. Efficient Global Optimization of Expen-sive Black-Box Functions. Journal of Global Opti-mization, 13(4), 1998. ISSN 0925-5001.

Minka, Thomas P. A family of algorithms for approxi-mate Bayesian inference. Phd thesis, MassachusettsInstitute of Technology, 2001.

Mockus, J, Tiesis, V, and Zilinskas, A. The applica-tion of Bayesian methods for seeking the extremum.Towards Global Optimization, 2:117–129, 1978.

Nickisch, Hannes and Rasmussen, CE. Approxi-mations for binary Gaussian process classification.Journal of Machine Learning Research, 9:2035–2078, 2008.

Rasmussen, Carl Edward and Williams, ChristopherK. I. Gaussian Processes for Machine Learning. TheMIT Press, 2006. ISBN 026218253X.

Settles, Burr. Active Learning Literature Survey.Technical report, University of Wisconsin–Madison,2009.

Tesch, Matthew, Schneider, Jeff, and Choset, Howie.Adapting Control Policies for Expensive Systems toChanging Environments. In International Confer-ence on Intelligent Robots and Systems, 2011.

Zilinskas, Antanas. A review of statistical models forglobal optimization. Journal of Global Optimization,2(2):145–153, June 1992. ISSN 0925-5001.

Wright, C., Buchan, A., Brown, B., Geist, J., Schw-erin, M., Rollinson, D., Tesch, M., and Choset,H. Design and Architecture of the Unified Modu-lar Snake Robot. In 2012 IEEE International Con-ference on Robotics and Automation, St. Paul, MN,2012.

Learning Stochastic Binary Tasks using Bayesian Optimization … · 2014-01-24 · Learning...

Documents

Transcript of Learning Stochastic Binary Tasks using Bayesian Optimization … · 2014-01-24 · Learning...