Multi-armed Bandit Problem and Bayesian Optimization in Reinforcement Learning

Multi-armed Bandit Problem and Bayesian Multi-armed Bandit Problem and Bayesian Optimization in Reinforcement LearningOptimization in Reinforcement Learning

From Cognitive Science and Machine Learning Summer School 2010

Loris Bazzani

Outline Summer School

www.videolectures.net2

Outline Summer School

www.videolectures.net3

Outline Presentation

• What are Machine Learning and Cognitive Science?

• How are they related each other?• Reinforcement Learning

– Background– Discrete case– Continuous case

What is Machine Learning (ML)?• Endow computers with the ability to “learn” from

“data”• Present data from sensors, the internet,

experiments• Expect computer to make decisions• Traditionally categorized as:

– Supervised Learning: classification, regression– Unsupervised Learning: dimensionality reduction,

clustering– Reinforcement Learning: learning from feedback,

planning

From N. Lawrence slides6

What is Cognitive Science (CogSci)?

• How does the mind get so much out of so little?– Rich models of the world– Make strong generalizations

• Process of reverse engineering of the brain– Create computational models of the brain

• Much of cognition involves induction: finding patterns in data

From N. Chater slides7

Link between CogSci and ML• ML takes inspiration from psychology, CogSci and

computer science– Rosenblatt’s Perceptron– Neural Networks– …

• CogSci uses ML as engineering toolkit– Bayesian inference in generative models– Hierarchical probabilistic models– Approximated methods of learning and inference– …

Multi-armed Bandit Problem[Auer et al. ‘95]

I wanna win a lot of cash!

• Trade-off between Exploration and Exploitation

• Adversary controls payoffs• No statistical assumptions on the rewards

distribution• Performances measurement: Regret = Player

Reward – Best Reward• Upper Bound on the Expected Regret

Actions

Sequence ofTrials

Reward(s)

Goal: define a probability distribution over 17

The Full Information Game[Freund & Shapire ‘95]

Regret Bound:

Problem: Compute the reward for each action!18

The Partial Information Game Exp3 = Exponential-weight algorithm for Exploration and Exploitation

Update only the selected action

Tries out all the possible actions

Bound for certain valuesof and dependingon the best reward

The Partial Information Game Exp3.1 = Exp3 with rounds, where a round consists of a sequence of trials

Each round guesses a bound for the total reward of the best action

Bound:

Applications [Hedge][Bazzani et al. ‘10]

Bayesian Optimization [Brochu et al. ‘10]

• Optimize a nonlinear function over a set:

Classic Optimization Tools Bayesian Optimization Tools

•Known math representation•Convex•Evaluation of the function on all the points

•Not close-form expression•Not convex•Evaluation of the function only on one point gets noisy response

actionsactions

Function that gives rewards

• Uses the Bayesian Theorem

Prior: our beliefs about the space of possible objective functions

Posterior: our updated beliefs about the unknown objective function

Likelihood: given what we think we know about the prior, how likely is the data we have seen?

Goal: maximize the posterior at each step, so that each new evaluation decreases the distance between the true global maximum and the expected maximum given the model.

Priors over Functions

• Convergence conditions of BO: – The acquisition function is continuous and

approximately minimizes the risk– Conditional variance converges to zero

Guaranteed by Gaussian Processes (GP)Guaranteed by Gaussian Processes (GP)

– The objective is continuous– The prior is homogeneous– The optimization is independent of the m-

th differences

Priors over Functions

• GP = extension of the multivariate Gaussian distribution to an infinite dimension stochastic process

• Any finite linear combination of samples will be normally distributed

• Defined by its mean function and covariance function

• Focus on defining the covariance function31

Why use GPs?• Assume zero-mean GP, function values are drawn according to

, where

• When a new observation comes

• Using Sherman-Morrison-Woodbury formula

Choice of Covariance Functions

• Isotropic model with hyperparameter

• Squared Exponential Kernel

• Mater Kernel

Gamma function Bessel function33

Acquisition Functions

• The role of the acquisition function is to guide the search for the optimum and the uncertainty is great

• Assumption: Optimize the acquisition function is simple and cheap

• Goal: high acquisition corresponds to potentially high values of the objective function

• Maximizing the probability of improvement

Acquisition Functions• Expected improvement

• Confidence bound criterion

CDF and PDF of normal distribution

Applications [BO]

Learn a set of robot gait parameters that maximize velocity of a Sony AIBO ERS-7 robot

Find a policy for robot path planning that would minimize uncertainty about its location and heading

Select the locations of a set of sensors (e.g., cameras) in a dynamic system

Take-home Message

• ML and CogSci are connected

• Reinforcement Learning is useful for optimization when dealing with temporal information– Discrete case: Multi-armed bandit problem– Continuous case: Bayesian optimization

• We can employ these techniques for Computer Vision and System Control problems

http://heli.stanford.edu/

[Abbeel et al. 2007]

Some ReferencesP. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. 1995. Gambling in a rigged

casino: The adversarial multi-armed bandit problem. FOCS '95.Yoav Freund and Robert E. Schapire. 1995. A decision-theoretic generalization of on-

line learning and an application to boosting. EuroCOLT '95.Eric Brochu, Vlad Cora and Nando de Freitas. 2009. A Tutorial on Bayesian

Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. Technical Report TR-2009-023. UBC.

Loris Bazzani, Nando de Freitas and Jo-Anne Ting. 2010. Learning attentional mechanisms for simultaneous object tracking and recognition with deep networks. NIPS 2010 Deep Learning and Unsupervised Feature Learning Workshop.

Carl Edward Rasmussen and Christopher K. I. Williams. 2005. Gaussian Processes for Machine Learning. The MIT Press.

Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y. Ng. 2007. An Application of Reinforcement Learning to Aerobatic Helicopter Flight. NIPS 2007.

Multi-armed Bandit Problem and Bayesian Optimization in Reinforcement Learning

Documents

Transcript of Multi-armed Bandit Problem and Bayesian Optimization in Reinforcement Learning

Reinforcement Learning - The one with the banditsMyopic vs far-sighted. 4 The k-arm Bandit Problem. 5 THE K-ARMED BANDIT PROBLEM a 1 a 2 a 3::: a k Expected value 5 6 4 7. 6 THE K-ARMED

MULTI-ARMED BANDITS WITH COVARIATES: THEORY AND ... · personalized medicine, recommender system, reinforcement learning. 1. Introduction and Background The k-armed bandit problem

CCN Interest Forwarding Strategy as Multi-Armed Bandit ...

The Stochastic Multi-Armed Bandit Problem: In …vaibhav/talks/2017a.pdfThe Stochastic Multi-Armed Bandit Problem: In Neuroscience, Ecology, and Engineering Vaibhav Srivastava CYber

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 17: The Multi-Armed Bandit Problem 1Lecture 17: The Multi-Armed Bandit Problem.

Reinforcement Learning Evaluative Feedback and Bandit Problems

Evaluation of multi armed bandit algorithms and empirical ...journal.it.cas.cz/62(2017)--3-B/Paper NY13832.pdf · [4] J.Vermorel, M.Mohri: Multi-armed bandit algorithms and empirical

Contextual Multi-Armed Banditsproceedings.mlr.press/v9/lu10a/lu10a.pdfContextual Multi-Armed Bandits ... tion of the classical multi-armed bandit problem by Lai and Robbins and the

Multi-armed Bandit with Additional Observationsalinlab.kaist.ac.kr/resource/Multi_armed_Bandit... · Multi-armed Bandit with Additional Observations 13:3 the expert problem (i.e.,

Chapter 6 MULTI-ARMED BANDIT PROBLEMSsem/2WB12/MT08.pdfChapter 6 MULTI-ARMED BANDIT PROBLEMS Aditya Mahajan University of Michigan, Ann Arbor, MI, USA Demosthenis Teneketzis …

The multi-armed bandit problem with covariatesrigollet/PDFs/PerRig13.pdfWe consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which

Learning in A Changing World: Restless Multi-Armed Bandit with … · 2011. 12. 30. · Learning in A Changing World: Restless Multi-Armed Bandit with Unknown Dynamics Haoyang Liu,

Asymptotically Optimal Multi-Armed Bandit …mnk/papers/mab-c-arx-2015.pdfAsymptotically Optimal Multi-Armed Bandit Policies under a Cost Constraint Apostolos N. Burnetas aburnetas@math.uoa.gr

Foraging and Multi-armed Bandits Optimal Foraging …vaibhav/talks/2013a.pdfForaging and Multi-armed Bandits ... the multi-armed bandit problem with switching cost. IEEE Transactions

Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Digital Forensics Tool Selection with Multi-armed Bandit ...

The Multi-Armed Bandit, with Constraints - Stony Brookfeinberg/public/Bandits-DFR.pdfThe Multi-Armed Bandit, with Constraints Eric V. Denardo,1 Eugene A. Feinberg2 and Uriel G. Rothblum3

Multi-Armed Bandit and Applications

Complexity Constraints in Two-Armed Bandit Problems: An Example by Tilman B¨orgers and Antonio J. Morales …tborgers/Complexity.pdf · Complexity Constraints in Two-Armed Bandit

SPECIAL SECTION: GAME THEORY Truthful multi-armed bandit