Reinforcement Learning - 6. Monte Carlo, Bias-Variance and...

Reinforcement Learning

Reinforcement Learning6. Monte Carlo, Bias-Variance and Model-Based RL

Olivier Sigaud

Sorbonne Universitehttp://people.isir.upmc.fr/sigaud

1 / 12

Reinforcement learning, Monte Carlo

Over-estimation bias

I In Q-learning, due to the max operator, if some Q-value isover-estimated, this over-estimation propagates

I This is not the case of under-estimationI Over-estimation propagation cannot be prevented due to Q-Table

initializationI Solution: using two Q-Tables, one for value estimation and one for value

propagation

Van Hasselt, H. (2010) Double q-learning. Advances in Neural Information Processing Systems, pages 2613–2621

2 / 12

Reminder: TD error

I The goal of TD methods is to estimate the value function V (s)

I If estimations V (st) and V (st+1) were exact, we would getV (st) = rt + γV (st+1)

I δt = rt + γV (st+1)− V (st) measures the error between V (st) and thevalue it should have given rt

3 / 12

Monte Carlo (MC) methods

I Much used in games (Go...) to evaluate a stateI It uses the average estimation method Ek+1(s) = Ek(s) + α[rk+1 − Ek(s)]I Generate a lot of trajectories: s0, s1, . . . , sN with observed rewards

r0, r1, . . . , rNI Update state values V (sk), k = 0, . . . ,N − 1 with:

V (sk)← V (sk) + α(sk)(rk + rk+1 + · · ·+ rN − V (sk))

4 / 12

TD vs MC

I Temporal Difference (TD) methods combine the properties of DPmethods and Monte Carlo methods:

I In Monte Carlo, T and r are unknown, but the value update is globalalong full trajectories

I In DP, T and r are known, but the value update is local

I TD: as in DP, V (st) is updated locally given an estimate of V (st+1) andT and r are unknown

I Note: Monte Carlo can be reformulated incrementally using the temporaldifference δk update

5 / 12

Bias-variance compromize

I A bigger model may have more variance, and less bias

I Trajectories are a large model of value, a Q-Table is a smaller model.

6 / 12

Monte Carlo, One-step TD and N-step return

I MC suffers from variance due to exploration (+ stochastic trajectories)

I MC is on-policy → less sample efficient

I One-step TD suffers from bias

I N-step TD: tuning N to control the bias variance compromize

7 / 12

The N-step return in practice

I How do we store into the replay buffer?

I N-step Q-learning is more efficient than Q-learning

Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015b) High-dimensional continuous control using generalized

advantage estimation. arXiv preprint arXiv:1506.02438

Sharma, S., Ramesh, S., Ravindran, B., et al. (2017) Learning to mix N-step returns: Generalizing λ-returns for deep

reinforcement learning. arXiv preprint arXiv:1705.07445

8 / 12

Model-based reinforcement learning

Eligibility traces

I To improve over Q-learning

I Naive approach: store all (s, a) pair and back-propagate values

I Limited to finite horizon trajectories

I Speed/memory trade-off

I TD(λ), sarsa (λ) and Q(λ): more sophisticated approach to deal withinfinite horizon trajectories

I A variable e(s) is decayed with a factor λ after s was visited andreinitialized each time s is visited again

I TD(λ): V (s)← V (s) + αδe(s), (similar for sarsa (λ) and Q(λ)),

I If λ = 0, e(s) goes to 0 immediately, thus we get TD(0), sarsa orQ-learning

I TD(1) = Monte-Carlo...

9 / 12

Model-based Reinforcement Learning

I General idea: planning with a learnt model of T and r is performing back-ups “inthe agent’s head” ([Sutton, 1990, Sutton, 1991])

I Learning T and r is an incremental self-supervised learning problemI Several approaches:

I Draw random transition in the model and apply TD back-upsI Dyna-PI, Dyna-Q, Dyna-ACI Better propagation: Prioritized Sweeping

Moore, A. W. & Atkeson, C. (1993). Prioritized sweeping: Reinforcement learning with less data and less real time. Machine

Learning, 13:103–130.

10 / 12

Dyna architecture and generalization

I Thanks to the model of transitions, Dyna can propagate values more often

I Problem: in the stochastic case, the model of transitions is incard(S)× card(S)× card(A)

I Usefulness of compact models

I MACS: Dyna with generalisation (Learning Classifier Systems)

I SPITI: Dyna with generalisation (Factored MDPs)

Gerard, P., Meyer, J.-A., & Sigaud, O. (2005) Combining latent learning with dynamic programming in MACS. European Journal

of Operational Research, 160:614–637.

Degris, T., Sigaud, O., & Wuillemin, P.-H. (2006) Learning the Structure of Factored Markov Decision Processes in Reinforcement

Learning Problems. Proceedings of the 23rd International Conference on Machine Learning (ICML’2006), pages 257–264

11 / 12

Any question?

Send mail to: Olivier.Sigaud@upmc.fr

12 / 12

References

Degris, T., Sigaud, O., & Wuillemin, P.-H. (2006).

Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems.Edite dans Proceedings of the 23rd International Conference on Machine Learning, pages 257–264, CMU, Pennsylvania.

Gerard, P., Meyer, J.-A., & Sigaud, O. (2005).

Combining latent learning with dynamic programming in MACS.European Journal of Operational Research, 160:614–637.

Moore, A. W. & Atkeson, C. (1993).

Prioritized sweeping: Reinforcement learning with less data and less real time.Machine Learning, 13:103–130.

Schulman, J., Moritz, P., Levine, S., Jordan, M. I., & Abbeel, P. (2015).

High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438.

Sharma, S., Ramesh, S., Ravindran, B., et al. (2017).

Learning to mix n-step returns: Generalizing lambda-returns for deep reinforcement learning.arXiv preprint arXiv:1705.07445.

Sutton, R. S. (1990).

Integrating architectures for learning, planning, and reacting based on approximating dynamic programming.Edite dans Proceedings of the Seventh International Conference on Machine Learning, pages 216–224, San Mateo, CA. MorganKaufmann.

Sutton, R. S. (1991).

DYNA, an integrated architecture for learning, planning and reacting.SIGART Bulletin, 2:160–163.

Van Hasselt, H. (2010).

Double q-learning.Edite dans Advances in Neural Information Processing Systems, pages 2613–2621.

12 / 12

Reinforcement Learning - 6. Monte Carlo, Bias-Variance and...

Documents

Transcript of Reinforcement Learning - 6. Monte Carlo, Bias-Variance and...

Trajectories - Comparative and Historical Sociologyasa-comparative-historical.org/newsletter/Trajectories...Trajectories Spring 2017 · Vol 28 · No 3 How Societies and States Count

Trajectories 05.11.16

Orthogonal trajectories

Analysis of three-dimensional trajectories · trajectories are sometimes published, quantitative analysis of these trajectories is uncommon, primarily because there ... Newvicon tube)

Operating Variance Details - Brampton Budget/Proposed...Variance - 1 Operating Variance Details Brampton Library Variance - 2 Community Services Variance - 5 Corporate Services Variance

VEHICLE TRAJECTORIES RESULTING FROM TRAVERSING FDOT …roadsafellc.com/NCHRP22-24/Literature/Papers/Vehicle Trajectories... · VEHICLE TRAJECTORIES RESULTING FROM TRAVERSING FDOT

Rua Xavier Sigaud 150, 22290-180 Rio de Janeiro, Brazil ... · Rua Xavier Sigaud 150, 22290-180 Rio de Janeiro, Brazil and Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico

Keplerian trajectories

Trajectory Planning for Robot Manipulators - CORE · Trajectory Planning Scaling trajectories Analysis of Trajectories Trajectories in the Workspace Introduction Joint-space trajectories

Data Transformation Trajectories in Embedded Systems1088225/FULLTEXT01.pdf · Data Transformation Trajectories in Embedded Systems ... Data Transformation Trajectories in Embedded

Trajectories Spring 1979

Trajectory Planning for Robot Manipulators - unibo.it · Trajectory Planning Scaling trajectories Analysis of Trajectories Trajectories in the Workspace Introduction Joint-space trajectories

variance co-variance

Learning Trajectories Learning Trajectories of Early Math...Learning Trajectories • Mathematics of children— representations and thinking of children as it developments naturally!

Altering Developmental Trajectories in Mice by Restricted ... - pdfs/Atchley Xu Cowley...variance among traits arising from pleiotropy and epi- genetic effects might be quite different

Analysis of MD trajectories in GROMACS · PDF fileManaging your ﬁles trjcat - merging trajectories concatenating demultiplexing REMD trjconv - converting trajectories scaling, translating,

Statistical Analysis of Aircraft Trajectories: a ...€¦ · Looking for new orthogonal directions such that the variance of the projected data is maximal. 7 / 30. enac-bleu3.jpg

Lunar Trajectories

Key Technology Trajectories

and Motor Adaptation Olivier Sigaud - UPMCchronos.isir.upmc.fr/~khamassi/teaching/CogMaster/Sigaud... · Regression for Robotics Introduction Learning one’s body I Babies don’t