RLPy: A Value-Function-Based Reinforcement Learning ... · [7] G. Neumann, “The Reinforcement...
Transcript of RLPy: A Value-Function-Based Reinforcement Learning ... · [7] G. Neumann, “The Reinforcement...
Simulator
SimulatorSimulator Simulator
Value Function Policy
.
.
.
�
✓1
✓2
�1
�2
�n ✓n
V ⇡(s) ⇡ ⇥(s)>�s
WeightsFeatures
Interest in Value-based RL with linear function approximation
Increased Specialization in RL ➙ More Granular Framework
Comparison with State-of-the-art Techniques in Various Domains
An easy to use RL framework for both research and education
[1-4]
[3-4]
Easy Installation
Rapid Prototyping in Python
Grid World
Lack of granularity to accommodate recent advances in RL [6]
Challenging for entry level due to programming languages (e.g. C++)[6-7]
Not Self-contained [5]
Existing Gap
Examples
Conclusion
RLPy is an object-oriented reinforcement learning (RL) software package with focus on value- function-based methods using linear function approximation and discrete actions. The framework was designed for both education and research purposes. It provides a rich library of fine-grained, easily exchangeable compo-nents for learning agents (e.g., policies or representations of value functions), fa-cilitating recent increased specialization in RL. RLPy is written in Python to al-low fast prototyping but is also suitable for large-scale experiments through its inbuilt support for optimized numerical libraries and parallelization. Code profil-ing, domain visualizations, and data analysis are integrated in a self-contained package available under the Modified BSD License. All these properties allow users to compare various RL algorithms with little effort.
RLPy is a new python based open-source RL framework.
Simplifies the construction of new RL ideas
Accessible for both novice and expert users
Realizes reproducible experiments
RLPy: A Value-Function-Based Reinforcement Learning Framework for Education and ResearchAlborz Geramifard*, Christoph Dann, Robert H. Klein, William Dabney*, Jonathan P. How
Problem
Abstract
[1] A. Tamar, H. Xu, S. Mannor, “Scaling Up Approximate Value Iteration with Options: Better Policies with Fewer Iterations”, In Proceedings of the 31t International Conference on Machine Learning (ICML), pages 127-135, 2014[2] D. Calandriello, A. Lazaric, M. Restelli, “Sparse Multi-Task Reinforcement Learning”, In Proceedings of Advances in Neural Information Processing Systems (NIPS), pages 819–827, Quebec, Canada, 2014[3] J. Z. Kolter and A. Y. Ng, “Regularization and Feature Selection in Least-squares Temporal Difference Learning”, In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 521–528, New York, NY, USA, 2009[4] A. Geramifard, T. Walsh, N. Roy, and J. How “Batch iFDD: A Scalable Matching Pursuit Algorithm for Solving MDPs”, The Conference on Uncertainty in Artificial Intelligence (UAI), 2013[5] D. Aberdeen, “LibPGRL: A high performance reinforcement learning library in C++”, 2007. URL https://code.google.com/p/libpgrl.[6] F. de Comite, “PIQLE: A Platform for Implementation of Q-Learning Experiments”, 2006. URL http://piqle.sourceforge.net.[7] G. Neumann, “The Reinforcement Learning Toolbox , Reinforcement Learning for Optimal Control Tasks”. MSc thesis, TU Graz, 2005* Author is currently employed at Amazon.
Value Function / Policy
Domain
Learning Agent
Representation
Policy⇡
Q, V
at
st+1, rt+1RL Agent
Experiment
> pip install -U RLPy
Simulator
Cart Pole
Plotting
Learning Steps
Learning Episodes
Learning Time
Return
Number of Features
Termination
TOTAL TIME84.49 (s)
Pendulum:286:step6.90%
(0.72%)26559×
mlab:988:rk45.30%
(1.92%)26559×
5.30%26559×
numeric:256:asarray1.24%
(0.12%)134332×
1.18%106236×
Pendulum:347:_dsdt2.10%
(2.10%)106236×
2.10%106236×
Experiment:105:performanceRun26.30%(0.09%)
10×
4.28%16559×
eGreedy:29:pi33.94%(0.11%)27047×
21.93%16559×
GeneralTools:178:randSet0.60%
(0.09%)27691×
0.59%27047×
Representation:259:bestActions35.40%(0.26%)35987×
33.21%25987×
backend_macosx:24:mainloop11.90%(0.00%)
1×
~:0:<matplotlib.backends._macosx.show>11.90%
(11.74%)1×
11.90%1×
backend_bases:69:__call__11.90%(0.00%)
1×
11.90%1×
~:0:<method 'sum' of 'numpy.ndarray' objects>5.81%
(0.52%)527988×
_methods:16:_sum5.57%
(0.83%)537989×
5.29%527988×
~:0:<method 'reduce' of 'numpy.ufunc' objects>9.02%
(9.02%)1137592×
4.74%537989×
tiles:74:_hash15.67%(8.59%)527988×
5.81%527988×
~:0:<numpy.core.multiarray.arange>2.15%
(2.15%)747267×
1.13%527988×
main:45:main100.00%(0.00%)
1×
OnlineExperiment:54:run87.28%(0.12%)
1×
87.28%1×
OnlineExperiment:125:save12.70%(0.00%)
1×
12.70%1×
2.62%10000×
26.30%10×
12.01%10488×
Greedy_GQ:36:learn46.14%(1.45%)10000×
46.14%10000×
pyplot:123:show11.90%(0.00%)
1×
11.90%1×
11.90%1×
Representation:143:phi48.62%(0.13%)45987×
20.99%20000×
Representation:161:phi_sa5.25%
(3.84%)127961×
0.84%20000×
GeneralTools:117:count_nonzero20.30%
(19.91%)10000×
20.30%10000×
Representation:290:bestAction2.23%
(0.03%)10000×
2.23%10000×
~:0:<numpy.core._dotblas.dot>0.92%
(0.92%)128359×
0.15%20000×
Representation:112:Qs34.14%(0.47%)35987×
Representation:127:Q5.53%
(0.33%)107961×
5.53%107961×
27.63%25987×
~:0:<numpy.core.multiarray.array>2.50%
(2.50%)805132×
0.40%35987×
4.41%107961×
0.76%107961×
tiles:54:phi_nonTerminal48.30%
(17.30%)45063×
48.30%45063×
34.14%35987×
~:0:<method 'max' of 'numpy.ndarray' objects>0.54%
(0.04%)36047×
0.54%35987×
0.80%127961×
~:0:<numpy.core.multiarray.zeros>0.85%
(0.85%)228631×
0.61%127961×
2.18%10000×
_methods:28:_all4.41%
(0.57%)563331×
3.85%563331×
~:0:<method 'all' of 'numpy.ndarray' objects>4.96%
(0.54%)563331×
4.41%563331×
1.12%134332×
0.14%45063×
0.13%45063×
tiles:82:_physical_addr30.51%(7.53%)450630×
30.51%450630×
15.67%527988×
fromnumeric:1643:all7.31%
(0.94%)562496×
7.31%562496×
4.95%562496×
numeric:326:asanyarray1.52%
(0.68%)589084×
1.42%562496×
0.83%589084×
RLPy (rlpy.readthedocs.org)
Sponsored by:
FA9550-09-1-0522 N000141110688
Built-in Hyperparameter Optimization
Improved Experimentation
Reproducible
Parallel Execution
Optimized Implementation (Cython, C++)
Batteries Included!
20 Domains, 8 Learning Algorithms
4 Policies, 7 Representations
Built-in Profiling
Improved Granularity of the agent using OO Python