Value Function Approximation via Low-Rank Models
-
Upload
lyft -
Category
Data & Analytics
-
view
168 -
download
0
Transcript of Value Function Approximation via Low-Rank Models
![Page 1: Value Function Approximation via Low-Rank Models](https://reader036.fdocuments.in/reader036/viewer/2022092616/58e4a5a41a28abf5428b6e9d/html5/thumbnails/1.jpg)
Value Function Approximation via Low-Rank Models
Hao Yi Ong
AA 222, Stanford University
May 28, 2015
![Page 2: Value Function Approximation via Low-Rank Models](https://reader036.fdocuments.in/reader036/viewer/2022092616/58e4a5a41a28abf5428b6e9d/html5/thumbnails/2.jpg)
Outline
Introduction
Formulation
Approach
Numerical experiments
Introduction 2
![Page 3: Value Function Approximation via Low-Rank Models](https://reader036.fdocuments.in/reader036/viewer/2022092616/58e4a5a41a28abf5428b6e9d/html5/thumbnails/3.jpg)
Value function approximation
I Markov decision process can be solved optimally given thestate-action value function
– value function gives utility for taking an action given a state; want tofind action that maximizes utility
– can be represented as a matrix for discrete problems– typically millions or billions of dimensions for practical problems
I value function approximation finds compact alternative
– basis functions used widely in reinforcement learning (RL)– e.g., Gaussian radial basis function, neural network
Introduction 3
![Page 4: Value Function Approximation via Low-Rank Models](https://reader036.fdocuments.in/reader036/viewer/2022092616/58e4a5a41a28abf5428b6e9d/html5/thumbnails/4.jpg)
Value function decomposition
idea: approximate value function as low-rank plus sparse components
I assumes intrinsic low-dimensionality
– i.e., value function can be captured by small set of features– hinted by success of basis function approximation in RL
I falls under category of Robust Principal Component Analysis (PCA)
– widely used in image/video analysis and collaborative filtering; e.g.,Netflix challenge
– novel application of Robust PCA as far as author is aware
Introduction 4
![Page 5: Value Function Approximation via Low-Rank Models](https://reader036.fdocuments.in/reader036/viewer/2022092616/58e4a5a41a28abf5428b6e9d/html5/thumbnails/5.jpg)
Outline
Introduction
Formulation
Approach
Numerical experiments
Formulation 5
![Page 6: Value Function Approximation via Low-Rank Models](https://reader036.fdocuments.in/reader036/viewer/2022092616/58e4a5a41a28abf5428b6e9d/html5/thumbnails/6.jpg)
Markov decision process
defined by the tuple (S,A, T,R)
I S and A are the sets of all possible states and actions, respectively
I T gives the probability of transitioning into state s′ from takingaction a at the current state s, and is often denoted T (s, a, s′)
I R gives a scalar value indicating the immediate reward received fortaking action a at the current state s and is denoted R (s, a)
Formulation 6
![Page 7: Value Function Approximation via Low-Rank Models](https://reader036.fdocuments.in/reader036/viewer/2022092616/58e4a5a41a28abf5428b6e9d/html5/thumbnails/7.jpg)
Value iteration
want to find the optimal policy π? (s)
I returns action that maximizes the utility from any given state
I related to state-action value function Q? (s, a)
π? (s) = argmaxa∈A
Q? (s, a)
I value iteration updates value function guess Q̂ until convergence
Q̂ (s, a) := R (s, a) +∑s′∈S
T (s, a, s′)maxa′∈A
Q̂ (s′, a′)
Formulation 7
![Page 8: Value Function Approximation via Low-Rank Models](https://reader036.fdocuments.in/reader036/viewer/2022092616/58e4a5a41a28abf5428b6e9d/html5/thumbnails/8.jpg)
Matrix decomposition
I suppose matrix M ∈ Rm×n encodes Q? (s, a)
– m and n are the cardinalities of the state and action spaces
I approximate with decomposition M = L0 + S0
– L0 and S0 are the true low-rank and sparse components
I why should this work?
– implicit assumption about correlation of utility values across actions
Formulation 8
![Page 9: Value Function Approximation via Low-Rank Models](https://reader036.fdocuments.in/reader036/viewer/2022092616/58e4a5a41a28abf5428b6e9d/html5/thumbnails/9.jpg)
Matrix decomposition
M
(m×n)
= AL0
(m×r)
BTL0
(r×n)
+ S0
(m×n)
Formulation 9
![Page 10: Value Function Approximation via Low-Rank Models](https://reader036.fdocuments.in/reader036/viewer/2022092616/58e4a5a41a28abf5428b6e9d/html5/thumbnails/10.jpg)
Outline
Introduction
Formulation
Approach
Numerical experiments
Approach 10
![Page 11: Value Function Approximation via Low-Rank Models](https://reader036.fdocuments.in/reader036/viewer/2022092616/58e4a5a41a28abf5428b6e9d/html5/thumbnails/11.jpg)
Principal Component Pursuit (PCP)
I best (known) convex estimate of Robust PCA
minimize ‖L‖∗ + λ ‖S‖1subject to L+ S = M
I intuitively
– nuclear norm ‖·‖∗ is best convex approximation to minimizing rank– `1-norm has sparsifying property
I remarkably, solution to PCP decomposes M perfectly [CLMW11]
Approach 11
![Page 12: Value Function Approximation via Low-Rank Models](https://reader036.fdocuments.in/reader036/viewer/2022092616/58e4a5a41a28abf5428b6e9d/html5/thumbnails/12.jpg)
Outline
Introduction
Formulation
Approach
Numerical experiments
Numerical experiments 12
![Page 13: Value Function Approximation via Low-Rank Models](https://reader036.fdocuments.in/reader036/viewer/2022092616/58e4a5a41a28abf5428b6e9d/html5/thumbnails/13.jpg)
Mountain car
Numerical experiments 13
![Page 14: Value Function Approximation via Low-Rank Models](https://reader036.fdocuments.in/reader036/viewer/2022092616/58e4a5a41a28abf5428b6e9d/html5/thumbnails/14.jpg)
Inverted pendulum
Numerical experiments 14
![Page 15: Value Function Approximation via Low-Rank Models](https://reader036.fdocuments.in/reader036/viewer/2022092616/58e4a5a41a28abf5428b6e9d/html5/thumbnails/15.jpg)
Implementation
https://github.com/haoyio/LowRankMDP
Numerical experiments 15
![Page 16: Value Function Approximation via Low-Rank Models](https://reader036.fdocuments.in/reader036/viewer/2022092616/58e4a5a41a28abf5428b6e9d/html5/thumbnails/16.jpg)
References
I Emmanuel J Candes, Xiaodong Li, Yi Ma, and John Wright.Robust principal component analysis?Journal of the Association for Computing Machinery, 58(3), 2011.
16