Managing Uncertainty within Value Function Approximation in ...
Transcript of Managing Uncertainty within Value Function Approximation in ...
Managing Uncertainty within Value FunctionApproximation in Reinforcement LearningWorkshop on Active Learning and Experimental Design
Matthieu GEIST and Olivier [email protected]
May, 16 - 2010
Olivier PIETQUIN – AL&ED 2010 1/21
Some BackgroundParadigmFormalismSolving (Dynamic Programming)
Value function approximation & Managing uncertaintyGeneralized policy iterationWhy managing uncertainty in RL ?
A Kalman-based value function trackerCasting VFA as a filtering problemKalman Temporal Differences
Managing UncertaintyComputing values standard deviationsAn active learning SchemeExploration/exploitation dilemma
Conclusion and perspective
Olivier PIETQUIN – AL&ED 2010 2/21
Formalism
Markov Decision Process (system)MDP = {S, A, P, R, γ}S state space, A action space, γ forgetting factorP : s, a ∈ S × A → p(.|s, a) ∈ P(S)R : s, a, s′ ∈ S × A× S → r = R(s, a, s′) ∈ R
Olivier PIETQUIN – AL&ED 2010 4/21
Formalism
Policy (control of the system by the agent)π : s ∈ S → π(.|s) ∈ P(A)
Olivier PIETQUIN – AL&ED 2010 4/21
Formalism
Value function (control quality)Qπ(s, a) = E [
∑∞i=0 γ i ri |s0 = s, a0 = a, π]
Olivier PIETQUIN – AL&ED 2010 4/21
Policy iteration
Policy Iteration
I Bellman evaluation equation :Qπ(s, a) = Es′,a′|π,s,a[R(s, a, s′) + γQπ(s′, a′)],∀s, a
I greedy policy :πgreedy(s) = argmaxa∈A Qπ(s, a),∀s
Olivier PIETQUIN – AL&ED 2010 5/21
Value iteration
Value IterationI (nonlinear) Bellman optimality equation :
Q∗(s, a) = E [R(s, a, s′) + γ maxa′ Q∗(s′, a′)] = TQ∗(s, a)
I T is a contraction and Q∗ its unique fixed pointI Qi+1 = TQi , Qi → Q∗
I optimal policy : greedy resp. to Q∗
Olivier PIETQUIN – AL&ED 2010 6/21
Some BackgroundParadigmFormalismSolving (Dynamic Programming)
Value function approximation & Managing uncertaintyGeneralized policy iterationWhy managing uncertainty in RL ?
A Kalman-based value function trackerCasting VFA as a filtering problemKalman Temporal Differences
Managing UncertaintyComputing values standard deviationsAn active learning SchemeExploration/exploitation dilemma
Conclusion and perspective
Olivier PIETQUIN – AL&ED 2010 7/21
Generalized policy iterationthe general idea of letting policyevaluation and policyimprovement processesinteract, independent of thegranularity and other details ofthe two processes
EvaluationI learn the value function of
the followed policyI model-free learningI generalization
Improvement
I ε-greedy policyI Boltzman policy
π(a|s) =exp Q̂(s,a)
TPb∈A exp Q̂(s,b)
T
I exploration/exploitationdilemma
Olivier PIETQUIN – AL&ED 2010 8/21
Why should we manage uncertainty in RL ?
Exploration/exploitation dilemma
I having an uncertainty information about value’s estimateprovides useful for deciding between exploration and exploitation
I this kind of information is used in some schemes for theexploration/exploitation dilemma, however most of them are forthe tabular case
Active learning
I consider the Bellman optimality operator (Q-learning-likealgorithm)
I learn the optimal Q-function from any sufficient explorative policy
I if control of this policy is possible, how to choose actions in orderto speed up learning ?
Olivier PIETQUIN – AL&ED 2010 9/21
Some BackgroundParadigmFormalismSolving (Dynamic Programming)
Value function approximation & Managing uncertaintyGeneralized policy iterationWhy managing uncertainty in RL ?
A Kalman-based value function trackerCasting VFA as a filtering problemKalman Temporal Differences
Managing UncertaintyComputing values standard deviationsAn active learning SchemeExploration/exploitation dilemma
Conclusion and perspective
Olivier PIETQUIN – AL&ED 2010 10/21
VFA reformulation
Statistical filtering point of viewθi = θi−1 + vi
ri = Q̂θi (si , ai)− γ
{Q̂θi (si+1, ai+1)
maxa Q̂θi (si+1, a)+ ni
Minimized cost functionI conditional mean square error minimization :
θ̂i = argminθ E [‖θ − θi‖2|r1:i ]
I restriction to linear estimators :θ̂i = θ̂i−1 + Ki(ri − r̂i)
Olivier PIETQUIN – AL&ED 2010 11/21
Kalman equations
I Prediction step
θ̂i|i−1 = θ̂i−1|i−1
Pi|i−1 = Pi−1|i−1 + Pvi−1
r̂i = E [Q̂θi (si , ai)− γQ̂θi (si+1, ai+1)|r1:i−1]
I Correction step
Ki = Pθri P−1ri|i−1
Pθri P−1ri|i−1
θ̂i|i = θ̂i|i−1 + Ki(ri − r̂i)
Pi|i = Pi|i−1 − KiPri|i−1 Pri|i−1 K Ti
ProblemsI Choice of priors θ0|0 et P0|0 (classical)
I Compute statistics of interest r̂i , Pθri and Pri|i−1 ? Compute 1st and 2nd
moments of a nonlinearly mapped random variable ?
Olivier PIETQUIN – AL&ED 2010 12/21
Some BackgroundParadigmFormalismSolving (Dynamic Programming)
Value function approximation & Managing uncertaintyGeneralized policy iterationWhy managing uncertainty in RL ?
A Kalman-based value function trackerCasting VFA as a filtering problemKalman Temporal Differences
Managing UncertaintyComputing values standard deviationsAn active learning SchemeExploration/exploitation dilemma
Conclusion and perspective
Olivier PIETQUIN – AL&ED 2010 14/21
Computing uncertainty and illustration
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x position
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
y p
osi
tio
n
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Olivier PIETQUIN – AL&ED 2010 15/21
Active learning for KTD-Q
I KTD-Q : off-policy value-iteration-like algorithmI how to choose actions in order to speed up learning ?I
b(.|si) =σ̂Qi|i−1
(si , .)∑a∈A σ̂Qi|i−1
(si , a)
0
500
1000
1500
2000
2500
3000
0 50 100 150 200 250 300
step
s
number of episodes
KTD-QQ-learning
active KTD-Q
Olivier PIETQUIN – AL&ED 2010 16/21
Adapting existing (tabular) approaches
I there exists some scheme for exploration/exploitationdilemma which uses an uncertainty information
I most of them are designed for the tabular case (in thiscase, uncertainty information linked to the number of visitsto the state)
I we propose to adapt some of them (very directly)
Olivier PIETQUIN – AL&ED 2010 17/21
I ε-greedy (baseline)I
π(a|si+1) = (1−ε)δ(a = argmaxb∈A
Q̄i|i−1(si+1, b))+εδ(a 6= argmaxb∈A
Q̄i|i−1(si+1, b))
Olivier PIETQUIN – AL&ED 2010 18/21
I confident-greedyI
π(a|si+1) = δ(a = argmaxb∈A
(Q̄i|i−1(si+1, b) + ασ̂Qi|i−1(si+1, b)))
Olivier PIETQUIN – AL&ED 2010 18/21
I bonus-greedyI
π(a|si+1) = δ(a = argmaxb∈A
(Q̄i|i−1(si+1, b) + βσ̂2
Qi|i−1(si+1, b)
β0 + σ̂2Qi|i−1
(si+1, b)))
Olivier PIETQUIN – AL&ED 2010 18/21
I ThompsonI
π(ai+1|si+1) = argmaxb∈A
Q̂ξ(si+1, b) with ξ ∼ N (θ̂i|i−1, Pi|i−1)
Olivier PIETQUIN – AL&ED 2010 18/21
Some BackgroundParadigmFormalismSolving (Dynamic Programming)
Value function approximation & Managing uncertaintyGeneralized policy iterationWhy managing uncertainty in RL ?
A Kalman-based value function trackerCasting VFA as a filtering problemKalman Temporal Differences
Managing UncertaintyComputing values standard deviationsAn active learning SchemeExploration/exploitation dilemma
Conclusion and perspective
Olivier PIETQUIN – AL&ED 2010 20/21
I KTD : provides an uncertainty information (among otherthings : nonlinearities and nonstationarities handling)
I straightforwardly adapted tabular approaches work fine onsimple problems
I however :I there is a need to assess if the approach is theoretically
foundedI it should be experimented on more complex tasksI choice of parameters (prior variance and noises especially)
can be difficult, adaptive KTD ?
Olivier PIETQUIN – AL&ED 2010 21/21