Generic Reinforcement Learning Agents - Marcus Hutter · Marcus Hutter - 2 - Generic Reinforcement...
Transcript of Generic Reinforcement Learning Agents - Marcus Hutter · Marcus Hutter - 2 - Generic Reinforcement...
Generic Reinforcement
Learning Agents
Marcus HutterCanberra, ACT, 0200, Australia
http://www.hutter1.net/
ANU RSISE NICTA
Marcus Hutter - 2 - Generic Reinforcement Learning Agents
Abstract
General purpose intelligent learning agents cycle through
(complex,non-MDP) sequences of observations, actions, and rewards.
On the other hand, reinforcement learning is well-developed for small
finite state Markov Decision Processes (MDPs). It is an art performed
by human designers to extract the right state representation out of the
bare observations, i.e. to reduce the agent setup to the MDP framework.
Before we can think of mechanizing this search for suitable MDPs, we
need a formal objective criterion. The main contribution in these slides
is to develop such a criterion. I also integrate the various parts into one
learning algorithm. Extensions to more realistic dynamic Bayesian
networks are briefly discussed.
Marcus Hutter - 3 - Generic Reinforcement Learning Agents
Contents
• UAI, AIXI, ΦMDP, ... in Perspective
• Agent-Environment Model with Reward
• Universal Artificial Intelligence
• Markov Decision Processes (MDPs)
• Learn Map Φ from Real Problem to MDP
• Optimal Action and Exploration
• Extension to Dynamic Bayesian Networks
• Outlook and Jobs
Marcus Hutter - 4 - Generic Reinforcement Learning Agents
Universal AI in Perspective
What is A(G)I? Thinking Acting
humanly Cognitive Science Turing Test
rationally Laws of Thought Doing the right thing
Difference matters until systems reach self-improvement threshold
• Universal AI: analytically analyzable generic learning systems
• Real world is nasty: partially unobservable, uncertain, unknown,
non-ergodic, reactive, vast but luckily structured, ...
• Dealing properly with uncertainty and learning is crucial.
• Never trust a theory if it is not supported by an experiment=== =====experiment theory
Progress is achieved by an interplay between theory and experiment !
Marcus Hutter - 5 - Generic Reinforcement Learning Agents
ΦMDP in Perspective
º
¹
·
¸Universal AI
(AIXI)
²±
¯°ΦMDP / ΦDBN / .?.
²±
¯°Information²±
¯°Learning²±
¯°Planning²±
¯°Complexity
Search – Optimization – Computation – Logic – KR¡
¡
¡¡
¡¡
¡¡
¡¡
@@
@@
@@
@@
@@
JJ
J
JJ
JJ
¤¤¤
¤¤¤¤
CCC
CCCC
Agents = General Framework, Interface = Robots,Vision,Language
Marcus Hutter - 6 - Generic Reinforcement Learning Agents
ΦMDP Overview in 1 Slide
Goal: Develop efficient general purpose intelligent agent.
State-of-the-art: (a) AIXI: Incomputable theoretical solution.
(b) MDP: Efficient limited problem class.
(c) POMDP: Notoriously difficult. (d) PSRs: Underdeveloped.
Idea: ΦMDP reduces real problem to MDP automatically by learning.
Accomplishments so far: (i) Criterion for evaluating quality of reduction.
(ii) Integration of the various parts into one learning algorithm.
(iii) Generalization to structured MDPs (DBNs)
ΦMDP is promising path towards the grand goal & alternative to (a)-(d)
Problem: Find reduction Φ efficiently (generic optimization problem?)
Marcus Hutter - 7 - Generic Reinforcement Learning Agents
Agent Modelwith Reward
Framework for all AI problems!
Is there a universal solution?
r1 | o1 r2 | o2 r3 | o3 r4 | o4 r5 | o5 r6 | o6 ...
a1 a2 a3 a4 a5 a6 ...
work Agent tape ... workEnviron-
menttape ...
©©©©©¼ HHHHHY
³³³³³³³1PPPPPPPq
Marcus Hutter - 8 - Generic Reinforcement Learning Agents
Types of Environments / Problemsall fit into the general Agent setup but few are MDPs
sequential (prediction) ⇔ i.i.d (classification/regression)
supervised ⇔ unsupervised ⇔ reinforcement learning
known environment ⇔ unknown environment
planning ⇔ learning
exploitation ⇔ exploration
passive prediction ⇔ active learning
Fully Observable MDP ⇔ Partially Observed MDP
Unstructured (MDP) ⇔ Structured (DBN)
Competitive (Multi-Agents) ⇔ Stochastic Env (Single Agent)
Games ⇔ Optimization
Marcus Hutter - 9 - Generic Reinforcement Learning Agents
Universal Artificial IntelligenceKey idea: Optimal action/plan/policy based on the simplest worldmodel consistent with history. Formally ...
AIXI: ak := arg maxak
∑okrk
... maxam
∑omrm
[rk + ... + rm]∑
p : U(p,a1..am)=o1r1..omrm
2−`(p)
action, reward, observation, Universal TM, program env., k=now
AIXI is an elegant, complete, essentially unique,and limit-computable mathematical theory of AI.
Claim: AIXI is the most intelligent environmentalindependent, i.e. universally optimal, agent possible.
Proof: For formalizations, quantifications, proofs see ⇒Problem: Computationally intractable.
Achievement: Well-defines AI. Gold standard to aim at.Inspired practical algorithms. Cf. infeasible exact minimax.
Marcus Hutter - 10 - Generic Reinforcement Learning Agents
Markov Decision Processes (MDPs)a computationally tractable class of problems
• MDP Assumption: State st := ot and rt are
probabilistic functions of ot−1 and at−1 only.
Example MDP
±°²¯s1
r1
¨§-
±°²¯s2
r4 ¥¦
¾
±°²¯s3 r2
±°²¯s4r3
-
?¾
6¡
¡¡µ¡¡ª• Further Assumption:
State=observation space S is finite and small.
• Goal: Maximize long-term expected reward.
• Learning: Probability distribution is unknown but can be learned.
• Exploration: Optimal exploration is intractable
but there are polynomial approximations.
• Problem: Real problems are not of this simple form.
Marcus Hutter - 11 - Generic Reinforcement Learning Agents
Map Real Problem to MDPMap history ht := o1a1r1...ot−1 to state st := Φ(ht), for example:
Games: Full-information with static opponent: Φ(ht) = ot.
Classical physics: Position+velocity of objects = position at two
time-slices: st = Φ(ht) = otot−1 is (2nd order) Markov.
I.i.d. processes of unknown probability (e.g. clinical trials ' Bandits),
Frequency of obs. Φ(hn) = (∑n
t=1 δoto)o∈O is sufficient statistic.
Identity: Φ(h) = h is always sufficient, but not learnable.
Find/Learn Map AutomaticallyΦbest := arg minΦ Cost(Φ|ht)
• What is the best map/MDP? (i.e. what is the right Cost criterion?)
• Is the best MDP good enough? (i.e. is reduction always possible?)
• How to find the map Φ (i.e. minimize Cost) efficiently?
Marcus Hutter - 12 - Generic Reinforcement Learning Agents
Minimum Description Length Principlesuitable for large semi/non-parametric model classes
MDL Principle: The best model P̂ from a class of models P for data Dminimizes the 2-part code length of the data D given probability P plus
code length CL of the model: P̂ = arg minP∈P
{| log P(D)|+ CL(P)}
Shortest Code for I.I.D. DataD≡x1:n≡x1...xn∈{1, .., d}n i.i.d, i.e. Pθ(x1:n)=θx1·..·θxn =
∏di=1θ
nii
Each θi need to be coded to precision O(1/√
n) ⇒ CL(θ) = d2 log n.
⇒ θ̂ = arg minθ{| log Pθ(x1:n)|+ CL(θ)} = ... = n/n
If P = {all P} then still P̂ = Pθ̂ with high probability.
CL(x1:n) = CL(n) := nH(n/n) + d2 log n
is shortest code length for i.i.d. data x1:n.
H(θ̂) :=−∑di=1 θ̂i log θ̂i
is the entropy of p.
Marcus Hutter - 13 - Generic Reinforcement Learning Agents
MDP Coding
• Definition: nar′ss′ := #{t ≤ n : st−1 = s, at−1 = a, st = s′, rt = r′}
= number of times at which action a in state s resulted in
state s′ and reward r′ (sa→ s′r′).
• Notation: n• = (n1, n2, ...) =vector, n+ = n1 + n2 + ... =sum.
• Code length of s1:n: CL(s1:n|a1:n) =∑s,a
CL(na+s• ).
• Code length of r1:n given s1:n: CL(r1:n|s1:n, a1:n) =∑
s′CL(n+•
+s′).
Code is optimal (w.h.p) iff s1:n and r1:n are samples from an MDP.
Marcus Hutter - 14 - Generic Reinforcement Learning Agents
ΦMDP Cost CriterionReward↔State Trade-Off
• Small state space S has short CL(s1:n|a1:n) but obscures
structure of reward sequence ⇒ CL(r1:n|s1:na1:n) large.
• Large |S| usually makes predicting=compressing r1:n easier,
but a large model is hard to learn, i.e. the code for s1:n will be large
Cost(Φ|hn) := CL(s1:n|a1:n) + CL(r1:n|s1:n, a1:n)is minimized for Φ that keeps all and only
relevant information for predicting rewards.
• Recall st := Φ(ht) and ht := a1o1r1...ot.
Marcus Hutter - 15 - Generic Reinforcement Learning Agents
A Tiny Example
Observations ot ∈ {0, 1} are i.i.d, and reward rt = 2ot−1 + ot.
Considered features: Φk(ht) = ot−k+1...ot = last k obs. (S = {0, 1}k)
Φ0MDP
µ´¶³
ε
r = 0|1|2|3
¨ ¥?
Cost(Φ0|h) =2n + 3
2 log n
Φ1MDP
µ´¶³0
r = 0|2
¨ ¥?
-¾µ´¶³1
r = 1|3
¨ ¥?
Cost(Φ1|h) =2n + 2 log n
2
Φ2MDP
µ´¶³00
r = 0
¨§-
µ´¶³11
r = 3¥¦
¾
µ´¶³01 r = 1
µ´¶³10r = 2
-
?¾
6 ¡¡µ¡¡ª
Cost(Φ2|h) =n + 2 log n
4
Cost(Φk|h) = n+2k−1 log n2k (k≥2). Cost minimal for k=2 (if nÀ2k).
Intuition: Φ2 is smallest exact MDP ⇒ best observation summary.
Marcus Hutter - 16 - Generic Reinforcement Learning Agents
Cost(Φ) Minimization
• Minimize Cost(Φ|h) by search: random, blind, informed, adaptive,
local, global, population based, exhaustive, heuristic, other search.
• Most algs require a neighborhood relation between candidate Φ.
• Φ is equivalent to a partitioning of (O ×A×R)∗.
• Example partitioners: Decision trees/lists/grids/etc.
• Example neighborhood: Subdivide=split or merge partitions.
Stochastic Φ-Search (Monte Carlo)
• Randomly choose a neighbor Φ′ of Φ (by splitting or merging states)
• Replace Φ by Φ′ for sure if Cost gets smaller or with some small
probability if Cost gets larger. Repeat.
Marcus Hutter - 17 - Generic Reinforcement Learning Agents
Context Tree ExampleState Space S = History Suffix Tree
ΦImprove(ΦS , hn)
d Randomly choose a state s ∈ S;
Let p and q be uniform random numbers in [0, 1];if (p > 1/2) then split s i.e. S′ = S \ {s}∪ {os : o ∈ O}else if {os : o ∈ O} ⊆ Sthen merge them, i.e. S′ = S \ {os : o ∈ O} ∪ {s};if (Cost(ΦS |hn)−Cost(ΦS′ |hn) > log(q)) then S := S ′;
b return (ΦS);
Marcus Hutter - 18 - Generic Reinforcement Learning Agents
Optimal Action
• Let Φ̂ be a good estimate of Φbest.
⇒ Compressed history: s1a1r1...snanrn ≈ MDP sequence.
• For a finite MDP with known transition probabilities,
optimal action an+1 follows from Bellman equations.
• Use simple frequency estimate of transition probability
and reward function ⇒ Infamous problem ...
Exploration & Exploitation
• Polynomially optimal solutions: Rmax, E3, OIM [KS98,SL08].
• Main idea: Motivate agent to explore by pretending
high-reward for unexplored state-action pairs.
• Now compute the agent’s action based on modified rewards.
Marcus Hutter - 19 - Generic Reinforcement Learning Agents
Computational Flow
Environment
¶
µ
³
´History h
¶
µ
³
´Feature Vec. Φ̂
¶
µ
³
´Transition Pr. T̂Reward est. R̂
¶
µ
³
´T̂ e, R̂e
¶
µ
³
´(Q̂) V̂alue
¶
µ
³
´Best Policy p̂
6reward r observation o
6Cost(Φ|h) minimization
¡¡¡µfrequency estimate
-explorationbonus
@@@R
Bellman
?implicit
?action a
Marcus Hutter - 20 - Generic Reinforcement Learning Agents
Overall Algorithm for ΦMDP Agent
ΦMDP-Agent(A, R)d Initialize Φ ≡ ε; S = {ε}; h0 = a0 = r0 = ε;
for(n = 0, 1, 2, 3, ...)
d Choose e.g. γ = 1− 1/(n + 1);Set Re
max =Polynomial((1− γ)−1, |S × A|) ·maxR;
While waiting for on+1 {Φ := ΦImprove(Φ, hn)};Observe on+1; hn+1 = hnanrnon+1;
sn+1 := Φ(hn+1); S := S ∪ {sn+1};Compute action an+1 from Bellman Eq. with explore bonus;
Output action an+1;
b b Observe reward rn+1;
Marcus Hutter - 21 - Generic Reinforcement Learning Agents
Extension to Dynamic Bayesian Networks
• Unstructured MDPs are only suitable for relatively small problems.
• Dynamic Bayesian Networks = Structured MDPs for large problems.
• Φ(h) is now vector of (loosely coupled binary) features=nodes.
• Assign global reward to local nodes by linear regression.
• Cost(Φ, Structure|h) = sum of local node Costs.
• Learn optimal DBN structure in pseudo-polynomial time.
• Search for approximation Φ̂ of Φbest as before.
Neighborhood = adding/removing features.
• Use local linear value function approximation.
• Optimal action/policy by combining [KK99,KP00,SDL07,SL09].
Marcus Hutter - 22 - Generic Reinforcement Learning Agents
Incremental Updates
Incremental updates/computations can significantly speedup many
computations.
• Cost(Φ|h) can be computed incrementally when updating Φ or h.
• Local rewards can be updated in linear time.
• Old value estimate can be used as initialization for next cycle.
Marcus Hutter - 23 - Generic Reinforcement Learning Agents
Outlook
• Real-valued observations suggest real-valued features.
• Extra edges in DBN can improve the linear value function
approximation. Needs right incentives in Cost criterion.
• Extend ΦDBN to large structured action spaces.
• Improve various polynomial-time algs. to linear-time algorithms.
• Optimal learning&exploring DBNs required a lot of assumptions.
• Developing smart Φ generation and smart stochastic
search algorithms for Φ are the major open challenges.
Marcus Hutter - 24 - Generic Reinforcement Learning Agents
Conclusion
Goal: Develop efficient general purpose intelligent agent.
State-of-the-art: (a) AIXI: Incomputable theoretical solution.
(b) MDP: Efficient limited problem class.
(c) POMDP: Notoriously difficult. (d) PSRs: Underdeveloped.
Idea: ΦMDP reduces real problem to MDP automatically by learning.
Accomplishments so far: (i) Criterion for evaluating quality of reduction.
(ii) Integration of the various parts into one learning algorithm.
(iii) Generalization to structured MDPs (DBNs)
ΦMDP is promising path towards the grand goal & alternative to (a)-(d)
Problem: Find reduction Φ efficiently (generic optimization problem?)
Marcus Hutter - 25 - Generic Reinforcement Learning Agents
Connection to AI SubFields• Agents: ΦMDP is a reinforcement learning (single) agent.
• Search & Optimization: Minimizing Cost(Φ,Structure|h) is a well-defined but hard (non-continuous, non-convex) optimization problem.
• Planning: More sophisticated planners needed for large DBNs.
• Information Theory: Needed for analyzing&improving Cost criterion.
• Learning: So far mainly reinforcement learning, but others relevant.
• Logic/Reasoning: For agents that reason, rule-based logicalrecursive partitions of domain (O×A×R)∗ are predestined.
• Knowledge Representation (KR): Searching for Φbest is actually asearch for the best KR. Restrict search space to reasonable KR Φ.
• Complexity Theory: We need polynomial time and ultimatelylinear-time approximation algorithms for all building blocks.
• Application dependent interfaces: Robotics, Vision, Language.
Marcus Hutter - 26 - Generic Reinforcement Learning Agents
Thanks! Questions? Details:
• M.H., Feature Markov Decision Processes. (AGI 2009)
• M.H., Feature Dynamic Bayesian Networks. (AGI 2009)
• Human Knowledge Compression Contest: (50’000 C=)
• M. Hutter, Universal Artificial Intelligence:
Sequential Decisions based on Algorithmic Probability.
EATCS, Springer, 300 pages, 2005.
http://www.idsia.ch/˜marcus/ai/uaibook.htm
Decision Theory = Probability + Utility Theory+ +
Universal Induction = Ockham + Bayes + Turing= =
A Unified View of Artificial Intelligence