Contemplations on a Balanced and Intrinsically Motivated Life
Learning to Play with Intrinsically-Motivated Self-Aware ... · Learning to Play with...
Transcript of Learning to Play with Intrinsically-Motivated Self-Aware ... · Learning to Play with...
Learning to Play with Intrinsically-Motivated Self-Aware AgentsNick Haber1,2,3*, Damian Mrowca4*, Stephanie Wang4, Li Fei-Fei4, Daniel L. K. Yamins1,4,5
Introduction Approach: Self-Model Vs World-Model
Emergent Behaviors
Department of Psychology1, Pediatrics2, Biomedical Data Science3, Computer Science4, and Wu Tsai Neurosciences Institute5, Stanford, CA 94305
Learning in a physically realistic embodied environment. Agents move around and interact with objects.
Artificial intelligent robots cannot learn in unstructured
environment without supervision.
Babies learn in unstructured environments through intrinsically motivated
curious playing.
Curiosity-based learning mechanism- The world model (blue) solves a dynamics prediction problem. - The self-model (red) seeks to predict the world-model's loss and is learned simultaneously. - Actions are chosen to antagonize the world-model, leading to novel and surprising events in the environment (black).
- This creates a virtuous cycle in which the agent chooses novel but predictable actions.- Playful behavior emerges as the agent pushes the boundaries of what its world-model-prediction systems can achieve. - As world-modeling capacity improves, what used to be novel becomes old hat, and the cycle repeats.
StateEnvironment
SelfModel
World
Model
State
Action
αegot αobj
t
1αobj
t
2
αegot+1 αobj
t+1
1αobj
t+1
2
αegot-2 αobj
t-2
1αobj
t-2
2
αegot αobj
t
1αobj
t
2Lt
Lt+1 Lt+2 Lt+TLt+3 ...st
argmax Action
ωφ
ΛψState from
environment
αegot-1 αobj
t-1
1αobj
t-1
2
(b)(a)
st-1
st
st+1
st-2
Acti
on L
oss
Acti
on L
oss
Intrinsically-motivated self-aware architecture.
Emerging behaviors across different models. Different combinations of world-model tasks and policy mechanisms are compared. ID = Inverse Dynamics prediction; RW = Random World model; RP = Random Policy; SP = Self-aware Policy; LF = Latent Future prediction
A sequence of behaviors emerges:Ego motionprediction
Objectattention
Navigation towards ob-
Improved objectdynamics predic-
Improvement on task transfers
Multi-objectgathering
Ego motion learning Emergence of object attention 1 object learning 2 object learning
1 object loss
2 object loss
°* x
ID-SP-40 ID-SP-5 ID-RP LF-SP
Trai
ning
loss 0.4
0.3
0.2
0.10.0
0.5
0.6
0 4020 60 80Steps (in thousands)
320 560 800
x°*
0Steps (in thousands)
20 40 60 80 5601 O
bjec
t fre
quen
cy 1.00.80.60.4
0.00.2
x
0Steps (in thousands)
20 40 60 80 5602 O
bjec
t fre
quen
cy 0.50.40.30.2
0.00.1
0Steps (in thousands)
20 40 60 80 560
Agen
t-ob
ject
dis
tanc
e
8.0
6.0
4.0
0.0
2.0
x
Planning
Navigation and planning behavior for 12 consecutive time steps. Red force vectors on the objects depict the predicted actions. Ego-motion self prediction maps are drawn at the agents position.
Task Transfers
The agent starts off without seeing an object and turns around to explore for an object. Once an
object is in view the agent approaches or turns to-wards the object to keep it in view.
This work was supported by grants from the James S. McDonnell Foundation, Simons Foundation, and Sloan Foundation (DLKY), a Berry Foundation
postdoctoral fellowship and Stanford Department of Biomedical Data Science NLM T-15 LM007033-35 (NH), ONR - MURI (Stanford Lead) N00014-
16-1-2127 and ONR - MURI (UCLA Lead) 1015 G TA275 (LF).
Model comparison on task transfers including object presence (binary classification), localization (pixel-wise 2D centroid position), and recognition (16-way categorization). Linear estimators were trained on top off the output features of each model.
ID = Inverse Dynamics; RW = Random World model; RP = Random Policy; SP = Self-aware Policy; LF = Latent Future
Transfer to other Tasks
Acc
urac
y in
%
5060708090
100
Ego (Easy) Ego (Hard)
Accu
racy
in %
102030405060
Actions (Hard) Play Frequency
Acc
urac
y in
%
0
1020
30
40
Recognition
Erro
r in
%
0
0.751.5
2.25
3
Object Presence
Erro
r in
Pixe
ls
0
2.755.5
8.25
11
Localization
ID-SP trained with 2 objects!
Future Work
Next steps include - more photorealistic simulation- more realistic agent embodiment- curiosity towards the novel but learnable- animate attention and theory of mind- comparision to human developmental data
Phyiscally and visually realistic simulation environment.