Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How...
Transcript of Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How...
Maximizing Information Gain via Prediction Reward
Yash Satsangi, Sungsu Lim, Shimon Whiteson, Frans Oliehoek, Martha White
Active perception
Sensor selection Visual attention
The ability to take actions to reduce uncertainty
Active perception as an RL taskReward:
⇢(b) = �H(b)
Active perception as an RL task
Reward: ⇢(b) = �H(b)
Explicit belief
inference?
This paper• Main theoretical result
• How can we design state-based reward functions that approximate information gain.
• Prediction reward are linear approximation to entropy.
• Deep Anticipatory Networks (DAN)
• A deep RL algorithm that trains two deep neural networks simultaneously on each other’s feedback.
• Useful when reward is a convex function of the belief of the agent.
• Experiments
• Sensor selection with DAN
• Visual attention with DAN
Prediction rewardA connection between prediction reward and information gain
Prediction reward: reward agent for making accurate prediction.
Expected prediction reward
r’ for correct prediction
r’’ otherwise
Main resultA connection between prediction reward and information gain
⇢(b) = �H(b)
⇢0(b)
<latexit sha1_base64="5KHyYJ9bUaFbPVnLxKltESaFziY=">AAAB83icbVBNSwMxEJ2tX7V+VT16CbZivZTdIqi3ohePFewHdJeSTdM2NJssSVYoS/+GFw+KePXPePPfmLZ70NYHA4/3ZpiZF8acaeO6305ubX1jcyu/XdjZ3ds/KB4etbRMFKFNIrlUnRBrypmgTcMMp51YURyFnLbD8d3Mbz9RpZkUj2YS0yDCQ8EGjGBjJb/sq5E8r6Th9KLcK5bcqjsHWiVeRkqQodErfvl9SZKICkM41rrrubEJUqwMI5xOC36iaYzJGA9p11KBI6qDdH7zFJ1ZpY8GUtkSBs3V3xMpjrSeRKHtjLAZ6WVvJv7ndRMzuA5SJuLEUEEWiwYJR0aiWQCozxQlhk8swUQxeysiI6wwMTamgg3BW355lbRqVe+yevNQK9VvszjycAKnUAEPrqAO99CAJhCI4Rle4c1JnBfn3flYtOacbOYY/sD5/AGNDpC7</latexit>
expected prediction reward
constant term
✏1
<latexit sha1_base64="ZwN7C2KGpUGWyQTjXS7eUjXHlRo=">AAAB83icbVBNSwMxEM3Wr1q/qh69BFvBU9ktgnorevFYwdZCdynZdLYNzSYhyQql9G948aCIV/+MN/+NabsHbX0w8Hhvhpl5seLMWN//9gpr6xubW8Xt0s7u3v5B+fCobWSmKbSo5FJ3YmKAMwEtyyyHjtJA0pjDYzy6nfmPT6ANk+LBjhVEKRkIljBKrJPCagjKMC5FL6j2yhW/5s+BV0mQkwrK0eyVv8K+pFkKwlJOjOkGvrLRhGjLKIdpKcwMKEJHZABdRwVJwUST+c1TfOaUPk6kdiUsnqu/JyYkNWacxq4zJXZolr2Z+J/XzWxyFU2YUJkFQReLkoxjK/EsANxnGqjlY0cI1czdiumQaEKti6nkQgiWX14l7XotuKhd39crjZs8jiI6QafoHAXoEjXQHWqiFqJIoWf0it68zHvx3r2PRWvBy2eO0R94nz834JEq</latexit>
✏2
<latexit sha1_base64="eyKmuJ/6IHI69Xzd56EWrhTc7AE=">AAAB83icbVBNSwMxEM3Wr1q/qh69BFvBU9ktgnorevFYwdZCt5RsOtuGZpOQZIWy9G948aCIV/+MN/+NabsHbX0w8Hhvhpl5keLMWN//9gpr6xubW8Xt0s7u3v5B+fCobWSqKbSo5FJ3ImKAMwEtyyyHjtJAkojDYzS+nfmPT6ANk+LBThT0EjIULGaUWCeF1RCUYVyKfr3aL1f8mj8HXiVBTiooR7Nf/goHkqYJCEs5MaYb+Mr2MqItoxympTA1oAgdkyF0HRUkAdPL5jdP8ZlTBjiW2pWweK7+nshIYswkiVxnQuzILHsz8T+vm9r4qpcxoVILgi4WxSnHVuJZAHjANFDLJ44Qqpm7FdMR0YRaF1PJhRAsv7xK2vVacFG7vq9XGjd5HEV0gk7ROQrQJWqgO9RELUSRQs/oFb15qffivXsfi9aCl88coz/wPn8AOWWRKw==</latexit>
Main consequencesA connection between prediction reward and information gain
Can estimate using samples
Question answering
Visual attention
Intrinsic motivation
This paper
Active sensing
Active perception
Sensor selection
DAN: Deep Anticipatory NetworksTrain Q and M network simultaneously
Q agent is rewarded is M agent predicts the unknown variable correctly.
ExperimentsSensor selection
Baselines • Coverage • Random • Coverage + DAN • Shared representations
At each time step: • Agent must select 1 out of 10 sensors to process observations from. • Agent is rewarded for correctly predicting the <x-y> position of a person
ExperimentsSensor selection
Correct Predictions in Multi-person Tracking
100
80
60
40
20
02010521
Num. Tracked People
Cor
rect
Pre
dict
ions
per
Epi
sode
DAN + CoverageCoverageRandom Policy
DAN sharedDAN
ExperimentsVisual attention
ExperimentsVisual attention
Test Curve in Terminal Reward Setting1.0
0.8
0.6
0.4
0.2
0
Tota
l rew
ard
in a
n ep
isod
e (o
ut o
f 1)
150001000050000 20000Training Episodes
Fashion-MNIST terminal-rewardMNIST terminal-reward
MNIST DANFashion-MNIST DAN
Test Curve in Continuous Reward Setting12
10
8
6
4
2
0
Tota
l rew
ard
in a
n ep
isod
e (o
ut o
f 12)
150001000050000 20000Training Episodes
Fashion-MNIST terminal-rewardMNIST terminal-reward
MNIST DANFashion-MNIST DAN
Thank you!Contact:[email protected]
yashsatsangi.com
Summary • Connection between prediction rewards and information gain.
• Compute information gain estimates without explicit belief inference.