Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How...

14
Maximizing Information Gain via Prediction Reward Y ash Sats angi, Sungsu Lim, Shimon Whiteson, Fr ans Oliehoek, Martha White

Transcript of Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How...

Page 1: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •

Maximizing Information Gain via Prediction Reward

Yash Satsangi, Sungsu Lim, Shimon Whiteson, Frans Oliehoek, Martha White

Page 2: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •

Active perception

Sensor selection Visual attention

The ability to take actions to reduce uncertainty

Page 3: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •

Active perception as an RL taskReward:

⇢(b) = �H(b)

Page 4: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •

Active perception as an RL task

Reward: ⇢(b) = �H(b)

Explicit belief

inference?

Page 5: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •

This paper• Main theoretical result

• How can we design state-based reward functions that approximate information gain.

• Prediction reward are linear approximation to entropy.

• Deep Anticipatory Networks (DAN)

• A deep RL algorithm that trains two deep neural networks simultaneously on each other’s feedback.

• Useful when reward is a convex function of the belief of the agent.

• Experiments

• Sensor selection with DAN

• Visual attention with DAN

Page 6: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •

Prediction rewardA connection between prediction reward and information gain

Prediction reward: reward agent for making accurate prediction.

Expected prediction reward

r’ for correct prediction

r’’ otherwise

Page 7: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •

Main resultA connection between prediction reward and information gain

⇢(b) = �H(b)

⇢0(b)

<latexit sha1_base64="5KHyYJ9bUaFbPVnLxKltESaFziY=">AAAB83icbVBNSwMxEJ2tX7V+VT16CbZivZTdIqi3ohePFewHdJeSTdM2NJssSVYoS/+GFw+KePXPePPfmLZ70NYHA4/3ZpiZF8acaeO6305ubX1jcyu/XdjZ3ds/KB4etbRMFKFNIrlUnRBrypmgTcMMp51YURyFnLbD8d3Mbz9RpZkUj2YS0yDCQ8EGjGBjJb/sq5E8r6Th9KLcK5bcqjsHWiVeRkqQodErfvl9SZKICkM41rrrubEJUqwMI5xOC36iaYzJGA9p11KBI6qDdH7zFJ1ZpY8GUtkSBs3V3xMpjrSeRKHtjLAZ6WVvJv7ndRMzuA5SJuLEUEEWiwYJR0aiWQCozxQlhk8swUQxeysiI6wwMTamgg3BW355lbRqVe+yevNQK9VvszjycAKnUAEPrqAO99CAJhCI4Rle4c1JnBfn3flYtOacbOYY/sD5/AGNDpC7</latexit>

expected prediction reward

constant term

✏1

<latexit sha1_base64="ZwN7C2KGpUGWyQTjXS7eUjXHlRo=">AAAB83icbVBNSwMxEM3Wr1q/qh69BFvBU9ktgnorevFYwdZCdynZdLYNzSYhyQql9G948aCIV/+MN/+NabsHbX0w8Hhvhpl5seLMWN//9gpr6xubW8Xt0s7u3v5B+fCobWSmKbSo5FJ3YmKAMwEtyyyHjtJA0pjDYzy6nfmPT6ANk+LBjhVEKRkIljBKrJPCagjKMC5FL6j2yhW/5s+BV0mQkwrK0eyVv8K+pFkKwlJOjOkGvrLRhGjLKIdpKcwMKEJHZABdRwVJwUST+c1TfOaUPk6kdiUsnqu/JyYkNWacxq4zJXZolr2Z+J/XzWxyFU2YUJkFQReLkoxjK/EsANxnGqjlY0cI1czdiumQaEKti6nkQgiWX14l7XotuKhd39crjZs8jiI6QafoHAXoEjXQHWqiFqJIoWf0it68zHvx3r2PRWvBy2eO0R94nz834JEq</latexit>

✏2

<latexit sha1_base64="eyKmuJ/6IHI69Xzd56EWrhTc7AE=">AAAB83icbVBNSwMxEM3Wr1q/qh69BFvBU9ktgnorevFYwdZCt5RsOtuGZpOQZIWy9G948aCIV/+MN/+NabsHbX0w8Hhvhpl5keLMWN//9gpr6xubW8Xt0s7u3v5B+fCobWSqKbSo5FJ3ImKAMwEtyyyHjtJAkojDYzS+nfmPT6ANk+LBThT0EjIULGaUWCeF1RCUYVyKfr3aL1f8mj8HXiVBTiooR7Nf/goHkqYJCEs5MaYb+Mr2MqItoxympTA1oAgdkyF0HRUkAdPL5jdP8ZlTBjiW2pWweK7+nshIYswkiVxnQuzILHsz8T+vm9r4qpcxoVILgi4WxSnHVuJZAHjANFDLJ44Qqpm7FdMR0YRaF1PJhRAsv7xK2vVacFG7vq9XGjd5HEV0gk7ROQrQJWqgO9RELUSRQs/oFb15qffivXsfi9aCl88coz/wPn8AOWWRKw==</latexit>

Page 8: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •

Main consequencesA connection between prediction reward and information gain

Can estimate using samples

Question answering

Visual attention

Intrinsic motivation

This paper

Active sensing

Active perception

Sensor selection

Page 9: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •

DAN: Deep Anticipatory NetworksTrain Q and M network simultaneously

Q agent is rewarded is M agent predicts the unknown variable correctly.

Page 10: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •

ExperimentsSensor selection

Baselines • Coverage • Random • Coverage + DAN • Shared representations

At each time step: • Agent must select 1 out of 10 sensors to process observations from. • Agent is rewarded for correctly predicting the <x-y> position of a person

Page 11: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •

ExperimentsSensor selection

Correct Predictions in Multi-person Tracking

100

80

60

40

20

02010521

Num. Tracked People

Cor

rect

Pre

dict

ions

per

Epi

sode

DAN + CoverageCoverageRandom Policy

DAN sharedDAN

Page 12: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •

ExperimentsVisual attention

Page 13: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •

ExperimentsVisual attention

Test Curve in Terminal Reward Setting1.0

0.8

0.6

0.4

0.2

0

Tota

l rew

ard

in a

n ep

isod

e (o

ut o

f 1)

150001000050000 20000Training Episodes

Fashion-MNIST terminal-rewardMNIST terminal-reward

MNIST DANFashion-MNIST DAN

Test Curve in Continuous Reward Setting12

10

8

6

4

2

0

Tota

l rew

ard

in a

n ep

isod

e (o

ut o

f 12)

150001000050000 20000Training Episodes

Fashion-MNIST terminal-rewardMNIST terminal-reward

MNIST DANFashion-MNIST DAN

Page 14: Maximizing Information Gain via Prediction Reward · This paper • Main theoretical result • How can we design state-based reward functions that approximate information gain. •

Thank you!Contact:[email protected]

yashsatsangi.com

Summary • Connection between prediction rewards and information gain.

• Compute information gain estimates without explicit belief inference.