Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan...

84
Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Transcript of Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan...

Page 1: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Cooperative Inverse Reinforcement Learning

Dylan Hadfield-Menell CS237: Reinforcement Learning

May 31, 2017

Page 2: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

The Value Alignment Problem

Example taken from Eliezer Yudkowsky’s NYU talk

Page 3: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

The Value Alignment Problem

Page 4: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

The Value Alignment Problem

Page 5: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017
Page 6: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

The Value Alignment Problem

Page 7: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Action Selection in Agents: Ideal

Observe Update Plan Act

Observe Act

Page 8: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Action Selection in Agents: Reality

Observe Act

Objective Encoding

Desired Behavior

Challenge: how do we account for errors and failures in the encoding of an objective?

Page 9: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

The Value Alignment Problem

How do we make sure that the agents we build pursue ends

that we actually intend?

Page 10: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Reward Engineering is Hard

Page 11: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Reward Engineering is Hard

Page 12: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

What could go wrong?

“…a computer-controlled radiation therapy machine….massively overdosed 6 people. These

accidents have been describes as the worst in the 35-year history of

medical accelerators.”

Page 13: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Reward Engineering is Hard

At best, reinforcement learning and similar approaches reduce the

problem of generating useful behavior to that of designing a

‘good’ reward function.

Page 14: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Reward Engineering is Hard

R⇤⇠R

True (Complicated) Reward Function

Observed (likely incorrect) Reward Function

Page 15: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Why is reward engineering hard?

⇠⇤ = argmax

⇠2⌅r(⇠)

⇠3 ⇠4 ⇠5

⇠0 ⇠1 ⇠2

Page 16: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

0 1 2 3 4 5 6 7

Why is reward engineering hard?

⇠3 ⇠4 ⇠5

r⇤⇠r

⇠0 ⇠1 ⇠2

Page 17: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Why is reward engineering hard?

⇠3 ⇠4 ⇠5

⇠6 ⇠7

⇠0 ⇠1 ⇠2

0 1 2 3 4 5 6 7

r⇤⇠r

Page 18: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Negative Side Effects

??

“Get money”

Page 19: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Reward Hacking

“Get points”

?2 0

0 1 ?5 0

2 0

Page 20: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Analogy: Computer Security

Page 21: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Solution 1: Blacklist

Input Text

Disallowed Characters

Clean Text

Page 22: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Solution 2: Whitelist

Input Text

Clean Text

Filter of Allowed Characters

Page 23: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Goal

Reduce the extent to which system designers have to play

whack-a-mole

Page 24: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Inspiration: Pragmatics

😊 😎 😎🎩

😊 😎 😎🎩

😊 😎 😎🎩

“Hat”“Glasses”

Page 25: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Inspiration: Pragmatics

😊 😎 😎🎩

“Glasses”

😊 😎 😎🎩

“Hat”

“Glasses”

😊 😎 😎🎩

Page 26: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Inspiration: Pragmatics

😊 😎 😎🎩

😊 😎 😎🎩

“Glasses”

😊 😎 😎

“Hat”

“Glasses”

🎩

“My friend has glasses”

😊 😎 😎🎩

Page 27: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Notation

⇠3 ⇠4 ⇠5

⇠0 ⇠1 ⇠2

⇠ trajectory

R(⇠;w) = w>�(⇠)linear reward function

� features w weights

Page 28: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Literal Reward Interpretation

⇠⇡(⇠) / exp

⇣⇠w

>�(⇠)

selects trajectories in proportion to proxy reward evaluation

⇠4 ⇠5v

Page 29: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Designing Reward for Literal Interpretation

Assumption: rewarded behavior has high true utility in the training situations

Page 30: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Designing Reward for Literal Interpretation

P (⇠w|w⇤) / exp

⇣E[w⇤>

�(⇠)|⇠ ⇠ ⇠⇡]⌘

Literal optimizer’s trajectory distribution conditioned on .

True reward received for each trajectory

⇠w

Page 31: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Inverting Reward Design

P (w⇤|⇠w) / P (⇠w|w⇤)P (w⇤)

Page 32: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Inverting Reward Design

Key Idea: At test time, interpret reward functions in the context of

an ‘intended’ situation

P (w⇤|⇠w) / P (⇠w|w⇤)P (w⇤)

Page 33: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Domain: Lavaland

Experiment

Three types of states in the training MDP

New state introduced In the

‘testing’ MDP

Mtest Measure how often the agent

selects trajectories with

the new state

⇠⇡

Page 34: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Negative Side Effects

??“Get

money”

2

664

1 0 0 00 1 0 00 0 1 00 0 0 1

3

775

Page 35: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Reward Hacking

“Get points”

?2 0

0 1 ?5 0

2 0 2

664

1 1 0 0 00 0 1 1 00 0 0 0 11 0 0 1 0

3

775

Page 36: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Challenge: Missing Latent Rewards

k = 0k = 1k = 2k = 3

µk

⌃k �s

IsProxy

reward function is

only trained for the state

types observed during training

Page 37: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Results

Negative Side Effect Reward Hacking Missing Latent Reward

0.68

0.52

0.4

0.21

0.10.07

0.15

0.030.01

0.19

0.010

0.41

0.060.11

Sampled-Proxy Sampled-Z MaxEnt Z Mean Proxy

Page 38: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

On the folly of rewarding A and hoping for B “Whether dealing with monkeys, rats, or human beings, it is hardly controversial to state that most organisms seek information concerning what activities are rewarded, and then seek to do (or at least pretend to do) those things, often to the virtual exclusion of activities not rewarded…. Nevertheless, numerous examples exist of reward systems that are fouled up in that behaviors which are rewarded are those which the rewarder is trying to discourage….” – Kerr, 1975

Page 39: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

The Principal-Agent Problem

Principal Agent

Page 40: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

■ Principal and Agent negotiate contract

■ Agent selects effort

■ Value generated for principal, wages paid to agent

A Simple Principal-Agent Problem

Page 41: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

A Simple Principal Agent Problem

Page 42: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

A Simple Principal Agent Problem

Page 43: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

A Simple Principal Agent Problem

Page 44: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Misaligned Principal Agent Problem

Value to PrincipalPerformance Measure

[Baker 2002]

Page 45: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Misaligned Principal Agent Problem

[Baker 2002]

Scale Alignment

Page 46: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

■ Incentive Compatibility is a fundamental constraint on (human or artificial) agent behavior

■ PA model has fundamental misalignment because humans have differing objectives

■ Primary source of misalignment in VA is extrapolation ■ Although we may want to view algorithmic restrictions as a fundamental

misalignment

■ Recent news: Principal Agent models was awarded the 2016 Nobel prize in Economics

Principal Agent vs Value Alignment

Page 47: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

The Value Alignment Problem

Page 48: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Can we intervene?

vs

Better question: do our agents want us to intervene

Page 49: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

The Off-Switch Game

Page 50: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

The Off-Switch Game

Desired Behavior

Disobedient Behavior

Page 51: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

A trivial agent that ‘wants’ intervention

Page 52: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

The Off Switch Game

Desired Behavior

Disobedient Behavior Non-Functional Behavior

Page 53: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

The Off-Switch Game

Page 54: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

The Off-Switch Game

Page 55: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

The Off-Switch Game

Desired Behavior Disobedient BehaviorNon-Functional Behavior

Page 56: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Why have an off-switch?

Observe Act

Objective Encoding

Desired Behavior

This step might go wrong

The system designer has uncertainty about the correct objective, this is never represented to the robot!

Page 57: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

The Structure of a Solution

Desired Behavior

Distribution over Objectives

Observe World

ActObserve Human

Infer the desired behavior from the human’s actions

Page 58: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

■ Given

■ Determine

Inverse Reinforcement Learning

[Ng and Russell 2000]

MDP without reward function

Observations of optimal behavior

The reward function being optimized

Page 59: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Can we use IRL to infer objectives?

Desired Behavior

Distribution over Objectives

Observe World

Act

Observe Human

Bayesian IRL

Inferred Objective

Page 60: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

IRL Issue #1

Don’t want the robot to imitate the human

Page 61: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

IRL Issue #2: Assumes Human is Oblivious

IRL assumes the human is unaware she is being observed

one way mirror

Page 62: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

IRL Issue #3

Action selection is independent of reward uncertainty

Implicit Assumption: Robot gets no more information about the objective

Page 63: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

■ Cooperative Inverse Reinforcement Learning ■ [Hadfield-Menell et al. NIPS 2016]

■ Two players:

■ Both players maximize a shared reward function, but only observes the actual reward signal; only knows a prior distribution on reward functions ■ learns the reward parameters by observing

Proposal: Robot Plays Cooperative Game

Page 64: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Cooperative Inverse Reinforcement Learning

Environment

Hadfield-Menell et al. NIPS 2016

Page 65: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

The Off-Switch Game

Page 66: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Intuition

“Probably better to make coffee, but I should ask the human, just in case I’m wrong”

“Probably better to switch off, but I should ask the human, just in case I’m wrong”

Page 67: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

A rational human is a sufficient to incentivize the robot to let itself be switched off

Theorem 1

Page 68: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Incentives for the Robot

vs

vs

Page 69: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Theorem 1: Sufficient Conditions

rational

Page 70: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

If the robot knows the utility evaluations in the off switch game with certainty, then a rational human is

necessary to incentivize obedient behavior

Theorem 2

Page 71: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Conclusion

Uncertainty about the objective is crucial to incentivizing cooperative

behaviors.

Page 72: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

When is obedience a bad idea?

vs

Page 73: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Robot Uncertainty vs Human Suboptimality

Page 74: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Incentives for DesignersPopulation statistics on preferences

i.e., market research

Evidence about preferences from interaction with a particular customer

Question: is it a good idea to `lie’ to the agent and tell it that the variance of is ?

Page 75: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Incentives for Designers

Page 76: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Incentives for Designers

Page 77: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Incentives for Designers

Page 78: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

■ N actions, rewards are linear feature combinations\

■ Each round:

■ H observes the feature values for each action and gives R an ‘order’

■ R observes H’s order and then selects an action which executes

■ What are costs/benefits of learning the humans preferences, compared with blind obedience?

Obedience over Time: Model

Page 79: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Robot Obedience over Time

Page 80: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Robot Obedience over Time

Page 81: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Model Mismatch: Missing/Extra Features

Page 82: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Model Mismatch: Missing/Extra Features

Page 83: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

■ Key Observation: Expected obedience on step 1 should be close to 1

■ Proposal: initial baseline policy of obedience, track what the obedience would have been, only switch to learning if within a threshold

Detecting missing features

Page 84: Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Detecting Incorrect Features