Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit...

149
Reinforcement Learning Summer 2019 Stefan Riezler Computational Lingustics & IWR Heidelberg University, Germany [email protected] Reinforcement Learning, Summer 2019 1(86)

Transcript of Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit...

Page 1: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Reinforcement Learning

Summer 2019

Stefan Riezler

Computational Lingustics & IWRHeidelberg University, Germany

[email protected]

Reinforcement Learning, Summer 2019 1(86)

Page 2: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Overview

Overview

I Formalizing the reinforcement learning problem: MarkovDecision Processes (MDPs)

I Dynamic programming techniques for policy evaluation andpolicy optimization

I Sampling-based techniques: Monte-Carlo methods,Temporal-Difference learning, Q-learning

I Policy-gradient methods: Score function gradientestimators, actor-critic methods

I Seq2seq reinforcement learning: Bandit structuredprediction, actor-critic neural seq2seq learning

I Off-policy/counterfactual seq2seq reinforcement learning

I Seq2seq reinforcment learning from human feedback

Reinforcement Learning, Summer 2019 2(86)

Page 3: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Overview

Overview

I Formalizing the reinforcement learning problem: MarkovDecision Processes (MDPs)

I Dynamic programming techniques for policy evaluation andpolicy optimization

I Sampling-based techniques: Monte-Carlo methods,Temporal-Difference learning, Q-learning

I Policy-gradient methods: Score function gradientestimators, actor-critic methods

I Seq2seq reinforcement learning: Bandit structuredprediction, actor-critic neural seq2seq learning

I Off-policy/counterfactual seq2seq reinforcement learning

I Seq2seq reinforcment learning from human feedback

Reinforcement Learning, Summer 2019 2(86)

Page 4: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Overview

Overview

I Formalizing the reinforcement learning problem: MarkovDecision Processes (MDPs)

I Dynamic programming techniques for policy evaluation andpolicy optimization

I Sampling-based techniques: Monte-Carlo methods,Temporal-Difference learning, Q-learning

I Policy-gradient methods: Score function gradientestimators, actor-critic methods

I Seq2seq reinforcement learning: Bandit structuredprediction, actor-critic neural seq2seq learning

I Off-policy/counterfactual seq2seq reinforcement learning

I Seq2seq reinforcment learning from human feedback

Reinforcement Learning, Summer 2019 2(86)

Page 5: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Overview

Overview

I Formalizing the reinforcement learning problem: MarkovDecision Processes (MDPs)

I Dynamic programming techniques for policy evaluation andpolicy optimization

I Sampling-based techniques: Monte-Carlo methods,Temporal-Difference learning, Q-learning

I Policy-gradient methods: Score function gradientestimators, actor-critic methods

I Seq2seq reinforcement learning: Bandit structuredprediction, actor-critic neural seq2seq learning

I Off-policy/counterfactual seq2seq reinforcement learning

I Seq2seq reinforcment learning from human feedback

Reinforcement Learning, Summer 2019 2(86)

Page 6: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Overview

Overview

I Formalizing the reinforcement learning problem: MarkovDecision Processes (MDPs)

I Dynamic programming techniques for policy evaluation andpolicy optimization

I Sampling-based techniques: Monte-Carlo methods,Temporal-Difference learning, Q-learning

I Policy-gradient methods: Score function gradientestimators, actor-critic methods

I Seq2seq reinforcement learning: Bandit structuredprediction, actor-critic neural seq2seq learning

I Off-policy/counterfactual seq2seq reinforcement learning

I Seq2seq reinforcment learning from human feedback

Reinforcement Learning, Summer 2019 2(86)

Page 7: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Overview

Overview

I Formalizing the reinforcement learning problem: MarkovDecision Processes (MDPs)

I Dynamic programming techniques for policy evaluation andpolicy optimization

I Sampling-based techniques: Monte-Carlo methods,Temporal-Difference learning, Q-learning

I Policy-gradient methods: Score function gradientestimators, actor-critic methods

I Seq2seq reinforcement learning: Bandit structuredprediction, actor-critic neural seq2seq learning

I Off-policy/counterfactual seq2seq reinforcement learning

I Seq2seq reinforcment learning from human feedback

Reinforcement Learning, Summer 2019 2(86)

Page 8: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Overview

Overview

I Formalizing the reinforcement learning problem: MarkovDecision Processes (MDPs)

I Dynamic programming techniques for policy evaluation andpolicy optimization

I Sampling-based techniques: Monte-Carlo methods,Temporal-Difference learning, Q-learning

I Policy-gradient methods: Score function gradientestimators, actor-critic methods

I Seq2seq reinforcement learning: Bandit structuredprediction, actor-critic neural seq2seq learning

I Off-policy/counterfactual seq2seq reinforcement learning

I Seq2seq reinforcment learning from human feedback

Reinforcement Learning, Summer 2019 2(86)

Page 9: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Overview

Textbooks

I Richard S. Sutton and Andrew G. Barto (2018, 2nd edition):Reinforcement Learning: An Introduction. MIT Press.

I http://incompleteideas.net/sutton/book/

the-book-2nd.html

I Csaba Szepesvari (2010). Algorithms for ReinforcementLearning. Morgan & Claypool.

I https://sites.ualberta.ca/~szepesva/RLBook.html

I Dimitri Bertsekas and John Tsitsiklis (1996). Neuro-DynamicProgramming. Athena Scientific.

I = another name for deep reinforcement learning, contains a lotof proofs, analog version can be ordered athttp://www.athenasc.com/ndpbook.html

Reinforcement Learning, Summer 2019 3(86)

Page 10: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Introduction

Reinforcement Learning (RL) Philosopy

I Hedoninistic learning system that wants something, andadapts its behavior in order to maximize a special signal orreward from its environment.

I Interactive learning by trial and error, using consequences ofown actions in uncharted territory to learn to maximizeexpected reward.

I Weak supervision signal since no gold standard examples fromexpert are available.

Reinforcement Learning, Summer 2019 4(86)

Page 11: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Introduction

Reinforcement Learning Schema

I RL as Google DeepMind would like to see it (image fromDavid Silver):

Reinforcement Learning, Summer 2019 5(86)

Page 12: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Introduction

Reinforcement Learning Schema

I A real-world example: Interactive Machine Translation

I action = predicting a target wordI reward = per-sentence translation qualityI state = source sentence and target history

Reinforcement Learning, Summer 2019 6(86)

Page 13: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Introduction

Reinforcement Learning Schema

Agent/system and environment/user interact

I at each of a sequence of time steps t = 0, 1, 2, . . .,

I where agent receives a state representation St ,

I on which basis it selects an action At ,

I and as a consequence, it receives a reward Rt+1,

I and finds itself in a new state St+1.

Goal of RL: Maximize the total amount of reward an agentreceives in such interactions in the long run.

Reinforcement Learning, Summer 2019 7(86)

Page 14: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Introduction

Reinforcement Learning Schema

Agent/system and environment/user interact

I at each of a sequence of time steps t = 0, 1, 2, . . .,

I where agent receives a state representation St ,

I on which basis it selects an action At ,

I and as a consequence, it receives a reward Rt+1,

I and finds itself in a new state St+1.

Goal of RL: Maximize the total amount of reward an agentreceives in such interactions in the long run.

Reinforcement Learning, Summer 2019 7(86)

Page 15: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Markov Decision Processes

Formalizing User/Environment: MarkovDecision Processes (MDPs)

A Markov decision process is a tuple 〈S,A,P,R〉 where

I S is a set of states,

I A is a finite set of actions,

I P is a state transition probability matrix s.t.Pa

ss′ = P[St+1 = s ′|St = s,At = a],

I R is a reward function s.t. Ras = E[Rt+1|St = s,At = a].

Reinforcement Learning, Summer 2019 8(86)

Page 16: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Markov Decision Processes

Dynamics of MDPs

One-step dynamics of the environment under the Markov propertyis completely specified by probability distribution over pairs of nextstate and reward s ′, r , given state and action s, a:

I p(s ′, r |s, a) = P[St+1 = s ′,Rt+1 = r |St = s,At = a].

Exercise: Specify Pass′ and Ra

s in terms of p(s ′, r |s, a).Pa

ss′ = p(s ′|s, a) =∑

r∈R p(s ′, r |s, a),Ra

s =∑

r∈R r∑

s′∈S p(s ′, r |s, a).

Reinforcement Learning, Summer 2019 9(86)

Page 17: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Markov Decision Processes

Dynamics of MDPs

One-step dynamics of the environment under the Markov propertyis completely specified by probability distribution over pairs of nextstate and reward s ′, r , given state and action s, a:

I p(s ′, r |s, a) = P[St+1 = s ′,Rt+1 = r |St = s,At = a].

Exercise: Specify Pass′ and Ra

s in terms of p(s ′, r |s, a).

Pass′ = p(s ′|s, a) =

∑r∈R p(s ′, r |s, a),

Ras =

∑r∈R r

∑s′∈S p(s ′, r |s, a).

Reinforcement Learning, Summer 2019 9(86)

Page 18: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Markov Decision Processes

Dynamics of MDPs

One-step dynamics of the environment under the Markov propertyis completely specified by probability distribution over pairs of nextstate and reward s ′, r , given state and action s, a:

I p(s ′, r |s, a) = P[St+1 = s ′,Rt+1 = r |St = s,At = a].

Exercise: Specify Pass′ and Ra

s in terms of p(s ′, r |s, a).Pa

ss′ = p(s ′|s, a) =∑

r∈R p(s ′, r |s, a),Ra

s =∑

r∈R r∑

s′∈S p(s ′, r |s, a).

Reinforcement Learning, Summer 2019 9(86)

Page 19: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Markov Decision Processes

Formalizing Agent/System: Policies

A stochastic policy is a distribution over actions given states s.t.

I π(a|s) = P[At = a|St = s].

I A policy completely specifies the behavior of an agent/system.

I Policies are parameterized πθ, e.g. by a linear model or aneural nework - we use π to denote πθ if unambiguous.

I Deterministic policies a = π(s) also possible.

Reinforcement Learning, Summer 2019 10(86)

Page 20: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Dynamic Programming

Policy Evaluation and Policy Optimization

Two central tasks in RL:

I Policy evaluation (a.k.a. prediction): Evaluate theexpected reward for a given policy.

I Policy optimization (a.k.a. learning/control): Find theoptimal policy / optimize a parametric policy under theexpected reward criterion.

Reinforcement Learning, Summer 2019 11(86)

Page 21: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Dynamic Programming

Return and Value Functions

I The total discounted return from time-step t for discountγ ∈ [0, 1] is

I Gt = Rt+1 + γRt+2 + γ2Rt+3 + . . . =∑∞

k=0 γk Rt+k+1.

I The action-value function qπ(s, a) on an MDP is theexpected return starting from state s, taking action a, andfollowing policy π s.t.

I qπ(s, a) = Eπ[Gt |St = s,At = a].

I The state-value function vπ(s) on an MDP is the expectedreturn starting from state s and following policy π s.t.

I vπ(s) = Eπ[Gt |St = s] = Ea∼π[qπ(s, a)].

Reinforcement Learning, Summer 2019 12(86)

Page 22: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Dynamic Programming

Return and Value Functions

I The total discounted return from time-step t for discountγ ∈ [0, 1] is

I Gt = Rt+1 + γRt+2 + γ2Rt+3 + . . . =∑∞

k=0 γk Rt+k+1.

I The action-value function qπ(s, a) on an MDP is theexpected return starting from state s, taking action a, andfollowing policy π s.t.

I qπ(s, a) = Eπ[Gt |St = s,At = a].

I The state-value function vπ(s) on an MDP is the expectedreturn starting from state s and following policy π s.t.

I vπ(s) = Eπ[Gt |St = s] = Ea∼π[qπ(s, a)].

Reinforcement Learning, Summer 2019 12(86)

Page 23: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Dynamic Programming

Return and Value Functions

I The total discounted return from time-step t for discountγ ∈ [0, 1] is

I Gt = Rt+1 + γRt+2 + γ2Rt+3 + . . . =∑∞

k=0 γk Rt+k+1.

I The action-value function qπ(s, a) on an MDP is theexpected return starting from state s, taking action a, andfollowing policy π s.t.

I qπ(s, a) = Eπ[Gt |St = s,At = a].

I The state-value function vπ(s) on an MDP is the expectedreturn starting from state s and following policy π s.t.

I vπ(s) = Eπ[Gt |St = s] = Ea∼π[qπ(s, a)].

Reinforcement Learning, Summer 2019 12(86)

Page 24: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Dynamic Programming

Bellman Expectation Equation

The state-value function can be decomposed into immediatereward plus discounted value of successor state s.t.

vπ(s) = Eπ[Rt+1 + γvπ(St+1)|St = s]

=∑a∈A

π(a|s)

(Ra

s + γ∑s′∈SPa

ss′vπ(s ′)

).

In matrix notation:

vπ = Rπ + γPπvπ.

v(1)...

v(n)

=

R1...Rn

+ γ

P11 . . . P1n...Pn1 . . . Pnn

v(1)

...v(n)

Reinforcement Learning, Summer 2019 13(86)

Page 25: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Dynamic Programming

Bellman Expectation Equation

The state-value function can be decomposed into immediatereward plus discounted value of successor state s.t.

vπ(s) = Eπ[Rt+1 + γvπ(St+1)|St = s]

=∑a∈A

π(a|s)

(Ra

s + γ∑s′∈SPa

ss′vπ(s ′)

).

In matrix notation:

vπ = Rπ + γPπvπ.

v(1)...

v(n)

=

R1...Rn

+ γ

P11 . . . P1n...Pn1 . . . Pnn

v(1)

...v(n)

Reinforcement Learning, Summer 2019 13(86)

Page 26: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Dynamic Programming

Policy Evaluation by Linear Programming

The value of vπ can be found directly by solving the linearequations of the Bellman Expectation Equation:

I Solving linear equations:

vπ = (I− γPπ)−1Rπ

I Only applicable to small MDPs.

Exercise: Derive vπ from the Bellman Expectation Equaition.

vπ = Rπ + γPπvπ

(I− γPπ)vπ = Rπ

vπ = (I− γPπ)−1Rπ

Reinforcement Learning, Summer 2019 14(86)

Page 27: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Dynamic Programming

Policy Evaluation by Linear Programming

The value of vπ can be found directly by solving the linearequations of the Bellman Expectation Equation:

I Solving linear equations:

vπ = (I− γPπ)−1Rπ

I Only applicable to small MDPs.

Exercise: Derive vπ from the Bellman Expectation Equaition.

vπ = Rπ + γPπvπ

(I− γPπ)vπ = Rπ

vπ = (I− γPπ)−1Rπ

Reinforcement Learning, Summer 2019 14(86)

Page 28: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Dynamic Programming

Policy Evaluation by Linear Programming

The value of vπ can be found directly by solving the linearequations of the Bellman Expectation Equation:

I Solving linear equations:

vπ = (I− γPπ)−1Rπ

I Only applicable to small MDPs.

Exercise: Derive vπ from the Bellman Expectation Equaition.

vπ = Rπ + γPπvπ

(I− γPπ)vπ = Rπ

vπ = (I− γPπ)−1Rπ

Reinforcement Learning, Summer 2019 14(86)

Page 29: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Dynamic Programming

Policy Evaluation by Linear Programming

The value of vπ can be found directly by solving the linearequations of the Bellman Expectation Equation:

I Solving linear equations:

vπ = (I− γPπ)−1Rπ

I Only applicable to small MDPs.

Exercise: Derive vπ from the Bellman Expectation Equaition.

vπ = Rπ + γPπvπ

(I− γPπ)vπ = Rπ

vπ = (I− γPπ)−1Rπ

Reinforcement Learning, Summer 2019 14(86)

Page 30: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Dynamic Programming

Policy Evaluation by Dynamic Programming(DP)

Value of vπ can also be found by iterative application of BellmanExpectation Equation:

I Iterative policy evaluation:

vk+1 = Rπ + γPπvk .

I Performs dynamic programming by recursive decompositionof Bellman equation.

I Can be parallelized (or backed up asynchronously), thusapplicable to large MDPs.

I Converges to vπ.

Reinforcement Learning, Summer 2019 15(86)

Page 31: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Dynamic Programming

Policy Evaluation by Dynamic Programming(DP)

Value of vπ can also be found by iterative application of BellmanExpectation Equation:

I Iterative policy evaluation:

vk+1 = Rπ + γPπvk .

I Performs dynamic programming by recursive decompositionof Bellman equation.

I Can be parallelized (or backed up asynchronously), thusapplicable to large MDPs.

I Converges to vπ.

Reinforcement Learning, Summer 2019 15(86)

Page 32: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Dynamic Programming

Policy Optimization using Bellman OptimalityEquation

An optimal policy π∗ can be found by maximizing over the optimalaction-value function q∗(s, a) = maxπ qπ(s, a) s.t.

π∗(s) = argmaxa

q∗(s, a).

The optimal value functions are recursively related by the BellmanOptimality Equation:

q∗(s, a) = Eπ∗ [Rt+1 + γmaxa′

q∗(St+1, a′)|St = s,At = a]

= Ras + γ

∑s′∈SPa

ss′ maxa′

q∗(s ′, a′).

Reinforcement Learning, Summer 2019 16(86)

Page 33: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Dynamic Programming

Policy Optimization using Bellman OptimalityEquation

An optimal policy π∗ can be found by maximizing over the optimalaction-value function q∗(s, a) = maxπ qπ(s, a) s.t.

π∗(s) = argmaxa

q∗(s, a).

The optimal value functions are recursively related by the BellmanOptimality Equation:

q∗(s, a) = Eπ∗ [Rt+1 + γmaxa′

q∗(St+1, a′)|St = s,At = a]

= Ras + γ

∑s′∈SPa

ss′ maxa′

q∗(s ′, a′).

Reinforcement Learning, Summer 2019 16(86)

Page 34: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Dynamic Programming

Policy Optimization by Value Iteration

The Bellman Optimality Equation is non-linear and requiresiterative solutions such as value iteration by dynamic programming:

I Value iteration for q-function:

qk+1(s, a) = Ras + γ

∑s′∈SPa

ss′ maxa′

qk(s ′, a′).

I Converges to q∗(s, a).

Reinforcement Learning, Summer 2019 17(86)

Page 35: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Dynamic Programming

Summary: Dynamic Programming

I Earliest RL algorithms with well-defined convergenceproperties.

I Bellman equation gives recursive decomposition for iterativesolution to various problems in policy evaluation and policyoptimization.

I Can be trivially parallelized or even run asynchronously.

I We need to know a full MDP model with all transitionsand rewards, and touch all of them in learning!

Reinforcement Learning, Summer 2019 18(86)

Page 36: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Dynamic Programming

Summary: Dynamic Programming

I Earliest RL algorithms with well-defined convergenceproperties.

I Bellman equation gives recursive decomposition for iterativesolution to various problems in policy evaluation and policyoptimization.

I Can be trivially parallelized or even run asynchronously.

I We need to know a full MDP model with all transitionsand rewards, and touch all of them in learning!

Reinforcement Learning, Summer 2019 18(86)

Page 37: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

Policy Evaluation by Monte-Carlo (MC)Sampling

I Monte-Carlo Policy EvaluationI Sample episodes S0,A0,R1, . . . ,RT ∼ π.I For each sampled episode:

I Increment state counter N(s)← N(s) + 1.I Increment total return S(s)← S(s) + Gt .

I Estimate mean return V (s) = S(s)/N(s).

I Learns vπ from episodes sampled under policy π, thusmodel-free.

I Updates can be done at first step or at every time step twhere state s is visited in episode.

I Converges to vπ for large number of samples.

Reinforcement Learning, Summer 2019 19(86)

Page 38: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

Policy Evaluation by Monte-Carlo (MC)Sampling

I Monte-Carlo Policy EvaluationI Sample episodes S0,A0,R1, . . . ,RT ∼ π.I For each sampled episode:

I Increment state counter N(s)← N(s) + 1.I Increment total return S(s)← S(s) + Gt .

I Estimate mean return V (s) = S(s)/N(s).

I Learns vπ from episodes sampled under policy π, thusmodel-free.

I Updates can be done at first step or at every time step twhere state s is visited in episode.

I Converges to vπ for large number of samples.

Reinforcement Learning, Summer 2019 19(86)

Page 39: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

Incremental Mean

Use definition of incremental mean µk s.t.

µk =1

k

k∑j=1

xj

=1

k

xk +k−1∑j=1

xj

=

1

k(xk + (k − 1)µk−1)

= µk−1 +1

k(xk − µk−1) .

Reinforcement Learning, Summer 2019 20(86)

Page 40: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

Incremental Monte-Carlo Updates

I Incremental Monte-Carlo Policy EvaluationI For each sampled episode, for each step t:

I N(St)← N(St) + 1,I V (St)← V (St) + α (Gt − V (St)) .

I Can be seen as incremental update towards actual return.

I α can be 1N(St ) or more general term α > 0.

Reinforcement Learning, Summer 2019 21(86)

Page 41: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

Incremental Monte-Carlo Updates

I Incremental Monte-Carlo Policy EvaluationI For each sampled episode, for each step t:

I N(St)← N(St) + 1,I V (St)← V (St) + α (Gt − V (St)) .

I Can be seen as incremental update towards actual return.

I α can be 1N(St ) or more general term α > 0.

Reinforcement Learning, Summer 2019 21(86)

Page 42: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

Policy Evaluation by Temporal Difference(TD) Learning

I TD(0):I For each sampled episode, for each step t:

I V (St)← V (St) + α (Rt+1 + γV (St+1)− V (St)) .

I Combines sampling and recursive computation byupdating toward estimated return Rt+1 + γV (St+1).

I Recall Rt+1 + γV (St+1) from Bellman Expectation Equation,here called TD target.

I δt = (Rt+1 + γV (St+1)− V (St)) is called TD error.

Reinforcement Learning, Summer 2019 22(86)

Page 43: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

Policy Evaluation by Temporal Difference(TD) Learning

I TD(0):I For each sampled episode, for each step t:

I V (St)← V (St) + α (Rt+1 + γV (St+1)− V (St)) .

I Combines sampling and recursive computation byupdating toward estimated return Rt+1 + γV (St+1).

I Recall Rt+1 + γV (St+1) from Bellman Expectation Equation,here called TD target.

I δt = (Rt+1 + γV (St+1)− V (St)) is called TD error.

Reinforcement Learning, Summer 2019 22(86)

Page 44: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

TD Learning with n-Step Returns

n-Step Returns:

I G(1)t = Rt+1 + γV (St+1).

I G(2)t = Rt+1 + γRt+2 + γ2V (St+2).

I...

I G(n)t = Rt+1 + γRt+2 + . . .+ γn−1Rt+n + γnV (St+n).

n-Step TD Learning:

I V (St)← V (St) + α(

G(n)t − V (St)

).

Exercise: How can Incremental Monte Carlo be recovered by

TD(n)? Monte Carlo: G(∞)t = Rt+1 + γRt+2 + . . .+ γT−1RT .

Reinforcement Learning, Summer 2019 23(86)

Page 45: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

TD Learning with n-Step Returns

n-Step Returns:

I G(1)t = Rt+1 + γV (St+1).

I G(2)t = Rt+1 + γRt+2 + γ2V (St+2).

I...

I G(n)t = Rt+1 + γRt+2 + . . .+ γn−1Rt+n + γnV (St+n).

n-Step TD Learning:

I V (St)← V (St) + α(

G(n)t − V (St)

).

Exercise: How can Incremental Monte Carlo be recovered by

TD(n)? Monte Carlo: G(∞)t = Rt+1 + γRt+2 + . . .+ γT−1RT .

Reinforcement Learning, Summer 2019 23(86)

Page 46: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

TD Learning with n-Step Returns

n-Step Returns:

I G(1)t = Rt+1 + γV (St+1).

I G(2)t = Rt+1 + γRt+2 + γ2V (St+2).

I...

I G(n)t = Rt+1 + γRt+2 + . . .+ γn−1Rt+n + γnV (St+n).

n-Step TD Learning:

I V (St)← V (St) + α(

G(n)t − V (St)

).

Exercise: How can Incremental Monte Carlo be recovered by

TD(n)?

Monte Carlo: G(∞)t = Rt+1 + γRt+2 + . . .+ γT−1RT .

Reinforcement Learning, Summer 2019 23(86)

Page 47: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

TD Learning with n-Step Returns

n-Step Returns:

I G(1)t = Rt+1 + γV (St+1).

I G(2)t = Rt+1 + γRt+2 + γ2V (St+2).

I...

I G(n)t = Rt+1 + γRt+2 + . . .+ γn−1Rt+n + γnV (St+n).

n-Step TD Learning:

I V (St)← V (St) + α(

G(n)t − V (St)

).

Exercise: How can Incremental Monte Carlo be recovered by

TD(n)? Monte Carlo: G(∞)t = Rt+1 + γRt+2 + . . .+ γT−1RT .

Reinforcement Learning, Summer 2019 23(86)

Page 48: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

TD Learning with λ-Weighted Returns

λ-Returns:

I Average n-Step Returns using

Gλt = (1− λ)

∞∑n=1

λn−1G(n)t ,

where λ ∈ [0, 1].

TD(λ) Learning:

I V (St)← V (St) + α(Gλ

t − V (St)).

Exercise: How can TD(0) be recovered from TD(λ)?

λ = 0 ⇒ Gλt = G

(1)t = TD(0).

Reinforcement Learning, Summer 2019 24(86)

Page 49: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

TD Learning with λ-Weighted Returns

λ-Returns:

I Average n-Step Returns using

Gλt = (1− λ)

∞∑n=1

λn−1G(n)t ,

where λ ∈ [0, 1].

TD(λ) Learning:

I V (St)← V (St) + α(Gλ

t − V (St)).

Exercise: How can TD(0) be recovered from TD(λ)?

λ = 0 ⇒ Gλt = G

(1)t = TD(0).

Reinforcement Learning, Summer 2019 24(86)

Page 50: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

TD Learning with λ-Weighted Returns

λ-Returns:

I Average n-Step Returns using

Gλt = (1− λ)

∞∑n=1

λn−1G(n)t ,

where λ ∈ [0, 1].

TD(λ) Learning:

I V (St)← V (St) + α(Gλ

t − V (St)).

Exercise: How can TD(0) be recovered from TD(λ)?

λ = 0 ⇒ Gλt = G

(1)t = TD(0).

Reinforcement Learning, Summer 2019 24(86)

Page 51: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

TD Learning with λ-Weighted Returns

λ-Returns:

I Average n-Step Returns using

Gλt = (1− λ)

∞∑n=1

λn−1G(n)t ,

where λ ∈ [0, 1].

TD(λ) Learning:

I V (St)← V (St) + α(Gλ

t − V (St)).

Exercise: How can TD(0) be recovered from TD(λ)?

λ = 0 ⇒ Gλt = G

(1)t = TD(0).

Reinforcement Learning, Summer 2019 24(86)

Page 52: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

Policy Optimization by Q-Learning

I Q-Learning [Watkins and Dayan, 1992]:I For each sampled episode:

I Initialize St .I For each step t:

I Sample At , observe Rt+1, St+1.I Q(St ,At)← Q(St ,At)

+α(Rt+1 + γmaxa′ Q(St+1,a′)− Q(St ,At)).I St ← St+1.

I Q-Learning combines sampling and TD(0)-style recursivecomputation for policy optimization.

I Recall Rt+1 + γmaxa′ Q(St+1,a′) from Bellman OptimalityEquation.

Reinforcement Learning, Summer 2019 25(86)

Page 53: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

Policy Optimization by Q-Learning

I Q-Learning [Watkins and Dayan, 1992]:I For each sampled episode:

I Initialize St .I For each step t:

I Sample At , observe Rt+1, St+1.I Q(St ,At)← Q(St ,At)

+α(Rt+1 + γmaxa′ Q(St+1,a′)− Q(St ,At)).I St ← St+1.

I Q-Learning combines sampling and TD(0)-style recursivecomputation for policy optimization.

I Recall Rt+1 + γmaxa′ Q(St+1,a′) from Bellman OptimalityEquation.

Reinforcement Learning, Summer 2019 25(86)

Page 54: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

Summary: Monte-Carlo andTemporal-Difference Learning

I MC has zero bias, but high variance that grows withnumber of random actions, transitions, rewards incomputation of return.

I TD techniquesI reduce variance since TD target depends on a single random

action, transition, reward,I can learn from incomplete episodes and can use online

updates,I introduce bias and use approximations which are exact only in

special cases.

Reinforcement Learning, Summer 2019 26(86)

Page 55: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Monte-Carlo Methods

Summary: Monte-Carlo andTemporal-Difference Learning

I MC has zero bias, but high variance that grows withnumber of random actions, transitions, rewards incomputation of return.

I TD techniquesI reduce variance since TD target depends on a single random

action, transition, reward,I can learn from incomplete episodes and can use online

updates,I introduce bias and use approximations which are exact only in

special cases.

Reinforcement Learning, Summer 2019 26(86)

Page 56: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Summary: Value-Based/Critic-Only Methods

I All techniques discussed so far, DP, MC, and TD, focus onvalue-functions, not policies.

I Value-function is also called critic, thus critic-only methods.

I Value-based techniques are inherently indirect in learningapproximate value-function and extracting near-optimal policy.

I Overview over DP, MC, and TD in [Sutton and Barto, 1998]and [Szepesvari, 2009].

I Problems:I Closeness to optimal policy cannot be quantified.I Continous action spaces have to be discretized in order to fit

into MDP model.I Focus is on deterministic instead of on stochastic policies.

Reinforcement Learning, Summer 2019 27(86)

Page 57: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Summary: Value-Based/Critic-Only Methods

I All techniques discussed so far, DP, MC, and TD, focus onvalue-functions, not policies.

I Value-function is also called critic, thus critic-only methods.

I Value-based techniques are inherently indirect in learningapproximate value-function and extracting near-optimal policy.

I Overview over DP, MC, and TD in [Sutton and Barto, 1998]and [Szepesvari, 2009].

I Problems:I Closeness to optimal policy cannot be quantified.I Continous action spaces have to be discretized in order to fit

into MDP model.I Focus is on deterministic instead of on stochastic policies.

Reinforcement Learning, Summer 2019 27(86)

Page 58: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Policy-Gradient Methods

I Policy-Gradient techniques attempt at direct optimization ofexpected return

Eπθ [Gt ]

for parameterized stochastic policy

πθ(a|s) = P[At = a|St = s, θ].

I Policy-function is also called actor.

I We will discuss actor-only (optimize parametric policy) andactor-critic (learn both policy and critic parameters intandem) methods.

Reinforcement Learning, Summer 2019 28(86)

Page 59: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Policy-Gradient Methods

I Policy-Gradient techniques attempt at direct optimization ofexpected return

Eπθ [Gt ]

for parameterized stochastic policy

πθ(a|s) = P[At = a|St = s, θ].

I Policy-function is also called actor.

I We will discuss actor-only (optimize parametric policy) andactor-critic (learn both policy and critic parameters intandem) methods.

Reinforcement Learning, Summer 2019 28(86)

Page 60: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

One-Step MDPs/Gradient Bandits

Let pθ(y) denote probability of an action/output, ∆(y) be thereward/quality of an output.

Objective: Epθ [∆(y)]

Gradient: ∇θEpθ [∆(y)] = ∇θ∑

y

pθ(y)∆(y)

=∑

y

∇θpθ(y)∆(y)

=∑

y

pθ(y)

pθ(y)∇θpθ(y)∆(y)

=∑

y

pθ(y)∇θ log pθ(y)∆(y)

= Epθ [∆(y)∇θ log pθ(y)].

Reinforcement Learning, Summer 2019 29(86)

Page 61: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Score Function Gradient Estimator for Bandit

I Bandit Gradient Ascent:I Sample yi ∼ pθ,I Update θ ← θ + α(∆(yi )∇θ log pθ(yi )).

I Update by stochastic gradient gi = ∆(yi )∇θ log pθ(yi ) yieldsunbiased estimator of Epθ [∆(y)]

I Intuition: ∇θ log pθ(y) is called the score function.I Moving in the direction of gi pushes up the score of the sample

yi in proportion to its reward ∆(yi ).I In RL terms: High reward samples are weighted higher -

reinforced!I Estimator is valid even if ∆(y) is non-differentiable.

Reinforcement Learning, Summer 2019 30(86)

Page 62: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Score Function Gradient Estimator for Bandit

I Bandit Gradient Ascent:I Sample yi ∼ pθ,I Update θ ← θ + α(∆(yi )∇θ log pθ(yi )).

I Update by stochastic gradient gi = ∆(yi )∇θ log pθ(yi ) yieldsunbiased estimator of Epθ [∆(y)]

I Intuition: ∇θ log pθ(y) is called the score function.I Moving in the direction of gi pushes up the score of the sample

yi in proportion to its reward ∆(yi ).I In RL terms: High reward samples are weighted higher -

reinforced!I Estimator is valid even if ∆(y) is non-differentiable.

Reinforcement Learning, Summer 2019 30(86)

Page 63: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Score Function Gradient Estimator for Bandit

I Bandit Gradient Ascent:I Sample yi ∼ pθ,I Update θ ← θ + α(∆(yi )∇θ log pθ(yi )).

I Update by stochastic gradient gi = ∆(yi )∇θ log pθ(yi ) yieldsunbiased estimator of Epθ [∆(y)]

I Intuition: ∇θ log pθ(y) is called the score function.I Moving in the direction of gi pushes up the score of the sample

yi in proportion to its reward ∆(yi ).I In RL terms: High reward samples are weighted higher -

reinforced!I Estimator is valid even if ∆(y) is non-differentiable.

Reinforcement Learning, Summer 2019 30(86)

Page 64: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Score Function Gradient Estimator for MDPs

Let y = S0,A0,R1, . . . ,RT ∼ πθ be an episode, andR(y) = R1 + γR2 + . . .+ γT−1RT =

∑Tt=1 γ

t−1Rt be its totaldiscounted reward.

I Objective: Eπθ [R(y)].

I Gradient: Eπθ [R(y)∑T−1

t=0 ∇θ log πθ(At |St)].

I Reinforcement Gradient Ascent:I Sample episode y = S0,A0,R1, . . . ,RT ∼ πθ,I Obtain reward R(y) =

∑Tt=1 γ

t−1Rt ,I Update θ ← θ + α(R(y)

∑T−1t=0 ∇θ log πθ(At |St)).

Reinforcement Learning, Summer 2019 31(86)

Page 65: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Score Function Gradient Estimator for MDPs

Let y = S0,A0,R1, . . . ,RT ∼ πθ be an episode, andR(y) = R1 + γR2 + . . .+ γT−1RT =

∑Tt=1 γ

t−1Rt be its totaldiscounted reward.

I Objective: Eπθ [R(y)].

I Gradient: Eπθ [R(y)∑T−1

t=0 ∇θ log πθ(At |St)].

I Reinforcement Gradient Ascent:I Sample episode y = S0,A0,R1, . . . ,RT ∼ πθ,I Obtain reward R(y) =

∑Tt=1 γ

t−1Rt ,I Update θ ← θ + α(R(y)

∑T−1t=0 ∇θ log πθ(At |St)).

Reinforcement Learning, Summer 2019 31(86)

Page 66: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

General Form of Policy Gradient Algorithms

Formalized for expected per time-step reward with respect toaction-value qπθ(St ,At).

I Objective: Eπθ [qπθ(St ,At)].

I Gradient: Eπθ [qπθ(St ,At)∇θ log πθ(At |St)].

I Policy Gradient Ascent:I Sample episode y = S0,A0,R1, . . . ,RT ∼ πθ.I For each time step t:

I Obtain reward qπθ (St ,At),I Update θ ← θ + α(qπθ (St ,At)∇θ log πθ(At |St)).

Reinforcement Learning, Summer 2019 32(86)

Page 67: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

General Form of Policy Gradient Algorithms

Formalized for expected per time-step reward with respect toaction-value qπθ(St ,At).

I Objective: Eπθ [qπθ(St ,At)].

I Gradient: Eπθ [qπθ(St ,At)∇θ log πθ(At |St)].

I Policy Gradient Ascent:I Sample episode y = S0,A0,R1, . . . ,RT ∼ πθ.I For each time step t:

I Obtain reward qπθ (St ,At),I Update θ ← θ + α(qπθ (St ,At)∇θ log πθ(At |St)).

Reinforcement Learning, Summer 2019 32(86)

Page 68: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Policy Gradient Algorithms

I General form for expected per time-step return qπθ(St ,At) isknown as Policy Gradient Theorem [Sutton et al., 2000].

I Since qπθ(s, a) is normally not known, one can use the actualdiscounted return Gt at time step t, calculated from sampledepisode. This leads to the REINFORCE algorithm[Williams, 1992].

I Problems of Policy Gradient Algorithms, esp. REINFORCE:I Large variance in discounted returns calculated from sampled

episodes.I Each gradient update is done independently of past gradient

estimates.

Reinforcement Learning, Summer 2019 33(86)

Page 69: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Policy Gradient Algorithms

I General form for expected per time-step return qπθ(St ,At) isknown as Policy Gradient Theorem [Sutton et al., 2000].

I Since qπθ(s, a) is normally not known, one can use the actualdiscounted return Gt at time step t, calculated from sampledepisode. This leads to the REINFORCE algorithm[Williams, 1992].

I Problems of Policy Gradient Algorithms, esp. REINFORCE:I Large variance in discounted returns calculated from sampled

episodes.I Each gradient update is done independently of past gradient

estimates.

Reinforcement Learning, Summer 2019 33(86)

Page 70: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Variance Reduction by Baselines

I Variance of REINFORCE can be reduced by comparison ofactual return Gt to a baseline b(s) for state s that is constantwith respect to actions a. Example: average return so far.

I Update :

θ ← θ + α(Gt − b(St))∇θ log πθ(At |St)).

I Can be interpreted as Control Variate [Ross, 2013]:I Goal is to augment random variable X (= stochastic gradient)

with highly correlated variable Y such thatVar(X − Y ) = Var(X ) + Var(Y )− 2Cov(X ,Y ) is reduced.

I Gradient remains unbiased since E[X − Y + E[Y ]] = E[X ].

Reinforcement Learning, Summer 2019 34(86)

Page 71: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Variance Reduction by Baselines

I Variance of REINFORCE can be reduced by comparison ofactual return Gt to a baseline b(s) for state s that is constantwith respect to actions a. Example: average return so far.

I Update :

θ ← θ + α(Gt − b(St))∇θ log πθ(At |St)).

I Can be interpreted as Control Variate [Ross, 2013]:I Goal is to augment random variable X (= stochastic gradient)

with highly correlated variable Y such thatVar(X − Y ) = Var(X ) + Var(Y )− 2Cov(X ,Y ) is reduced.

I Gradient remains unbiased since E[X − Y + E[Y ]] = E[X ].

Reinforcement Learning, Summer 2019 34(86)

Page 72: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Variance Reduction by Baselines

Exercise: Show that E[Y ] = 0 for constant baselines.

Proof:

Eπθ [∇θ log πθ(a|s)b(s)] =∑

a

πθ(a|s)∇θπθ(a|s)

πθ(a|s)b(s)

= b(s)∇θ∑

a

πθ(a|s)

= b(s)∇θ1

= 0.

Reinforcement Learning, Summer 2019 35(86)

Page 73: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Variance Reduction by Baselines

Exercise: Show that E[Y ] = 0 for constant baselines.Proof:

Eπθ [∇θ log πθ(a|s)b(s)] =∑

a

πθ(a|s)∇θπθ(a|s)

πθ(a|s)b(s)

= b(s)∇θ∑

a

πθ(a|s)

= b(s)∇θ1

= 0.

Reinforcement Learning, Summer 2019 35(86)

Page 74: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Actor-Critic Methods

I Learning a critic in order to get an improved estimate of theexpected return will also reduce variance.

I Critic: TD(0) update for linear approximationqπθ

(s, a) ≈ qw (s, a) = φ(s, a)>w .I Actor: Policy gradient update reinforced by qw (s, a).

I Simple Actor-Critic [Konda and Tsitsiklis, 2000]:I Sample a ∼ πθ.I For each step t:

I Sample reward r ∼ Ras , transition s ′ ∼ Pa

s,·, actiona′ ∼ πθ(s ′, ·),

I δ ← r + γqw (s ′, a′)− qw (s, a),I θ ← θ + α∇θ log πθ(a|s)qw (s, a),I w ← w + βδφ(s, a),I a← a′, s ← s ′.

I True online updates of policy πθ in each step!

Reinforcement Learning, Summer 2019 36(86)

Page 75: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Actor-Critic Methods

I Learning a critic in order to get an improved estimate of theexpected return will also reduce variance.

I Critic: TD(0) update for linear approximationqπθ

(s, a) ≈ qw (s, a) = φ(s, a)>w .I Actor: Policy gradient update reinforced by qw (s, a).

I Simple Actor-Critic [Konda and Tsitsiklis, 2000]:I Sample a ∼ πθ.I For each step t:

I Sample reward r ∼ Ras , transition s ′ ∼ Pa

s,·, actiona′ ∼ πθ(s ′, ·),

I δ ← r + γqw (s ′, a′)− qw (s, a),I θ ← θ + α∇θ log πθ(a|s)qw (s, a),I w ← w + βδφ(s, a),I a← a′, s ← s ′.

I True online updates of policy πθ in each step!

Reinforcement Learning, Summer 2019 36(86)

Page 76: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Bias and Compatible Function Approximation

I Approximating qπθ(s, a) ≈ qw (s, a) introduces bias. Unless

1. Value approximator is compatible with the policy, i.e., thechange in value equals the score function s.t.

∇w qw (s, a) = ∇θ log πθ(s, a),

2. Parameters w are set to minimize the squared error

ε = Eπθ[(qπθ

(s, a)− qw (s, a))2],

I Then policy gradient is exact:

Eπθ [qπθ(s, a)∇θ log πθ(a|s)] = Eπθ [qw (s, a)∇θ log πθ(a|s)].

Reinforcement Learning, Summer 2019 37(86)

Page 77: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Bias and Compatible Function Approximation

Exercise: Prove the Compatible Function Approximation Theorem.

Proof: At MSE, ∇w ε = 0. Thus

Eπθ [(qπθ(s, a)− qw (s, a))∇w qw (s, a)] = 0,

Eπθ [(qπθ(s, a)− qw (s, a))∇θ log πθ(s, a)] = 0,

Eπθ [qπθ(s, a)∇θ log πθ(a|s)] = Eπθ [qw (s, a)∇θ log πθ(a|s)].

Reinforcement Learning, Summer 2019 38(86)

Page 78: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Bias and Compatible Function Approximation

Exercise: Prove the Compatible Function Approximation Theorem.Proof: At MSE, ∇w ε = 0. Thus

Eπθ [(qπθ(s, a)− qw (s, a))∇w qw (s, a)] = 0,

Eπθ [(qπθ(s, a)− qw (s, a))∇θ log πθ(s, a)] = 0,

Eπθ [qπθ(s, a)∇θ log πθ(a|s)] = Eπθ [qw (s, a)∇θ log πθ(a|s)].

Reinforcement Learning, Summer 2019 38(86)

Page 79: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Advantage Actor-Critic

I Combine idea of baseline with actor-critic by using advantagefunction that compares action-value function qπθ(s, a) tostate-value function vπθ(s) = Ea∼π[qπθ(s, a)].

I Use approximate TD error

δw = r + γvw (s ′)− vw (s),

where state-value is approximated by vw (s), and action-valueis approximated by sample qw (s ′) = r + γvw (s ′).

I Update Actor: θ ← θ + α∇θ log πθ(a|s)(qw (s ′)− vw (s)).

I Update Critic: w = arg minw (qw (s ′)− vw (s))2.

Reinforcement Learning, Summer 2019 39(86)

Page 80: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Advantage Actor-Critic

I Combine idea of baseline with actor-critic by using advantagefunction that compares action-value function qπθ(s, a) tostate-value function vπθ(s) = Ea∼π[qπθ(s, a)].

I Use approximate TD error

δw = r + γvw (s ′)− vw (s),

where state-value is approximated by vw (s), and action-valueis approximated by sample qw (s ′) = r + γvw (s ′).

I Update Actor: θ ← θ + α∇θ log πθ(a|s)(qw (s ′)− vw (s)).

I Update Critic: w = arg minw (qw (s ′)− vw (s))2.

Reinforcement Learning, Summer 2019 39(86)

Page 81: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Advantage Actor-Critic

I Combine idea of baseline with actor-critic by using advantagefunction that compares action-value function qπθ(s, a) tostate-value function vπθ(s) = Ea∼π[qπθ(s, a)].

I Use approximate TD error

δw = r + γvw (s ′)− vw (s),

where state-value is approximated by vw (s), and action-valueis approximated by sample qw (s ′) = r + γvw (s ′).

I Update Actor: θ ← θ + α∇θ log πθ(a|s)(qw (s ′)− vw (s)).

I Update Critic: w = arg minw (qw (s ′)− vw (s))2.

Reinforcement Learning, Summer 2019 39(86)

Page 82: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Summary: Policy-Gradient Methods

I Build upon huge knowlegde in stochastic optimization whichprovides excellent theoretical understanding ofconvergence properties.

I Gradient-based techniques are model-free since MDPtransation matrix is not dependent on θ.

I Directly applicable to continuous output spaces andstochastic policies.

I Problem of high variance in actor-only methods can bemitigated by the critic’s low-variance estimate of expectedreturn.

Reinforcement Learning, Summer 2019 40(86)

Page 83: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Quick Summary and Outlook

What have we covered:

I Policy evaluation (a.k.a. prediction) using DP

I Policy optimization (a.k.a. control) using Value-basedtechniques of DP, MC, or both: TD.

I Policy-gradient techniques for direct stochastic optimizationof parametric policies.

Where from here on:I Sequence-to-Sequence Reinforcement Learning

I Algorithms for seq2seq RL from simulated feedbackI Algorithms for offline learning from logged feedbackI Seq2seq RL from human bandit feedback

Reinforcement Learning, Summer 2019 41(86)

Page 84: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Policy Gradient Methods

Quick Summary and Outlook

What have we covered:

I Policy evaluation (a.k.a. prediction) using DP

I Policy optimization (a.k.a. control) using Value-basedtechniques of DP, MC, or both: TD.

I Policy-gradient techniques for direct stochastic optimizationof parametric policies.

Where from here on:I Sequence-to-Sequence Reinforcement Learning

I Algorithms for seq2seq RL from simulated feedbackI Algorithms for offline learning from logged feedbackI Seq2seq RL from human bandit feedback

Reinforcement Learning, Summer 2019 41(86)

Page 85: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Sequence-to-Sequence RL

Sequence-to-sequence (seq2seq) learning:

I x = x1 . . . xS represents an input sequence, indexed over asource vocabulary VSrc.

I y = y1 . . . yT represents an output sequence, indexed over atarget vocabulary VTrg.

I Goal of seq2seq learning is to estimate a function for mappingan input sequence x into an output sequences y, defined asproduct of conditional token probabilities:

pθ(y | x) =T∏

t=1

pθ(yt | x; y<t).

Reinforcement Learning, Summer 2019 42(86)

Page 86: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Seq2seq RL: Neural Machine Translation

Neural machine translation (NMT):I x are source sentences, y are human reference translations,I Maximize likelihood of parallel data D = {(x(i), y(i))}n

i=1:

L(θ) =n∑

i=1

log pθ(y(i) | x(i))

I pθ(yt | x; y<t) is defined by the neural model’ssoftmax-normalized output vector of size R|VTrg|:

pθ(yt | x; y<t) = softmax(NNθ(x; y<t)).

I Various options for NNθ, such as recurrent[Sutskever et al., 2014, Bahdanau et al., 2015], convolutional[Gehring et al., 2017] or attentional [Vaswani et al., 2017]encoder-decoder architectures (or mix [Chen et al., 2018]).

Reinforcement Learning, Summer 2019 43(86)

Page 87: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Seq2seq RL: Neural Machine Translation

Neural machine translation (NMT):I x are source sentences, y are human reference translations,I Maximize likelihood of parallel data D = {(x(i), y(i))}n

i=1:

L(θ) =n∑

i=1

log pθ(y(i) | x(i))

I pθ(yt | x; y<t) is defined by the neural model’ssoftmax-normalized output vector of size R|VTrg|:

pθ(yt | x; y<t) = softmax(NNθ(x; y<t)).

I Various options for NNθ, such as recurrent[Sutskever et al., 2014, Bahdanau et al., 2015], convolutional[Gehring et al., 2017] or attentional [Vaswani et al., 2017]encoder-decoder architectures (or mix [Chen et al., 2018]).

Reinforcement Learning, Summer 2019 43(86)

Page 88: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Seq2seq RL for NMT

Why deviate from supervised learning using parallel data?

I What if no human references are available, e.g., inunder-resourced language pairs?

I Maybe weak human feedback signals are easier to obtainthan full translations, e.g., from logged user interactions incommercial NMT services?

I [Sutton and Barto, 2018] on the “Future of ArtificialIntelligence”:

The full potential of reinforcement learning requiresreinforcement learning agents to be embedded into theflow of real-world experience, where they act, explore,and learn in our world, not just in their worlds.

Reinforcement Learning, Summer 2019 44(86)

Page 89: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Seq2seq RL for NMT

Why deviate from supervised learning using parallel data?

I What if no human references are available, e.g., inunder-resourced language pairs?

I Maybe weak human feedback signals are easier to obtainthan full translations, e.g., from logged user interactions incommercial NMT services?

I [Sutton and Barto, 2018] on the “Future of ArtificialIntelligence”:

The full potential of reinforcement learning requiresreinforcement learning agents to be embedded into theflow of real-world experience, where they act, explore,and learn in our world, not just in their worlds.

Reinforcement Learning, Summer 2019 44(86)

Page 90: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Seq2seq RL for NMT

I Learning from weak user feedback in form of user clicks isstate-of-the-art in computational advertising[Bottou et al., 2013, Chapelle et al., 2014].

I Let’s dig the gold mine of user feedback to improve NMT!

Reinforcement Learning, Summer 2019 45(86)

Page 91: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Seq2seq RL for NMT

I Learning from weak user feedback in form of user clicks isstate-of-the-art in computational advertising[Bottou et al., 2013, Chapelle et al., 2014].

I Let’s dig the gold mine of user feedback to improve NMT!

Reinforcement Learning, Summer 2019 45(86)

Page 92: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Collecting Feedback: Facebook

Reinforcement Learning, Summer 2019 46(86)

Page 93: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Collecting Feedback: Facebook

Reinforcement Learning, Summer 2019 47(86)

Page 94: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Collecting Feedback: Facebook

Reinforcement Learning, Summer 2019 48(86)

Page 95: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Collecting Feedback: Facebook

Reinforcement Learning, Summer 2019 49(86)

Page 96: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Collecting Feedback: Facebook

Reinforcement Learning, Summer 2019 50(86)

Page 97: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Collecting Feedback: Facebook

Reinforcement Learning, Summer 2019 51(86)

Page 98: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Collecting Feedback: Facebook

Reinforcement Learning, Summer 2019 52(86)

Page 99: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Collecting Feedback: Microsoft

Reinforcement Learning, Summer 2019 53(86)

Page 100: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Collecting Feedback: Microsoft (community)

Reinforcement Learning, Summer 2019 54(86)

Page 101: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Collecting Feedback: Google (community)

Reinforcement Learning, Summer 2019 55(86)

Page 102: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Collecting Feedback: Google

Reinforcement Learning, Summer 2019 56(86)

Page 103: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Seq2seq RL for NMT: Simulations

I NMT in standard RL framework:I In timestep t, a state is defined by the input x and the

currently produced tokens y<t .I A reward is obtained by evaluating quality of the fully

generated sequence y.I An action corresponds to generating the next token yt .

I Exercise: How would this translate into an MDP’s statetransitions and an agent’s policy?

I pθ(yt | x; y<t) corresponds to a stochastic policy, while thestate transition is deterministic given an action.

I Interactive NMT:I The NMT system is the agent that performs actions, while

the human user provides rewards.

Reinforcement Learning, Summer 2019 57(86)

Page 104: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Seq2seq RL for NMT: Simulations

I NMT in standard RL framework:I In timestep t, a state is defined by the input x and the

currently produced tokens y<t .I A reward is obtained by evaluating quality of the fully

generated sequence y.I An action corresponds to generating the next token yt .

I Exercise: How would this translate into an MDP’s statetransitions and an agent’s policy?

I pθ(yt | x; y<t) corresponds to a stochastic policy, while thestate transition is deterministic given an action.

I Interactive NMT:I The NMT system is the agent that performs actions, while

the human user provides rewards.

Reinforcement Learning, Summer 2019 57(86)

Page 105: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Seq2seq RL for NMT: Simulations

I NMT in standard RL framework:I In timestep t, a state is defined by the input x and the

currently produced tokens y<t .I A reward is obtained by evaluating quality of the fully

generated sequence y.I An action corresponds to generating the next token yt .

I Exercise: How would this translate into an MDP’s statetransitions and an agent’s policy?

I pθ(yt | x; y<t) corresponds to a stochastic policy, while thestate transition is deterministic given an action.

I Interactive NMT:I The NMT system is the agent that performs actions, while

the human user provides rewards.

Reinforcement Learning, Summer 2019 57(86)

Page 106: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Seq2seq RL for NMT: Simulations

I NMT in standard RL framework:I In timestep t, a state is defined by the input x and the

currently produced tokens y<t .I A reward is obtained by evaluating quality of the fully

generated sequence y.I An action corresponds to generating the next token yt .

I Exercise: How would this translate into an MDP’s statetransitions and an agent’s policy?

I pθ(yt | x; y<t) corresponds to a stochastic policy, while thestate transition is deterministic given an action.

I Interactive NMT:I The NMT system is the agent that performs actions, while

the human user provides rewards.

Reinforcement Learning, Summer 2019 57(86)

Page 107: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Seq2seq RL for NMT: Simulations

I Expected loss/reward objective:

L(θ) =Ep(x) pθ(y|x;θ) [∆(y)]

where ∆(y) is task loss, e.g., −BLEU(y)

I Sampling an input x and an output y, and performing astochastic gradient descent update corresponds to a policygradient algorithm.

Reinforcement Learning, Summer 2019 58(86)

Page 108: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

(Neural) Bandit Structured Prediction

Algorithm 1 (Neural) Bandit Structured Prediction

1: for k = 0, . . . ,K do2: Observe input xk

3: Sample output yk ∼ pθ(y|xk)4: Obtain feedback ∆(yk)5: Update parameters θk+1 = θk − γk sk

6: where stochastic gradient sk = ∆(y)∂ log pθ(y|xk )∂θi

.

I [Sokolov et al., 2015, Sokolov et al., 2016,Kreutzer et al., 2017]

Reinforcement Learning, Summer 2019 59(86)

Page 109: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

(Neural) Bandit Structured Prediction

I Why (Neural) Bandit Structured Prediction?I An action is defined as generating a full output sequence, thus

corresponding to a one-state MDP.I Term bandit feedback is inherited from the problem of

maximizing the reward for a sequence of pulls of arms ofso-called “one-armed bandit” slot machines[Bubeck and Cesa-Bianchi, 2012]:

I In contrast to fully supervised learning, the learner receivesfeedback to a single prediction. It does not know what thecorrect output looks like, nor what would have happened if ithad predicted differently.

I Related to gradient bandit algorithms [Sutton and Barto, 2018]and contextual bandits [Li et al., 2010].

Reinforcement Learning, Summer 2019 60(86)

Page 110: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

(Neural) Bandit Structured Prediction

I Important measure for variance reduction: Control variatesI Random variable X is stochastic gradient sk in case of

algorithm 1.I Two choices in [Kreutzer et al., 2017]:

1. Baseline [Williams, 1992]:

Yk = ∇ log pθ(y|xk )1

k

k∑j=1

∆(yj ).

2. Score Function [Ranganath et al., 2014]:

Y k = ∇ log pθ(y|xk ).

Reinforcement Learning, Summer 2019 61(86)

Page 111: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Advantage Actor-Critic for Bandit NMT

I Neural encoder-decoder A2C [Nguyen et al., 2017]:I Gradient approximation

∇L(θ) ≈T∑

t=1

Rt(y)∇θ log pθ(yt | x; y<t)

I Uses per-action advantage function

Rt(y) := ∆(y)− V (y<t)

I State-value function V (y<t) centers the reward and usesseparate neural encoder-decoder network that is trained tominimize the squared error [Vw (y<t)−∆(y)]2

Reinforcement Learning, Summer 2019 62(86)

Page 112: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Seq2seq RL for NMT: Simulation Results

I EuroParl→NewsComm NMT conservative domain adaptation

I ∆(y) simulated by per-sentence BLEU against reference

full-info EL PR SF-EL BL-ELModel (BPE)

6

4

2

0

2

4

6Di

ffere

nce

in B

LEU

+6.11

+2.89+1.79

+3.00+4.08

-4.67

-1.72-0.90 -1.35

-1.93

Domain Adaptation with Weak Feedback: EP to NCEPNC

Reinforcement Learning, Summer 2019 63(86)

Page 113: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Seq2seq RL for NMT: Simulation Results

I EuroParl→TED NMT conservative domain adaptation task

full-info EL PR SF-EL BL-ELModel (BPE)

10

5

0

5

10

Diffe

renc

e in

BLE

U

+11.88

+4.18

+0.24

+4.16+5.89

-8.42

-1.78-0.02

-1.49 -1.97

Domain Adaptation with Weak Feedback: EP to TEDEPTED

Reinforcement Learning, Summer 2019 64(86)

Page 114: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Seq2seq RL for NMT: To Simulate or Not

I Domain adaptation experiments show impressive gains forlearning from simulated bandit feedback only

I Most work on Seq2seq RL for NMT is confined tosimulations, aiming to improve “exposure bias” and“loss-evaluation mismatch” [Ranzato et al., 2016]

I Recall [Sutton and Barto, 2018] on the “Future of ArtificialIntelligence”:

A major reason for wanting a reinforcement learningagent to act and learn in the real world is that it isoften difficult, sometimes impossible, to simulatereal-world experience with enough fidelity to makethe resulting policies [...] work well—andsafely—when directing real actions.

Reinforcement Learning, Summer 2019 65(86)

Page 115: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Seq2seq RL for NMT: To Simulate or Not

I Where do simulations fall short?I Real-wold RL only has access to human bandit feedback to a

single prediction—no summation over all actions that amountsto full supervision [Shen et al., 2016, Bahdanau et al., 2017].

I Online/on-policy learning might be undesirable given concernsabout safety and stability of commercial systems.

I Reward function for human translation quality is not welldefined, reward signals are noisy and skewed.

I (Super)human performance (similar to playing Atari or Go) ofreal-world RL is not to be expected soon!

Reinforcement Learning, Summer 2019 66(86)

Page 116: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Seq2seq RL for NMT: To Simulate or Not

I Where do simulations fall short?I Real-wold RL only has access to human bandit feedback to a

single prediction—no summation over all actions that amountsto full supervision [Shen et al., 2016, Bahdanau et al., 2017].

I Online/on-policy learning might be undesirable given concernsabout safety and stability of commercial systems.

I Reward function for human translation quality is not welldefined, reward signals are noisy and skewed.

I (Super)human performance (similar to playing Atari or Go) ofreal-world RL is not to be expected soon!

Reinforcement Learning, Summer 2019 66(86)

Page 117: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Offline Learning from Logged Feedback

Standard: Online/On-Policy RL

I Undesirable if stability or real-world system has priority overfrequent updates after each interaction

Offline/Off-Policy RL from Logged Bandit Feedback

I Attempts to learn from logged feedback that has been given tothe predictions of a historic system following a different policy

I Allows control over system updates

I Prior work in counterfactual bandit learning[Dudik et al., 2011, Bottou et al., 2013] and off-policy RL[Precup et al., 2000, Jiang and Li, 2016]

Reinforcement Learning, Summer 2019 67(86)

Page 118: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Offline Learning from Logged Feedback

Standard: Online/On-Policy RL

I Undesirable if stability or real-world system has priority overfrequent updates after each interaction

Offline/Off-Policy RL from Logged Bandit Feedback

I Attempts to learn from logged feedback that has been given tothe predictions of a historic system following a different policy

I Allows control over system updates

I Prior work in counterfactual bandit learning[Dudik et al., 2011, Bottou et al., 2013] and off-policy RL[Precup et al., 2000, Jiang and Li, 2016]

Reinforcement Learning, Summer 2019 67(86)

Page 119: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Offline Learning = Counterfactual Learning

I Counterfactual question: Estimate how the new system wouldhave performed if it had been in control of choosing thelogged predictions.

Reinforcement Learning, Summer 2019 68(86)

Page 120: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Offline Learning from Logged Feedback

I Logged data D = {(x(h), y(h), r(y(h)))}Hh=1 where y(h) is

sampled from a logging system µ(y(h)|x(h)), and thereward/loss r(y(h)) ∈ [0, 1] is obtained from human user.

I Inverse propensity scoring (IPS) to learn target policy pθ(y|x):

L(θ) =1

H

H∑h=1

r(y(h)) ρθ(y(h)|x(h)).

I IPS uses importance sampling to correct for sampling bias

of logging system s.t. ρθ(y(h)|x(h)) = pθ(y(h)|x(h))

µ(y(h)|x(h))

I Exercise: Show unbiasedness of IPS estimator.

1

H

H∑h=1

r(y(h))pθ(y(h)|x(h))

µ(y(h)|x(h))= Ep(x)Eµ(y|x)[r(y)

pθ(y|x)

µ(y|x)]

= Ep(x)Epθ(y|x)[r(y)].

Reinforcement Learning, Summer 2019 69(86)

Page 121: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Offline Learning from Logged Feedback

I Logged data D = {(x(h), y(h), r(y(h)))}Hh=1 where y(h) is

sampled from a logging system µ(y(h)|x(h)), and thereward/loss r(y(h)) ∈ [0, 1] is obtained from human user.

I Inverse propensity scoring (IPS) to learn target policy pθ(y|x):

L(θ) =1

H

H∑h=1

r(y(h)) ρθ(y(h)|x(h)).

I IPS uses importance sampling to correct for sampling bias

of logging system s.t. ρθ(y(h)|x(h)) = pθ(y(h)|x(h))

µ(y(h)|x(h))I Exercise: Show unbiasedness of IPS estimator.

1

H

H∑h=1

r(y(h))pθ(y(h)|x(h))

µ(y(h)|x(h))= Ep(x)Eµ(y|x)[r(y)

pθ(y|x)

µ(y|x)]

= Ep(x)Epθ(y|x)[r(y)].

Reinforcement Learning, Summer 2019 69(86)

Page 122: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Offline Learning from Logged Feedback

I Logged data D = {(x(h), y(h), r(y(h)))}Hh=1 where y(h) is

sampled from a logging system µ(y(h)|x(h)), and thereward/loss r(y(h)) ∈ [0, 1] is obtained from human user.

I Inverse propensity scoring (IPS) to learn target policy pθ(y|x):

L(θ) =1

H

H∑h=1

r(y(h)) ρθ(y(h)|x(h)).

I IPS uses importance sampling to correct for sampling bias

of logging system s.t. ρθ(y(h)|x(h)) = pθ(y(h)|x(h))

µ(y(h)|x(h))I Exercise: Show unbiasedness of IPS estimator.

1

H

H∑h=1

r(y(h))pθ(y(h)|x(h))

µ(y(h)|x(h))= Ep(x)Eµ(y|x)[r(y)

pθ(y|x)

µ(y|x)]

= Ep(x)Epθ(y|x)[r(y)].

Reinforcement Learning, Summer 2019 69(86)

Page 123: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Offline Learning under Deterministic Logging:Problems

I Commercial NMT systems try to avoid risk by showing onlymost probable translation to users = exploration-free,deterministic logging

I Problems with deterministic logging [Lawrence et al., 2017a]I No correction of sampling bias like in IPS since µ(y|x) = 1I Degenerate behavior: Empirical reward over log is

maximized by setting probability of all logged data to 1→ Undesirable to increase probability of low reward examples

I Unbiased learning is thought to be impossible forexploration-free off-policy learning[Langford et al., 2008, Strehl et al., 2010].

Reinforcement Learning, Summer 2019 70(86)

Page 124: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Offline Learning under Deterministic Logging:Problems

I Commercial NMT systems try to avoid risk by showing onlymost probable translation to users = exploration-free,deterministic logging

I Problems with deterministic logging [Lawrence et al., 2017a]I No correction of sampling bias like in IPS since µ(y|x) = 1I Degenerate behavior: Empirical reward over log is

maximized by setting probability of all logged data to 1→ Undesirable to increase probability of low reward examples

I Unbiased learning is thought to be impossible forexploration-free off-policy learning[Langford et al., 2008, Strehl et al., 2010].

Reinforcement Learning, Summer 2019 70(86)

Page 125: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Offline Learning under Deterministic Logging:Solutions

I Implicit exploration via inputs [Bastani et al., 2017]

I Deterministic Propensity Matching (DPM)[Lawrence et al., 2017b, Lawrence and Riezler, 2018]

L(θ) =1

H

H∑h=1

r(y(h)) pθ(y(h)|x(h)),

I Reweighting by multiplicative control cariate , evaluatedone-step-late at θ′ from some previous iteration:

pθ,θ′(y(h)|x(h)) = pθ(y(h)|x(h))∑Bb=1 pθ′ (y(b)|x(b))

.

I Effect of self-normalization: Introduces bias that decreasesas B increases [Kong, 1992], but prevents increasingprobability for low reward data by taking away probability massfrom higher reward outputs.

Reinforcement Learning, Summer 2019 71(86)

Page 126: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Offline Learning under Deterministic Logging:Solutions

I Implicit exploration via inputs [Bastani et al., 2017]I Deterministic Propensity Matching (DPM)

[Lawrence et al., 2017b, Lawrence and Riezler, 2018]

L(θ) =1

H

H∑h=1

r(y(h)) pθ(y(h)|x(h)),

I Reweighting by multiplicative control cariate , evaluatedone-step-late at θ′ from some previous iteration:

pθ,θ′(y(h)|x(h)) = pθ(y(h)|x(h))∑Bb=1 pθ′ (y(b)|x(b))

.

I Effect of self-normalization: Introduces bias that decreasesas B increases [Kong, 1992], but prevents increasingprobability for low reward data by taking away probability massfrom higher reward outputs.

Reinforcement Learning, Summer 2019 71(86)

Page 127: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Offline Learning under Deterministic Logging:Gradients

I Optimization by Stochastic Gradient DescentI IPS:

∇L(θ) =1

H

H∑h=1

r(y(h)) ρθ(y(h)|x(h))∇ log pθ(y(h)|x(h))

I OSL self-normalized deterministic propensity matching:

∇L(θ) =1

H

H∑h=1

r(y(h)) pθ(y(h)|x(h))∇ log pθ(y(h)|x(h))

Reinforcement Learning, Summer 2019 72(86)

Page 128: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Offline Learning from Human Feedback:e-commerce

I [Kreutzer et al., 2018]: 69k translated item titles (en-es) with148k individual ratings

I No agreement of paid raters with e-commerce users, lowinter-rater agreement, learning impossible

Reinforcement Learning, Summer 2019 73(86)

Page 129: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Offline Learning from Human Feedback:e-commerce

I Lessons from e-commerce experiments:I Offline learning from direct user feedback to e-commerce titles

is equivalent to learning from noiseI Conjecture: Missing reliability and validity of human feedback

in e-commerce experimentI Experiment: Controlled feedback collection!

Reinforcement Learning, Summer 2019 74(86)

Page 130: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Offline Learning from Controlled HumanFeedback

vs

I Comparison of judgments on five-point Likert scale to pairwisepreferences

I Feedback collected from ∼15 bilinguals for 800 translations(de-en)1

1Data: https://www.cl.uni-heidelberg.de/statnlpgroup/humanmt/

Reinforcement Learning, Summer 2019 75(86)

Page 131: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Reliability and Learnability of HumanFeedback

I Controlled study on main factors in human RL:

1. Reliability: Collect five-point and pairwise feedback on samedata, evaluate intra- and inter-rater agreement.

2. Learnability: Train reward estimators on human feedback,evaluate correlation to TER on held-out data.

3. RL: Use rewards directly or estimated rewards to improve anNMT system.

What are your guesses on reliability and learnability—five-point orpairwise?

Reinforcement Learning, Summer 2019 76(86)

Page 132: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Reliability and Learnability of HumanFeedback

I Controlled study on main factors in human RL:

1. Reliability: Collect five-point and pairwise feedback on samedata, evaluate intra- and inter-rater agreement.

2. Learnability: Train reward estimators on human feedback,evaluate correlation to TER on held-out data.

3. RL: Use rewards directly or estimated rewards to improve anNMT system.

What are your guesses on reliability and learnability—five-point orpairwise?

Reinforcement Learning, Summer 2019 76(86)

Page 133: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Reliability: α-agreement

Inter-rater Intra-raterRating Type α Mean α Stdev α

5-point 0.23080.4014 0.1907

+ normalization 0.2820+ filtering 0.5059 0.5527 0.0470

Pairwise 0.2385 0.5085 0.2096+ filtering 0.3912 0.7264 0.0533

I Inter- and intra-reliability measured by Krippendorff’s α for5-point and pairwise ratings of 1,000 translations of which 200translations are repeated twice.

I Filtered variants are restricted to either a subset ofparticipants or a subset of translations.

Reinforcement Learning, Summer 2019 77(86)

Page 134: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Reliability: Qualitative Analysis

Rating Type Avg. subjective difficulty [1-10]

5-point 4.8Pairwise 5.69

I Difficulties with 5-point ratings:I Weighing of error types; long sentences with few essential

errors

I Difficulties with Pairwise ratings (incl. ties):I Distinction between similar translationsI Ties: no absolute anchoring of the quality of the pairI Final score: No normalization for individual biases possible

Reinforcement Learning, Summer 2019 78(86)

Page 135: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Learnability

5-point feedback: standard MSE on scaled ratings

L(ψ) =1

n

n∑i=1

(r(yi )− rψ(yi ))2.

Pairwise feedback: predict human preferences Q[· � ·][Christiano et al., 2017]

L(ψ) = −1

n

n∑i=1

Q[y1i � y2

i ] log Pψ[y1i � y2

i ]

+Q[y2i � y1

i ] log Pψ[y2i � y1

i ],

with the Bradley-Terry model for preferences

Pψ[y1 � y2] =exp rψ(y1)

exp rψ(y1) + exp rψ(y2).

Reinforcement Learning, Summer 2019 79(86)

Page 136: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Reward Estimator Architecture

         

Source BiLSTM

Target  BiLSTM

s0 s1 s2 s3 s4 s5 s6 PAD

t0 t1 t2 t3 t4 t5 PAD PAD

   

   

   

1D Convolution Max over time Fully connected output layer

I biLSTM-enhanced bilingual extension of convolutional modelfor sentence classification [Kim, 2014]

Reinforcement Learning, Summer 2019 80(86)

Page 137: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Learnability: Results

Model Feedback Spearman’s ρ with -TER

MSE 5-point norm. 0.2193+ filtering 0.2341

PW Pairwise 0.1310+ filtering 0.1255

I Comparatively better results for reward estimation fromcardinal human judgements.

I Overall relatively low correlation, presumably due tooverfitting on small training data set.

Reinforcement Learning, Summer 2019 81(86)

Page 138: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

End-to-end Seq2seq RL

1. Tackle the arguably simpler problem of learning a rewardestimator from human feedback first.

2. Then provide unlimited learned feedback to generalize tounseen outputs in off-policy RL.

Reinforcement Learning, Summer 2019 82(86)

Page 139: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

End-to-End RL from Estimated Rewards

Expected Risk Minimiziation from Estimated RewardsEstimated rewards allow to use minimum risk training[Shen et al., 2016] s.t. feedback can be collected for k samples:

L(θ) =Ep(x)pθ(y|x) [rψ(y)]

≈S∑

s=1

k∑i=1

pτθ (y(s)i |x

(s)) rψ(yi)

I Softmax temperature τ to control the amount of explorationby sharpening the sampling distributionpτθ (y|x) = softmax(o/τ) at lower temperatures.

I Subtract the running average of rewards from rψ to reducegradient variance and estimation bias.

Reinforcement Learning, Summer 2019 83(86)

Page 140: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Sequence-to-Sequence Reinforcement Learning

Results on TED Talk Translations

I Significant improvements over the baseline (27.0 BLEU / 30.7METEOR / 59.48 BEER):

I Gains of 1.1 BLEU for expected risk (ER) minimization forestimated rewards.

I Deterministic propensity matching (DPM) on directly loggedhuman feedback yields up to 0.5 BLEU points.

Reinforcement Learning, Summer 2019 84(86)

Page 141: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Summary

Summary

Basic RL:

I Policy evaluation using Dynamic Programming

I Policy optimization using Dynamic Programming, MonteCarlo, or both: Temporal Difference learning.

I Policy-gradient techniques for direct policy optimization.

Seq2seq RL:

I Seq2seq RL simulations: Bandit Neural Machine Translation.

I Offline learning from deterministically logged feedback:Deterministic Propensity Matching.

I Seq2seq RL from human feedback: Collecting reliablefeedback, learning reward estimators, end-to-end RL fromestimated rewards.

Reinforcement Learning, Summer 2019 85(86)

Page 142: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

Thank you!

Questions?

P.S.: I’m currently hiring PhD/PostDoc for projects on seq2seq RL formachine translation and conversational question-answering.Email: [email protected]

Research: https://www.cl.uni-heidelberg.de/statnlpgroup/

Reinforcement Learning, Summer 2019 86(86)

Page 143: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

References

ReferencesI Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A.,

and Bengio, Y. (2017). An actor-critic algorithm for sequence prediction.In Proceedings of the 5th International Conference on Learning Representations(ICLR), Toulon, France.

I Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation byjointly learning to align and translate.In Proceedings of the International Conference on Learning Representations(ICLR), San Diego, CA.

I Bastani, H., Bayati, M., and Khosravi, K. (2017). Exploiting the naturalexploration in contextual bandits.ArXiv e-prints, 1704.09011.

I Bottou, L., Peters, J., Quinonero-Candela, J., Charles, D. X., Chickering, D. M.,Portugaly, E., Ray, D., Simard, P., and Snelson, E. (2013). Counterfactualreasoning and learning systems: The example of computational advertising.Journal of Machine Learning Research, 14:3207–3260.

I Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic andnonstochastic multi-armed bandit problems.Foundations and Trends in Machine Learning, 5(1):1–122.

Reinforcement Learning, Summer 2019 86(86)

Page 144: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

References

I Chapelle, O., Masnavoglu, E., and Rosales, R. (2014). Simple and scalableresponse prediction for display advertising.ACM Transactions on Intelligent Systems and Technology, 5(4).

I Chen, M. X., Firat, O., Bapna, A., Johnson, M., Macherey, W., Foster, G., Jones,L., Schuster, M., Shazeer, N., Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L.,Chen, Z., Wu, Y., and Hughes, M. (2018). The best of both worlds: Combiningrecent advances in neural machine translation.In Proceedings of the 56th Annual Meeting of the Association for ComputationalLinguistics (ACL), Melbourne, Australia.

I Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D.(2017). Deep reinforcement learning from human preferences.In Advances in Neural Information Processing Systems (NIPS), Long Beach, CA,USA.

I Dudik, M., Langford, J., and Li, L. (2011). Doubly robust policy evaluation andlearning.In Proceedings of the 28th International Conference on Machine Learning (ICML),Bellevue, WA.

I Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. (2017).Convolutional sequence to sequence learning.In Proceedings of the 55th Annual Meeting of the Association for ComputationalLinguistics (ACL), Vancouver, Canada.

Reinforcement Learning, Summer 2019 86(86)

Page 145: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

References

I Jiang, N. and Li, L. (2016). Doubly robust off-policy value evaluation forreinforcement learning.In Proceedings of the 33rd International Conference on Machine Learning (ICML),New York, NY.

I Kim, Y. (2014). Convolutional neural networks for sentence classification.In Proceedings of the Conference on Empirical Methods in Natural LanguageProcessing (EMNLP), Doha, Qatar.

I Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms.In Advances in Neural Information Processing Systems (NIPS), Vancouver, Canada.

I Kong, A. (1992). A note on importance sampling using standardized weights.Technical Report 348, Department of Statistics, University of Chicago, Illinois.

I Kreutzer, J., Khadivi, S., Matusov, E., and Riezler, S. (2018). Can neural machinetranslation be improved with user feedback?In Proceedings of the 16th Annual Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies -Industry Track (NAACL-HLT), New Orleans, LA.

I Kreutzer, J., Sokolov, A., and Riezler, S. (2017). Bandit structured prediction forneural sequence-to-sequence learning.In Proceedings of the 55th Annual Meeting of the Association for ComputationalLinguistics (ACL), Vancouver, Canada.

Reinforcement Learning, Summer 2019 86(86)

Page 146: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

References

I Langford, J., Strehl, A., and Wortman, J. (2008). Exploration scavenging.In Proceedings of the 25th International Conference on Machine Learning (ICML),Helsinki, Finland.

I Lawrence, C., Gajane, P., and Riezler, S. (2017a). Counterfactual learning formachine translation: Degeneracies and solutions.In Proceedings of the NIPS WhatIF Workshop, Long Beach, CA.

I Lawrence, C. and Riezler, S. (2018). Improving a neural semantic parser bycounterfactual learning from human bandit feedback.In Proceedings of the 56th Annual Meeting of the Association for ComputationalLinguistics (ACL), Melbourne, Australia.

I Lawrence, C., Sokolov, A., and Riezler, S. (2017b). Counterfactual learning frombandit feedback under deterministic logging: A case study in statistical machinetranslation.In Proceedings of the Conference on Empirical Methods in Natural LanguageProcessing (EMNLP), Copenhagen, Denmark.

I Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-banditapproach to personalized news article recommendation.In Proceedings of the 19th International Conference on World Wide Web (WWW).

I Nguyen, K., Daume, H., and Boyd-Graber, J. (2017). Reinforcement learning forbandit neural machine translation with simulated feedback.

Reinforcement Learning, Summer 2019 86(86)

Page 147: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

References

In Proceedings of the Conference on Empirical Methods in Natural LanguageProcessing (EMNLP), Copenhagen, Denmark.

I Precup, D., Sutton, R. S., and Singh, S. P. (2000). Eligibility traces for off-policypolicy evaluation.In Proceedings of the Seventeenth International Conference on Machine Learning(ICML), San Francisco, CA.

I Ranganath, R., Gerrish, S., and Blei, D. M. (2014). Black box variational inference.

In Proceedings of the 17th International Conference on Artificial Intelligence andStatistics (AISTATS), Reykjavik, Iceland.

I Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. (2016). Sequence leveltraining with recurrent neural networks.In Proceedings of the International Conference on Learning Representation (ICLR),San Juan, Puerto Rico.

I Ross, S. M. (2013). Simulation.Elsevier, fifth edition.

I Shen, S., Cheng, Y., He, Z., He, W., Wu, H., Sun, M., and Liu, Y. (2016).Minimum risk training for neural machine translation.In Proceedings of the 54th Annual Meeting of the Association for ComputationalLinguistics (ACL), Berlin, Germany.

Reinforcement Learning, Summer 2019 86(86)

Page 148: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

References

I Sokolov, A., Kreutzer, J., Lo, C., and Riezler, S. (2016). Stochastic structuredprediction under bandit feedback.In Advances in Neural Information Processing Systems (NIPS), Barcelona, Spain.

I Sokolov, A., Riezler, S., and Urvoy, T. (2015). Bandit structured prediction forlearning from user feedback in statistical machine translation.In Proceedings of MT Summit XV, Miami, FL.

I Strehl, A. L., Langford, J., Li, L., and Kakade, S. M. (2010). Learning from loggedimplicit exploration data.In Advances in Neural Information Processing Sytems (NIPS), Vancouver, Canada.

I Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learningwith neural networks.In Advances in Neural Information Processing Systems (NIPS), Montreal, Canada.

I Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning. An Introduction.The MIT Press.

I Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning. An Introduction.The MIT Press, second edition.

I Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (2000). Policy gradientmethods for reinforcement learning with function approximation.In Advances in Neural Information Processings Systems (NIPS), Vancouver,Canada.

Reinforcement Learning, Summer 2019 86(86)

Page 149: Reinforcement Learninglxmls.it.pt/2019/rl-intro.pdf · I Seq2seq reinforcement learning: Bandit structured prediction, actor-critic neural seq2seq learning I O -policy/counterfactual

References

I Szepesvari, C. (2009). Algorithms for Reinforcement Learning.Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan &Claypool.

I Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,Kaiser, L., and Polosukhin, I. (2017). Attention is all you need.In Advances in Neural Information Processing Systems (NIPS), Long Beach, CA.

I Watkins, C. and Dayan, P. (1992). Q-learning.Machine Learning, 8:279–292.

I Williams, R. J. (1992). Simple statistical gradient-following algorithms forconnectionist reinforcement learning.Machine Learning, 8:229–256.

Reinforcement Learning, Summer 2019 86(86)