Learning Reward Machines for Partially Observable ...rntoro/docs/learningRM_KR20.pdfLearning Reward...

Learning Reward Machines for Partially ObservableReinforcement Learning

Rodrigo Toro Icarte Ethan Waldie Toryn Q. Klassen Richard ValenzanoMargarita P. Castro Sheila A. McIlraith

KR 2020September 16

Hi, I’m Rodrigo :)

“The ultimate goal of AI is to create computer programsthat can solve problems in the world as well as humans.”

— John McCarthy

Our research incorporates insights from knowledge, reasoning, and learning,in service of building general-purpose agents.

Hi, I’m an AI researcher

— John McCarthy

Reinforcement Learning (RL)

RL Agent

Policy

Environment

Transition Probabilities

Reward Function

Action

Observation & reward

This learning process captures some aspects of human intelligence.

RL Agent

Policy

Environment

Reward Function

Action

RL Agent

Policy

Environment

Reward Function

Action

RL Agent

Policy

Environment

Reward Function

Action

How to enhance RL with KR

Long-standing RL problems that we tackled using KR:

Reward specification.

Sample efficiency.

Memory.

Reward specification

Make a bridge: get wood, iron, and use the factory

LTL specifications1:3(got wood ∧ 3used factory) ∧ 3(got iron ∧3used factory)

Reward machines2:Automata-based reward functions

1 Teaching Multiple Tasks to an RL Agent using LTL (AAMAS-18).

2 Using Reward Machines for High-Level Task Specification and Decomposition in RL (ICML-18).

Reward machine

u0start u1

〈w, 0〉

〈¬w ∧ ¬i, 0〉

〈i, 0〉

〈¬i, 0〉

〈f, 1〉

〈¬f, 0〉

〈¬w ∧ i, 0〉

〈w, 0〉〈¬w, 0〉

Formal languages3:Many formal languages → Reward machines.

1 Teaching Multiple Tasks to an RL Agent using LTL (AAMAS-18).2 Using Reward Machines for High-Level Task Specification and Decomposition in RL (ICML-18).

3 LTL and Beyond: Formal Languages for Reward Function Specification in RL (IJCAI-19).

Sample efficiency

500 1,000 1,500

Training steps (in thousands)

Avg.rewardper

Craft World

Reward machine

u0start u1

〈w, 0〉

〈¬w ∧ ¬i, 0〉

〈i, 0〉

〈¬i, 0〉

〈f, 1〉

〈¬f, 0〉

〈¬w ∧ i, 0〉

〈w, 0〉〈¬w, 0〉 How to exploit the reward machine’s structure:

CRM: Counterfactual reasoning.

HRM: Task decomposition.

RS: Reward shaping.

Sample efficiency

500 1,000 1,500

rewardper

Craft World

Legend: QL QL+RS HRM CRM CRM+RS

Sample efficiency

500 1,000 1,500 2,000 2,500

rewardper

Half-Cheetah

Legend: DDPG DDPG+RS HRM CRM CRM+RS

Memory

Button

(Cookie)

Memory

Button

(Cookie)

Memory

Button

(Cookie)

Memory

Button

(Cookie)

Memory

(+1 Reward)

Memory

The most popular approach:

Training LSTMs policies using a policy gradient method.

... starves in the cookie domain.

1 · 106 3 · 106 5 · 1060

Training steps

Legend:OptimalACERA3CPPODDQN

Reward Machines as memory

If the agent can detect the color of the rooms ( , , , ), and when it presses thebutton ( ), eats a cookie ( ), and sees a cookie ( ), then:

B1 B2B3

〈o/w, 0〉

〈o/w, 0〉〈o/w, 0〉〈o/w, 0〉

〈 , 0〉

〈 , 0〉;〈 , 0〉

〈 , 1〉〈 , 1〉

... becomes a “perfect” memory for the cookie domain.

Learning Reward Machines for Partially Observable Reinforcement Learning (NeurIPS-19).

Reward Machines as memory

If the agent can detect the color of the rooms ( , , , ), and when it presses thebutton ( ), eats a cookie ( ), and sees a cookie ( ), then:

B1 B2B3

〈o/w, 0〉

〈o/w, 0〉〈o/w, 0〉〈o/w, 0〉

〈 , 0〉

〈 , 0〉;〈 , 0〉

〈 , 1〉〈 , 1〉

... becomes a “perfect” memory for the cookie domain.

Learning Reward Machines for Partially Observable Reinforcement Learning (NeurIPS-19).

Memory

Cookie domain

0 1 · 106 2 · 106 3 · 1060

Training steps

Two keys domain

0 2 · 106 4 · 1060

Training stepsR

OptimalLRM-V2LRM-V1ACERA3CPPODDQN

∗Note: The detectors were also given to the baselines.

Summary

If you are interested in KR ∩ RL, consider reading our papers:

Advice-Based Exploration in Model-Based Reinforcement Learning (Canadian AI-18)Teaching Multiple Tasks to an RL Agent using LTL (AAMAS-18)Using Reward Machines for High-Level Task Specification and Decomposition in RL (ICML-18)LTL and Beyond: Formal Languages for Reward Function Specification in RL (IJCAI-19)Learning Reward Machines for Partially Observable RL (NeurIPS-19)

Symbolic Plans as High-Level Instructions for Reinforcement Learning (ICAPS-20)

Code: https://bitbucket.org/RToroIcarte/

Thanks! :)

Summary

Thanks! :)

Summary

Thanks! :)

Summary

Thanks! :)

Learning Reward Machines for Partially Observable ...rntoro/docs/learningRM_KR20.pdfLearning Reward...

Documents

Transcript of Learning Reward Machines for Partially Observable ...rntoro/docs/learningRM_KR20.pdfLearning Reward...

Partially Observable Markov Decision Processes …ggordon/780-fall07/... · What is a Partially Observable Markov Decision Process? ... Observations: S1 emits O1 with prob 1.0, S2

Planning and acting in partially observable …people.csail.mit.edu/lpk/papers/aij98-pomdp.pdfoptimal actions in partially observable stochastic domains. We begin by introducing the

Planning and Acting in Partially Observable Stochastic Domainscse.unl.edu/.../Seminars/Seminar06_POMDP.pdf · 2016-09-20 · Partially Observable Markov Decision Process (POMDP) To

Network Utility Maximization over Partially Observable Markov Channels

Model-based Bayesian Reinforcement Learning in Partially Observable Domains

Partially Observable Markov Decision Processes · 2018-09-27 · partially observable, stochastic domains. For finite horizon problems, the resulting value functions are piecewise

Optimal control of infinite horizon partially observable ...€¦ · Optimal control of infinite horizon partially observable decision processes modelled as ... Policies may be deterministic

Veri cation and Control of Partially Observable Probabilistic Real …gethin/papers/formats15poptas.pdf · 2015-08-05 · Veri cation and Control of Partially Observable Probabilistic

Planning in partially-observable switching-mode continuous ...people.csail.mit.edu/lpk/papers/amai_pomdp_continuous_brunskill.pdf · Planning in partially-observable switching-mode

Learning and Solving Partially Observable Markov Decision Processes

Inverse Reinforcement Learning in Partially Observable Environments

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Planning with Continuous Actions in Partially Observable Environments · Planning with Continuous Actions in Partially Observable Environments Matthijs T. J. Spaan and Nikos Vlassis

Partially observable Markov decision processes for spoken …€¦ · Partially observable Markov decision processes for spoken dialog systems Jason D. Williams a,*,1, Steve Young

Learning Partially Observable Markov Models from First Passage …pdupont/pdupont/pdf... · 2007-10-12 · Learning Partially Observable Markov Models from FPT 93 (POMMs) which are

Partially Observable Markov Decision Processes: A ...€¦ · Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The ﬁrst explicit

Partially Observable Markov Decision Processes (POMDPs)robotics.usc.edu/~geoff/cs599/POMDP.pdf · Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential

CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Geometry and Determinism of Optimal Stationary Control in Partially Observable … · 2018-07-03 · Geometry and Determinism of Optimal Stationary Control in Partially Observable

Learning and Solving Partially Observable Markov Decision ...shanigu/Publications/Dissertation.4.pdf · Learning and Solving Partially Observable Markov Decision Processes Dissertation