Download - Takeshi Shibuya University of Tsukuba [email protected]

Takeshi ShibuyaUniversity of Tsukuba

[email protected]

1

A fundamental study on representation of reward for reinforcement learning in dynamic environments+ an introduction of rescue simulation

OutlineReinforcement learning

A interactive learning framework in soft computing

a method to learn in dynamic environmentRoboCup Rescue: Overview

an application of soft computing

2

Reinforcement learning

(theoritical side)Learning in dynamic

environment(application side)

Rescue simulation

REINFORCEMENT LEARNING

Contents:・ Reinforcement Learning in psychology・ Learning in dynamic environments

3

Reinforcement Learning in psychology

If he finishes to push numbers orderly,he gets a peanut as reward. 4

Kyoto University

notable thingsin Reinforcement Learning

The learner acquires suitable behavior from the only reward.

The trainerDoes not have to tell the learner how to

behave step by step.

5

What is reinforcement learning(RL)?

6

The agent enhances values that bring rewards.

The agent selects the action whose value is highest.

Agent

rewardState

Action

Environment1 2

Valu

e

Actions

Research themelearning in dynamic environment:How to learn behavior when suitable

action is changed?

7

time

Action 1

Action 2

？Great reward

Dividing reward into two part:Time-dependent part: to be designed.Time-independent part: to be learnt

8

Research themelearning in dynamic environment:

Research themelearning in dynamic environment:

9

Probability of selecting EAST action increases.

Proposed method enables the agent to adapt the change of the environment

The probability of selecting action switches after the change of environment

ROBOCUP RESCUE

Contents:・ Overview of Robocup rescue・ demonstration

10

Leagues in RoboCup

Soccer Robot leagues Simulation leagues

Rescue Robot leagues Simulation leagues

2D 3D

Ultimate goal of the RoboCup:By mid-21st century, a team of fully autonomous humanoid robot soccer players shall win the soccer game, comply with the official rule of the FIFA, against the winner of the most recent World Cup. (from official site)

11

RoboCup Rescue The purpose: (1) to develop simulators that form the infrastructure of the

simulation system and emulate realistic phenomena predominant in disasters. (2) to develop intelligent agents and robots that are given the capabilities of the main actors in a disaster response scenario.(from official site)

Virtual Robots simulation(Powered by USARSim)

Agent simulation12

RoboCup Rescue: The agent simulation

Buildings: Fire, Collapse

Roads : Traffic movement Blocked roads due to rubble etc

Emergency services: Fire brigades Ambulance teams Police forces

13

Agent’s observation and Action

14

Demonstration/ movie

15

RoboCup Rescue + RL (Team MRL)Reinforcement learning is employed for

controlling agent.The details are not shown in the paper.

Team MRL is the champion of RoboCup 2007. (total: 8 teams)

Omid Aghazadeh+, Implementing Parametric Reinforcement Learning in Robocup Rescue Simulation , RoboCup 2007: Robot Soccer World Cup XI Lecture Notes in Computer Science, 2008, Volume 5001/2008, 409-416, DOI: 10.1007/978-3-540-68847-1_42

16

SummaryFollowing topics are overviewed:

Reinforcement learningThe framework and some research theme

RoboCup RescueAims in some leagues and demonstrations

17

19未知の一定量既知の変化量

学習の対象

Reinforcement LearningAs an engineering approach

Agent(learner)

reward

State

Action

Environment

20

Deviding reward into two part:Time-dependent part: to be designed.Time-independent part: to be learnt

21

22

Research Theme 1: learning in Partially observable environment:

Torque

If agent can observe four states(angle and angular velocity of each joint ), the agent can control it.

If the agent can not use velocity information,the agent can not determine the direction to be torqued.Angular velocity

Complex-valued reinforcement learning enables the agent to overcome the problem by using context of behavior.

23

100 ％

50 ％

-50 ％

-100 ％

Swing up

1

2

1

2

Research Theme 1: learning in Partially observable environment:

reward function

24