Takeshi Shibuya University of Tsukuba [email protected]
Embed Size (px)
Transcript of Takeshi Shibuya University of Tsukuba [email protected]
Reinforcement learning & RoboCup Rescue
Takeshi ShibuyaUniversity of [email protected] fundamental study on representation of reward for reinforcement learning in dynamic environments+ an introduction of rescue simulationToday, Id like to make a presentation entitled Reinforcement learning & Robocup Resuce.My name is Shibuya Takeshi, an assistant professor in University of Tsukuba.
1OutlineReinforcement learningA interactive learning framework in soft computinga method to learn in dynamic environmentRoboCup Rescue: Overview an application of soft computing2Reinforcement learning(theoritical side)Learning in dynamic environment(application side)Rescue simulation First of all, let me outline my presentation.
The presentation consists of two parts: Reinforcement learning and Robocup rescue.The reinforcement learning is a learning framework in machine learning research field.In reinforcement learning, a controller which is named agent interacts with a plant which is named environment and learns suitable behavior by trial-and-error process.
I have submitted my paper on a theoretical technique which improves reinforcement learning performance in a changing environment.But after submission, discussing with Yasunobu-sensei, and Nobuhara-sensei, we decided focus more application side of artificial intelligence in this presentation.
2Reinforcement learningContents:Reinforcement Learning in psychologyLearning in dynamic environments 3Reinforcement Learning in psychology
If he finishes to push numbers orderly,he gets a peanut as reward.4Kyoto UniversityThe term reinforcement learning in A.I. is from psychology.
He is Ayumu-kun, a famous chimpanzee at Kyoto University.A panel at the front of Ayumu-kun displays numbers. At first, the panel displays some numbers from 1 to 8.When he push the number orderly, the number is vanished.If he finished vanishing all the numbers, he gets a peanut as reward.4notable thingsin Reinforcement LearningThe learner acquires suitable behavior from the only reward.
The trainerDoes not have to tell the learner how to behave step by step.55What is reinforcement learning(RL)?6The agent enhances values that bring rewards.The agent selects the action whose value is highest. AgentrewardState
2ValueActionsThe reinforcement learning provides a learning framework which works by interactions between an agent and an environment.Let me show you algorithm of the RL briefly.
In the RL, the agent aims to acquire suitable behavior in the environment.
First, the agent observes the state s.Then, the agent selects the action according to the state s.Lastly, the agent moves the next state according to the selected action a.
If the state s is in a suitable area in the environment,the environment gives the agent reward r.
The agent uses action values for selecting an action.The action value means an expectation of rewards with respect to a pair of state s and action a.
The agent enhances an appropriate action value when the agent gets a reward.And the agent selects the action according to the amplitudes of its action value.
These design has achieved a lot of successes.6Research themelearning in dynamic environment:How to learn behavior when suitable action is changed?7
timeAction 1Action 2
Great rewardDividing reward into two part:Time-dependent part: to be designed.Time-independent part: to be learnt8
Research themelearning in dynamic environment:
Research themelearning in dynamic environment:9Probability of selecting EAST action increases.Proposed method enables the agent to adapt the change of the environmentThe probability of selecting action switches after the change of environmentRobocup RescueContents:Overview of Robocup rescuedemonstration10Leagues in RoboCupSoccerRobot leaguesSimulation leagues
RescueRobot leaguesSimulation leagues2D3DUltimate goal of the RoboCup:By mid-21st century, a team of fully autonomous humanoid robot soccer players shall win the soccer game, comply with the official rule of the FIFA, against the winner of the most recent World Cup. (from official site)11RoboCup RescueThe purpose:(1) to develop simulators that form the infrastructure of the simulation system and emulate realistic phenomena predominant in disasters. (2) to develop intelligent agents and robots that are given the capabilities of the main actors in a disaster response scenario.(from official site)
Virtual Robots simulation(Powered by USARSim)Agent simulation
12RoboCup Rescue: The agent simulation
Buildings: Fire, Collapse
Roads : Traffic movement Blocked roads due to rubble etc
Emergency services: Fire brigades Ambulance teams Police forces
13Agents observation and Action
14Demonstration/ movie15RoboCup Rescue + RL (Team MRL)Reinforcement learning is employed for controlling agent.The details are not shown in the paper.
Team MRL is the champion of RoboCup 2007. (total: 8 teams)
Omid Aghazadeh+, Implementing Parametric Reinforcement Learning in Robocup Rescue Simulation , RoboCup 2007: Robot Soccer World Cup XI Lecture Notes in Computer Science, 2008, Volume 5001/2008, 409-416, DOI: 10.1007/978-3-540-68847-1_42
16SummaryFollowing topics are overviewed:Reinforcement learningThe framework and some research themeRoboCup RescueAims in some leagues and demonstrations171819
Reinforcement LearningAs an engineering approachAgent(learner)rewardState
Environment20What you see here is a framework of reinforcement learning as an engineering approach.
The agent is a learner. The agent is corresponding to Ayumu-kun.This is an environment. The environment is corresponding to Display in the last example.
First, the agent observes the state s.State is a remain numbers on the display in the last example.
Then, the agent selects the action according to the state s.Action is a what the agent does. What is pushed is corresponded in the last example.
Lastly, the agent moves the next state according to the selected action a.
If the state s is a suitable one in the environment,the environment gives the agent reward r.In the Mouse case, when he select a door for the cheese, he gets the reward.In the Ayumu-kuns case, when he finish pusing all the numbers, he gets the reward.
20Deviding reward into two part:Time-dependent part: to be designed.Time-independent part: to be learnt21
22Research Theme 1: learning in Partially observable environment:
TorqueIf agent can observe four states(angle and angular velocity of each joint ), the agent can control it.
If the agent can not use velocity information,the agent can not determine the direction to be torqued.
Angular velocityComplex-valued reinforcement learning enables the agent to overcome the problem by using context of behavior.23
2Research Theme 1: learning in Partially observable environment:reward function24