PFN Spring Internship Final Report: Autonomous Drive by Deep RL

34
Driving in TORCS with Deep Deterministic Policy Gradient Final Report Naoto Yoshida

Transcript of PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Page 1: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Driving in TORCS withDeep Deterministic Policy Gradient

Final Report

Naoto Yoshida

Page 2: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

About Me

● Ph.D. student from Tohoku University● My Hobby:

○ Reading Books○ TBA

● NEWS:○ My conference paper on the reward function was

accepted!■ SCIS&ISIS2016 @ Hokkaido

Page 3: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Outline● TORCS and Deep Reinforcement Learning● DDPG: An Overview● In Toy Domains● In TORCS Domain● Conclusion / Impressions

Page 4: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

TORCS and Deep Reinforcement Learning

Page 5: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

TORCS: The Open source Racing Car Simulator● Open source● Realistic (?) dynamics simulation of the car environment

Page 6: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Deep Reinforcement Learning● Reinforcement Learning + Deep Learning

○ From Pixel to Action■ General game play in ATARI domain■ Car Driver■ (Go Expert)

● Deep Reinforcement Learning in Continuous Action Domain: DDPG○ Lillicrap, Timothy P., et al.

"Continuous control with deep reinforcement learning." , ICLR 2016

Vision-based Car agent in TORCSSteering + Accel/Blake = 2 dim continuous actions

Page 7: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

DDPG: An overview

Page 8: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

GOAL: Maximization of in expectation

Reinforcement Learning

Agent

Environment

Action : a

State : sReward : r

Page 9: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

GOAL: Maximization of in expectation

Reinforcement Learning

Agent

Environment

Action : a

State : sReward : r

Interface

Raw output: u

Raw input: x

Page 10: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Deterministic Policy Gradient● Formal Objective Function: Maximization of True Action Value

● Policy Evaluation: Approximation of the objective function

● Policy Improvement: Improvement of the objective function

where

Bellman equationwrt Deterministic Policy

Loss for Critic

Update direction of Actor

Silver, David, et al. "Deterministic policy gradient algorithms." ICML. 2014.

Page 11: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Deep Deterministic Policy Gradient

Initialization

Update of Critic+ minibatch

Update of Actor+ minibatch

Update of Target

Sampling / Interaction

RL agent(DDPG)

s, ra

TORCS

Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning.", ICLR 2016

Page 12: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Deep Architecture of DDPG

Three-step observation

Simultaneous training of two deep convolutional networks

Page 13: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Exploration: Ornstein–Uhlenbeck process● Gaussian noise with moments

○ θ,σ:parameters○ dt:time difference○ μ:mean (= 0.)

● Stochastic Differential Equation:

● Exact Solution for the discrete time step:

SDE

Wiener Process

Gaussian

Page 14: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

OU process: Example

GaussianOU Process

θ = 0.15, σ = 0.2,μ = 0

Page 15: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

DDPG in Toy Domains

https://gym.openai.com/

Page 16: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Toy Problem: Pendulum Swingup● Classical RL benchmark task

○ Nonlinear control:○ Action: Torque○ State:

○ Reward:

From “Reinforcement Learning In Continuous Time and Space”, Kenji Doya, 2000

Page 17: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Results

# of Episode

Page 18: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Results: SVG(0)

# of Episode

Page 19: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Toy Problem 2: Cart-pole Balancing● Another classical benchmark task

○ Action: Horizontal Force○ State:

○ Reward: (other definition is possible)■ +1 (angle is in the area)■ 0 (Episode Terminal)

Angle Area

Page 20: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Results: non-convergent behavior :(

RLLAB implementation worked well

SuccessfulScore

Total Steps

https://rllab.readthedocs.io

Page 21: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

DDPG in TORCS Domain

Note: Red line : Parameters with Author Confirmation / DDPG paper Blue line : Estimated/Hand-tuned Parameters

Page 22: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Track: Michigan Speedway● Used in DDPG paper● This track actually exists!

www.mispeedway.com

Page 23: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

TORCS: Low-dimensional observation● TORCS supports low-dimensional sensor outputs for AI agent

○ “Track” sensor○ “Opponent” sensor○ Speed sensor○ Fuel, damage, wheel spin speed etc.

● Track + speed sensor as observation● Network: Shallow network● Action: Steering (1 dim) ● Reward:

○ If car clashed/course out , car gets penalty -1○ otherwise

“Track” Laser Sensor

Track Axis

Car Axis

Page 24: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Result: Reasonable behavior

Page 25: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

TORCS: Vision inputs● Two deep convolutional neural networks

○ Convolution:■ 1st layer: 32 filters, 5x5 kernel, stride 2, paddling 2■ 2nd, 3rd layer: 32 filters, 3x3 kernel, stride 2, paddling 1■ Full-connection: 200 hidden nodes

Page 26: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

VTORCS-RL-color

● Visual TORCS○ TORCS for Vision-based AI agent

■ Original TORCS does not have vision API!■ vtorcs:

● Koutník et al., "Evolving deep unsupervised convolutional networks for vision-based reinforcement learning, ACM, 2014.

○ Monochrome image from TORCS server■ Modification for the color vision → vtorcs-RL-color

○ Restart bug■ Solved with help of mentors’ substantial suggestions!

Page 27: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Result: Still not a good result...

Page 28: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

What was the cause of the failure?● DDPG implementation?

○ Worked correctly, at least in toy domains.■ The approximation of value functions → ok

● However, policy improvement failed in the end.■ Default exploration strategy is problematic in TORCS environment

● This setting may be for general tasks● Higher order exploration in POMDP is required

● TROCS environment?○ Still several unknown environment parameters

■ Reward → ok (DDPG author check)■ Episode terminal condition → still various possibilities

(from DDPG paper)

Page 29: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

gym-torcs・TORCS environment with openAI-Gym like interface

Page 30: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Impressions● On DDPG

○ Learning of the continuous control is a tough problem :(■ Difficulty of policy update in DDPG■ “Twice” recommendation of Async method by DDPG author (^ ^;)

● Throughout this PFN internship:○ Weakness: Coding

■ Thank you! Fujita-san, Kusumoto-san■ I knew many weakness of my coding style

○ Advantage: Reinforcement Learning Theory■ and its branched algorithms, topics and relationships between RL and Inference■ For DEEP RL, Fujita-san is an auth. in Japan :)

Page 31: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Update after the PFI seminar

Page 32: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Cart-pole Balancing● DDPG could learn the successful policy

○ Still unstable after the several successful trial

Page 33: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Success in Half-Cheetah Experiment● We could run successful experiment with identical hyper parameters in cart-

pole.

300-step total reward

Episode

Page 34: PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Keys in DDPG / deep RL● Normalization of the environment

○ Preprocess is known to be very important for deep learning. ■ This is also true in deep RL.■ Scaling of inputs (possibly, and actions, rewards) will help the agent to learn.

● Possible normalization:○ Simple normalization helps: x_norm = (x - mean_x)/std_x ○ Mean and Standard deviation are obtained during the initial exploration.○ Other normalization like ZCA/PCA whitening may also help.

● Epsilon parameter in Adam/RMS prop can be large value○ 0.1, 0.01, 0.001… We still need a hand-tuning / grid search...