A Control Strategy of Autonomous Vehicles Based on Deep...

A Control Strategy of Autonomous Vehicles based on Deep Reinforcement Learning

Wei Xia1, Huiyun Li1, Baopu Li2,1 1. Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

2. Department of Biomedical Engineering, Shenzhen University Shenzhen, China

e-mail: [email protected], [email protected]

Abstract—Deep reinforcement learning has received considerable attention after the outstanding performance of AlphaGo. In this paper, we propose a new control strategy of self-driving vehicles using the deep reinforcement learning model, in which learning with an experience of professional driver and a Q-learning algorithm with filtered experience replay are proposed. Experimental results demonstrate that the proposed model can reduce the time consumption of learning by 71.2%, and the stability increases by about 32%, compared with the existing neural fitted Q-iteration algorithm.

Keywords-Autonomous vehicles; neural network; deep reinforcement learning;

I. INTRODUCTION Autonomous vehicles have been entitled as one of the top

ten emerging technologies of 2016 by the World Economic Forum [1]. Google, Tesla, Baidu and other technological companies have taken considerable efforts on autonomous vehicle prototypes. Some autonomous vehicles are already under test on public roads. However, it remains as a great challenge to develop the self-driving control strategy to deal with vast amount of environment variables and conditions.

Q-learning is one of the most famous of model-free reinforcement learning (RL) technique which used Q-table to store action values [2]. Riedmiller [3] introduced neural network to train Q-value function for the simple control task by storing and reusing all transition experiences. However, the resource consuming on all empirical data is inevitable.

In 2006, the work in deep belief networks (DBNs) from Hinton has begun a new era of deep learning (DL) [4]. Since then, deep learning has been widely used in the field of vision, speech and text recognition [5-6]. Recently, researchers combined RL and DL to solve problems in control and decision-making applications [7-8]. Enlightened by these works, we can use the deep network as the value function to estimate the states and actions in reinforcement learning for autonomous vehicles.

In this paper, we propose a control strategy for self-driving vehicles, so-called deep Q-learning with filtered experiences (DQFE) model, which combined the classic neural fitted Q-iteration and deep learning technique to obtain a control strategy with less resource consumption. Then we conduct experiment on self-driving cars in the Open Racing Car Simulator (TORCS), an open source simulation platform for racing cars [9].

The rest of this paper is structured as follows. Section II reviews existing works on deep reinforcement learning (DRL) in field of control and decision making. Our model is presented in Section III. Section IV illustrates the experiment on TORCS, followed by a brief conclusion in Section V.

II. RELATED WORK The application of reinforcement learning on control and

decision-making has been investigated in several works. Pyeatt and Howe [10] applied reinforcement learning to learning racing behaviors in Robot Auto Racing Simulator, precursor of the TORCS platform, both are open source racing simulators. Daniele et al. [11] used the tabular Q-learning model to learn the overtaking strategies on TORCS. Riedmiller [3] proposed a neural reinforcement learning method, namely neural fitted Q-iteration (NFQ), to generate control strategy for the pole balancing and mountain car task with least interactions. In above work, both the state and action spaces are low dimensional. NFQ used the Multi-Layer Perceptron (MLP) model as the value function for Q-iteration in which the traditional three-layer neural network is employed under normal circumstances. It may fail to find a feasible solution with the increased status and action spaces.

In recent years, the deep reinforcement learning has seen an exciting development. One of the representative works is published in Nature 2015 by the researchers of Google DeepMind, in which the authors developed a novel artificial agent of a deep Q-network, based on convolutional neural network (CNN) and Q-learning. The artificial agent learned policies directly from high-dimensional sensor inputs, and achieved human-level control on the challenging domain of classic Atari 2600 games [8]. Another exciting work by Google DeepMind is more convincing. AlphaGo defeated the world chess champion Lee Se-dol by 4:1 [12]. The technology behind it was a combination reinforcement learning, and Monte Carlo tree search technique, supplemented with extensive training set.

III. THE PROPOSED APPROACH After in-depth analysis of the correlation on professional

driver and self-driving technology, we propose a scheme based on the combinations of DL and RL. We utilize the historical dataset of the professional drivers in TORCS platform, and filter their experience replay at the learning stage.

2016 9th International Symposium on Computational Intelligence and Design

2473-3547/16 $31.00 © 2016 IEEE

DOI 10.1109/ISCID.2016.159

198

In this section, our approach will be described in three parts, including the modeling, pre-training and the Q-learning with filtered experience replay. Figure 1 shows the approach in general.

A. Modeling We consider the control task in which the agent,

mimicking a driver, interacts with the environment through a sequence of situation, action and reward. Our model framework is shown in Figure 2.

At each time step t, the output value ( )Q ,at ts of the deep neural network is determined by the input state ts and action at . More formally, we use the deep neural network to approximate the optimal action-value function:

( ) 1k 0

, k

t t t kQ s a rγ∞

+ +=

= (1)

where (0 1) γ γ< ≤ refers to the discount rate. The agent tries to select actions so that the sum of the discounted rewards ( )Q ,at ts is maximized [2]. The value of state ts is collected from the TORCS servers. The action a t is chosen from our action spaces, which is discretized into two parts manually, including the steer action space and others action space. The steer action is determined by the -greedy algorithm at each step t . Other actions use the driving rules extracted from the experiences of driver in [13]. The reward tr is defined by the impactful traveled distance between two states, which is discounted by and then accumulate to forming the output value ( )Q ,at ts of the deep neural network. Finally, we used the Rprop algorithm in [14] to update the Q-iteration and the loss function as following:

( ) ( ) ( )( )1

2

1 1 1, Q , at t t t t

nt

t t

L r Q s a sθ γ+ + +

=

= + − (2)

where 1, , nt t t= indicates the round of Q-iteration, and θ refers to the weights of networks.

B. Pre-training After prel times training with the same method of

reinforcement learning iterations described above [15]. We selected the structure of 21-11-10-1 neural network in which the agent completed the maximum steps in their test round.

Next, we would utilize this kind neural network model to build our DQFE algorithm for further experiments.

Firstly, let a professional driver runs one lap on TORCS platform, we collected the observation states, action and the next states for each 20ms during the driving time. A dataset of states-action sequence pairs is obtained when the professional driver finish his track road. Next, the current states and action in the dataset as the input value with st and at of the pre-training network in Figure 2, and the change between two consecutive states as the neural target value ( )1S St t−+ . Then we apply the Rprop algorithm to train the network for prel times.

On the last phase of pre-training, we only utilized the historical data of the experienced driver. To improve the convergence efficiency of our deep Q-learning with filtered experiences (DQFE) method, the relationship between the reward value of a professional driver’s strategy r and the Q value is demonstrated as in Eq. 1. Table 1 illustrates the pre-training network with professional drivers.

TABLE I. PRE-TRAINING NETWORK WITH PROFESSIONAL DRIVERS

Initialize dataset preDS :

While do: Driver makes action a by state s ; Calculate reward r ;

For each a c ∈ action space aω :

Calculate the distance between a and ca ;

Selected as a∈ with shortest distance;

Save s( ),a ,s r into dataset preDS ;

Until drive task is terminal. Used preDS train our network with times expd by Rprop

algorithm.

Fig. 1 The general process flow for our approach

Fig. 2 Model framework

199

C. Training and Testing Storing all historical data for each experimental round to

tune weights of the network in NFQ algorithm may be a challenging task since there are bad experiment data, which yield difficulty in convergence to a certain control strategy.

A variable container may be constructed to form an experience replay [16], which includes two parameters: maximum capacity of historical data rmsμ and maximum records number of round num . Thus the poor experimental round is eliminated when it exceeds the restriction of the container. The incremental changes in the experimental data can be expressed as follows:

( ) & & ( )01

rms numh hlen DS num DSd

others

μ< <Δ =

− (3)

And a mini-patch with random sampling applies to Rprop algorithm for Q-iteration with limits times Ntrain to avoid appear over-fitting phenomenon [16].

IV. EXPERIMENT In this section, we conduct a comprehensive set of

experiment to test the performance of our DQFE algorithm in TORCS platform. We first introduce the environment setup and parameter configuration.

A. Environment Setup and Parameter Configuration We used the client-server architecture, extended from the

original TORCS platform, that controller interactions with server in each game through UDP connections [17]. An interface in Python utilized to get observation and sent control command between servers every 20ms of simulated time [13, 18]. The observation used in our scheme including angle, anglePos, 7 track and speed along the longitudinal axis of the car; and the actions including gas, brake, clutch, gear and steering value etc [17]. The main parameters are listed in the table II.

TABLE II. PRE-PROCESS NETWORK WITH PROFESSIONAL DRIVE

Parameters Value Description

prel 80 Pre-training times with historical dataset

expd 10 Train times with data by professional driver

Mbs 0.5 Mini-batch size

rmsμ 8000 Replay memory size

num 20 Max numbers of historical round preserved

Ntrain 20 Max training steps in each round

And Fig.3 provides different control strategy in action

tests. The above two sub-figures indicate that the vehicle is through a left and right turn, respectively. The cross at the bottom right of each sub-figure shows the forward direction of the vehicle. A sharp left turn in (c) and (d) shows a straight road.

B. Experimental results

We implement 300 iterations with three methods: 1) without any learning method, 2) naive NFQ without pre-training by experienced driver, and 3) our DQFE algorithm respectively.

We tried both strategies for 300 rounds, we take the times

to finish the competition road successfully to measure the effectiveness of each strategy. We also take the time consumption of each strategy to verify the efficiency. Driver without learning cannot achieve any subtasks in this experiment. Then we just demonstrate the comparison between our method and the naive NFQ method in Figure 4 and Figure 5.

Our DQFE algorithm completed the whole journey for 48 rounds. In comparison the driver with NFQ algorithm got their record of 59 times. Owing to the pre-training, we can obtain control strategies faster than NFQ about 60 rounds. Although, the overall result is almost on a par in 300 experiments, the differentiation of time consumption between two methods is obvious. A time record on each round for each strategy is shown in Figure 5.

Fig. 4 Times of finished tasks for different strategies

(d) straight road

(a) turn left (b) turn right

(c) turn left sharply

Fig. 3 Example of testing our model from DQFE

200

In terms of time consumption, the complexity of NFQ

algorithm is ( )NO , and their consumption is up to 2948s on 297 rounds. While, the consumption of our DQFE algorithm is stable at about 320s after 128 rounds. From the average sense, we used 236s for learning in each trial, as about a quarter of 820s in NFQ algorithm. Generally speaking, our algorithm can save 71.2% of the time consumption in the case of 300 trials.

After learning, the model obtained from two strategies has been tested with 50 times. We calculate the distance between the car and the track axis to verify the control performance. Table III shows the comparison of control performance between NFQ and DQFE. Under the total tests, the average distance is 0.2513 in NFQ, as -0.0884 in our algorithm with slightly larger standard derivation. However, we finished the competition road 49 times among 50 tests, which is 33 times in NFQ.

TABLE III. COMPARISON OF CONTROL PERFORMANCE

Model Error Information of Test

Average Distance

Standard Deviation

Ratio(%)

NFQ 0.2513 0.125 66DQFE -0.0884 0.1537 98

As can be seen from the above table, the performance of

our DQFE algorithm training model is significant. Compared to the model in NFQ, our algorithm increases the stability by 32% in tests.

V. CONCLUSION Developing the control strategy for self-driving vehicles

is difficult, due to enormous circumstances should be taken into account. However, deep reinforcement learning provides a promising direction. In this paper, we propose a learning algorithm for self-driving vehicles, featuring deep Q-learning with filtered experiences replay. Experimental results on the TORCS platform verified that after proper training, the proposed model can obtain effective control strategies. Compared with the existing neural fitted Q-iteration algorithm, our model reduces the time consumption of learning by 71.2% in 300 trials. In addition, our algorithm increases the stability by 32% in 50 tests.

ACKNOWLEDGMENT This work was supported by Shenzhen S&T Funding

with Grant No. CXZZ 20140527172356968, JSGG 20150511104613104, JCYJ20160510154531467. Guangdong S&T Funding 2013B050800003 and 2015B010106004.

REFERENCES [1] D. Oliver Cann, Media Relations, World Economic Forum, "These

are the top 10 emerging technologies of 2016," ed, 2016. [2] R. S. Sutton and A. G. Barto, Reinforcement learning: An

introduction vol. 1: MIT press Cambridge, 1998. [3] M. Riedmiller, "Neural fitted Q iteration–first experiences with a data

efficient neural reinforcement learning method," in European Conference on Machine Learning, 2005, pp. 317-328.

[4] G. E. Hinton, S. Osindero, and Y.-W. Teh, "A fast learning algorithm for deep belief nets," Neural computation, vol. 18, pp. 1527-1554, 2006.

[5] A. Nguyen, J. Yosinski, and J. Clune, "Deep neural networks are easily fooled: High confidence predictions for unrecognizable images," in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 427-436.

[6] Y. Qian, Y. Fan, W. Hu, and F. K. Soong, "On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis," in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3829-3833.

[7] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, et al., "Playing atari with deep reinforcement learning," arXiv preprint arXiv:1312.5602, 2013.

[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, et al., "Human-level control through deep reinforcement learning," Nature, vol. 518, pp. 529-533, 2015.

[9] B. Wymann, E. Espié, C. Dimitrakakis, A. Sumner, and C. Guionneau, "TORCS: The open racing car simulator," ed, 2015.

[10] L. D. Pyeatt and A. E. Howe, "Learning to Race: Experiments with a Simulated Race Car," in FLAIRS Conference, 1998, pp. 357-361.

[11] D. Loiacono, A. Prete, P. L. Lanzi, and L. Cardamone, "Learning to overtake in torcs using simple reinforcement learning," in IEEE Congress on Evolutionary Computation, 2010, pp. 1-8.

[12] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, et al., "Mastering the game of Go with deep neural networks and tree search," Nature, vol. 529, pp. 484-489, 01/28/print 2016.

[13] X. E. Chris, "SnakeOil," ed, 2013. [14] M. Riedmiller and H. Braun, "A direct adaptive method for faster

backpropagation learning: the RPROP algorithm," in Neural Networks, 1993., IEEE International Conference on, 1993, pp. 586-591 vol.1.

[15] C. W. Anderson, M. Lee, and D. L. Elliott, "Faster reinforcement learning after pretraining deep networks to predict state dynamics," in 2015 International Joint Conference on Neural Networks (IJCNN), 2015, pp. 1-7.

[16] P. Wawrzy Ski and A. K. Tanwani, "Autonomous reinforcement learning with experience replay," Neural Networks, vol. 41, pp. 156-167, 2013.

[17] D. Loiacono, L. Cardamone, and P. L. Lanzi, "Simulated car racing championship: Competition software manual," arXiv preprint arXiv:1304.1672, 2013.

[18] T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke, et al., "PyBrain," Journal of Machine Learning Research, vol. 11, pp. 743-746, 2010.

Fig. 5 Learning time comparison between NFQ and our method

201

本文献由“学霸图书馆-文献云下载”收集自网络，仅供学习交流使用。

学霸图书馆（www.xuebalib.com）是一个“整合众多图书馆数据库资源，

提供一站式文献检索和下载服务”的24 小时在线不限IP

图书馆。

图书馆致力于便利、促进学习与科研，提供最强文献下载服务。

图书馆导航：

图书馆首页文献云下载图书馆入口外文数据库大全疑难文献辅助工具

http://www.xuebalib.com/cloud/

http://www.xuebalib.com/

http://www.xuebalib.com/cloud/


http://www.xuebalib.com/vip.html

http://www.xuebalib.com/db.php

http://www.xuebalib.com/zixun/2014-08-15/44.html


A Control Strategy of Autonomous Vehicles Based on Deep...

Documents

Transcript of A Control Strategy of Autonomous Vehicles Based on Deep...