Efﬁciently Learning a Robust Self-Driving Model with Neuron … · with Neuron Coverage Aware...

Efficiently Learning a Robust Self-Driving Modelwith Neuron Coverage Aware Adaptive Filter Reuse

Chunpeng Wu, Ang Li, Bing Li, and Yiran ChenElectrical and Computer Engineering Department, Duke University

Durham, NC 27708, USA{chunpeng.wu, ang.li630, bing.li.ece, yiran.chen}@duke.edu

Abstract—Human drivers learn driving skills from both reg-ular (non-accidental) and accidental driving experiences, whilemost of current self-driving research focuses on regular drivingonly. We argue that learning from accidental driving datais necessary for robustly modeling driving behavior. A mainchallenge, however, is how accident data can be effectively usedtogether with regular data to learn vehicle motion, since manuallylabeling accident data without expertise is significantly difficult.Our proposed solution for robust vehicle motion learning, inthis paper, is to integrate layer-level discriminability and neuroncoverage (neuron-level robustness) regulariziers into an unsuper-vised generative network for video prediction. Layer-level dis-criminability increases divergence of feature distribution betweenthe regular data and accident data at network layers. Neuroncoverage regulariziers enlarge interval span of neuron activationadopted by training samples, to reduce probability that a samplefalls into untested interval regions. To accelerate training process,we propose adaptive filter reuse based on neuron coverage. Ourstrategies of filter reuse reduce structural network parameters,encourage memory reuse, and guarantee effectiveness of robustvehicle motion learning. Experiments results show that our modelimproves the inference accuracy by 1.1% compared to FCM-LSTM, and cut down 10.2% training time over the traditionalmethod with negligible accuracy loss.

Index Terms—Self-driving, robust vehicle motion learning,layer-level discriminability, neuron coverage, efficient training,adaptive filter reuse

I. INTRODUCTION

Computer vision based perception and decision making playan important role in self-driving systems. Early vision-basedautonomous navigation using neural network dates back atleast to Pomerleau [1]. DNN-based vision algorithms havebeen extensively adopted in recently launched self-drivingprojects. Such prevalent use of vision algorithms in a safety-critical system is a great test of these algorithms’ robustness:any subtle visual recognition errors may incur fatal traffic acci-dents. However, most of current self-driving research focuseson regular (non-accidental) driving only: publishing regulardriving datasets [2]–[4]and learning to predict vehicle ego-motion or steering angle using the regular data [5]–[7].Fewresearch [8]–[11] tackles driving issues on traffic accidentsor potential unsafe risks. Figure 1 compares regular driving

This work was supported in part by NSF-1717657, NSF-1822085, and thememberships of NSF IUCRC for ASIC from Quantil, Ergomotion, JD.com,etc. Any opinions, findings and conclusions or recommendations expressed inthis material are those of the authors and do not necessarily reflect the viewsof grant agencies or their contractors.

samples from BDD100K [4] (left two) and accidental drivingdata samples from AADV [8] (right three).

We argue that accident data is necessary for robustly mod-eling driving behavior as human drivers learn driving skillsfrom both regular and accidental driving experiences. A mainchallenge, however, is how accident data can be effectivelyused together with regular data to learn vehicle motion, asmanually labeling “correct” motion is easy in the regular databut significantly difficult in the accident data. Analyzing the“correct” vehicle motion in the accident data requires expertiseknowledge which is beyond the scope of machine learningresearch. Practically, the traffic accidents are a sub-category ofgeneral anomaly events but more safety-critical. Theoretically,using labeled regular data and unlabeled accident data togetheris typically a problem of multiple instance learning.

Most of previous research on the traffic accident issues [8]–[10] does not tackle the challenge, but work on supervisedaccidents/obstacles/failure detection or anticipation. Further,the methods [8], [9], and another most recent method ongeneral accidents [12], explicitly describe the accidents usingglobal features with local object-centric features in videos,which raises two problems: requiring tons of manually labeledbounding boxes of objects; depressing implicit risk factorswhich are not object-centric. Therefore, a better choice of theaccident description should be labor-free and preserves bothexplicit and implicit risk factors.

Our proposed solution for robust vehicle motion learningis inspired by most recent software engineering research ontesting robustness of machine learning software [13]. For aneuron in a trained DNN, Ma et al. [13] calculate lowerboundary bl and upper boundary bu of the neuron responseusing all training data. The interval [bl, bu] is defined as majorfunction region, while (−∞, bl) ∪ (bu,+∞) is corner-caseregion. Then, for each test sample, Ma et al. [13] check whichregion the neuron response of this sample falls into. In ourmethod, to apply this neuron coverage criteria to training,we will model vehicle motion learning as multiple instancelearning and integrate the criteria into the learning model.

Our efficient training strategy is inspired by a key observa-tion on relationship between DNN filters and visual tasks in re-cent research on DNN interpretability and visualization [14]–[16]. These research show that neuron activation of DNN filtersis significantly sparse in the inference process. Among thosesparsely activated filters, only a few filters of a DNN model

Fig. 1. Comparison of regular driving samples in BDD100K [4] (left two) and accidental driving data samples in AADV [8] (right three).

make significant contribution to inference and these filtersmost likely extract common features of all the classes of inputimages. For the other activated filters making less contributionto the inference, their distributions are more dispersed todifferent classes, indicating these filters are learned to retrievethe class-dependent features for each class. Since the numberof “useful” filters is few compared to the total number offilters, we will reuse filters at a DNN layer in our method.Further, we will link filter reuse to the neuron coverage criteria.

Our main contributions are as follows. First, we integratelayer-level discriminability and neuron coverage (neuron-levelrobustness) regulariziers into an unsupervised generative net-work for video prediction. Layer-level discriminability in-creases divergence of feature distribution between the regulardata and accident data at network layers. Neuron coverage reg-ulariziers reduce probability that a sample falls into untestedinterval regions. Second, we accelerate training process byadaptive filter reuse based on neuron coverage. Our strategiesof filter reuse reduce structural network parameters, encouragememory reuse, and adaptively update with neuron coveragechange to guarantee the robustness of vehicle motion learning.

II. RELATED WORK

A. Vehicle motion prediction

1) Most of current self-driving research focuses on regu-lar (non-accidental) driving only: learning to predict vehicleego-motion or steering angle only using the regular drivingdata [5], [7], [17]–[20]Early works attempted ConvolutionalNeural Network (CNN) to map the raw vision data to drivingactions or decisions [5], [17]. Though these works achieve thehigh success rate in making driving decision and prediction,their capability is limited within the scenes provided by the hu-man drivers [19]. Some works turn to more powerful learningmodels to increase the prediction capability such as GenerativeAdversarial Network [21], memory-based network [18], andreinforcement learning [7], [20].

2) Few research [8]–[10] tackles driving issues on trafficaccidents or potential unsafe risks. [10] proposed the modelsto detect the off-road obstacles under human supervision. [8],[9] proposed anticipation models to predict the occurrence ofaccident in advance. [9] constructed a self-annotated incidentdatabase that contains a large-number of near-miss trafficincidents.

B. DNN robustness

DNN robustness has been challenged by the impercep-tible perturbations of the input which can result in mis-classification. A lot of approaches have been investigatedto improve the DNN robustness. Several works [22], [23]

presented the modified training methods. In these works, theclassification of known noise/perturbation is added into thetotal training loss function during training. Therefore, therobustness of those models against certain types of perturba-tions is significantly promoted. However, one critical limitationunderlying these models is the generated models will notguarantee the performance against other kinds of input per-turbations [24]. Data augmentation is another widely studiedstrategy for DNN robustness enhancement. Various strategiesfor generating additional samples have been investigated, suchas the transformations of original data [25]–[27] and somecertain types of training data [28], [29]. For example, [28]selected the data that are not classified correctly with highconfidence but are visually similar to easy positives.

C. Efficient DNN training

For accelerating DNN inference tasks, pruning and quanti-zation have been widely studied, such as [30]–[32]. However,training acceleration is more challenging because the difficultyof training convergence exacerbates when applying with thesetechniques. As DNN training mostly are executed on thelarge-scale distributed system, where the massive data trans-mission among computing nodes bottlenecked the trainingacceleration. To address the problem, gradient quantizationand sparsification have been studied. Different bit-widths ofgradients [33], [34] have been proposed to speedup CNN orRNN training process. As for the gradients sparsification, thegradients that meet a prefined standard is regarded useful andwill be transferred. [35] employed merge these two techniquesand significantly reduced on gradients overhead.

D. Unsupervised video prediction

LSTM model has been used to predict videos or a sequenceof videos frames in unsupervised learning [36], [37]. Forexample, Srivastava et al. [36] used several LSTM models toperform video prediction under the unsupervised mode. Theproposed system first uses LSTM to extract the representationsin video frames and then inputs the information into LSTM-based decoders to predict the future actions. [38] developed apredictive model which not only predicts the long-term pixelmotion from previous frames but also generalizes previouslyunseen objects. The used model in [38] is the encoder-decoderstructure. [39] proposed a novel GAN-based architecture toperform prediction of the future video frames.

III. PROPOSED METHOD

Our proposed method for robust vehicle motion learning isshown in Figure 2, consisting of three models: an unsupervisedgenerative model G with two output branches respectively

Model 𝐺:video prediction 𝐺(𝑋$) = '𝑋$()

𝑋*$

𝑋+$

Regular

Accident

𝑋*$()

𝑋+$()

𝐷

𝑄

Vehi

cle

mot

ion

True

or

False

𝑋$()

layers 𝒮/

𝑋$'𝑋*$()

'𝑋$()

'𝑋+$()

Neuroncoverage

Layer-leveldiscriminability

U-Net connection

Regular videos

Accident videos

Fig. 2. Our proposed method for robust vehicle motion learning. The discriminator D is not used for inference.

for visual feature extraction and video sequence prediction; adiscriminator D for adversarially distinguishing G’s generatedframes from real ones; and a model Q for vehicle motion pre-diction based on the features extracted by G. The discriminatorD is not used for inference.

Training data is sampled from two sets of video clips:regular driving and traffic accident. All video clips in thesetwo sets are preprocessed to be the same length T .

Generative model G’s input batch at time t (t = 1, 2, ..., T )is denoted as Xt (|Xt| = N ), and Xt = Xt

r ∪Xta where Xt

r

(|Xtr| = N/2) is sampled from the regular driving video set

at time t and Xta (|Xt

a| = N/2) is sampled from the trafficaccident video set at time t. The model G’s output at the videoprediction branch (lower branch in Figure 2) is synthesizedsequences X̃t+1 = X̃t+1

r ∪ X̃t+1a where |X̃t+1

r | = |X̃t+1a | =

N/2. The sequences X̃t+1 is fed into the discriminator D.The model G’s output at the feature extraction branch (upperbranch in Figure 2) is predicted features of regular drivingsequences at time t+ 1, and is fed into the model Q.

Discriminator D has two types of input at time t: realvideo sequences Xt+1 = Xt+1

r ∪ Xt+1a where |Xt+1

r | =|Xt+1

a | = N/2, and the synthesized sequences X̃t+1 from themodel G. The discriminator D’s output is a two-class true/falseclassification.

Vehicle motion predictor Q adopts the model G’s output inan U-Net [40] manner as input (i.e., copying layer-wise featuremaps from model G to model Q), and predicts vehicle motionof regular driving at time t+1. The predictor Q’s output can bediscrete (left-turn, right-turn, straight and stop) or continuous(driving trajectory).

A. Robust vehicle motion learning

Our overall learning objective at time t is defined as:

minG,Q

maxD

{Lt0(G,D)+λ1·Lt

1(G,Q)+λ2·Lt2(G)+λ3·Lt

3(G)},

(1)where four losses, Lt

0(G,D), Lt1(G,Q), Lt

2(G), and Lt3(G)

are optimized, and λ1, λ2, and λ3 are three scale factors. Thefirst loss Lt

0(G,D) is to enforce the synthesized sequencesX̃t+1 to approach the real ones Xt+1. The second lossLt1(G,Q) is to improve prediction accuracy of vehicle motion.

The third loss Lt2(G) is to learn more discriminable features

between the regular driving sequences Xr and traffic accidentsequences Xa. The fourth loss Lt

3(G) is to increase neuroncoverage to enhance learning robustness.

The loss Lt0(G,D) for video sequence prediction in Equa-

tion 1 is implemented using a min-max game between thegenerative model G and discriminator D as popular generativeadversarial networks:

Lt0(G,D) = EXt

[1−D(G(Xt))

]+ EXt+1

[logD(Xt+1)

].

(2)The loss Lt

1(G,Q) for vehicle motion prediction in Equa-tion 1 is defined as:

Lt1(G,Q) = − 1

|Xt+1r |

∥∥∥Q(G(Xtr))− Y t+1

r

∥∥∥22, (3)

where Y t+1r is labels of vehicle motion corresponding to

regular driving sequences Xt+1r .

The loss Lt2(G) for discriminability in Equation 1 is

implemented as averaged layer-level discriminability usingmaximum mean discrepancy:

Lt2(G) = − 1

|SG|∑l∈SG

∥∥∥EXtr[Φl

G(Xtr)]− EXt

a[Φl

G(Xta)]∥∥∥2H,

(4)where SG is a set of layers in the model G, function Φl

G(·) isneuron activation of layer l in the model G, and H is a uni-versal reproducing kernel Hilbert space. Equation 3 capturesstatistical divergence between samples of regular driving andtraffic accidents, and this divergence is trained to be enlargedin Equation 1.

The loss Lt3(G) for neuron-level robustness in Equation 1

is implemented as averaged neuron coverage:

Lt3(G) = − 1

|SG|∑l∈SG

[ 1

|T lG|∑n∈T l

G

(R(Xtr, l, n)+R(Xt

a, l, n))],

(5)where T l

G is all neurons at layer l in the model G, functionR(·) is to compute neuron coverage, and α is a scale factor.Specifically, for nth neuron at layer l in the model G, Rt

r(l, n)computes neuron coverage for input regular driving sequencesXt

r, while Rta(l, n) computes neuron coverage for input traffic

accident sequences Xta. The function R(·) is implemented as

FiltersLayer 𝒊 Layer 𝒊 + 𝟏

(a) Original Structure


(b) Filter Repeated Times M=4Group Index ①②

①

①

①

①

②

②

②

②


(c) Filter Repeated Times M=2Group Index ①②③④

①

①

②

② ③

③

④

④

Fig. 3. Comparison of a layer’s original structure without filter reuse (a) and an example of the layer’s re-designed structure based on our filter reuse withfilter repeated times M = 4 (b) and M = 2 (c).

distance between maximum response and minimum responsegiven a neuron and input samples:

R(Xt∗, l, n) =

∣∣∣max(Φl,nG (Xt

∗))−min(Φl,nG (Xt

∗))∣∣∣, (6)

where Xt∗ is Xt

r or Xta according to Equation 5, and Φl,n

G (Xt∗)

is neuron activation of nth neuron at layer l in the modelG given input Xt

∗. Physical meaning here is that the largestdistance between samples’ neuron activation at each epochis trained to be enlarged, to encourage that each neuron’sactivation interval is more adequately used.

B. Efficient training strategy

Our efficient training strategy is to enforce filter reuse atlayers SG in the generative model G, while G’s architecture isautomatically and locally re-designed for accuracy compensa-tion. Our accuracy compensation method is combination use ofdropout [41] and shuffling [42], since these two operations areboth time efficient and memory efficient. Figure 3 comparesa layer’s original structure without filter reuse (a) and theexamples of the layer’s re-designed structure based on ourfilter reuse with filter repeated times M = 4 (b) and M = 2(c). In Figure 3, weight connections (filters) with the samecolor between input and output feature maps are the same. Alllayers in SG are adopted the same filter repeated times M inour method. For Figure 3 (b)&(c), the upper and bottom outputfeature maps, F and G, share the same filters but use differentdropout ratios, and output of F and G are thus different. Thefeature maps F and G are concatenated and then shuffled. Inaddition, filter reuse can be applied to the discriminator D andvehicle motion predictor Q, while we focus on the model Gonly in this paper.

Filter reuse with higher filter repeated times M reducesmore structural network parameters and encourages morememory reuse, hence training process can be further accel-erated. However, two problems are incurred. First, predictionaccuracy of vehicle motion may be more severely sacrificed.Second, the averaged neuron coverage, −Lt

3(G) in Equation 5,is more easily to be increased which does not reflect actual im-provement of neuron-level robustness. Therefore, to guaranteeeffectiveness of our robust vehicle motion learning, the filterrepeated times M is adaptively decreased with the increasingof G’s averaged neuron coverage −Lt

3(G) in Equation 5 in our

method. Specifically, M is initially set to 2k (k ≥ 1) and grad-ually decreased by being divided by 2 each time when−Lt

3(G)keeps on increasing for C training iterations. Please note thatthe averaged neuron coverage −Lt

3(G) will not always keepon increasing during the whole training process. The reason isthat training data always have a biased distribution [43], hencethe whole interval of a neuron’s activations is impossible tobe fully covered. When the filter repeated times M decreasesfrom 4 to 2, filters do not change except each filter’s groupindex (as shown in Figure 3(c)).

IV. EXPERIMENTS

Regular driving data Xtr (t = 1, 2, ..., T ) is sampled from

BDD100k [4] dataset. This dataset consists of 100K drivingvideos that are collected from diverse riding sceanrios, includ-ing city streets, residential areas, and highways. In addition,those videos cover various weather conditions and differenttime of a day. Each video is about 40-second long, andit contains not only the high-resolution (720p) and high-frame-rate (30fps) images, but also the GPS/IMU informationrecorded by cell-phones showing rough driving trajectories.

Accidental driving data Xta (t = 1, 2, ..., T ) is sampled from

AADV [8] dataset. This dataset consists of 678 videos thatare captured in six major cities in Taiwan, including diverseaccidents: 42.6% motorbike hits car, 19.7% car hits car, 15.6%motorbike hits motorbike, and 20% other types. 1750 clipsare sampled and annotated from those videos, where each clipconsists of 100 frames. Among those 1750 clips, 620 clips arelabeled as positive due to containing the moment of accidentat the last 10 frames, and the remaining 1130 clips with noaccidents are labeled as negative.

Popular network architectures are directly used for ourgenerative model G, discriminator D, and vehicle motionpredictor Q, since this paper’s focus is not on designingnew DNNs but filter reuse. For the model G, we adopt aneural stylizer from [44] with 128x128 as input resolution.The neural stylizer consists of two convolutions with stridetwo, five residual blocks [18], and two fractionally stridedconvolutions with stride one. For the model D, we adopt a70x70 PatchGAN [45]. PatchGAN uses fully convolution withfewer parameters, compared to traditional whole-image dis-criminator, to distinguish whether 70x70 overlapping patches

TABLE IACCURACY COMPARISON OF FCN-LSTM AND OUR METHOD IN

DISCRETE VEHICLE MOTION PREDICTION.

Model AccuracyFCN-LSTM [48] 84.1%

Our method 85.2%

TABLE IICOMPARISON OF THE REGULAR METHOD AND OUR METHOD IN DISCRETE

VEHICLE MOTION PREDICTION.

Time(s) Accuracy)Regular

316, 512 85.4%(No filter reuse)Ours

287, 227 85.2%(Filter reuse)

are real or fake. For the model Q, we adopt a two-layer fullyconvolution network.

Our filter repeated times M is initially set to 8. Since we useU-Net connection between model G and model Q, |SG| is setto 2 which is the layer number of model Q, and SG includesmodel G’s last last two layers, i.e., the two fractionally stridedconvolutions. The scale factors λ1, λ2, and λ3 in Equation 1are all fixed to 1.0. Adam solver [46] is used for our wholemodel in Figure 2. Adam parameters are: β1 = 0.5, β2 =0.999, and an initial learning rate of 0.0001. Batch size is setto 8 in all of our experiments. To less model collapse duringGAN training, we follow a strategy proposed in [47] whichuses a history of generated images but not most recent one toupdate a discriminator.

A. Discrete vehicle motion prediction

To evaluate the performance of our proposed model, we firstcompare the accuracy of discrete vehicle motion predictionby our model with FCN-LSTM [48]. In the discrete case,we aim to predict a feasible action from a discrete actionspace, including straight, stop, left turn and right turn. Inaddition, we choose the FCN-LSTM as the baseline modelfor comparison, which fuses an LSTM temporal encoder witha fully convolutional visual encoder. As Table I shows, ourproposed model can achieve 85.2% accuracy, which is 1.1%higher than that of FCM-LSTM model.

B. Our training time & GPU memory use

Table II is the comparison results of the regular method andour filter reuse one in discrete vehicle motion prediction. Aswe can see, our method improves both training performanceagainst the the traditional approach with marginal accuracyloss. The training time is reduced 10.2% with only 0.2%accuracy loss. Figure 4 shows the the impact of different Mson the GPU memory use. Along with the decreasing filterrepeated times M from 8 to 2, the measured peak and regularGPU memory use increase 1.3× and 1.2×, respectively.

V. CONCLUSION

In this paper, we propose an efficient framework for learningrobust self-driving behaviors, which are effectively learned

3,000

3,500

4,000

4,500

5,000

M=8 M=4 M=2GP

U u

se m

em

roy

(MB

)

Filter repeated times

Peak

No-peak

Fig. 4. The change of GPU memory use along with filter repeated times(M ).

from both regular driving samples and accidental drivingdata. To enhance the robustness of the proposed model, bothlayer-level discriminability and neuron coverage(neuron-levelrobustness) regulariziers are integrated into an unsupervisedgenerative network. In order to speedup the training process,a neuron-coverage-based adaptive filter reuse strategy is de-signed to reduce structural network parameters and facilitatein memory reuse. A set of comprehensive experiments areconducted, and evaluation results have demonstrated the ef-fectiveness and efficiency of our proposed framework.

REFERENCES

[1] D. A. Pomerleau, “ALVINN: An Autonomous Land Vehicle ina Neural Network,” Advances in Neural Information ProcessingSystems (NIPS), 1989.

[2] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “VisionMeets Robotics: The KITTI Dataset,” International Journal ofRobotics Research (IJRR), 2013.

[3] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,R. Benenson, U. Franke, S. Roth, and B. Schiele, “TheCityscapes Dataset for Semantic Urban Scene Understand-ing,” International Conference on Computer Vision and PatternRecognition (CVPR), 2016.

[4] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, andT. Darrell, “BDD100K: A Diverse Driving Video Database withScalable Annotation Tooling,” arXiv:1805.04687v1, 2018.

[5] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “DeepDriving:Learning Affordance for Direct Perception in AutonomousDriving,” International Conference on Computer Vision (ICCV),2015.

[6] L. Yang, X. Liang, T. Wang, and E. Xing, “Real-to-VirtualDomain Unification for End-to-End Autonomous Driving,” Eu-ropean Conference on Computer Vision (ECCV), 2018.

[7] X. Liang, T. Wang, L. Yang, and E. P. Xing, “CIRL: Control-lable Imitative Reinforcement Learning for Vision-based Self-driving,” European Conference on Computer Vision (ECCV),2018.

[8] F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun, “AnticipatingAccidents in Dashcam Videos,” Asian Conference on ComputerVision (ACCV), 2016.

[9] H. Kataoka, T. Suzuki, Y. Aoki, and Y. Satoh, “AnticipatingTraffic Accidents with Adaptive Loss and Large-scale IncidentDB,” International Conference on Computer Vision and PatternRecognition (CVPR), 2018.

[10] R. Hadsell, P. Sermanet, J. Ben, A. Erkan, M. Scoffier,K. Kavukcuoglu, U. Muller, and Y. Lecun, “Learning Lon-gRange Vision for Autonomous OffRoad Driving,” Journal ofField Robotics (JFR), 2009.

[11] J. Zhang and K. Cho, “Query-Efficient Imitation Learning forEnd-to-End Simulated Driving,” AAAI Conference on ArtificialIntelligence, 2017.

[12] K.-H. Zeng, S.-H. Chou, F.-H. Chan, J. C. Niebles, and M. Sun,“Agent-Centric Risk Assessment: Accident Anticipation andRisky Region Localization,” International Conference on Com-puter Vision and Pattern Recognition (CVPR), 2017.

[13] L. Ma, F. J.-X., F. Zhang, J. Sun, M. Xue, B. Li, C. Chen,T. Su, L. Li, Y. Liu, J. Zhao, and Y. Wang, “DeepGauge:Multi-Granularity Testing Criteria for Deep Learning Systems,”IEEE/ACM International Conference on Automated SoftwareEngineering (ASE), 2018.

[14] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba,“Network dissection: Quantifying interpretability of deep visualrepresentations,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2017.

[15] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba,“Object detectors emerge in deep scene cnns,” arXiv preprintarXiv:1412.6856, 2014.

[16] A. S. Morcos, D. G. Barrett, N. C. Rabinowitz, andM. Botvinick, “On the importance of single directions forgeneralization,” arXiv preprint arXiv:1803.06959, 2018.

[17] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp,P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang,X. Zhang, J. Zhao, and K. Zieba, “End to End Learning forSelf-Driving Cars,” arXiv:1604.07316v1, 2016.

[18] T. Fernando, S. Denman, S. Sridharan, and C. Fookes, “GoingDeeper: Autonomous Steering with Neural Memory Networks,”International Conference on Computer Vision (ICCV) Work-shops, 2017.

[19] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end Learningof Driving Models from Large-scale Video Datasets,” Interna-tional Conference on Computer Vision and Pattern Recognition(CVPR), 2017.

[20] X. Pan, Y. You, Z. Wang, and C. Lu, “Virtual to Real Reinforce-ment Learning for Autonomous Driving,” The British MachineVision Conference (BMVC), 2017.

[21] A. Ghosh, B. Bhattacharya, and S. B. R. Chowdhury, “SAD-GAN: Synthetic Autonomous Driving Using Generative Adver-sarial Networks,” Advances in Neural Information ProcessingSystems (NIPS) Workshops, 2016.

[22] S. Zheng, Y. Song, T. Leung, and I. Goodfellow, “Improving therobustness of deep neural networks via stability training,” in TheIEEE Conference on Computer Vision and Pattern Recognition(CVPR), June 2016.

[23] Z. Yan, Y. Guo, and C. Zhang, “Deep defense: Training dnnswith improved adversarial robustness,” in Advances in NeuralInformation Processing Systems, 2018, pp. 419–428.

[24] N. Carlini and D. Wagner, “Towards evaluating the robustnessof neural networks,” in 2017 IEEE Symposium on Security andPrivacy (SP). IEEE, 2017, pp. 39–57.

[25] P. Y. Simard, Y. A. LeCun, J. S. Denker, and B. Victorri, “Trans-formation invariance in pattern recognitiontangent distance andtangent propagation,” in Neural networks: tricks of the trade.Springer, 1998, pp. 239–274.

[26] O. Chapelle, J. Weston, L. Bottou, and V. Vapnik, “Vicinal riskminimization,” in Advances in neural information processingsystems, 2001, pp. 416–422.

[27] H. Yang, J. Zhang, H.-P. Cheng, W. Wang, Y. Chen, and H. Li,“Bamboo: Ball-shape data augmentation against adversarialattacks from all directions,” in Ceur Workshop Proceedings,2019.

[28] A. Kuznetsova, S. Ju Hwang, B. Rosenhahn, and L. Sigal,“Expanding object detector’s horizon: Incremental learningframework for object detection in videos,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition,2015, pp. 28–36.

[29] I. Misra, A. Shrivastava, and M. Hebert, “Watch and learn:Semi-supervised learning for object detectors from video,” inProceedings of the IEEE Conference on Computer Vision andPattern Recognition, 2015, pp. 3593–3602.

[30] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learningstructured sparsity in deep neural networks,” in Advances inneural information processing systems, 2016, pp. 2074–2082.

[31] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Ben-gio, “Binarized neural networks: Training deep neural networkswith weights and activations constrained to+ 1 or-1,” arXivpreprint arXiv:1602.02830, 2016.

[32] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neu-ral networks,” in European Conference on Computer Vision.Springer, 2016, pp. 525–542.

[33] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic,“Qsgd: Communication-efficient sgd via gradient quantizationand encoding,” in Advances in Neural Information ProcessingSystems, 2017, pp. 1709–1720.

[34] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li,“Terngrad: Ternary Gradients to Reduce Communication inDistributed Deep Learning,” Conference on Neural InformationProcessing Systems (NeurIPS), 2017.

[35] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gra-dient compression: Reducing the communication bandwidth fordistributed training,” arXiv preprint arXiv:1712.01887, 2017.

[36] N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsuper-vised learning of video representations using lstms,” in Inter-national conference on machine learning, 2015, pp. 843–852.

[37] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh, “Action-conditional video prediction using deep networks in atarigames,” in Advances in neural information processing systems,2015, pp. 2863–2871.

[38] C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learningfor physical interaction through video prediction,” in Advancesin neural information processing systems, 2016, pp. 64–72.

[39] X. Liang, L. Lee, W. Dai, and E. P. Xing, “Dual motion ganfor future-flow embedded video prediction,” in Proceedings ofthe IEEE International Conference on Computer Vision, 2017.

[40] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: ConvolutionalNetworks for Biomedical Image Segmentation,” Springer Con-ference on Medical Image Computing and Computer-AssistedIntervention (MICCAI), 2015.

[41] S. Ioffe and C. Szegedy, “Batch Normalization: AcceleratingDeep Network Training by Reducing Internal Covariate Shift,”IEEE Conference on Machine Learning (ICML), 2015.

[42] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: AnExtremely Efficient Convolutional Neural Network for MobileDevices,” IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2018.

[43] A. Torralba and A. A. Efros, “Unbiased Look at Dataset Bias,”IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2011.

[44] J. Johnson, A. Alahi, and F. F. Li, “Perceptual Losses for Real-Time Style Transfer and Super-Resolution,” European Confer-ence on Computer Vision (ECCV), 2016.

[45] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-ImageTranslation with Conditional Adversarial Networks,” Interna-tional Conference on Computer Vision and Pattern Recognition(CVPR), 2017.

[46] D. P. Kingma and J. L. Ba, “Adam: A Method for StochasticOptimization,” International Conference on Learning Represen-tations (ICLR), 2015.

[47] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, andR. Webb, “Learning from simulated and unsupervised imagesthrough adversarial training,” IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2017.

[48] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning ofdriving models from large-scale video datasets,” in Proceedingsof the IEEE conference on computer vision and pattern recog-nition, 2017, pp. 2174–2182.

Efﬁciently Learning a Robust Self-Driving Model with Neuron … · with Neuron Coverage Aware...

Documents

Transcript of Efﬁciently Learning a Robust Self-Driving Model with Neuron … · with Neuron Coverage Aware...