Robust Semantic Segmentation in Adverse Weather Conditions ... · their methods in adverse weather...

Robust Semantic Segmentation in Adverse WeatherConditions by means of Sensor Data Fusion

Andreas Pfeuffer, and Klaus DietmayerInstitute of Measurement, Control, and Microtechnology, Ulm University, 89081 Ulm, Germany

Email: {andreas.pfeuffer, klaus.dietmayer}@uni-ulm.de

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this versionmay no longer be accessible.

Abstract—A robust and reliable semantic segmentation inadverse weather conditions is very important for autonomouscars, but most state-of-the-art approaches only achieve highaccuracy rates in optimal weather conditions. The reason is thatthey are only optimized for good weather conditions and givennoise models. However, most of them fail, if data with unknowndisturbances occur, and their performance decrease enormously.One possibility to still obtain reliable results is to observe theenvironment with different sensor types, such as camera andlidar, and to fuse the sensor data by means of neural networks,since different sensors behave differently in diverse weatherconditions. Hence, the sensors can complement each other bymeans of an appropriate sensor data fusion. Nevertheless, thefusion-based approaches are still susceptible to disturbances andfail to classify disturbed image areas correctly. This problemcan be solved by means of a special training method, the socalled Robust Learning Method (RLM), a method by whichthe neural network learns to handle unknown noise. In thiswork, two different sensor fusion architectures for semanticsegmentation are compared and evaluated on several datasets.Furthermore, it is shown that the RLM increases the robustnessin adverse weather conditions enormously, and achieve goodresults although no disturbance model has been learned by theneural network.

I. INTRODUCTION

A reliable environment recognition in adverse weather con-ditions is a quite challenging task for autonomous driving.While many applications, such as the semantic segmentationof the camera streams, achieve very good results in goodweather conditions, they fail in non optimal weather conditionswhen the sensors are disturbed by rain, snow, fog or evenby dazzling sun, and their accuracy decreases enormously.Usually, the sensors of autonomous cars fail asymmetrically inbad weather conditions. For instance, in blinding sunlight, thecamera images are disturbed by large white areas so that thereis no information available about the environment. In contrast,the corresponding environment can still be observed by thelidar sensor, which is not disrupted by blinding sun. Thisobservation motivates us to use a multiple sensor-setup, e.gcamera image and lidar data, or the image of a stereo cameraand the corresponding depth image, to increase the robustnessof the semantic labeling approaches. However, the robustnessof these approaches is not necessarily improved by the fusionof multiple sensor data, as the example in Fig. 1 shows. Inthis example, the camera image is artificially disturbed byfog so that the cars in the image are less visible, while thecorresponding depth image is not disturbed. It is expected thatthe cars are still recognized completely, since the cars can be

Fig. 1. Qualitative comparison of the state-of-the-art training method withproposed learning technique by means of the Foggy Cityscapes Dataset.

clearly seen in the depth image. However, large parts of thecars are not classified correctly as car but as building and road.The reason is that state-of-the-art approaches are particularlyvulnerable if one sensor stream is corrupted by noise theneural network has not seen during training. By simulating aspecial noise pattern, the performance can be increased for thistype of noise. For example, Sakaridis et al. [20] have createdsynthetic fog data, and have shown that they achieve a greateraccuracy on real foggy data. However, it is almost impossibleto record or simulate training data covering all adverse weatherscenarios and all possible sensor disturbances, since the noisein diverse weather conditions is rather complex. Therefore,a suitable training method is necessary so that the networklearns to focus on the undisturbed sensor data and ignores thedisrupted sensor data of unknown noise types.

In this work, a robust semantic segmentation approach ispresented, which can deal with different adverse weather situ-ations and unknown noise pattern due to an advanced trainingmethod, the so called Robust Learning Method (RLM). Theproposed approach is evaluated on different datasets and bymeans of various noise types, which have not been used fortraining. It turns out that the proposed approach outperformsother state-of-the-art approaches. For instance, the proposedapproach can classify the cars in the introductory example(see Fig. 1) completely, while the state-of-the-art approachonly classifies a tiny fractions of the cars correctly. Moreover,two different fusion architectures (Early and Late Fusion)are investigated to find out which fusion strategy is moreappropriate for a robust semantic segmentation.

arX

iv:1

905.

1011

7v1

[cs

.CV

] 2

4 M

ay 2

019

II. RELATED WORK

In recent years, convolution neural networks (CNNs) arewidely used for computer vision tasks such as semanticsegmentation approaches, which assign every pixel of theimage a predefined class. Early CNN-based approaches areoriginally derived from the image classification task and oftenconsist of an encoder-decoder architecture [2, 22]. Theirperformance has continuously improved in several ways, e.g.by multi-scale architectures [5, 6, 27], or the usage of temporalinformation [23, 25]. Additionally, some approaches focus onthe reduction of the inference time [15, 26], so that theyare practicable for real-time applications such as autonomousdriving or mobile robots. Despite pure camera-based semanticsegmentation approaches, there are also semantic segmentationapproaches for other sensors, e.g. for lidar [4, 9, 18], or forradar [21].

The segmentation accuracy can be increased further byusing information of several sensors. For instance, the fusionof RGB and depth image of a stereo camera is very popularin the literature. Generally, there are two possibilities to fuseimage and depth information by means of neural networks. Onthe one hand, 2D semantic image segmentation approachesare extended by an additional channel for the depth image[14, 24], which is also known as early fusion in the literature.On the other hand, the fusion is performed at feature level,which is denoted as late fusion. For example, FuseNet [10]determines two separate feature maps for camera and depthimage by means of two VGG16-encoders, which are combinedbefore they are fed into a common decoder. In [13], cameraand depth image are processed by two independent neuralnetworks, and the two resulting predictions are combined inthe end. Although fusing camera and lidar data is very popularin object detection tasks [7, 11, 16], there are hardly anysimilar approaches for semantic segmentation, which is mainlycaused by the fact that a public available dataset for semanticsegmentation containing camera and lidar data does not yetexist. Nevertheless, a fusion approach for semantic labeling isintroduced in this work, which is based on camera and lidardata, and is evaluated on our in-house dataset.

Most semantic segmentation and computer vision ap-proaches focus on achieving high scores at well-known bench-marks, but they do not care much about the robustness oftheir methods in adverse weather conditions. In fact, thereare only a few approaches which can deal with disturbancescaused by diverse weather often focusing on one specialnoise type. For instance, Sakaridis et al. [20] increase thesegmentation accuracy in fog by adding simulated fog to real-world images, and by training with these synthetic data. Poravet al. [19] derain rainy images by means of a neural networkand achieve better performance, when a further segmentationnetwork is applied on the derained images. Bijelic et al. [3]introduces the Robust Learning Technique, by which an imageclassifier becomes more robust against unknown disturbancesnot occurring in the training set. The authors randomly choseone of the input channels (camera image of depth image)

and replace the corresponding input data by an arbitrarilyselected sensor data of another, different scene. In [16], atraining strategy is described which increases the robustnessof an object detector in diverse weather conditions by fittingwhite polygons into the training data. Based on this trainingstrategy, a robust semantic segmentation approach is proposedhereinafter, which achieves good results on disturbed adverse-weather data.

III. NETWORK ARCHITECTURES

In this section, the proposed sensor fusion architecturesare described, which are based on the ICNet [26], a purecamera-based semantic segmentation approach. The ICNet isa real-time capable semantic segmentation method, which stilldelivers high performance by progressively processing theinput image in multiple resolutions. More precisely, the inputimage is downsampled twice so that three different brancheswith input images of scale 1, 1/2 and 1/4 are yielded. Eachbranch is processed independently and combined at the end ofthe network by means of Cascade Feature Fusion (CFF) layers[26]. First, the determined feature maps of the low resolutionbranch and the medium resolution branch are fused using aCFF layer. Then, the output of the CFF layer is concatenatedwith the feature map of the high resolution branch by meansof an additional CFF layer. The fused resolution branches areupsampled to the size of the input image using interpolationand further convolution layers. Finally, a softmax classifier isapplied to predict the class of each image pixel. The describedorigin ICNet is extended so that different sensor data, e.g.camera and lider, can be fused by means of the neural network.First of all, the camera image and the lidar data have to bepreprocessed so that they are in an appropriate input format.The camera image is first resized to an appropriate size andthe image mean is subtracted from the image, so that it isindependent from the image illumination. The lidar data isused to determine a dense depth image similar to the disparitymap of a stereo camera, as described in [16]. For this, thelidar data is first filtered so that only those points located inthe camera field of view are left. The remaining 3D points arethen projected onto the image plane so that a sparse depth mapresults. In a next step, the pixels without depth informationare filled by interpolating between neighboring points. Note,if there is no point within a predefined neighborhood, thecorresponding pixel is set to infinity. In this work, two differentnetwork architectures for an appropriate sensor fusion areconsidered, the so called Early Fusion and the Late Fusion. Inthe Early Fusion, the preprocessed sensor data are fused rightat the beginning, and in the Late Fusion, the sensor data arefused at the end of the neural network. Both fusion strategiesare described in more detail in the next sections.

A. Early Fusion

The idea of the Early Fusion is, that the camera imageand the corresponding depth image are concatenated to acommon input tensor and are processed together throughoutthe network, so that the neural network can learn by itself

Fig. 2. Early Fusion Approach: RGB image and the depth image determined from the lidar data are fit into a common 4D tensor and progressed together inthe entire neural network

Fig. 3. Late Fusion Approach: For each sensor and for each resolution, an independent feature map is determined by distinct feature encoders. The cameraand lidar feature maps of one scale are then concatenated before the different scales are fused by means of CFF layers.

how to fuse the sensor data. For this, the input dimension ofthe origin ICNet is extended from 3D to 4D. The first threedimensions contain the three color channels of the cameraimage and the 4th dimension contains the determined densedepth image. The common 4D input tensor is fed into the threebranches with different scales to yield the semantic labels ofthe corresponding input data. The net architecture of the EarlyFusion is illustrated in Fig. 2.

B. Late Fusion

Another fusion strategy is the Late Fusion, where thedifferent sensor data streams are first processed separatelybefore they are fused at the end of the neural network. Themotivation is that each sensor has sensor specific properties,

which might get lost in the Early Fusion. If each sensor is han-dled independently instead, the senor specific properties canbe considered better. Another advantage is that disturbancesof one sensor do not affect the other sensors until the fusionof their feature maps. Generally, the architecture of the LateFusion is similar to the original ICNet, but the amount ofresolution branches is doubled, where two branches have gotthe same resolution s (s ∈ {1; 1/2; 1/4}). The camera imageis fed into one branch of resolution s, while the preprocessedlidar data is fed into the other branch of resolution s. Thefeature maps of camera and lidar data of resolution s areconcatenated, before they are fused with the other featuremaps of different resolutions by means of the CFF layers.The described Late Fusion architecture is shown in Fig 3.

IV. ROBUST LEARNING METHOD

The problem of most state-of-the-art approaches fusingdifferent sensor data by means of neural networks is thatthey are quite vulnerable to sensor disturbances, which do notoccur during training, and hence, the performance of theseapproaches usually decreases in these scenarios enormously.An intuitive explanation is that the neural network has learneda joint representation of the individual sensor streams, e.g. ifthe object can be clearly seen in all sensor data, it can beclassified correctly. However, if one sensor stream is disturbedby unknown noise, the object cannot be classified correctly.Therefore, the goal is that the neural network should learn an“OR”-connection instead of an “AND”-connection. In otherwords, the neural network should still function correctly, if onesensor is disrupted by neglecting the disturbed data and usingthe undisturbed data instead. Therefore, the training strategydescribed in [16] is used, which overcomes this problem. Forthis, the clear, undisturbed data recorded in optimal weatherconditions are modified for training to increase the robustnessof the neural network. Analogous to [16], one of the sensorstreams is randomly chosen and set to infinity to simulatea sensor failure. Furthermore, white polygons are randomlyfitted in the camera or the depth image, which representspartially disturbed data. In total, the good weather data, partlydisturbed data and completely impaired data are used at theratio of 1:2:4 for training the proposed semantic segmentationapproaches, and is also called Robust Learning Method (RLM)in the following. In the evaluation part, the traditional state-of-the-art learning method (SLM) and the RLM are compared,and their robustness is analyzed, if one of the sensor streamsis disturbed by unknown noise.

V. EVALUATION

In this section, the proposed approaches are evaluated quali-tatively on two datasets, the AtUlm-Dataset and the Cityscapesdataset [8]. The AtUlm dataset is a non public availabledataset, which consists of 1446 fine-annotated grayscalesimages of a wideangle camera (size 850×1920), and contains12 classes (road, sidewalk, pole, traffic light, traffic sign,pedestrian, bicyclist, car, truck, bus, motorcyclist, and back-ground), which are defined in accordance to the Cityscapesdataset. The data consists of urban and rural scenarios and wasrecorded in good weather conditions by our test-vehicle [12]in the surroundings of Ulm (Germany). Furthermore, for eachcamera image, the corresponding lidar data of four Velodyne16 are available, which are mounted on four different positionsin the car roof (front, left, right, and rear). The dataset issplit into two subsets: 1054 images are used for training, andthe remaining 392 images are used for evaluation. Note, thatthe training and validation set consists of different sequencesrecorded at different locations.

For reproducibility, the different approaches are also eval-uated on the Cityscapes dataset [8], which consists of severalRGB images of 50 German cities and the corresponding depthimages and semantic labels. Due to the lack of lidar data,the depth image of the stereo camera is used instead of the

depth image determined form the lidar data, assuming thatcamera and stereo data are delivered by two separate sensorsand are disturbed independently from each other (knowingthat this does not correspond to reality). Similar to otherworks [17, 26, 27], 19 of the available 30 classes are usedfor training. The proposed approaches are trained on the 2975fine-annotated camera images and evaluated by means of the500 validation images, since the ground-truth of the test-set isnot publicly available.

For a fair comparison of the proposed approaches, thetraining hyper-parameters are chosen identical for each train-ing. Each approach is implemented in Tensorflow [1] andis trained on a single Nivida Titan X on the training setby means of the Stochastic Gradient Descent (SGD) with amomentum equal to 0.9 and a weight decay of 0.0001. Abatch size of two is chosen due to memory reasons and theinitial learning rate is set to 0.001, which is decreased bymeans of the poly-learning rate policy. The training loss isidentical to the original ICNet implementation [26], which isthe weighted sum of the cross-entropy losses of each resolutionbranch. The models are trained for 60k iterations in caseof the AtUlm dataset and for 100k iterations in case of theCityscapes dataset. During training, the images are randomlyflipped and randomly resized between 0.5 and 2.0 of the originimage size. For compensating the training fluctuations, eachapproach is trained for three times and the average is taken.The training weights of the network are initialized with apretrained ImageNet network. However, the kernel weights ofthe first convolutional layer are initialized randomly using aGaussian distribution, since the input dimension of the Earlyand Late Fusion varies. Note, that our ICNet implementationdiffers from the origin implementation described in [26], sincewe do not apply the model compression afterwards, but trainthe model with half of its feature size. Due to this and dueto the small batch size, our ICNet implementation achievesa mIoU of 64.7% (batch size 2) on the Cityscapes dataset,while the mIoU of the origin ICNet implementation proposedin [26] yields 67.7% (batch size 16).

In the next sections, the proposed methods are evaluatedby means of pixel-wise accuracy (acc.) and mean Intersec-tion of Union (mIoU). For this, two cases are considered.First, the fusion-based approaches, a pure camera-based ap-proach (ICNet img) and a pure depth-image-based approach(ICNet depth) are compared in good-weather conditions andin scenarios, in which the sensors are disturbed by knownnoise. Moreover, the robustness of the described methods istested by means of noise and data artifacts, the neural networkhas not seen during the training process. Hereinafter, the LateFusion approach trained by SLM is called LateFusion SLMand the Late Fusion approach trained by RLM is denoted asLateFusion RLM.

A. Evaluation at Good Weather Conditions

The various, proposed approaches are now evaluated ingood weather conditions on the AtUlm and Cityscapes datasetusing SLM and RLM for training. Furthermore, they are also

Fig. 4. Qualitative comparison of the state-of-the-art training method (SLM)with proposed learning technique (RLM) by means of the adverse AtUlmDataset.

evaluated on a simulated adverse weather dataset, the so-called adverse AtUlm dataset and adverse Cityscapes dataset,to analyses, how the learned models behave towards knowndisturbances, such as white polygons randomly fitted in eitherthe camera or the depth image. Adverse AtUlm and adverseCityscapes consist of good weather data of the correspondingdataset (≈ 14.3%), partially disturbed data in which whitepolygons are randomly fitted in either camera or depth image(≈ 57.1%), and complete impaired data where one of thesensor streams is set to infinity (≈ 28.6%). Each approachis trained on AtUlm, adverse AtUlm, Cityscapes and adverseCityscapes by means of SLM and RLM. The correspondingresults are presented in Table I and Table II. Generally, theperformance increases the later the sensor data are fused,as results in good and bad weather conditions show. Thereason for this is that the sensor data influences each otherin an early fusion. If one sensor is disrupted, the completefeature map is disturbed, and hence, the performance of thesegmentation approach declines. The Late Fusion approachesdeliver the best results for each dataset while ICNet depthperforms worst. This is caused by the fact that the depth imagebased approaches can detect the objects at close range, but theycannot classify them correctly. For instance, buses and truckslook similar in the depth data, so that the network cannotdistinguish between these two classes. Furthermore, the depthimage does only deliver little far-range information, and hence,the correct class cannot be predicted correctly.

The comparison of both learning methods indicates that theRLM-based approaches outperforms the SLM on the adversedatasets, however at the expense of accuracy and mIoU inoptimal weather conditions. The reason is that the RLM-basedapproaches have learned to detect coarse structures in theimage, especially at disturbed sensor data, and small imagedetails often get lost, such as the traffic signs in the backgroundof Fig. 4. Hence, the RLM-based approaches have a lowerclass-wise accuracy for small objects than the SLM-basedmethods in good weather conditions (traffic sign: 56.81%(RLM) versus 67.88% (SLM); traffic light: 56.51% (RLM)versus 58.52% (SLM)), while it is similar or even greater forlarger structures (road: 97.66% (RLM) versus 97.55% (SLM);

TABLE IEVALUATION ON ATULM DATASET

adverse AtUlm AtUlmApproach acc. (%) mIoU (%) acc. (%) mIoU (%)

ICNet (img only) – – 96.09 51.66ICNet (lidar only) – – 92.50 32.02

Early Fusion (SLM) 93.41 42.54 96.17 51.43Early Fusion (RLM) 94.81 45.27 96.08 50.30Late Fusion (SLM) 93.66 44.30 96.25 54.15Late Fusion (RLM) 95.28 46.72 96.20 52.04

TABLE IIEVALUATION ON CITYSCAPES DATASET

adverse Cityscapes CityscapesApproach acc. (%) mIoU (%) acc. (%) mIoU (%)

ICNet (img only) – – 93.25 64.73ICNet (depth only) – – 86.65 43.65


trucks: 82.96% (RLM) versus 76.95% (SLM)). Nevertheless,the RLM-based approaches are more robust against sensordisturbances and achieve much better results in this case.Additionally, they are less vulnerable to unknown noise asit will be shown in the next section.

B. Evaluation Using Unknown Disturbances

In this section, the proposed fusion approaches are evaluatedhow they behave if an unknown disturbance occurs in thesensor data and three different noise types (fog, blinding sun,and rain) are considered in more detail. First, the trainedmodels are evaluated in fog by means of the Foggy Cityscapesdataset [20]. Foggy Cityscapes is a synthetic fog dataset,which is derived from the Cityscapes Dataset. The real-world images of the Cityscapes dataset are modified byadding synthetic fog of varying density. The fog densityand the corresponding visibility range depends on a constantattenuation coefficient β. In the dataset, foggy images withβ ∈ {0.005; 0.010; 0.020} are provided, which correspondto visibility ranges of 800m, 400m and 200m respectively.The proposed fusion approaches, ICNet img and ICNet depthare trained on the Cityscapes dataset by means of SLM andRLM, and are evaluated on the Foggy Cityscapes for eachβ. In Fig. 5, their mIoU-values are shown as a function ofβ. Their performances decrease the denser the fog is exceptfor ICNet depth, whose performance is constant, since it onlydepends on the depth image, and hence is not disturbed by thefog in the camera image. However, the performances decreasedifferently. For approaches trained by SLM the performancedecreases much more, and at dense fog (β = 0.020), theirmIoU-values are even lower than the pure depth based ap-proach. In contrast, the RLM-trained methods perform muchbetter, e.g. LateFusion RLM outperforms the other approachesby at least 15% and ICNet depth by about 5% (see Table IIIfor more details). Furthermore, the proposed methods are also

Fig. 5. Performance course (mIoU) as function of β evaluated on the FoggyCityscapes dataset

TABLE IIIEVALUATION ON FOGGY CITYSCAPES

without fog dense fog (β = 0.02)Approach acc. (%) mIoU (%) acc. (%) mIoU (%)

ICNet (img) 93.25 64.73 74.09 31.40ICNet (depth) 86.65 43.62 86.65 43.62


evaluated qualitatively on Foggy Cityscapes. Fig. 1 and Fig. 7show some examples of the Late Fusion approach trained withSLM and RLM using the densest fog (β = 0.020). It turns outthat both learning techniques perform similarly at close range,while RLM outperforms the other one in the long-distancerange. For instance, the RLM still detects vehicles slightlycovered by fog. In contrast, both approaches fail to predictsmall objects such as traffic signs occurring in the distance.

The RLM-based approaches are also more robust, if thecamera is blinded by the sun, as it will be shown on sunnyimages of the AtUlm dataset. In case of dazzling sun, partsof the image are disturbed by white spots. The state-of-the-art methods do not assign the disturbed pixels as car,while the RLM-based methods predict most disturbed pixelscorrectly, as shown in Fig 8. Nevertheless, the RLM-basedapproaches also have problems to detect vehicles, which arefar away, such as the car in the last row of Fig 8. However,this is mainly caused by sensor limitations, since the usedlidars (Velodyne 16) deliver only one or two points for thesevehicles, and hence, a precise classification is not possible anymore. A more accurate lidar will increase the classification ratein far range further. Due to the lack of real blinding sun data,a qualitative evaluation is not possible. Instead, a second usecase is considered, in which synthetic rain is added to the real-

Fig. 6. Proposed approaches evaluated at different rain intensities (light,moderate, and heavy rain)

TABLE IVEVALUATION IN RAIN (ATULM DATASET)

sunny heavy rainApproach acc. (%) mIoU (%) acc. (%) mIoU (%)

ICNet (img) 96.09 51.66 84.77 16.38ICNet (depth) 92.50 32.02 92.50 32.02


world images of the AtUlm dataset. The rain is simulated byrandomly drawing N small lines of pixel length l in the cameraimage, which randomly chosen slant remains constant in oneimage. Furthermore, the image brightness is reduced by 30%,since rainy days are usually darker. For evaluation, differentrain intensities are simulated, namely light (N = 500, l = 10),moderate (N = 1500, l = 30), and heavy (N = 2500, l = 60)rain. In Fig. 6, the mIoU course of the considered approachesis shown in dependence of the rain intensity, and in Table IV,the results of heavy rain are shown. The graph shows thatthe performance of all approaches decreases the heavier therain is. LateFusion RLM again performs best, but in heavyrain it is outperformed by ICNet depth. The reason is thatICNet depth only uses the lidar data, which is not disturbedby the rain drops in this work. In Fig. 9, LateFusion SLMand LateFusion RLM are compared qualitatively in the caseof heavy rain. Generally, the qualitative analysis shows thatthe LateFusion RLM is more robust to the rain. For instance,the RLM-based approach detects the cars more reliably (seefirst and second row of Fig. 9) and there are much less areas,which are misclassified (e.g. see second and third row ofFig. 9). However, small details such as poles or bicycles arenot detected in heavy rain any more, which is mainly causedthat these objects are difficult to observe.

Fig. 7. Qualitative results on the Foggy Cityscapes dataset (β = 0.020). First column: input image; second column: results of the Late Fusion approachusing the standard training methods (SLM); third column: results of the Late Fusion approach using the Robust Learning Method (RLM); fourth column:corresponding ground-truth.

Fig. 8. Qualitative results on the on the AtUlm dataset (blinding sun). First column: input image; second column: results of the Late Fusion approach using thestandard training method (SLM); third column: results of the Late Fusion approach using the Robust Learning Method (RLM); fourth column: correspondingground-truth

VI. CONCLUSION

In this paper, both performance and robustness of semanticsegmentation approaches were investigated in adverse weatherconditions. It turns out that they decrease enormously in thiscase as compared to good weather conditions. This problemcan be solved in two steps. First, the segmentation accuracycan be increased by fusing different sensor data, e.g cameraand lidar data, and so two different sensor data fusion architec-tures were presented. It showed that the Late Fusion approachoutperforms the Early Fusion approach, especially in diverseweather conditions. However, these approaches do not deliverreliable results in diverse weather conditions, since they arevulnerable to unknown noise. Hence, in a second step, theproposed network architectures were trained by means of the

Robust Learning Method, which makes the approaches morerobust in adverse weather conditions. More concretely, bothrobustness and performance could be increased enormouslyby this learning method, as demonstrated on several examples.Furthermore, it was shown, that the proposed approaches alsoperform well, if unknown disturbances occur that were notpresented during training.

VII. ACKNOWLEDGMENT

The research leading to these results has received fundingfrom the European Union under the H2020 ECSEL Pro-gramme as part of the DENSE project, contract number692449.

Fig. 9. Qualitative results on the on the AtUlm dataset (rain). First column: input image; second column: results of the Late Fusion approach using thestandard training method (SLM); third column: results of the Late Fusion approach using the Robust Learning Method (RLM); fourth column: correspondingground-truth

REFERENCES

[1] Martın Abadi, Ashish Agarwal, Paul Barham, and Eugene Brevdo et al.TensorFlow: Large-scale machine learning on heterogeneous systems,2015. Software available from tensorflow.org.

[2] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: Adeep convolutional encoder-decoder architecture for image segmenta-tion. CoRR, abs/1511.00561, 2015.

[3] M. Bijelic, C. Muench, W. Ritter, Y. Kalnishkan, and K. Dietmayer.Robustness against unknown noise for raw data fusing neural networks.In 2018 21st International Conference on Intelligent TransportationSystems (ITSC), pages 2177–2184, Nov 2018.

[4] Luca Caltagirone, Samuel Scheidegger, Lennart Svensson, and MattiasWahde. Fast lidar-based road detection using fully convolutional neuralnetworks. CoRR, abs/1703.03613, 2017.

[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Mur-phy, and Alan L. Yuille. Deeplab: Semantic image segmentation withdeep convolutional nets, atrous convolution, and fully connected crfs.CoRR, abs/1606.00915, 2016.

[6] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff,and Hartwig Adam. Encoder-decoder with atrous separable convolutionfor semantic image segmentation. In ECCV, 2018.

[7] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. CoRR,abs/1611.07759, 2016.

[8] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld,Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, andBernt Schiele. The cityscapes dataset for semantic urban scene under-standing. CoRR, abs/1604.01685, 2016.

[9] Ayush Dewan, Gabriel L. Oliveira, and Wolfram Burgard. Deep semanticclassification for 3d lidar data. In Proc. of the IEEE Int. Conf. onIntelligent Robots and Systems (IROS), 2017.

[10] Caner Hazirbas, Lingni Ma, Csaba Domokos, and Daniel Cremers.Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In ACCV, 2016.

[11] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, andSteven Lake Waslander. Joint 3d proposal generation and objectdetection from view aggregation. CoRR, abs/1712.02294, 2017.

[12] F. Kunz, D. Nuss, J. Wiest, H. Deusch, S. Reuter, F. Gritschneder,A. Scheel, M. Stbler, M. Bach, P. Hatzelmann, C. Wild, and K. Di-etmayer. Autonomous driving at ulm university: A modular, robust, andsensor-independent fusion approach. In 2015 IEEE Intelligent VehiclesSymposium (IV), pages 666–673, June 2015.

[13] Seungyong Lee, Seong-Jin Park, and Ki-Sang Hong. Rdfnet: Rgb-dmulti-level residual feature fusion for indoor semantic segmentation.2017 IEEE International Conference on Computer Vision (ICCV), pages4990–4999, 2017.

[14] Lingni Ma, Jorg Stuckler, Christian Kerl, and Daniel Cremers. Multi-view deep learning for consistent semantic mapping with RGB-Dcameras. CoRR, abs/1703.08866, 2017.

[15] Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culur-ciello. Enet: A deep neural network architecture for real-time semanticsegmentation. CoRR, abs/1606.02147, 2016.

[16] Andreas Pfeuffer and Klaus Dietmayer. Optimal sensor data fusionarchitecture for object detection in adverse weather conditions. In2018 21st International Conference on Information Fusion (FUSION)(FUSION 2018), pages 2592–2599, Cambridge, United Kingdom (GreatBritain), July 2018.

[17] Andreas Pfeuffer, Karina Schulz, and Klaus Dietmayer. Semanticsegmentation of video sequences with convolutional lstms. In 2019IEEE Intelligent Vehicles Symposium (IV), 2019.

[18] Florian Piewak, Peter Pinggera, Manuel Schafer, David Peter, BeateSchwarz, Nick Schneider, David Pfeiffer, Markus Enzweiler, and J. Mar-ius Zollner. Boosting lidar-based semantic labeling by cross-modaltraining data generation. CoRR, abs/1804.09915, 2018.

[19] Horia Porav, Tom Bruls, and Paul Newman. I can see clearly now :Image restoration via de-raining. CoRR, abs/1901.00893, 2019.

[20] Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Semantic foggyscene understanding with synthetic data. International Journal ofComputer Vision, 126(9):973–992, Sep 2018.

[21] Ole Schumann, Markus Hahn, Jurgen Dickmann, and Christian Wohler.Semantic segmentation on radar point clouds. 2018 21st InternationalConference on Information Fusion (FUSION), pages 2179–2186, 2018.

[22] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutionalnetworks for semantic segmentation. CoRR, abs/1605.06211, 2016.

[23] Sepehr Valipour, Mennatullah Siam, Martin Jagersand, and NilanjanRay. Recurrent fully convolutional networks for video segmentation.2017 IEEE Winter Conference on Applications of Computer Vision(WACV), pages 29–36, 2017.

[24] Jinghua Wang, Zhenhua Wang, Dacheng Tao, Simon See, and GangWang. Learning common and specific features for RGB-D semanticsegmentation with deconvolutional networks. CoRR, abs/1608.01082,2016.

[25] E. E. Yurdakul and Y. Yemez. Semantic segmentation of rgbd videoswith recurrent fully convolutional neural networks. In 2017 IEEEInternational Conference on Computer Vision Workshops (ICCVW),pages 367–374, Oct 2017.

[26] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and JiayaJia. Icnet for real-time semantic segmentation on high-resolution images.CoRR, abs/1704.08545, 2017.

[27] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and JiayaJia. Pyramid scene parsing network. CoRR, abs/1612.01105, 2016.

Robust Semantic Segmentation in Adverse Weather Conditions ... · their methods in adverse weather...

Documents

Transcript of Robust Semantic Segmentation in Adverse Weather Conditions ... · their methods in adverse weather...