RGBd Image Semantic Labelling for Urban Driving Scenes via...

Jason Bolito, Research School of Computer Science, ANU

RGBd Image Semantic Labelling for Urban Driving Scenes via a DCNN

Supervisors: Yiran Zhong & Hongdong Li

Outline

1. Motivation and Background

2. Proposed Method

3. Implementation, Experiment and Results

4. Conclusion and Future Work

2

Motivation – Semantic Segmentation

• Understanding road scenes.• Useful for autonomous cars and drones.

3

Source: cityscape datasets.

Semantic Segmentation vs. Object RecognitionObject Recognition Semantic Segmentation

4

“Person”

“Road” “Person” “Vegetation” “Motorcycle”Source: cityscapes-datasets.com

What we want from our method

• Leverage both 3D and colourinformation.

• Attain more accurate and robust semantic segmentation.

5

Background – RGB Semantic Labelling

• Earlier days: CRFs (low level vision cues).

• Recently: Deep Neural Nets.

6

Background – Fully Convolutional Networks

7

• Pixels to pixels approach.• Builds on VGG16. (encoder)• Upsampling using deconvolution to get label map. (decoder)

Source: FCNs for semantic segmentation by J. Long et al.

Background – Deconvolution Networks

8

• Expands VGG16. (encoder)• Uses unpooling + deconv to get label map. (decoder)

Source: Learning Deconvolution Network for Semantic Segmentation by H. Noh et al.

Background – SegNet

9

• Similar encoder-decoder structure.• Removes fully connected layers.• Prioritises memory efficiency.

Source: SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation by V. Badrinarayanan et al.

Background – RGBd Semantic Labelling

• HHA representation (Saurabh et all, 2014).

• Hard mutex constraints (Deng et al 2015).

• LSTM-F (Li et al, 2016).

• Fusenet (Hazirbas et al, 2016).

10

Background (cont’d)

• Presented methods use depth as a channel.

• Depth used as generic information.

• 3D structure not considered/learned.

11

Proposed Method – Ideas

12

• Use depth to partially reconstruct 3D scene.

• Use 3D convolution to capture structure.

• Apply encoder-decoder design to achieve rich segmentation maps.

Proposed Method – S3D

Deconv3D + ReLUConv3D + ReLU

SoftmaxConv3D (2x stride) + ReLU

Encoder

Decoder

32 64 64 64 128Featuremaps

13

S3D building blocks – Input Layer

• Input RGB image I is voxelised via disparity map D:

• 2.5D reconstruction of environment.

• Points at infinity have disparity 0.

14

I3D(z , x , y , c) :=

⇢I(x , y , c), for z = bD(x , y)c0, otherwise

,

S3D building blocks – Encoder

• Feature extraction via 3D convolution:

• Each 3x3x3 filter is a learnable template.

• High response = input matches template.

• 3D structure = 3D input + 3D templates15

Fout

(z , x , y , cout

) =X

k,i ,j ,cin

Fin

(z + k , x + i , y + j , cin

)Kcout

(k , i , j , cin

)

S3D building blocks – Encoder (cont’d)

• Non-linear activation function:

• Good gradients for backprop.

• Learnable downsampling = strided 3D convolution.

16

ReLU(x) = max(0, x)

S3D building blocks – Decoder

• 3D deconvolution = “inverse” of 3D convolution.

• Already implemented as backwards Conv3D pass.

• Learnable upsampling = strided 3D deconvolution.

17

Fout

(z + k , x + i , y + j , cout

) +=X

cin

Fin

(z , x , y , cin

)Kcout

(k , i , j , cin

)

S3D building blocks – Decoder (cont’d)

• Skip layers (top down modulation)

18

Conv3D ...

...Deconv3D...

Shallow=

Low level features

Deep=

High level knowledge

Helps with convergenceand refines features

S3D building blocks – Inference

• Use softmax to get probability cube:

• Argmax over classes to get 3D labels:

• Project using D to get 2D labels:

19

ˆP(z , x , y , c) :=exp(F(z , x , y , c))P

c02Classes exp(F(z , x , y , c0))

ˆL3D(z , x , y) := argmax

c2Classes

ˆP(z , x , y , c).

L̂(x , y) := L̂3D(bD(x , y)c , x , y)

Implementation

20

• Implemented using a deep-learning facadeAPI and TensorFlow.

Experiment and Results

• Dataset: Cityscapes (urban scene dataset)

• Splits: 2795 training / 500 test images over 50 cities.

• GPU: Nvidia GeForce Titan X Pascal.

21

Experiment and Results (cont’d)

• Image size: 128x64x128• Iterations: Around 30k• Results: (State of Art has mIoU = 80.1%)

• Learning feature extraction takes a while.• Can we de better?

22

G mIoU C Gtest mIoUtest Ctest

0.908 0.533 0.399 0.833 0.444 0.295

Experiment and Results (cont’d)

• Trick: Let pre-trained 2D DCNN do feature extraction.

• Use S3D on extracted features.

• Not SoA but matches DeepLab (71.4%)!• Depth accuracy/efficiency trade-off!

23

Method G mIoU C Gtest mIoUtest Ctest time/it (s)

S3D-ResNet-38-128 0.958 0.717 0.606 0.949 0.691 0.56 1.35

S3D-ResNet-38-48 0.96 0.748 0.622 0.942 0.715 0.57 0.513

S3D-ResNet-38-16 0.954 0.722 0.597 0.943 0.69 0.548 0.169

Conclusion and Future Work• Presented a DNN solution for semantic segmentation.

• Solution fully utilises 3D structure.

• Achieves good results especially when used on pre-extracted features.

• Good results achieved without any extra goodies! (CRFs, data augmentation, …)

• There is plenty of room for improvement!

24

Conclusion and Future Work (cont’d)

• Need to push S3D to the limit.

• Can be done with post-processing, balancing, upsampling, …

• What happens when we generalise one of the other architectures to 3D?

25

Questions?

26

Thank You!

RGBd Image Semantic Labelling for Urban Driving Scenes via...

Documents

Transcript of RGBd Image Semantic Labelling for Urban Driving Scenes via...