RGBd Image Semantic Labelling for Urban Driving Scenes via...
Transcript of RGBd Image Semantic Labelling for Urban Driving Scenes via...
Jason Bolito, Research School of Computer Science, ANU
RGBd Image Semantic Labelling for Urban Driving Scenes via a DCNN
Supervisors: Yiran Zhong & Hongdong Li
Outline
1. Motivation and Background
2. Proposed Method
3. Implementation, Experiment and Results
4. Conclusion and Future Work
2
Motivation – Semantic Segmentation
• Understanding road scenes.• Useful for autonomous cars and drones.
3
Source: cityscape datasets.
Semantic Segmentation vs. Object RecognitionObject Recognition Semantic Segmentation
4
“Person”
“Road” “Person” “Vegetation” “Motorcycle”Source: cityscapes-datasets.com
What we want from our method
• Leverage both 3D and colourinformation.
• Attain more accurate and robust semantic segmentation.
5
Background – RGB Semantic Labelling
• Earlier days: CRFs (low level vision cues).
• Recently: Deep Neural Nets.
6
Background – Fully Convolutional Networks
7
• Pixels to pixels approach.• Builds on VGG16. (encoder)• Upsampling using deconvolution to get label map. (decoder)
Source: FCNs for semantic segmentation by J. Long et al.
Background – Deconvolution Networks
8
• Expands VGG16. (encoder)• Uses unpooling + deconv to get label map. (decoder)
Source: Learning Deconvolution Network for Semantic Segmentation by H. Noh et al.
Background – SegNet
9
• Similar encoder-decoder structure.• Removes fully connected layers.• Prioritises memory efficiency.
Source: SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation by V. Badrinarayanan et al.
Background – RGBd Semantic Labelling
• HHA representation (Saurabh et all, 2014).
• Hard mutex constraints (Deng et al 2015).
• LSTM-F (Li et al, 2016).
• Fusenet (Hazirbas et al, 2016).
10
Background (cont’d)
• Presented methods use depth as a channel.
• Depth used as generic information.
• 3D structure not considered/learned.
11
Proposed Method – Ideas
12
• Use depth to partially reconstruct 3D scene.
• Use 3D convolution to capture structure.
• Apply encoder-decoder design to achieve rich segmentation maps.
Proposed Method – S3D
Deconv3D + ReLUConv3D + ReLU
SoftmaxConv3D (2x stride) + ReLU
Encoder
Decoder
32 64 64 64 128Featuremaps
13
S3D building blocks – Input Layer
• Input RGB image I is voxelised via disparity map D:
• 2.5D reconstruction of environment.
• Points at infinity have disparity 0.
14
I3D(z , x , y , c) :=
⇢I(x , y , c), for z = bD(x , y)c0, otherwise
,
S3D building blocks – Encoder
• Feature extraction via 3D convolution:
• Each 3x3x3 filter is a learnable template.
• High response = input matches template.
• 3D structure = 3D input + 3D templates15
Fout
(z , x , y , cout
) =X
k,i ,j ,cin
Fin
(z + k , x + i , y + j , cin
)Kcout
(k , i , j , cin
)
S3D building blocks – Encoder (cont’d)
• Non-linear activation function:
• Good gradients for backprop.
• Learnable downsampling = strided 3D convolution.
16
ReLU(x) = max(0, x)
S3D building blocks – Decoder
• 3D deconvolution = “inverse” of 3D convolution.
• Already implemented as backwards Conv3D pass.
• Learnable upsampling = strided 3D deconvolution.
17
Fout
(z + k , x + i , y + j , cout
) +=X
cin
Fin
(z , x , y , cin
)Kcout
(k , i , j , cin
)
S3D building blocks – Decoder (cont’d)
• Skip layers (top down modulation)
18
Conv3D ...
...Deconv3D...
Shallow=
Low level features
Deep=
High level knowledge
Helps with convergenceand refines features
S3D building blocks – Inference
• Use softmax to get probability cube:
• Argmax over classes to get 3D labels:
• Project using D to get 2D labels:
19
ˆP(z , x , y , c) :=exp(F(z , x , y , c))P
c02Classes exp(F(z , x , y , c0))
ˆL3D(z , x , y) := argmax
c2Classes
ˆP(z , x , y , c).
L̂(x , y) := L̂3D(bD(x , y)c , x , y)
Implementation
20
• Implemented using a deep-learning facadeAPI and TensorFlow.
Experiment and Results
• Dataset: Cityscapes (urban scene dataset)
• Splits: 2795 training / 500 test images over 50 cities.
• GPU: Nvidia GeForce Titan X Pascal.
21
Experiment and Results (cont’d)
• Image size: 128x64x128• Iterations: Around 30k• Results: (State of Art has mIoU = 80.1%)
• Learning feature extraction takes a while.• Can we de better?
22
G mIoU C Gtest mIoUtest Ctest
0.908 0.533 0.399 0.833 0.444 0.295
Experiment and Results (cont’d)
• Trick: Let pre-trained 2D DCNN do feature extraction.
• Use S3D on extracted features.
• Not SoA but matches DeepLab (71.4%)!• Depth accuracy/efficiency trade-off!
23
Method G mIoU C Gtest mIoUtest Ctest time/it (s)
S3D-ResNet-38-128 0.958 0.717 0.606 0.949 0.691 0.56 1.35
S3D-ResNet-38-48 0.96 0.748 0.622 0.942 0.715 0.57 0.513
S3D-ResNet-38-16 0.954 0.722 0.597 0.943 0.69 0.548 0.169
Conclusion and Future Work• Presented a DNN solution for semantic segmentation.
• Solution fully utilises 3D structure.
• Achieves good results especially when used on pre-extracted features.
• Good results achieved without any extra goodies! (CRFs, data augmentation, …)
• There is plenty of room for improvement!
24
Conclusion and Future Work (cont’d)
• Need to push S3D to the limit.
• Can be done with post-processing, balancing, upsampling, …
• What happens when we generalise one of the other architectures to 3D?
25
Questions?
26
Thank You!