Semantic Segmentation -...

36
Semantic Segmentation Dr. Eyal Gruss Director of AI, Flatspace

Transcript of Semantic Segmentation -...

Page 1: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

SemanticSegmentation

Dr. Eyal Gruss

Director of AI, Flatspace

Page 2: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Talpiyot

PhD Physics

Machine Learning • Researcher• Consultant• Entrepreneur

Digital Artist

Eyal Gruss

Page 3: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

For photorealistic VR experience

3D Model

Using deep neural networks

Architectural Interpretation

Bitmap Floorplan

An AI-powered service that creates a VR model from a simple floorplan.

Flatspace

Demo video: http://flatspace.xyz

Page 4: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

28.19%

25.77%

16.42%

11.74%

6.66%

3.57%2.99%

2.25%

5.10%

0%

5%

10%

15%

20%

25%

30%

2010 2011 2012 2013 2014 2015 2016 2017 Humanlevel

Top

5 c

lass

ific

atio

n e

rro

r

Move to deep neural networks:AlexNet

Image Recognition (ImageNet ILSVRC)

GoogLeNet

MicrosoftResidualNet

1.2M train images, 100k test images, 1000 categories

Trimps-SoushenMinisteryof public security, China

Karpathy

Momenta/Oxford

Page 5: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Object Detection and Recognition (ImageNet)

googleresearch.blogspot.com/2014/09/building-deeper-understanding-of-images.html (Szegedy et al., GoogLeNet)

Live:• VGG• YOLO• YOLO v2• LeCun

Concurrence,Localization

Occlusion

Out of context

Counting

Tracking

Page 6: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Multi Instance Semantic Segmentation

Li et al.,arxiv.org/abs/1611.07709

Won the

COCO 2016Detection Challenge

(for segmentation)

Page 7: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Adversarial Perturbations AgainstSemantic SegmentationFischer et al.,arxiv.org/abs/1703.01101

Xie et al.,arxiv.org/abs/1703.08603

Metzen et al.,arxiv.org/abs/1704.05712

Cisse et al.,arxiv.org/abs/1707.05373

Page 8: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Other related tasks

• Edge detection

• Semantic edge detection

• Surface normals

• Matting / objectness (foreground/background)

• Saliency / memorability

• Pose estimation

• Depth estimation

• Optical flow interpolation and estimation

• Motion prediction

• E.g. Eigen and Fergus, UberNet, PixelNetcombine several of the above

Page 9: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

This talk: Semantic Segmentationaka: scene labeling / scene parsing / dense prediction / dense labeling / pixel-level classification

(d) Input (e) semantic segmentation (f) naive instance segmentation (g) instance segmentation(e) semantic segmentation

Page 10: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Datasets and use cases

• General• Pascal VOC 2012 • MS COCO (evaluation only for instance segmentation)• ADE20K / SceneParse150K (all pixels annotated)• DAVIS 2017 (video; review)

• Urban (e.g. for autonomous vehicles)• Cityscapes (all pixels annotated)• CMP Facades (strong priors)• KITTI road/lane• CamVid (all pixels annotated, video)

• Aerial / Satellite • ISPRS Potsdam and Vaihingen• DSTL Kaggle (multi-modal)

• Human parsing (LIP, MHP)

• Medial imaging (can be 2.5D/multi-view)

• More: riemenschneider.hayko.at/vision/dataset

Page 11: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Pascal VOC 2012 11,530 6,929 20 + background

Train+Validation:

github.com/nightrome/really-awesome-semantic-segmentation

Page 12: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Evaluation metrics

• Pixel accuracy (dominated by background class)

• Mean accuracy over classes (individual class recall does not penalize false pos; must include background class)

• Jaccard index = Intersection over Union (IoU) = (GT ∩ Pred) / (GT U Pred) = TP / (TP + FN + FP)• <= Recall = TP / GT, Precision = TP / Pred• Usually: mean over classes, on the whole dataset• Can include or exclude the background class• Can be mean over images instead of whole dataset• Can be frequency weighted (unbalanced, similar to pixel accuracy) • Can be weighted by inverse instance size (cityscapes, important in traffic use cases)• Can be averaged with e.g. pixel accuracy (ADE20K)

• Dice index = F1 score = 2(GT ∩ Pred) / (GT + Pred) = 2TP / (2TP + FN + FP)• = Harmonic mean of Recall and Precision • = 2IoU / (1 + IoU), Monotonic with IoU

Page 13: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Evaluation metrics

• Trimap IoU around boundaries 4/8px (Krähenbühl and Koltun, Kohli et al.)

• Boundary F1 (BF) - Nearest boundary pixel distance (Csurka et al.)• For some distance error tolerance = e.g. 0.75% of the image diagonal

• Can be averaged with IoU (Davis)

• Average precision (AP) = Area under the precision-recall curve (MS COCO)• Here precision, recall are instance-level given some IoU threshold e.g. 0.5

• Can be additionally averaged over different thresholds (e.g. 0.5 - 0.95 in steps of 0.05)

• Multiple detections (instance fragmentation) are counted as false positives beyond the best

• Primary metric for instance segmentation (pixel-level metrics can be ambiguous)

Page 14: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Loss

• Cross entropy = - sumclasses sumpixels p*log(q)• p = targets; q = output probabilites• Can be weighted by inverse class size• Can be weighted to emphasize areas around edges (U-Net, Meyer)

• IoU approximated with probabilities = sumclasses [(sumpixels p*q) / sumpixels (p + q – p*q)]• Approximation is needed since IOU is discrete • Makes sense since this is our evaluation metric• Multiclass formulation is balanced over class sizes • Rediscovered in literature from time to time [1-16]• Visualead reported mixed results • Loss =

• - IoU [1 2 3 4 5 6]• - Dice [7 8 9 10]• - Tversky generalization [11]• 0.1 * CE + 0.9 * (1 - Dice) [12]• CE - log(IoU) [13]

• Other approximations [14 15 16 (TBD in TF)]• Total variation smoothing = sumclasses sumx,y |qx+1,y – qx,y|+|qx,y+1 – qx,y|• Adversarial (later)

Page 15: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Architectures

1. Patchwise CNN

2. FCN

3. DeepLab

4. DeconvNet

5. U-Net

6. SegNet

7. Dilated Convolutions (Yu and Koltun)

8. 100-Layer Tiramisu (DesneNets)

9. Wide ResNet

10. PSPNet

11. Adversarial

12. PolygonRNN

13. Mask R-CNN

14. Semi-supervised with unsupervised loss

Page 16: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Patchwise CNN

• Ning et al., http://yann.lecun.com/exdb/publis/pdf/ning-05.pdf

• Ciresan et al., people.idsia.ch/%7Ejuergen/nips2012.pdf

• A sliding window CNN classifies each pixel in turn

Page 17: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Fully Convolutional NN

• cs231n.github.io/convolutional-networks/#converting-fc-layers-to-conv-layers

• Start from a CNN classifier

• Convert fully connected to conv (with filter size = input volume, no padding):• CNN -> 7*7*512 -> fc(4096) -> 4096 -> fc(1000) -> 1000• CNN -> 7*7*512 -> conv(7*7*4096) -> 1*1*4096 -> conv (1*1*1000) -> 1*1*1000

• Can take arbitrarily larger input:• 224*224 -> 7*7*512 -> 1*1*100• 384*384 -> 12*12*512 -> 6*6*100

• Equivalent to sliding a patchwise CNN, butwith a single pass that is much moreefficient due to convolution sharing

Page 18: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Deconvolution/Upconvolution Layers

• FC convolution transposed

• cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf

• Fractionally strided convolution

• github.com/vdumoulin/conv_arithmetic

Stride = 2 Stride = 1/2

input

(Resolution Increasing Convolutions)

Page 19: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Fully Convolutional Network (FCN; 2014-11)

• Long et al., arxiv.org/abs/1411.4038

• Shelhamer et al., arxiv.org/abs/1605.06211

• Start from classification CNN pre-trained onImageNet (AlexNet/VGG-16/GoogLeNet)and convert fully connected to conv (conv7)

• Replace final layer to 1*1*21 and add bilinearupsampling to get full spatial output (FCN-32s)

• Add x2 deconv (initialized as bilinear) on conv7and sum with conv prediction added to pool4

• Add bilinear upsampling to get full spatialoutput (FCN-16s). Fine tune from FCN-32s

• Do similarly for above fuse and pool3 (FCN-8s)

• Pascal VOC 2012 IoU=62.2%-67.2% (up from 51.6%)

• 100-175 ms(vs. 50 s)

• 134M params

Page 20: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

DeepLab (2014-12)

• Chen et al. (Google), arxiv.org/abs/1412.7062

• VGG-16 pre-trained on ImageNet -> fully conv

• Cancel last two max-pool

• Change conv after above to x2/x4 dilated convolutions

• Train with x8 subsampled targets (IoU<90.7%). Infer with bilinear upsampling.

• Fully connected CRF (raw image dependent potential) post-processing in inference (+ 3%-5%)

• Add multi-scale layers fine tuned separately (similar to FCN-8s but with concats and convs)

• Increase dilation for first FC layer to x12 (large field of view) + change FC kernel, filters

• 20.5M params

• Pascal VOC 2012 IoU = 71.6%

• V2: arxiv.org/abs/1606.00915 with ResNet-101 + “atrous spatial pyramid pooling” • Pascal VOC 2012 IoU = 79.7% Cityscapes IoU = 70.4%

• V3: arxiv.org/abs/1706.05587• Pascal VOC 2012 IoU = 86.9% Cityscapes IoU = 81.3% (SOTA 2017)

Before softmaxAfter softmax

hole = atrous = dilated convolutions increase field of view without decreasing resolution, or adding parameters

Page 21: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

DeconvNet (2015-05)

• Noh et al., arxiv.org/abs/1505.04366

• VGG-16 pre-trained on ImageNet

• Unpooling layers use saved max pooling indices

• Symmetric encoder-decoder: multiple deconvolutions + BatchNorm + ReLU (no dropout)

• Relies on region proposals. Training with two-stage curriculum learning:• 1. Instances cropped to GT bounding boxes * 1.2, all non-class pixels labeled as background• 2. Object proposals from edge-box * 1.2

• Inference:• Top 50 objectness score of 2000 edge-box object proposals, Max per pixel/class before softmax• Fully connected CRF post-processing (+ ~1%)

• 252M params

• Pascal VOC 2012 IoU = 70.5%

• Ensemble with FCN-8s = 72.5%

Page 22: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

U-Net (2015-05)

• Ronneberger et al., arxiv.org/abs/1505.04597

• No VGG! Not pre-trained!

• Skip connections to keep res.!

• Separate deconv to:learned 2x2 upconv +(3x3 regular conv + ReLU) * 2

• Weighting to emphasize areasaround morphological edges

• Implementations I’ve seenuse half the filters and padding

dropout

Page 23: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

SegNet (2015-11)

• Badrinarayanan et al., arxiv.org/abs/1505.07293 arxiv.org/abs/1511.00561• VGG-16 pre-trained on ImageNet (without fully connected layers)• Unpooling layers use saved max pooling indices like in DeconvNet• Deconvolutions + BatchNorm + ReLU (some dropout)• They compare various decoders, and dropouts (arxiv.org/abs/1511.02680)• Pascal VOC 2012 IoU = 59.9%

Page 24: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Dilated Convolutions (2015-11)

• Yu and Koltun, arxiv.org/abs/1511.07122

• Front-end network + Context aggregation network

• Front-end is a truncated VGG-16 like DeepLab + dilated convs,pre-trained on Pascal VOC 2012

• Context aggregation is a 7-layer uniform resolution dilated convs +ReLUs, with increasing dilations and initialized to unit filters

• Train with x8 subsampled targets. Front-end is trained first. Then context is added and trained with fixed front-end

• Possible post-processing with fully connected CRF / CRF-RNN

• Front-end alone: Pascal VOC 2012 IoU = 71.3%

• Front-end + Context + CRF-RNN: Pascal VOC 2012 IoU = 75.3%

• Dilation10: Cityscapes IoU = 67.1%

Page 25: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

The One Hundred Layers Tiramisu (2016-11)

• Jegou et al., arxiv.org/abs/1611.09326

• DenseNets (few params, easy training)

• Encoder-Decoder with skipconnections

• 56 – 103 layers

• 1.5M – 9.4M params

• No pre-training

• No / negative results on largebenchmarks

Page 26: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Wide ResNet (2016-11)

• Wu et al., arxiv.org/abs/1611.10080

• Wider or Deeper Resnets? Wider!• See also Littwin and Wolf, arxiv.org/abs/1611.02525

• Wide 7-block ResNet pre-trained for classification, adapted to dilated a la DeepLab

• Pascal VOC 2012 IoU = 82.5%

• Cityscapes IoU = 78.4%

• ADE20K avg(pixel acc., IoU) = 56.74%

Page 27: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Pyramid Scene Parsing (PSPNet; 2016-12)• Zhao et al., arxiv.org/abs/1612.01105 (trained models: Caffee, Keras)

• Pre-trained dilated 101-269 ResNet + deep supervision auxiliary loss+ pyramid pooling module

• Pascal VOC 2012 IoU = 85.4% (1st place 2016)

• Cityscapes IoU = 80.2% (1st place 2016). Video

• ADE20K avg(pixel acc., IoU) = 57.21% (1st place 2016)

SOTA!(2016)

Page 28: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Mismatched Relationship

Confusion Categories

Inconspicuous Classes

Page 29: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Generative Adversarial Nets

Goodfellow et al., arxiv.org/abs/1406.2661

Generator

יוצרת

Discriminator(Curator)אוצרת

Fake or Real?

Fake

Real

Page 30: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Image to Image TranslationWith Conditional Adversarial Networks (PatchGAN)

Isola et al., phillipi.github.io/pix2pix Interactive: affinelayer.com/pixsrvGuide: ml4a.github.io/guides/Pix2Pix fotogenerator.npocloud.nl

Page 31: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Adversarial (2016-09)

• Idea is to regulate naturalness (strong and smooth classes, sharp boundaries, denoising, global structure)

• David Golan et al. (2016-09, first one AFAIK)

• Pix2pix, Isola et al., arxiv.org/abs/1611.07004• Generator is U-Net style (with skip connections)• 4x4 Conv with stride 2 – BatchNorm - ReLU (+ some dropout). No max-pooing.• L1 loss for generator• “PatchGAN” Discriminator takes both image and segmentation, averages over 70x70 patches• Adversarial loss hurts! Cityscapes IoU = 29% • L1 only Cityscapes IoU = 35%

• FAIR, Luc et al., arxiv.org/abs/1611.08408• Generator is Yu and Koltun’s Dilated8• Cross-entropy loss for generator• Discriminator issue: we feed it continuous probabilities (cannot do sgd with discrete labels), but GT are discrete

• Tested product with image and scaling GT, as alternative input to discriminator, but results were the same

• Pascal VOC 2012 IoU = 73.3% (compare to Yu and Koltun’s 71.3%). Adversarial ~ 2%• Several citations using this

Page 32: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

PolygonRNN (2017-04)

• Castrejon et al. CSC2523_Project_Report, arxiv.org/abs/1704.05548

• Spare representation using polygons

• Cityscapes IoU = 61.4% per instance, assuming given bounding boxes

• Can speed-up manual annotation• CVPR 2017 Best Paper Honorable Mention Award (video)

(ConvLSTM)

Page 33: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Mask R-CNN (2017-03)

• He et al., arxiv.org/abs/1703.06870 (tutorial)

• Instance segmentation SOTA

Page 34: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Semi-Supervised Semantic Segmentationwith Unsupervised Total Variation Loss• Javanmard et al.,

arxiv.org/abs/1605.01368

Supervised Proposed10 pix/image 10 pix/image Full labels GT

Page 35: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Meta references

• Janai et al., arxiv.org/abs/1704.05519 (chapter 6)

• Garcia-Garcia et al., arxiv.org/abs/1704.06857

• meetshah1995.github.io/semantic-segmentation/deep-learning/pytorch/visdom/2017/06/01/semantic-segmentation-over-the-years

• blog.qure.ai/notes/semantic-segmentation-deep-learning-review

• handong1587.github.io/deep_learning/2015/10/09/segmentation

• github.com/kjw0612/awesome-deep-vision#semantic-segmentation

• github.com/mrgloom/Semantic-Segmentation-Evaluation

• github.com/fchollet/keras/issues/6538

Page 36: Semantic Segmentation - NVIDIAon-demand.gputechconf.com/gtc-il/2017/presentation/sil7145-eyal-gr… · This talk: Semantic Segmentation aka: scene labeling / scene parsing / dense

Thanks!

• Slides: bit.ly/semantic-segmentation

• Contact: [email protected]