Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf ·...

104
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018 1 Lecture 11: Detection and Segmentation

Transcript of Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf ·...

Page 1: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 20181

Lecture 11:Detection and Segmentation

Page 2: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 20182

Administrative

1. Midterms being graded- Please don’t discuss midterms until next week - some students not yet

taken

2. A2 being graded

3. Project milestones due Wed 5/16

Page 3: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 20183

Last Time: Recurrent Networks

Page 4: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 20184

Last Time: Recurrent Networks

Page 5: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 20185

Figure from Karpathy et a, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015; figure copyright IEEE, 2015.Reproduced for educational purposes.

Last Time: Recurrent Networks

A cat sitting on a suitcase on the floor

A cat is sitting on a tree branch

Two people walking on the beach with surfboards

A tennis player in action on the court

A woman is holding a cat in her hand

A person holding a computer mouse on a desk

Page 6: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 20186

Class ScoresCat: 0.9Dog: 0.05Car: 0.01...

So far: Image Classification

This image is CC0 public domain Vector:4096

Fully-Connected:4096 to 1000

Page 7: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 20187

Today: Detection, Segmentation

Page 8: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 20188

Other Computer Vision TasksClassification + Localization

SemanticSegmentation

Object Detection

Instance Segmentation

CATGRASS, CAT, TREE, SKY

DOG, DOG, CAT DOG, DOG, CAT

Single Object Multiple ObjectNo objects, just pixels This image is CC0 public domain

Page 9: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 20189

Other Computer Vision TasksSemantic

Segmentation2D Object Detection

DOG, DOG, CAT

Object categories + 2D bounding boxes

This image is CC0 public domain

GRASS, CAT, TREE, SKY

No objects, just pixels

3D Object Detection

Car

Object categories + 3D bounding boxes

Page 10: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201810

Semantic SegmentationSemantic

Segmentation2D Object Detection

DOG, DOG, CAT

Object categories + 2D bounding boxes

GRASS, CAT, TREE, SKY

No objects, just pixelsThis image is CC0 public domain

3D Object Detection

Car

Object categories + 3D bounding boxes

Page 11: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201811

Semantic Segmentation

Cow

Grass

SkyTrees

Label each pixel in the image with a category label

Don’t differentiate instances, only care about pixels

This image is CC0 public domain

Grass

Cat

Sky Trees

Page 12: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201812

Semantic Segmentation Idea: Sliding Window

Full image

Extract patchClassify center pixel with CNN

Cow

Cow

Grass

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Page 13: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201813

Semantic Segmentation Idea: Sliding Window

Full image

Extract patchClassify center pixel with CNN

Cow

Cow

GrassProblem: Very inefficient! Not reusing shared features between overlapping patches Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Page 14: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201814

Semantic Segmentation Idea: Fully Convolutional

Input:3 x H x W

Convolutions:D x H x W

Conv Conv Conv Conv

Scores:C x H x W

argmax

Predictions:H x W

Design a network as a bunch of convolutional layers to make predictions for pixels all at once!

Page 15: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201815

Semantic Segmentation Idea: Fully Convolutional

Input:3 x H x W

Convolutions:D x H x W

Conv Conv Conv Conv

Scores:C x H x W

argmax

Predictions:H x W

Design a network as a bunch of convolutional layers to make predictions for pixels all at once!

Problem: convolutions at original image resolution will be very expensive ...

Page 16: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201816

Semantic Segmentation Idea: Fully Convolutional

Input:3 x H x W Predictions:

H x W

Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!

High-res:D1 x H/2 x W/2

High-res:D1 x H/2 x W/2

Med-res:D2 x H/4 x W/4

Med-res:D2 x H/4 x W/4

Low-res:D3 x H/4 x W/4

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Page 17: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201817

Semantic Segmentation Idea: Fully Convolutional

Input:3 x H x W Predictions:

H x W

Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!

High-res:D1 x H/2 x W/2

High-res:D1 x H/2 x W/2

Med-res:D2 x H/4 x W/4

Med-res:D2 x H/4 x W/4

Low-res:D3 x H/4 x W/4

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Downsampling:Pooling, strided convolution

Upsampling:???

Page 18: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201818

In-Network upsampling: “Unpooling”

1 2

3 4

Input: 2 x 2 Output: 4 x 4

1 1 2 2

1 1 2 2

3 3 4 4

3 3 4 4

Nearest Neighbor

1 2

3 4

Input: 2 x 2 Output: 4 x 4

1 0 2 0

0 0 0 0

3 0 4 0

0 0 0 0

“Bed of Nails”

Page 19: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201819

In-Network upsampling: “Max Unpooling”

Input: 4 x 4

1 2 6 3

3 5 2 1

1 2 2 1

7 3 4 8

1 2

3 4

Input: 2 x 2 Output: 4 x 4

0 0 2 0

0 1 0 0

0 0 0 0

3 0 0 4

Max UnpoolingUse positions from pooling layer

5 6

7 8

Max PoolingRemember which element was max!

… Rest of the network

Output: 2 x 2

Corresponding pairs of downsampling and upsampling layers

Page 20: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201820

Learnable Upsampling: Transpose Convolution

Recall:Typical 3 x 3 convolution, stride 1 pad 1

Input: 4 x 4 Output: 4 x 4

Page 21: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201821

Recall: Normal 3 x 3 convolution, stride 1 pad 1

Input: 4 x 4 Output: 4 x 4

Dot product between filter and input

Learnable Upsampling: Transpose Convolution

Page 22: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201822

Input: 4 x 4 Output: 4 x 4

Dot product between filter and input

Recall: Normal 3 x 3 convolution, stride 1 pad 1

Learnable Upsampling: Transpose Convolution

Page 23: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201823

Input: 4 x 4 Output: 2 x 2

Recall: Normal 3 x 3 convolution, stride 2 pad 1

Learnable Upsampling: Transpose Convolution

Page 24: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201824

Input: 4 x 4 Output: 2 x 2

Dot product between filter and input

Recall: Normal 3 x 3 convolution, stride 2 pad 1

Learnable Upsampling: Transpose Convolution

Page 25: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201825

Input: 4 x 4 Output: 2 x 2

Dot product between filter and input

Filter moves 2 pixels in the input for every one pixel in the output

Stride gives ratio between movement in input and output

Recall: Normal 3 x 3 convolution, stride 2 pad 1

Learnable Upsampling: Transpose Convolution

Page 26: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201826

3 x 3 transpose convolution, stride 2 pad 1

Input: 2 x 2 Output: 4 x 4

Learnable Upsampling: Transpose Convolution

Page 27: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201827

Input: 2 x 2 Output: 4 x 4

Input gives weight for filter

3 x 3 transpose convolution, stride 2 pad 1

Learnable Upsampling: Transpose Convolution

Page 28: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201828

Input: 2 x 2 Output: 4 x 4

Input gives weight for filter

Sum where output overlaps3 x 3 transpose convolution, stride 2 pad 1

Filter moves 2 pixels in the output for every one pixel in the input

Stride gives ratio between movement in output and input

Learnable Upsampling: Transpose Convolution

Page 29: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201829

Input: 2 x 2 Output: 4 x 4

Input gives weight for filter

Sum where output overlaps3 x 3 transpose convolution, stride 2 pad 1

Filter moves 2 pixels in the output for every one pixel in the input

Stride gives ratio between movement in output and input

Learnable Upsampling: Transpose Convolution

Page 30: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201830

Input: 2 x 2 Output: 4 x 4

Input gives weight for filter

Sum where output overlaps3 x 3 transpose convolution, stride 2 pad 1

Filter moves 2 pixels in the output for every one pixel in the input

Stride gives ratio between movement in output and input

Other names:-Deconvolution (bad)-Upconvolution-Fractionally strided convolution-Backward strided convolution

Learnable Upsampling: Transpose Convolution

Page 31: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201831

Learnable Upsampling: 1D Example

a

b

x

y

z

ax

ay

az + bx

by

bz

Input FilterOutput

Output contains copies of the filter weighted by the input, summing at where at overlaps in the output

Need to crop one pixel from output to make output exactly 2x input

Page 32: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201832

Convolution as Matrix Multiplication (1D Example)We can express convolution in terms of a matrix multiplication

Example: 1D conv, kernel size=3, stride=1, padding=1

Page 33: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201833

Convolution as Matrix Multiplication (1D Example)We can express convolution in terms of a matrix multiplication

Example: 1D conv, kernel size=3, stride=1, padding=1

Convolution transpose multiplies by the transpose of the same matrix:

When stride=1, convolution transpose is just a regular convolution (with different padding rules)

Page 34: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201834

Convolution as Matrix Multiplication (1D Example)We can express convolution in terms of a matrix multiplication

Example: 1D conv, kernel size=3, stride=2, padding=1

Page 35: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201835

Convolution as Matrix Multiplication (1D Example)We can express convolution in terms of a matrix multiplication

Example: 1D conv, kernel size=3, stride=2, padding=1

Convolution transpose multiplies by the transpose of the same matrix:

When stride>1, convolution transpose is no longer a normal convolution!

Page 36: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201836

Semantic Segmentation Idea: Fully Convolutional

Input:3 x H x W Predictions:

H x W

Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!

High-res:D1 x H/2 x W/2

High-res:D1 x H/2 x W/2

Med-res:D2 x H/4 x W/4

Med-res:D2 x H/4 x W/4

Low-res:D3 x H/4 x W/4

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Downsampling:Pooling, strided convolution

Upsampling:Unpooling or strided transpose convolution

Page 37: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018

Aside: Multi-view 3D Reconstruction

37

Choy, C. B., Xu, D., Gwak, J., Chen, K., & Savarese, S. (2016, October). 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European Conference on Computer Vision (pp. 628-644). Springer, Cham.

Page 38: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018

Aside: Multi-view 3D Reconstruction

38

Choy, C. B., Xu, D., Gwak, J., Chen, K., & Savarese, S. (2016, October). 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European Conference on Computer Vision (pp. 628-644). Springer, Cham.

Page 39: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018

Aside: Multi-view 3D Reconstruction

39

2D image sequence 3D Voxelized ReconstructionChoy, C. B., Xu, D., Gwak, J., Chen, K., & Savarese, S. (2016, October). 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European Conference on Computer Vision (pp. 628-644). Springer, Cham.

Page 40: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201840

2D Object DetectionSemantic

Segmentation2D Object Detection

GRASS, CAT, TREE, SKY

DOG, DOG, CAT

Object categories + 2D bounding boxesNo objects, just pixels

This image is CC0 public domain

3D Object Detection

Car

Object categories + 3D bounding boxes

Page 41: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201841

Class ScoresCat: 0.9Dog: 0.05Car: 0.01...

Classification + Localization

This image is CC0 public domain Vector:4096

FullyConnected:4096 to 1000

Box Coordinates(x, y, w, h)

FullyConnected:4096 to 4

Treat localization as a regression problem!

Page 42: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201842

Class ScoresCat: 0.9Dog: 0.05Car: 0.01...

Classification + Localization

Vector:4096

FullyConnected:4096 to 1000

Box Coordinates(x, y, w, h)

FullyConnected:4096 to 4

Softmax Loss

L2 Loss

Correct label:Cat

Correct box:(x’, y’, w’, h’)

This image is CC0 public domain

Treat localization as a regression problem!

Page 43: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201843

Class ScoresCat: 0.9Dog: 0.05Car: 0.01...

Classification + Localization

Vector:4096

FullyConnected:4096 to 1000

Box Coordinates(x, y, w, h)

FullyConnected:4096 to 4

Softmax Loss

L2 Loss

Loss

Correct label:Cat

Correct box:(x’, y’, w’, h’)

+This image is CC0 public domain

Treat localization as a regression problem!

Multitask Loss

Page 44: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201844

Class ScoresCat: 0.9Dog: 0.05Car: 0.01...

Classification + Localization

Vector:4096

FullyConnected:4096 to 1000

Box Coordinates(x, y, w, h)

FullyConnected:4096 to 4

Softmax Loss

L2 Loss

Loss

Correct label:Cat

Correct box:(x’, y’, w’, h’)

+This image is CC0 public domain Often pretrained on ImageNet

(Transfer learning)

Treat localization as a regression problem!

Page 45: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201845

Object Detection as Regression?

CAT: (x, y, w, h)

DOG: (x, y, w, h)DOG: (x, y, w, h)CAT: (x, y, w, h)

DUCK: (x, y, w, h)DUCK: (x, y, w, h)….

Page 46: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201846

Object Detection as Regression?

CAT: (x, y, w, h)

DOG: (x, y, w, h)DOG: (x, y, w, h)CAT: (x, y, w, h)

DUCK: (x, y, w, h)DUCK: (x, y, w, h)….

4 numbers

16 numbers

Many numbers!

Each image needs a different number of outputs!

Page 47: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201847

Object Detection as Classification: Sliding Window

Dog? NOCat? NOBackground? YES

Apply a CNN to many different crops of the image, CNN classifies each crop as object or background

Page 48: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201848

Object Detection as Classification: Sliding Window

Dog? YESCat? NOBackground? NO

Apply a CNN to many different crops of the image, CNN classifies each crop as object or background

Page 49: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201849

Object Detection as Classification: Sliding Window

Dog? YESCat? NOBackground? NO

Apply a CNN to many different crops of the image, CNN classifies each crop as object or background

Page 50: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201850

Object Detection as Classification: Sliding Window

Dog? NOCat? YESBackground? NO

Apply a CNN to many different crops of the image, CNN classifies each crop as object or background

Page 51: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201851

Object Detection as Classification: Sliding Window

Dog? NOCat? YESBackground? NO

Apply a CNN to many different crops of the image, CNN classifies each crop as object or background

Problem: Need to apply CNN to huge number of locations, scales, and aspect ratios, very computationally expensive!

Page 52: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201852

Region Proposals / Selective Search● Find “blobby” image regions that are likely to contain objects● Relatively fast to run; e.g. Selective Search gives 2000 region

proposals in a few seconds on CPU

Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014

Page 53: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201853

R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Page 54: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201854

R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Page 55: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201855

R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Page 56: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201856

R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Page 57: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201857

R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Page 58: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201858

R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Page 59: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201859

R-CNN: Problems

• Ad hoc training objectives• Fine-tune network with softmax classifier (log loss)• Train post-hoc linear SVMs (hinge loss)• Train post-hoc bounding-box regressions (least squares)

• Training is slow (84h), takes a lot of disk space

• Inference (detection) is slow• 47s / image with VGG16 [Simonyan & Zisserman. ICLR15]• Fixed by SPP-net [He et al. ECCV14]

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Slide copyright Ross Girshick, 2015; source. Reproduced with permission.

Page 60: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201860

Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Page 61: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201861

Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Page 62: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201862

Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Page 63: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201863

Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Page 64: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201864

Fast R-CNN: RoI Pooling

Hi-res input image:3 x 640 x 480

with region proposal

Hi-res conv features:512 x 20 x 15;

Projected region proposal is e.g.

512 x 18 x 8(varies per proposal)

Fully-connected layers

Divide projected proposal into 7x7 grid, max-pool within each cell

RoI conv features:512 x 7 x 7

for region proposal

Fully-connected layers expect low-res conv features:

512 x 7 x 7

CNN

Project proposal onto features

Girshick, “Fast R-CNN”, ICCV 2015.

Page 65: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201865

Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Page 66: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201866

Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Page 67: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201867

Fast R-CNN(Training)

Girshick, “Fast R-CNN”, ICCV 2015.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Page 68: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201868

Fast R-CNN(Training)

Girshick, “Fast R-CNN”, ICCV 2015.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Page 69: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201869

R-CNN vs SPP vs Fast R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014Girshick, “Fast R-CNN”, ICCV 2015

Page 70: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201870

R-CNN vs SPP vs Fast R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014Girshick, “Fast R-CNN”, ICCV 2015

Problem:Runtime dominated by region proposals!

Page 71: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201871

Faster R-CNN: Make CNN do proposals!

Insert Region Proposal Network (RPN) to predict proposals from features

Jointly train with 4 losses:1. RPN classify object / not object2. RPN regress box coordinates3. Final classification score (object

classes)4. Final box coordinates

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015Figure copyright 2015, Ross Girshick; reproduced with permission

Page 72: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201872

Faster R-CNN: Make CNN do proposals!

Page 73: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201873

Mask R-CNN

He et al, “Mask R-CNN”, arXiv 2017

RoI Align Conv

Classification Scores: C Box coordinates (per class): 4 * C

CNN Conv

Predict a mask for each of C classes

C x 14 x 14

256 x 14 x 14 256 x 14 x 14

Page 74: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201874

Mask R-CNN: Very Good Results!

He et al, “Mask R-CNN”, arXiv 2017Figures copyright Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, 2017. Reproduced with permission.

Page 75: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201875

Mask R-CNNAlso does pose

He et al, “Mask R-CNN”, arXiv 2017Figures copyright Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, 2017. Reproduced with permission.

Page 76: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201876

Detection without Proposals: YOLO / SSD

Divide image into grid7 x 7

Image a set of base boxes centered at each grid cell

Here B = 3

Input image3 x H x W

Within each grid cell:- Regress from each of the B

base boxes to a final box with 5 numbers:(dx, dy, dh, dw, confidence)

- Predict scores for each of C classes (including background as a class)

Output:7 x 7 x (5 * B + C)

Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016

Page 77: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201877

Detection without Proposals: YOLO / SSD

Divide image into grid7 x 7

Image a set of base boxes centered at each grid cell

Here B = 3

Input image3 x H x W

Within each grid cell:- Regress from each of the B

base boxes to a final box with 5 numbers:(dx, dy, dh, dw, confidence)

- Predict scores for each of C classes (including background as a class)

Output:7 x 7 x (5 * B + C)

Go from input image to tensor of scores with one big convolutional network!

Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016

Page 78: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201878

Object Detection: Lots of variables ...

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017

Base NetworkVGG16ResNet-101Inception V2Inception V3Inception ResNetMobileNet

R-FCN: Dai et al, “R-FCN: Object Detection via Region-based Fully Convolutional Networks”, NIPS 2016Inception-V2: Ioffe and Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML 2015Inception V3: Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”, arXiv 2016Inception ResNet: Szegedy et al, “Inception-V4, Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv 2016MobileNet: Howard et al, “Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv 2017

Object Detection architectureFaster R-CNNR-FCNSSD

Image Size# Region Proposals…

TakeawaysFaster R-CNN is slower but more accurate

SSD is much faster but not as accurate

Page 79: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201879

Object Detection: Impact of Deep Learning

Figure copyright Ross Girshick, 2015. Reproduced with permission.

Page 80: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018

Open Source Frameworks

Lots of good implementations on GitHub!

TensorFlow Detection API: https://github.com/tensorflow/models/tree/master/research/object_detection Faster RCNN, SSD, RFCN, Mask R-CNN

Caffe2 Detectron: https://github.com/facebookresearch/Detectron Mask R-CNN, RetinaNet, Faster R-CNN, RPN, Fast R-CNN, R-FCN

Finetune on your own dataset with pre-trained models

80

Page 81: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201881

Aside: Object Detection + Captioning= Dense Captioning

Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016Figure copyright IEEE, 2016. Reproduced for educational purposes.

Page 82: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201882

Aside: Object Detection + Captioning= Dense Captioning

Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016Figure copyright IEEE, 2016. Reproduced for educational purposes.

Page 83: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201883

Page 84: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018

Aside: VisualGenome

84

Krishna, Ranjay, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen et al. "Visual genome: Connecting language and vision using crowdsourced dense image annotations." International Journal of Computer Vision 123, no. 1 (2017): 32-73.

Page 85: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201885

Aside: Scene Graph Generation

Xu, Zhu, Choy, and Fei-Fei, “Scene Graph Generation by Iterative Message Passing”, CVPR 2017Figure copyright IEEE, 2018. Reproduced for educational purposes.

Page 86: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201886

3D Object DetectionSemantic

Segmentation2D Object Detection

GRASS, CAT, TREE, SKY

DOG, DOG, CAT

Object categories + 2D bounding boxesNo objects, just pixels

This image is CC0 public domain

3D Object Detection

Car

Object categories + 3D bounding boxes

Page 87: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018

Simplified Camera Model

87Image source: https://www.pcmag.com/encyclopedia_images/_FRUSTUM.GIF

image plane

camera viewing frustrum

camera

Page 88: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018

Simplified Camera Model

88Image source: https://www.pcmag.com/encyclopedia_images/_FRUSTUM.GIF

2D point

3D ray

image plane

camera viewing frustrum

camera

Page 89: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018

Simplified Camera Model

89Image source: https://www.pcmag.com/encyclopedia_images/_FRUSTUM.GIF

2D point

3D ray

image plane

camera viewing frustrum

camera

Page 90: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018

Scale & Distance Ambiguity

90

This image is CC0 public domain

viewing frustrum

Image plane

distance

Page 91: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201891Image source: https://www.pcmag.com/encyclopedia_images/_FRUSTUM.GIF

2D point

3D ray

Simplified Camera Model

image plane

camera viewing frustrum

camera

Page 92: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018

3D Object Detection

92

This image is CC0 public domain

Page 93: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201893

3D Object Detection

Page 94: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201894

3D Object Detection: Monocular Camera

Chen, Xiaozhi, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, and Raquel Urtasun. "Monocular 3d object detection for autonomous driving." CVPR 2016.

Faster R-CNN

Page 95: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201895

3D Object Detection: Monocular View

Chen, Xiaozhi, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, and Raquel Urtasun. "Monocular 3d object detection for autonomous driving." CVPR 2016.

Page 96: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201896

3D Object Detection: Camera + LiDARThis image is CC0 public domain

Velodyne (HDL-64e) 3D Point Cloud

Page 97: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201897

3D Object Detection: Camera + LiDAR

Chen, Xiaozhi, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. "Multi-view 3d object detection network for autonomous driving." CVPR 2017

Page 98: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 201898

3D Object Detection: Camera + LiDAR

Chen, Xiaozhi, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. "Multi-view 3d object detection network for autonomous driving." CVPR 2017

Page 99: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018

RGB-Depth Camera

99

This image is CC0 public domain

Page 100: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018

RGB-Depth Camera

100

Registered RGB + depth point cloud

Page 101: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018

Point Cloud Voxelization

101

(x, y, z, r, g, b)

Page 102: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018

3D Object Detection: RGB-Depth Camera

102

“Faster RCNN in 3D”

RPN

3D Conv

More layers

3D NMS

3D region proposalsVoxelized RGB-D point cloud Object categories + 3D bounding boxes

S. Song, and J. Xiao. Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images. CVPR 2016

Page 103: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018103

RecapSemantic

Segmentation2D Object Detection

GRASS, CAT, TREE, SKY

DOG, DOG, CAT

Object categories + 2D bounding boxesNo objects, just pixels

This image is CC0 public domain

3D Object Detection

Car

Object categories + 3D bounding boxes

Page 104: Lecture 11: Detection and Segmentationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture11.pdf · Semantic Segmentation Object Detection Instance Segmentation GRASS, CAT, CAT TREE,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2018104

Next time:Visualizing CNN featuresDeepDream + Style Transfer