Convolutional Neural Networks IIcs194-26/sp20/Lectures/...Lecture 7 - 6 27 Jan 2016 Case Study:...

Convolutional Neural Networks II

CS194: Image Manipulation, Comp. Vision, and Comp. PhotoAlexei Efros, UC Berkeley, Spring 2020

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 20162

Case Study: LeNet-5[LeCun et al., 1998]

Conv filters were 5x5, applied at stride 1Subsampling (Pooling) layers were 2x2 applied at stride 2i.e. architecture is [CONV-POOL-CONV-POOL-CONV-FC]

Andrew NgImageNet Challenge (1000 object classes), Fei-Fei et al.


Case Study: AlexNet[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4=>Q: what is the output volume size? Hint: (227-11)/4+1 = 55




First layer (CONV1): 96 11x11 filters applied at stride 4=>Output volume [55x55x96]

Q: What is the total number of parameters in this layer?




First layer (CONV1): 96 11x11 filters applied at stride 4=>Output volume [55x55x96]Parameters: (11*11*3)*96 = 35K



Input: 227x227x3 imagesAfter CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2

Q: what is the output volume size? Hint: (55-3)/2+1 = 27




Second layer (POOL1): 3x3 filters applied at stride 2Output volume: 27x27x96

Q: what is the number of parameters in this layer?




Second layer (POOL1): 3x3 filters applied at stride 2Output volume: 27x27x96Parameters: 0!



Input: 227x227x3 imagesAfter CONV1: 55x55x96After POOL1: 27x27x96...



Full (simplified) AlexNet architecture:[227x227x3] INPUT[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0[27x27x96] MAX POOL1: 3x3 filters at stride 2[27x27x96] NORM1: Normalization layer[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2[13x13x256] MAX POOL2: 3x3 filters at stride 2[13x13x256] NORM2: Normalization layer[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1[6x6x256] MAX POOL3: 3x3 filters at stride 2[4096] FC6: 4096 neurons[4096] FC7: 4096 neurons[1000] FC8: 1000 neurons (class scores)



Full (simplified) AlexNet architecture:[227x227x3] INPUT[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0[27x27x96] MAX POOL1: 3x3 filters at stride 2[27x27x96] NORM1: Normalization layer[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2[13x13x256] MAX POOL2: 3x3 filters at stride 2[13x13x256] NORM2: Normalization layer[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1[6x6x256] MAX POOL3: 3x3 filters at stride 2[4096] FC6: 4096 neurons[4096] FC7: 4096 neurons[1000] FC8: 1000 neurons (class scores)

Details/Retrospectives: - first use of ReLU- used Norm layers (not common anymore)- heavy data augmentation- dropout 0.5- batch size 128- SGD Momentum 0.9- Learning rate 1e-2, reduced by 10manually when val accuracy plateaus- L2 weight decay 5e-4- 7 CNN ensemble: 18.2% -> 15.4%

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201613

(slide from Kaiming He’s recent presentation)


Case Study: ResNet[He et al., 2015]

224x224x3

spatial dimension only 56x56!


Case Study: ResNet [He et al., 2015]


“You need a lot of a data if you want to train/use CNNs”


Transfer Learning

“You need a lot of a data if you want to train/use CNNs”

Deep Features & their Embeddings

The Unreasonable Effectiveness of Deep Features

Classes separate in the deep representations and transfer to many tasks.[DeCAF] [Zeiler-Fergus]

Can be used as a generic feature (“CNN code” = 4096-D vector before classifier)

query image nearest neighbors in the “code” space

ImageNet + Deep Learning

Beagle

- Image Retrieval- Detection (RCNN)- Segmentation (FCN)- Depth Estimation- …

ImageNet + Deep Learning

Beagle

Pose?

Boundaries?Geometry?Parts?

Materials?


Transfer Learning with CNNs

1. Train on Imagenet




2. If small dataset: fix all weights (treat CNN as fixed feature extractor), retrain only the classifier

i.e. swap the Softmax layer at the end






3. If you have medium sized dataset, “finetune”instead: use the old weights as initialization, train the full network or only some of the higher layers

retrain bigger portion of the network, or even all of it.






3. If you have medium sized dataset, “finetune”instead: use the old weights as initialization, train the full network or only some of the higher layers

retrain bigger portion of the network, or even all of it.

tip: use only ~1/10th of the original learning rate in finetuning to player, and ~1/100th on intermediate layers

27

Learning an Embedding

CNN

Embeddingshared representation

CNNShared W

Image 1 Image 2

28

CNN

Matchingshared representation

CNNShared W

Image 1 Image 2

Learning an Embedding

Siamese Network w/ Contrastive Loss

Siamese Architecture[Chopra 2005, Hadsell 2006]

L E A R N I N G V I S U A L S I M I L A R I T YF O R P R O D U C T D E S I G N W I T HC O N V O L U T I O N A L N E U R A L N E T W O R K SS E A N B E L L A N D K A V I T A B A L AC O R N E L L U N I V E R S I T Y

T H E P R O B L E M

Name: ”Great Bowl O’Fire Sculptural Fire Bowl”

(1) “What is this?” (2) “Where is it used?”

Category: Fire pitSold by: John T. Unger, LLC

T H E P R O B L E M

(1) “What is this?” (2) “Where is it used?”

Challenge: determine whether these are the same product(different resolution, viewpoint, color, lighting, occlusions)

Name: ”Great Bowl O’Fire Sculptural Fire Bowl”Category: Fire pitSold by: John T. Unger, LLC

T W O K I N D S O F I M A G E SIconic In context

(From a product website) (Cropped from a scene photo)

P R O J E C T I N G I N T O A J O I N T E M B E D D I N G

Embedding

Iconic In context

S E A R C H U S I N G T H E E M B E D D I N G

Embedding

“What is it?”

Name: Hemel RingCategory: Hanging lightSold by: Holly Hunt

S E A R C H U S I N G T H E E M B E D D I N G

Embedding

“Where is it used?”

Embedding

C O N T R A S T I V E L O S S : P O S I T I V E E X A M P L E

Parameters θ

Iconic (same)

In context

CNN

CNN

Loss Lp

xp

xq

Embedding

C O N T R A S T I V E L O S S : N E G A T I V E E X A M P L E

In context

Loss Ln

Margin m

Iconic (different)

CNN

CNN

Parameters θxq

xn

C O N T R A S T I V E L O S S : A L L T O G E T H E R

[Chopra 2005, Hadsell 2006]

Minimize L(θ) with stochastic gradient descent and momentum

Margin

T R A I N I N G P I P E L I N E

StochasticGradientDescent

CNN Embedding

Image pairsCNN Parameters

θ

Image database

R E S U L T S : “ W H A T I S I T ? ”

In context


In context Iconic Top 4 results:


In context

C O M P A R I S O N : T R A I N E D O N L Y O N C A T E G O R I E S

IconicIn context Top 4 results:

IconicIn context Top 4 results:

C O M P A R I S O N : T R A I N E D O N L Y O N I M A G E N E T

R E S U L T S : F A I L U R E C A S E

In context

R E S U L T S : F A I L U R E C A S E


“Maskros Pendant Lamp”

R E S U L T S : “ W H E R E I S I T U S E D ? ”

R E S U L T S : “ W H E R E I S I T U S E D ? ”

"LEM Piston Stool | Design Within Reach”

S E A R C H I N G A C R O S S C A T E G O R I E S

Color distribution cross-entropy loss with colorfulness enhancing term.

Zhang et al. 2016

[Zhang, Isola, Efros, ECCV 2016]

Designing loss functionsInput Ground truth

Image colorization

Cross entropy loss, with colorfulness term

“semantic feature loss” (VGG feature covariance matching objective)

[Johnson et al. 2016]

Super-resolution[Zhang et al. 2016]

Designing loss functions

Universal loss?

… …

…

Generated vs Real(classifier)

[Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio 2014]

Generative Adversarial Network(GANs)

Real photos

Generated images

…

…

Generator

[Goodfellow et al., 2014]

G tries to synthesize fake images that fool D

D tries to identify the fakes

Generator Discriminator

real or fake?


fake (0.9)

real (0.1)


G tries to synthesize fake images that foolD:

real or fake?


G tries to synthesize fake images that fool the bestD:

real or fake?


Loss Function

G’s perspective: D is a loss function.

Rather than being hand-designed, it is learned.

[Isola et al., 2017][Goodfellow et al., 2014]

+ L1

real or fake?


real!(“Aquarius”)


real or fake pair ?

[Goodfellow et al., 2014][Isola et al., 2017]

fake pair


real pair


real or fake pair ?


BW → Color

Data from [Russakovsky et al. 2015]

Input Output Groundtruth

Data from[maps.google.com]

http://maps.google.com

Input Output Groundtruth

Data from [maps.google

http://maps.google.com

Labels → FacadesInput Output

Data from [Tylecek, 2013]

Labels → FacadesInput Output Input Output

Data from [Tylecek, 2013]

Day → NightInput Output Input Output Input Output

Data from [Laffont et al., 2014]

Thermal → RGB

Edges → ImagesInput Output Input Output Input Output

Edges from [Xie & Tu, 2015]

Sketches → ImagesInput Output Input Output Input Output

Trained on Edges → ImagesData from [Eitz, Hays, Alexa, 2012]

#edges2cats [Christopher Hesse]

Ivy Tasi @ivymyt

Vitaly Vidmirov @vvid

@gods_tail

@ka92

Twitter-driven research: #pix2pix

Bertrand Gondouin @bgondouin

Brannon Dorsey @brannondorsey

Mario Klingemann @quasimondo

Scott Eaton (http://www.scott-eaton.com/)

http://www.scott-eaton.com/

“Do as I Do”

OpenPose

pix2pix

Everybody Dance NowCaroline Chan, Shiry Ginosar, Tinghui Zhou, Alexei A. EfrosUC Berkeley

Source Subject Target Subject

Results

https://www.youtube.com/watch?v=PCBTZh41Ris&feature=youtu.be

https://www.youtube.com/watch?v=PCBTZh41Ris&feature=youtu.be

CEO: our own Dr. Tinghui Zhou

…

Paired training examples Unpaired training examples

… …

…

CycleGAN, or “there and back aGAN”

[Zhu*, Park*, Isola, Efros. ICCV 2017]

……

Cycle-Consistency LossG(x) F(G x )x

F G x − x 1

G(x) F(G x )x F(y) G(F x )𝑦𝑦

Cycle-Consistency Loss

F G x − x 1 G F y − 𝑦𝑦 1

CG to RealGrand Theft Auto

Real to CG

Shallower depth of field

Failure case

A Neural Algorithmof Artistic Style

Gatys, Ecker, Bethge (arXiv 2015)

Van Gogh (1889)

Picasso (1910)

Munch (1893)

Turner (1805)

Kandinsky (1913)

Early Vision Texture Models

2.37.13.8

Heeger & Bergen (1995) Portilla & Simoncelli (2000)

Linear filter bank

Heeger & Bergen, SIGGRAPH‘95Start with a noise image as output Main loop:

• Match pixel histogram of output image to input

• Decompose input and output images using multi-scale filter bank (Steerable Pyramid)

• Match subband histograms of input and output pyramids

• Reconstruct input and output images (collapse the pyramids)

Heeger, Bergen, Pyramid-based texture analysis/synthesis, SIGGRAPH 1995

Multi-scale filter decomposition

Filter bank

Input image

Filter response histograms

Simoncelli & Portilla ’98+

Match joint histograms of pairs of filter responses at adjacent spatial locations, orientations, and scales.

Optimize using repeated projections onto statistical constraint sufraces

Texture SynthesisImage Space Model Space

Images with equalmodel response

Portilla & Simoncelli (2000)

Convolutional Neural Network Texture Model

2.37.13.8

Convolutional Neural Network

Gatys et al. (NIPS 2015)

CNN - Multiscale Filter Bank

conv1_1

pool4pool3

pool2

pool1

64

# features

64

128

256

512

CNN - Texture Features

CNN - Texture Features

64

# features

64

128

256

512

Gram Matrices

Texture Synthesis

Test Julesz’ Conjecture

CNN - Texture Synthesis

Gatys et al. (NIPS 2015)

Artistic Style Transfer

Relative Weighting of Content and Style1e-4

1e-2 1e-1

1e-3

Different Reconstruction Layers

Conv2_2 Conv4_2

Different Reconstruction Layers

Conv2_2Original Conv4_2

General Style Transfer

Convolutional Neural Networks IIcs194-26/sp20/Lectures/...Lecture 7 - 6 27 Jan 2016 Case Study:...

Documents

Transcript of Convolutional Neural Networks IIcs194-26/sp20/Lectures/...Lecture 7 - 6 27 Jan 2016 Case Study:...