Convolutional Neural Networkscs194-26/sp20/Lectures/... · 2020. 3. 14. · The brain/neuron view...

Convolutional Neural Networks

CS194: Image Manipulation, Comp. Vision, and Comp. PhotoAlexei Efros, UC Berkeley, Spring 2020

image X

Convolutional Neural

Network“Penguin”

label Y

Neural Nets: a particularly useful Black Box

image X

“Penguin”

label Y

Classic Object Recognition

Edges

Texture

Colors

Segments

Parts

Slide by Philip Isola

Feature extractors

Classifier

image X

“Penguin”

label Y

Classic Object Recognition

Edges

Texture

Colors

Segments

Parts


Feature extractors

Classifier

Learned

image X label Y

Learning Features

Edges

Texture

Colors

Feature extractorsLearned

image X

“Penguin”

label Y

Neural Network:algorithm + feature + data!


Learned

Vanilla (fully-connected) Neural Networks

8

Example: 200x200 image40K hidden units

~2B parameters!!!

- Spatial correlation is local- Waste of resources + we have not enough

training samples anyway..

Fully Connected Layer

Ranzato

9

Locally Connected Layer

Example: 200x200 image40K hidden unitsFilter size: 10x10

4M parameters

Ranzato

Note: This parameterization is good when input image is registered (e.g., face recognition).

10

STATIONARITY? Statistics is similar at different locations

Ranzato

Locally Connected Layer

Example: 200x200 image40K hidden unitsFilter size: 10x10

4M parameters

11

Convolutional Layer

Share the same parameters across different locations (assuming input is stationary):Convolutions with learned kernels

Ranzato

Convolutional Layer

Ranzato

28

Learn multiple filters.

E.g.: 200x200 image100 FiltersFilter size: 10x10

10K parameters

Ranzato

Convolutional Layer

Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 201529

before:

now:

input layer hidden layer

output layer

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201630

32

32

3

Convolution Layer32x32x3 image

width

height

depth


32

32

3

Convolution Layer

5x5x3 filter

32x32x3 image

Convolve the filter with the imagei.e. “slide over the image spatially, computing dot products”


32

32

3

Convolution Layer

5x5x3 filter

32x32x3 image

Convolve the filter with the imagei.e. “slide over the image spatially, computing dot products”

Filters always extend the full depth of the input volume


32

32

3

Convolution Layer32x32x3 image5x5x3 filter

1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image(i.e. 5*5*3 = 75-dimensional dot product + bias)


32

32

3


convolve (slide) over all spatial locations

activation map

1

28

28


32

32

3



activation maps

1

28

28

consider a second, green filter


32

32

3

Convolution Layer

activation maps

6

28

28

For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:

We stack these up to get a “new image” of size 28x28x6!


convolving the first filter in the input gives the first slice of depth in output volume


Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions

32

32

3

28

28

6

CONV,ReLUe.g. 6 5x5x3 filters


Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions

32

32

3

CONV,ReLUe.g. 6 5x5x3 filters 28

28

6


CONV,ReLU

….

10

24

24


A closer look at spatial dimensions:

32

32

3

32x32x3 image5x5x3 filter


activation map

1

28

28


7x7 input (spatially)assume 3x3 filter

7

7




7

7




=> 5x5 output

7

7



7x7 input (spatially)assume 3x3 filterapplied with stride 2

7

7



7x7 input (spatially)assume 3x3 filterapplied with stride 2=> 3x3 output!

7

7



7x7 input (spatially)assume 3x3 filterapplied with stride 3?

7

7



7x7 input (spatially)assume 3x3 filterapplied with stride 3?

7

7


doesn’t fit! cannot apply 3x3 filter on 7x7 input with stride 3.


N

NF

F

Output size:(N - F) / stride + 1

e.g. N = 7, F = 3:stride 1 => (7 - 3)/1 + 1 = 5stride 2 => (7 - 3)/2 + 1 = 3stride 3 => (7 - 3)/3 + 1 = 2.33 :\


In practice: Common to zero pad the border0 0 0 0 0 0

0

0

0

0

e.g. input 7x73x3 filter, applied with stride 1 pad with 1 pixel border => what is the output?

(recall:)(N - F) / stride + 1


In practice: Common to zero pad the border


7x7 output!

0 0 0 0 0 0

0

0

0

0


In practice: Common to zero pad the border


7x7 output!in general, common to see CONV layers with stride 1, filters of size FxF, and zero-padding with (F-1)/2. (will preserve size spatially)e.g. F = 3 => zero pad with 1

F = 5 => zero pad with 2F = 7 => zero pad with 3

0 0 0 0 0 0

0

0

0

0


Remember back to… E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially!(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well.

32

32

3

CONV,ReLUe.g. 6 5x5x3 filters 28

28

6


CONV,ReLU

….

10

24

24


(btw, 1x1 convolution layers make perfect sense)

6456

561x1 CONVwith 32 filters

3256

56(each filter has size 1x1x64, and performs a 64-dimensional dot product)


The brain/neuron view of CONV Layer

32

32

3


1 number: the result of taking a dot product between the filter and this part of the image(i.e. 5*5*3 = 75-dimensional dot product)



32

32

3


1 number: the result of taking a dot product between the filter and this part of the image(i.e. 5*5*3 = 75-dimensional dot product)

It’s just a neuron with local connectivity...



32

32

3

An activation map is a 28x28 sheet of neuron outputs:1. Each is connected to a small region in the input2. All of them share parameters

“5x5 filter” -> “5x5 receptive field for each neuron”28

28



32

32

3

28

28

E.g. with 5 filters,CONV layer consists of neurons arranged in a 3D grid(28x28x5)

There will be 5 different neurons all looking at the same region in the input volume5

61

Let us assume filter is an “eye” detector.

Q.: how can we make the detection robust to the exact location of the eye?

Pooling Layer

Ranzato

62

By “pooling” (e.g., taking max) filterresponses at different locations we gainrobustness to the exact spatial locationof features.

Ranzato

Pooling Layer


Pooling layer- makes the representations smaller and more manageable - operates over each activation map independently:


1 1 2 4

5 6 7 8

3 2 1 0

1 2 3 4

Single depth slice

x

y

max pool with 2x2 filters and stride 2 6 8

3 4

MAX POOLING

65

ConvNets: Typical Stage

Convol. Pooling

One stage (zoom)

courtesy ofK. Kavukcuoglu Ranzato

Neural network

“clown fish”

Learned

Deep neural network

“clown fish”

Learned

Computation in a neural net

Input representation

Output representation


Rectified linear unit (ReLU)


Last layer

…

…

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

“clown fish”argmax

Learning with deep nets

“clown fish”

Learned


“clown fish”

“grizzly bear”

“chameleon”

Train network to associate the right

label with each image


“clown fish”

Loss

Learned

Loss function

“clown fish”

Loss error

Network output

…

…

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

Ground truth label

Loss function

“clown fish”

Loss small

Network output

…

…

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

Ground truth label

Loss function

“grizzly bear”

Loss large

Network output

…

…

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

Ground truth label

Loss function for classificationNetwork output

…

…

dolphin

cat

grizzly bear

angel fish

chameleon

clown fish

Ground truth label

…

iguana

elephant

Probability of the observed data under

the model

Results in learning a probability model !


“clown fish”

Loss

Learned


“grizzly bear”

Loss

Learned


“chameleon”

Loss

Learned

Gradient descent

One iteration of gradient descent:

learning rate

Gradient descent

Texture synthesis by non-parametric sampling

p

Synthesizing a pixel

non-parametricsampling

Input image

[Efros & Leung 1999]

Models

Input partial image

Predicted color of next

pixel

Texture synthesis with a deep net

“white”

PixelCNN [van der Oord et al. 2016]

Texture synthesis with a deep net

“white”

[van der Oord et al. 2016]

…

Sampling

Network output

…

…

turquoise

blue

green

red

orange

gray

black

white

p

Sampling

Network output

p

…

…

turquoise

blue

green

red

orange

gray

black

white

prob

abilit

y

Sampling

Network output

prob

abilit

y

…

…

turquoise

blue

green

red

orange

gray

black

white

Sampling

Network output

…

…

turquoise

blue

green

red

orange

gray

black

white

prob

abilit

y

“yellow”

…

Discriminative Deep Networks

96

“Rockfish”

Discriminative Deep Networks

97

Raw, Unlabeled Pixels

Generative Deep Networks

98

Raw, Unlabeled Pixels

Ansel Adams. Yosemite Valley Bridge.99

Grayscale image: L channel Color information: ab channels

abL

Zhang, Isola, Efros. Colorful Image Colorization. In ECCV, 2016.100

Concatenate (L,ab) channels

101

Grayscale image: L channel

Zhang, Isola, Efros. Colorful Image Colorization. In ECCV, 2016.

abL

Input Output Ground truth

Simple L2 regression doesn’t work

Better Loss Function Colors in ab space(discrete)

104

• Regression with L2 loss inadequate

• Use per-pixel multinomial classification

Color distribution cross-entropy loss with colorfulness enhancing term.

Zhang et al. 2016

[Zhang, Isola, Efros, ECCV 2016]

Designing loss functionsInput Ground truth

Convolutional Neural Networkscs194-26/sp20/Lectures/... · 2020. 3. 14. · The brain/neuron view...

Documents

Transcript of Convolutional Neural Networkscs194-26/sp20/Lectures/... · 2020. 3. 14. · The brain/neuron view...