Convolutional Neural Networkscs194-26/sp20/Lectures/... · 2020. 3. 14. · The brain/neuron view...

Post on 22-Aug-2021

1 views 0 download

Transcript of Convolutional Neural Networkscs194-26/sp20/Lectures/... · 2020. 3. 14. · The brain/neuron view...

Convolutional Neural Networks

CS194: Image Manipulation, Comp. Vision, and Comp. PhotoAlexei Efros, UC Berkeley, Spring 2020

image X

Convolutional Neural

Network“Penguin”

label Y

Neural Nets: a particularly useful Black Box

image X

“Penguin”

label Y

Classic Object Recognition

Edges

Texture

Colors

Segments

Parts

Slide by Philip Isola

Feature extractors

Classifier

image X

“Penguin”

label Y

Classic Object Recognition

Edges

Texture

Colors

Segments

Parts

Slide by Philip Isola

Feature extractors

Classifier

Learned

image X label Y

Learning Features

Edges

Texture

Colors

Feature extractorsLearned

image X

“Penguin”

label Y

Neural Network:algorithm + feature + data!

Slide by Philip Isola

Learned

Vanilla (fully-connected) Neural Networks

8

Example: 200x200 image40K hidden units

~2B parameters!!!

- Spatial correlation is local- Waste of resources + we have not enough

training samples anyway..

Fully Connected Layer

Ranzato

9

Locally Connected Layer

Example: 200x200 image40K hidden unitsFilter size: 10x10

4M parameters

Ranzato

Note: This parameterization is good when input image is registered (e.g., face recognition).

10

STATIONARITY? Statistics is similar at different locations

Ranzato

Locally Connected Layer

Example: 200x200 image40K hidden unitsFilter size: 10x10

4M parameters

11

Convolutional Layer

Share the same parameters across different locations (assuming input is stationary):Convolutions with learned kernels

Ranzato

Convolutional Layer

Ranzato

Convolutional Layer

Ranzato

Convolutional Layer

Ranzato

Convolutional Layer

Ranzato

Convolutional Layer

Ranzato

Convolutional Layer

Ranzato

Convolutional Layer

Ranzato

Convolutional Layer

Ranzato

Convolutional Layer

Ranzato

Convolutional Layer

Ranzato

Convolutional Layer

Ranzato

Convolutional Layer

Ranzato

Convolutional Layer

Ranzato

Convolutional Layer

Ranzato

Convolutional Layer

Ranzato

Convolutional Layer

Ranzato

28

Learn multiple filters.

E.g.: 200x200 image100 FiltersFilter size: 10x10

10K parameters

Ranzato

Convolutional Layer

Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 201529

before:

now:

input layer hidden layer

output layer

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201630

32

32

3

Convolution Layer32x32x3 image

width

height

depth

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201631

32

32

3

Convolution Layer

5x5x3 filter

32x32x3 image

Convolve the filter with the imagei.e. “slide over the image spatially, computing dot products”

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201632

32

32

3

Convolution Layer

5x5x3 filter

32x32x3 image

Convolve the filter with the imagei.e. “slide over the image spatially, computing dot products”

Filters always extend the full depth of the input volume

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201633

32

32

3

Convolution Layer32x32x3 image5x5x3 filter

1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image(i.e. 5*5*3 = 75-dimensional dot product + bias)

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201634

32

32

3

Convolution Layer32x32x3 image5x5x3 filter

convolve (slide) over all spatial locations

activation map

1

28

28

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201635

32

32

3

Convolution Layer32x32x3 image5x5x3 filter

convolve (slide) over all spatial locations

activation maps

1

28

28

consider a second, green filter

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201636

32

32

3

Convolution Layer

activation maps

6

28

28

For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:

We stack these up to get a “new image” of size 28x28x6!

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201637

convolving the first filter in the input gives the first slice of depth in output volume

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201638

Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions

32

32

3

28

28

6

CONV,ReLUe.g. 6 5x5x3 filters

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201639

Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions

32

32

3

CONV,ReLUe.g. 6 5x5x3 filters 28

28

6

CONV,ReLUe.g. 10 5x5x6 filters

CONV,ReLU

….

10

24

24

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201640

A closer look at spatial dimensions:

32

32

3

32x32x3 image5x5x3 filter

convolve (slide) over all spatial locations

activation map

1

28

28

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201641

7x7 input (spatially)assume 3x3 filter

7

7

A closer look at spatial dimensions:

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201642

7x7 input (spatially)assume 3x3 filter

7

7

A closer look at spatial dimensions:

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201643

7x7 input (spatially)assume 3x3 filter

7

7

A closer look at spatial dimensions:

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201644

7x7 input (spatially)assume 3x3 filter

7

7

A closer look at spatial dimensions:

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201645

7x7 input (spatially)assume 3x3 filter

=> 5x5 output

7

7

A closer look at spatial dimensions:

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201646

7x7 input (spatially)assume 3x3 filterapplied with stride 2

7

7

A closer look at spatial dimensions:

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201647

7x7 input (spatially)assume 3x3 filterapplied with stride 2

7

7

A closer look at spatial dimensions:

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201648

7x7 input (spatially)assume 3x3 filterapplied with stride 2=> 3x3 output!

7

7

A closer look at spatial dimensions:

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201649

7x7 input (spatially)assume 3x3 filterapplied with stride 3?

7

7

A closer look at spatial dimensions:

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201650

7x7 input (spatially)assume 3x3 filterapplied with stride 3?

7

7

A closer look at spatial dimensions:

doesn’t fit! cannot apply 3x3 filter on 7x7 input with stride 3.

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201651

N

NF

F

Output size:(N - F) / stride + 1

e.g. N = 7, F = 3:stride 1 => (7 - 3)/1 + 1 = 5stride 2 => (7 - 3)/2 + 1 = 3stride 3 => (7 - 3)/3 + 1 = 2.33 :\

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201652

In practice: Common to zero pad the border0 0 0 0 0 0

0

0

0

0

e.g. input 7x73x3 filter, applied with stride 1 pad with 1 pixel border => what is the output?

(recall:)(N - F) / stride + 1

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201653

In practice: Common to zero pad the border

e.g. input 7x73x3 filter, applied with stride 1 pad with 1 pixel border => what is the output?

7x7 output!

0 0 0 0 0 0

0

0

0

0

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201654

In practice: Common to zero pad the border

e.g. input 7x73x3 filter, applied with stride 1 pad with 1 pixel border => what is the output?

7x7 output!in general, common to see CONV layers with stride 1, filters of size FxF, and zero-padding with (F-1)/2. (will preserve size spatially)e.g. F = 3 => zero pad with 1

F = 5 => zero pad with 2F = 7 => zero pad with 3

0 0 0 0 0 0

0

0

0

0

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201655

Remember back to… E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially!(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well.

32

32

3

CONV,ReLUe.g. 6 5x5x3 filters 28

28

6

CONV,ReLUe.g. 10 5x5x6 filters

CONV,ReLU

….

10

24

24

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201656

(btw, 1x1 convolution layers make perfect sense)

6456

561x1 CONVwith 32 filters

3256

56(each filter has size 1x1x64, and performs a 64-dimensional dot product)

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201657

The brain/neuron view of CONV Layer

32

32

3

32x32x3 image5x5x3 filter

1 number: the result of taking a dot product between the filter and this part of the image(i.e. 5*5*3 = 75-dimensional dot product)

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201658

The brain/neuron view of CONV Layer

32

32

3

32x32x3 image5x5x3 filter

1 number: the result of taking a dot product between the filter and this part of the image(i.e. 5*5*3 = 75-dimensional dot product)

It’s just a neuron with local connectivity...

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201659

The brain/neuron view of CONV Layer

32

32

3

An activation map is a 28x28 sheet of neuron outputs:1. Each is connected to a small region in the input2. All of them share parameters

“5x5 filter” -> “5x5 receptive field for each neuron”28

28

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201660

The brain/neuron view of CONV Layer

32

32

3

28

28

E.g. with 5 filters,CONV layer consists of neurons arranged in a 3D grid(28x28x5)

There will be 5 different neurons all looking at the same region in the input volume5

61

Let us assume filter is an “eye” detector.

Q.: how can we make the detection robust to the exact location of the eye?

Pooling Layer

Ranzato

62

By “pooling” (e.g., taking max) filterresponses at different locations we gainrobustness to the exact spatial locationof features.

Ranzato

Pooling Layer

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201663

Pooling layer- makes the representations smaller and more manageable - operates over each activation map independently:

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201664

1 1 2 4

5 6 7 8

3 2 1 0

1 2 3 4

Single depth slice

x

y

max pool with 2x2 filters and stride 2 6 8

3 4

MAX POOLING

65

ConvNets: Typical Stage

Convol. Pooling

One stage (zoom)

courtesy ofK. Kavukcuoglu Ranzato

Neural network

“clown fish”

Learned

Deep neural network

“clown fish”

Learned

Computation in a neural net

Input representation

Output representation

Computation in a neural net

Input representation

Output representation

Computation in a neural net

Rectified linear unit (ReLU)

Computation in a neural net

Last layer

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

“clown fish”argmax

Learning with deep nets

“clown fish”

Learned

Learning with deep nets

“clown fish”

“grizzly bear”

“chameleon”

Train network to associate the right

label with each image

Learning with deep nets

“clown fish”

Loss

Learned

Loss function

“clown fish”

Loss error

Network output

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

Ground truth label

Loss function

“clown fish”

Loss small

Network output

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

Ground truth label

Loss function

“grizzly bear”

Loss large

Network output

dolphin

cat

grizzly bear

angel fish

chameleon

iguana

elephant

clown fish

Ground truth label

Loss function for classificationNetwork output

dolphin

cat

grizzly bear

angel fish

chameleon

clown fish

Ground truth label

iguana

elephant

Probability of the observed data under

the model

Results in learning a probability model !

Learning with deep nets

“clown fish”

Loss

Learned

Learning with deep nets

“grizzly bear”

Loss

Learned

Learning with deep nets

“chameleon”

Loss

Learned

Gradient descent

One iteration of gradient descent:

learning rate

Gradient descent

Texture synthesis by non-parametric sampling

p

Synthesizing a pixel

non-parametricsampling

Input image

[Efros & Leung 1999]

Models

Input partial image

Predicted color of next

pixel

Texture synthesis with a deep net

“white”

PixelCNN [van der Oord et al. 2016]

Texture synthesis with a deep net

“white”

[van der Oord et al. 2016]

Sampling

Network output

turquoise

blue

green

red

orange

gray

black

white

p

Sampling

Network output

p

turquoise

blue

green

red

orange

gray

black

white

prob

abilit

y

Sampling

Network output

prob

abilit

y

turquoise

blue

green

red

orange

gray

black

white

Sampling

Network output

prob

abilit

y

turquoise

blue

green

red

orange

gray

black

white

Sampling

Network output

turquoise

blue

green

red

orange

gray

black

white

prob

abilit

y

Sampling

Network output

turquoise

blue

green

red

orange

gray

black

white

prob

abilit

y

“yellow”

Discriminative Deep Networks

96

“Rockfish”

Discriminative Deep Networks

97

Raw, Unlabeled Pixels

Generative Deep Networks

98

Raw, Unlabeled Pixels

Ansel Adams. Yosemite Valley Bridge.99

Grayscale image: L channel Color information: ab channels

abL

Zhang, Isola, Efros. Colorful Image Colorization. In ECCV, 2016.100

Concatenate (L,ab) channels

101

Grayscale image: L channel

Zhang, Isola, Efros. Colorful Image Colorization. In ECCV, 2016.

abL

Input Output Ground truth

Simple L2 regression doesn’t work

Better Loss Function Colors in ab space(discrete)

104

• Regression with L2 loss inadequate

• Use per-pixel multinomial classification

105

Color distribution cross-entropy loss with colorfulness enhancing term.

Zhang et al. 2016

[Zhang, Isola, Efros, ECCV 2016]

Designing loss functionsInput Ground truth