Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth,...

68
Convolution and Pooling Benjamin Roth, Nina Poerner CIS LMU M¨ unchen November 28, 2018 Benjamin Roth, Nina Poerner (CIS LMU M¨ unchen) Convolution and Pooling November 28, 2018 1 / 28

Transcript of Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth,...

Page 1: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution and Pooling

Benjamin Roth, Nina Poerner

CIS LMU Munchen

November 28, 2018

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 1 / 28

Page 2: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Gliederung

1 Convolutional Neural NetworksConvolution with one filterConvolution with N filtersWhat does convolution do?

2 Pooling

3 Application to NLP

4 Comparison: RNN vs. CNN

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Page 3: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Outline

1 Convolutional Neural NetworksConvolution with one filterConvolution with N filtersWhat does convolution do?

2 Pooling

3 Application to NLP

4 Comparison: RNN vs. CNN

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 3 / 28

Page 4: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolutional Neural Networks (CNNs)

Technique from Computer Vision (e.g., object recognition in images)

Alternative to RNNs for many (not all) NLP tasks

General idea: Filter bank with N learnable filters

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 4 / 28

Page 5: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Outline

1 Convolutional Neural NetworksConvolution with one filterConvolution with N filtersWhat does convolution do?

2 Pooling

3 Application to NLP

4 Comparison: RNN vs. CNN

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 5 / 28

Page 6: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution with one filter

For now, assume we have only one filter

Move filter over input with step size (stride) s (here: 1)

At every position, multiply filter and input entries together(elementwise), and sum the results into a single value

Input

Filter

Output

=

-1

-2

1

0

-3

0

-1

0

1

0

0

-1

0

1

-1

2

1 -1

1 -2

-1 -1

Filter size: 2× 2

Input size: 3× 4

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 6 / 28

Page 7: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution with one filter

For now, assume we have only one filter

Move filter over input with step size (stride) s (here: 1)

At every position, multiply filter and input entries together(elementwise), and sum the results into a single value

Input

Filter

Output

=

-1

-2

1

0

-3

0

-1

0

1

0

0

-1

0

1

-1

2

1 -1

1 -2

-1 -1

Filter size: 2× 2

Input size: 3× 4

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 6 / 28

Page 8: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution with one filter

For now, assume we have only one filter

Move filter over input with step size (stride) s (here: 1)

At every position, multiply filter and input entries together(elementwise), and sum the results into a single value

Input

Filter

Output

=

-1

-2

1

0

-3

0

-1

0

1

0

0

-1

0

1

-1

21

-1

1 -2

-1 -1

Filter size: 2× 2

Input size: 3× 4

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 6 / 28

Page 9: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution with one filter

For now, assume we have only one filter

Move filter over input with step size (stride) s (here: 1)

At every position, multiply filter and input entries together(elementwise), and sum the results into a single value

Input

Filter

Output

=

-1

-2

1

0

-3

0

-1

0

1

0

0

-1

0

1

-1

21

-1

1 -2

-1 -1

Filter size: 2× 2

Input size: 3× 4

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 6 / 28

Page 10: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution with one filter

For now, assume we have only one filter

Move filter over input with step size (stride) s (here: 1)

At every position, multiply filter and input entries together(elementwise), and sum the results into a single value

Input

Filter

Output

=

-1

-2

1

0

-3

0

-1

0

1

0

0

-1

0

1

-1

21 -1

1 -2

-1 -1

Filter size: 2× 2

Input size: 3× 4

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 6 / 28

Page 11: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution with one filter

For now, assume we have only one filter

Move filter over input with step size (stride) s (here: 1)

At every position, multiply filter and input entries together(elementwise), and sum the results into a single value

Input

Filter

Output

=

-1

-2

1

0

-3

0

-1

0

1

0

0

-1

0

1

-1

21 -1

1 -2

-1 -1

Filter size: 2× 2

Input size: 3× 4

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 6 / 28

Page 12: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution with one filter

For now, assume we have only one filter

Move filter over input with step size (stride) s (here: 1)

At every position, multiply filter and input entries together(elementwise), and sum the results into a single value

Input

Filter

Output

=

-1

-2

1

0

-3

0

-1

0

1

0

0

-1

0

1

-1

21 -1

1

-2

-1 -1

Filter size: 2× 2

Input size: 3× 4

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 6 / 28

Page 13: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution with one filter

For now, assume we have only one filter

Move filter over input with step size (stride) s (here: 1)

At every position, multiply filter and input entries together(elementwise), and sum the results into a single value

Input

Filter

Output

=

-1

-2

1

0

-3

0

-1

0

1

0

0

-1

0

1

-1

21 -1

1

-2

-1 -1

Filter size: 2× 2

Input size: 3× 4

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 6 / 28

Page 14: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution with one filter

For now, assume we have only one filter

Move filter over input with step size (stride) s (here: 1)

At every position, multiply filter and input entries together(elementwise), and sum the results into a single value

Input

Filter

Output

=

-1

-2

1

0

-3

0

-1

0

1

0

0

-1

0

1

-1

21 -1

1 -2

-1 -1

Filter size: 2× 2

Input size: 3× 4

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 6 / 28

Page 15: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution with one filter

For now, assume we have only one filter

Move filter over input with step size (stride) s (here: 1)

At every position, multiply filter and input entries together(elementwise), and sum the results into a single value

Input

Filter

Output

=

-1

-2

1

0

-3

0

-1

0

1

0

0

-1

0

1

-1

21 -1

1 -2

-1 -1

Filter size: 2× 2

Input size: 3× 4

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 6 / 28

Page 16: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution with one filter

For now, assume we have only one filter

Move filter over input with step size (stride) s (here: 1)

At every position, multiply filter and input entries together(elementwise), and sum the results into a single value

Input

Filter

Output

=

-1

-2

1

0

-3

0

-1

0

1

0

0

-1

0

1

-1

21 -1

1 -2

-1

-1

Filter size: 2× 2

Input size: 3× 4

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 6 / 28

Page 17: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution with one filter

For now, assume we have only one filter

Move filter over input with step size (stride) s (here: 1)

At every position, multiply filter and input entries together(elementwise), and sum the results into a single value

Input

Filter

Output

=

-1

-2

1

0

-3

0

-1

0

1

0

0

-1

0

1

-1

21 -1

1 -2

-1

-1

Filter size: 2× 2

Input size: 3× 4

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 6 / 28

Page 18: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution with one filter

For now, assume we have only one filter

Move filter over input with step size (stride) s (here: 1)

At every position, multiply filter and input entries together(elementwise), and sum the results into a single value

Input

Filter

Output

=

-1

-2

1

0

-3

0

-1

0

1

0

0

-1

0

1

-1

21 -1

1 -2

-1 -1

Filter size: 2× 2

Input size: 3× 4

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 6 / 28

Page 19: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Building an edge detector filter

Assume that -1 means black and +1 means white

We want to build a filter that can detect diagonal edges where theupper left side is dark and the lower right side is bright

= a filter that calculates a high positive number on windows thatlook like this:

-3-3-3-3-3

-33

-3

3

-3

3

-3

3

-33

-3

3

-3

3

-33

-3

3

-33

How would the filter

react to this window?

In CNNs, the filters are not manually chosen, but learned withgradient descent

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 7 / 28

Page 20: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Building an edge detector filter

Assume that -1 means black and +1 means white

We want to build a filter that can detect diagonal edges where theupper left side is dark and the lower right side is bright

= a filter that calculates a high positive number on windows thatlook like this:

-3-3-3-3-3

-33

-3

3

-3

3

-3

3

-33

-3

3

-3

3

-33

-3

3

-33

How would the filter

react to this window?

In CNNs, the filters are not manually chosen, but learned withgradient descent

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 7 / 28

Page 21: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Building an edge detector filter

Assume that -1 means black and +1 means white

We want to build a filter that can detect diagonal edges where theupper left side is dark and the lower right side is bright

= a filter that calculates a high positive number on windows thatlook like this:

-3-3-3-3-3

-33

-3

3

-3

3

-3

3

-33

-3

3

-3

3

-33

-3

3

-33

How would the filter

react to this window?

In CNNs, the filters are not manually chosen, but learned withgradient descent

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 7 / 28

Page 22: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution with one filter: Tensor sizes

Most images are not 2D but 3DI 3rd dimension is # channels, e.g., RGB valuesI image height × image width × # channels

As a consequence, each filter is also 3DI filter height × filter width × # channels

The operation stays the same, with an additional summation over thechannel dimension

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 8 / 28

Page 23: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Outline

1 Convolutional Neural NetworksConvolution with one filterConvolution with N filtersWhat does convolution do?

2 Pooling

3 Application to NLP

4 Comparison: RNN vs. CNN

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 9 / 28

Page 24: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution with N filters

Apply N different filters of the same size → N matrices with thesame size

Stack the N matrices on top of each other → 3D tensor, where thelast dimension is N

Also known as a feature map

Feature map is slightly smaller than input (why?)

Because a filter of size k fits into an input of size h only h − k + 1times

... unless we pad the input with zeros

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 10 / 28

Page 25: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution with N filters

Apply N different filters of the same size → N matrices with thesame size

Stack the N matrices on top of each other → 3D tensor, where thelast dimension is N

Also known as a feature map

Feature map is slightly smaller than input (why?)

Because a filter of size k fits into an input of size h only h − k + 1times

... unless we pad the input with zeros

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 10 / 28

Page 26: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution with N filters: Tensor sizes

Tensor sizes:I Input 3D: input height × input width × # channels (if this is the first

layer, otherwise # filters of previous layer)I Parameter tensor 4D: filter height × filter width × # channels ×

#filtersI Output 3D: input height* × input width* × #filtersI *height and width are slightly reduced by convolution unless we do

padding

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 11 / 28

Page 27: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Outline

1 Convolutional Neural NetworksConvolution with one filterConvolution with N filtersWhat does convolution do?

2 Pooling

3 Application to NLP

4 Comparison: RNN vs. CNN

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 12 / 28

Page 28: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

What does convolution do?

Contextualization: Feature vector computed for position (i , j)contains info from (i −k, j −k) to (i +k, j +k) (where k is filter size).

Locality-preserving: In one convolution layer, info can travel nofurther than k positions

Computer Vision: Many convolutional layers applied one after another

Typical nonlinearity between convolution layers: ReLU

With every layer, feature maps become more complex

Pixels → edges → shapes → small objects → bigger, compositional

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 13 / 28

Page 29: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

What does convolution do?

Contextualization: Feature vector computed for position (i , j)contains info from (i −k, j −k) to (i +k, j +k) (where k is filter size).

Locality-preserving: In one convolution layer, info can travel nofurther than k positions

Computer Vision: Many convolutional layers applied one after another

Typical nonlinearity between convolution layers: ReLU

With every layer, feature maps become more complex

Pixels → edges → shapes → small objects → bigger, compositional

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 13 / 28

Page 30: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

What does convolution do?

Contextualization: Feature vector computed for position (i , j)contains info from (i −k, j −k) to (i +k, j +k) (where k is filter size).

Locality-preserving: In one convolution layer, info can travel nofurther than k positions

Computer Vision: Many convolutional layers applied one after another

Typical nonlinearity between convolution layers: ReLU

With every layer, feature maps become more complex

Pixels → edges → shapes → small objects → bigger, compositional

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 13 / 28

Page 31: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution

Source: Computer science: The learning machines. Nature (2014).

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 14 / 28

Page 32: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Outline

1 Convolutional Neural NetworksConvolution with one filterConvolution with N filtersWhat does convolution do?

2 Pooling

3 Application to NLP

4 Comparison: RNN vs. CNN

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 15 / 28

Page 33: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Pooling

Often applied between convolution steps

Divide feature map into “grid”

Combine vectors inside the same grid cell with some operator

Most popular: Average pooling, Max pooling

Max pooling: only select maximum value for each dimension

“Feature detector”, “Cat neuron fires”

Max pooled

2 1

1 0

0 4

0 5

2 0

2 0

2 3

1 0

2 2

5 3

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 16 / 28

Page 34: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Pooling

Often applied between convolution steps

Divide feature map into “grid”

Combine vectors inside the same grid cell with some operator

Most popular: Average pooling, Max pooling

Max pooling: only select maximum value for each dimension

“Feature detector”, “Cat neuron fires”

Max pooled

2 1

1 0

0 4

0 5

2 0

2 0

2 3

1 0

2 2

5 3

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 16 / 28

Page 35: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Pooling

Often applied between convolution steps

Divide feature map into “grid”

Combine vectors inside the same grid cell with some operator

Most popular: Average pooling, Max pooling

Max pooling: only select maximum value for each dimension

“Feature detector”, “Cat neuron fires”

Max pooled

2 1

1 0

0 4

0 5

2 0

2 0

2 3

1 0

2

2

5 3

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 16 / 28

Page 36: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Pooling

Often applied between convolution steps

Divide feature map into “grid”

Combine vectors inside the same grid cell with some operator

Most popular: Average pooling, Max pooling

Max pooling: only select maximum value for each dimension

“Feature detector”, “Cat neuron fires”

Max pooled

2 1

1 0

0 4

0 5

2 0

2 0

2 3

1 0

2

2

5 3

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 16 / 28

Page 37: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Pooling

Often applied between convolution steps

Divide feature map into “grid”

Combine vectors inside the same grid cell with some operator

Most popular: Average pooling, Max pooling

Max pooling: only select maximum value for each dimension

“Feature detector”, “Cat neuron fires”

Max pooled

2 1

1 0

0 4

0 5

2 0

2 0

2 3

1 0

2 2

5 3

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 16 / 28

Page 38: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Pooling

Often applied between convolution steps

Divide feature map into “grid”

Combine vectors inside the same grid cell with some operator

Most popular: Average pooling, Max pooling

Max pooling: only select maximum value for each dimension

“Feature detector”, “Cat neuron fires”

Max pooled

2 1

1 0

0 4

0 5

2 0

2 0

2 3

1 0

2 2

5 3

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 16 / 28

Page 39: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Pooling

Often applied between convolution steps

Divide feature map into “grid”

Combine vectors inside the same grid cell with some operator

Most popular: Average pooling, Max pooling

Max pooling: only select maximum value for each dimension

“Feature detector”, “Cat neuron fires”

Max pooled

2 1

1 0

0 4

0 5

2 0

2 0

2 3

1 0

2 2

5

3

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 16 / 28

Page 40: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Pooling

Often applied between convolution steps

Divide feature map into “grid”

Combine vectors inside the same grid cell with some operator

Most popular: Average pooling, Max pooling

Max pooling: only select maximum value for each dimension

“Feature detector”, “Cat neuron fires”

Max pooled

2 1

1 0

0 4

0 5

2 0

2 0

2 3

1 0

2 2

5

3

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 16 / 28

Page 41: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Pooling

Often applied between convolution steps

Divide feature map into “grid”

Combine vectors inside the same grid cell with some operator

Most popular: Average pooling, Max pooling

Max pooling: only select maximum value for each dimension

“Feature detector”, “Cat neuron fires”

Max pooled

2 1

1 0

0 4

0 5

2 0

2 0

2 3

1 0

2 2

5 3

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 16 / 28

Page 42: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution and Pooling: LeNet

LeCun et al. (1998). Gradient-based learning applied to document recognition.

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 17 / 28

Page 43: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Outline

1 Convolutional Neural NetworksConvolution with one filterConvolution with N filtersWhat does convolution do?

2 Pooling

3 Application to NLP

4 Comparison: RNN vs. CNN

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 18 / 28

Page 44: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution for NLP

Images have width and height, but text only has “width” (length)

→ We can discard the “height” dimension from our filters

Tensor sizes (in NLP):I Input 2D: sentence length × # channels (word embedding size, or #

filters of previous convolution)I Parameter tensor 3D: filter length × # channels × #filtersI Output 2D: sentence length* × #filtersI *length slightly reduced unless we do padding

Computer vision: 2D convolution (over height and width)

NLP: 1D convolution (over length)

Typically fewer convolutional layers than Computer Vision

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 19 / 28

Page 45: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Pooling for NLP

Pooling between convolutional layers less frequently used than inComputer Vision

After last convolutional layer: “global” pooling step

Calculate max/average over the entire sequence (“pooling over time”)

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 20 / 28

Page 46: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution and Pooling for NLP

I What is the unpadded input size (=length)?

6I What is the padded input size? 8I How many filters? 3I How many input channels (=word vector dimensions)? 4I What is the filter size (=filter width)? 3I What stride (=step size)? 1I What is the output size of the convolution operation? 6× 3I What is the output size of the pooling operation? 3I How many parameters have to be learned? 3× 3× 4 = 36

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 21 / 28

Page 47: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution and Pooling for NLP

I What is the unpadded input size (=length)? 6I What is the padded input size?

8I How many filters? 3I How many input channels (=word vector dimensions)? 4I What is the filter size (=filter width)? 3I What stride (=step size)? 1I What is the output size of the convolution operation? 6× 3I What is the output size of the pooling operation? 3I How many parameters have to be learned? 3× 3× 4 = 36

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 21 / 28

Page 48: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution and Pooling for NLP

I What is the unpadded input size (=length)? 6I What is the padded input size? 8I How many filters?

3I How many input channels (=word vector dimensions)? 4I What is the filter size (=filter width)? 3I What stride (=step size)? 1I What is the output size of the convolution operation? 6× 3I What is the output size of the pooling operation? 3I How many parameters have to be learned? 3× 3× 4 = 36

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 21 / 28

Page 49: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution and Pooling for NLP

I What is the unpadded input size (=length)? 6I What is the padded input size? 8I How many filters? 3I How many input channels (=word vector dimensions)?

4I What is the filter size (=filter width)? 3I What stride (=step size)? 1I What is the output size of the convolution operation? 6× 3I What is the output size of the pooling operation? 3I How many parameters have to be learned? 3× 3× 4 = 36

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 21 / 28

Page 50: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution and Pooling for NLP

I What is the unpadded input size (=length)? 6I What is the padded input size? 8I How many filters? 3I How many input channels (=word vector dimensions)? 4I What is the filter size (=filter width)?

3I What stride (=step size)? 1I What is the output size of the convolution operation? 6× 3I What is the output size of the pooling operation? 3I How many parameters have to be learned? 3× 3× 4 = 36

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 21 / 28

Page 51: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution and Pooling for NLP

I What is the unpadded input size (=length)? 6I What is the padded input size? 8I How many filters? 3I How many input channels (=word vector dimensions)? 4I What is the filter size (=filter width)? 3I What stride (=step size)?

1I What is the output size of the convolution operation? 6× 3I What is the output size of the pooling operation? 3I How many parameters have to be learned? 3× 3× 4 = 36

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 21 / 28

Page 52: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution and Pooling for NLP

I What is the unpadded input size (=length)? 6I What is the padded input size? 8I How many filters? 3I How many input channels (=word vector dimensions)? 4I What is the filter size (=filter width)? 3I What stride (=step size)? 1I What is the output size of the convolution operation?

6× 3I What is the output size of the pooling operation? 3I How many parameters have to be learned? 3× 3× 4 = 36

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 21 / 28

Page 53: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution and Pooling for NLP

I What is the unpadded input size (=length)? 6I What is the padded input size? 8I How many filters? 3I How many input channels (=word vector dimensions)? 4I What is the filter size (=filter width)? 3I What stride (=step size)? 1I What is the output size of the convolution operation? 6× 3I What is the output size of the pooling operation?

3I How many parameters have to be learned? 3× 3× 4 = 36

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 21 / 28

Page 54: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution and Pooling for NLP

I What is the unpadded input size (=length)? 6I What is the padded input size? 8I How many filters? 3I How many input channels (=word vector dimensions)? 4I What is the filter size (=filter width)? 3I What stride (=step size)? 1I What is the output size of the convolution operation? 6× 3I What is the output size of the pooling operation? 3I How many parameters have to be learned?

3× 3× 4 = 36

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 21 / 28

Page 55: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution and Pooling for NLP

I What is the unpadded input size (=length)? 6I What is the padded input size? 8I How many filters? 3I How many input channels (=word vector dimensions)? 4I What is the filter size (=filter width)? 3I What stride (=step size)? 1I What is the output size of the convolution operation? 6× 3I What is the output size of the pooling operation? 3I How many parameters have to be learned? 3× 3× 4 = 36

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 21 / 28

Page 56: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution and Pooling for NLP

Source: Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of ConvNets for Sentence Classification.

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 22 / 28

Page 57: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Convolution and Pooling for NLP

Source: Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification.

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 23 / 28

Page 58: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

# binary classifier, e.g., sentiment polarity

from keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense

from keras.models import Sequential

embedding = Embedding(input_dim = VOCAB_SIZE, output_dim = EMB_DIM)

conv_layer = Conv1D(filters = NUM_FILTERS, kernel_size = FILTER_WIDTH,

activation = "relu")

pool_layer = GlobalMaxPooling1D()

dense_layer = Dense(units = 1, activation = "sigmoid")

model = Sequential(layers = [emb_layer, conv_layer, pool_layer, dense_layer])

model.compile(loss = "binary_crossentropy", optimizer = "sgd")

X, Y = # load_data()

model.fit(X, Y)

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 24 / 28

Page 59: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Outline

1 Convolutional Neural NetworksConvolution with one filterConvolution with N filtersWhat does convolution do?

2 Pooling

3 Application to NLP

4 Comparison: RNN vs. CNN

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 25 / 28

Page 60: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

RNN vs. CNN

RangeI CNN: Cannot capture dependencies with range above k × L (where k is

filter width and L is the number of layersI RNN: Can capture long-range dependencies

Information transportI RNN: Must learn to “transport” salient information across many time

steps.I CNN: No information transport across time, salient information

“fast-tracked” by global max pooling

EfficiencyI RNN: Sequential data processing → not parallelizable over timeI CNN: Input windows are independent from one another → highly

parallelizable over time

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 26 / 28

Page 61: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

RNN vs. CNN: Quiz

Given a task description, choose appropriate architecture!I Task: predict the number of the main verb (sleep or sleeps)

F The cats, who were sitting on the map inside the house, [sleep/sleeps?]

I Which architecture should we use?

RNNI Task: predict the polarity of the review:

F [... many useless sentences ...] best book ever [... many uselesssentences ...]

I Which architecture should we use? CNN

Task: Machine Translation

Which architecture should we use?I Intuitively RNN (because MT is all about long-range dependencies),

but ...I Attention gives CNNs the ability to capture long-range dependencies,

while maintaining parallel processing (Gehring et al.)I More about attention: Next week

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 27 / 28

Page 62: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

RNN vs. CNN: Quiz

Given a task description, choose appropriate architecture!I Task: predict the number of the main verb (sleep or sleeps)

F The cats, who were sitting on the map inside the house, [sleep/sleeps?]

I Which architecture should we use? RNN

I Task: predict the polarity of the review:F [... many useless sentences ...] best book ever [... many useless

sentences ...]

I Which architecture should we use? CNN

Task: Machine Translation

Which architecture should we use?I Intuitively RNN (because MT is all about long-range dependencies),

but ...I Attention gives CNNs the ability to capture long-range dependencies,

while maintaining parallel processing (Gehring et al.)I More about attention: Next week

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 27 / 28

Page 63: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

RNN vs. CNN: Quiz

Given a task description, choose appropriate architecture!I Task: predict the number of the main verb (sleep or sleeps)

F The cats, who were sitting on the map inside the house, [sleep/sleeps?]

I Which architecture should we use? RNNI Task: predict the polarity of the review:

F [... many useless sentences ...] best book ever [... many uselesssentences ...]

I Which architecture should we use?

CNN

Task: Machine Translation

Which architecture should we use?I Intuitively RNN (because MT is all about long-range dependencies),

but ...I Attention gives CNNs the ability to capture long-range dependencies,

while maintaining parallel processing (Gehring et al.)I More about attention: Next week

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 27 / 28

Page 64: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

RNN vs. CNN: Quiz

Given a task description, choose appropriate architecture!I Task: predict the number of the main verb (sleep or sleeps)

F The cats, who were sitting on the map inside the house, [sleep/sleeps?]

I Which architecture should we use? RNNI Task: predict the polarity of the review:

F [... many useless sentences ...] best book ever [... many uselesssentences ...]

I Which architecture should we use? CNN

Task: Machine Translation

Which architecture should we use?I Intuitively RNN (because MT is all about long-range dependencies),

but ...I Attention gives CNNs the ability to capture long-range dependencies,

while maintaining parallel processing (Gehring et al.)I More about attention: Next week

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 27 / 28

Page 65: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

RNN vs. CNN: Quiz

Given a task description, choose appropriate architecture!I Task: predict the number of the main verb (sleep or sleeps)

F The cats, who were sitting on the map inside the house, [sleep/sleeps?]

I Which architecture should we use? RNNI Task: predict the polarity of the review:

F [... many useless sentences ...] best book ever [... many uselesssentences ...]

I Which architecture should we use? CNN

Task: Machine Translation

Which architecture should we use?

I Intuitively RNN (because MT is all about long-range dependencies),but ...

I Attention gives CNNs the ability to capture long-range dependencies,while maintaining parallel processing (Gehring et al.)

I More about attention: Next week

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 27 / 28

Page 66: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

RNN vs. CNN: Quiz

Given a task description, choose appropriate architecture!I Task: predict the number of the main verb (sleep or sleeps)

F The cats, who were sitting on the map inside the house, [sleep/sleeps?]

I Which architecture should we use? RNNI Task: predict the polarity of the review:

F [... many useless sentences ...] best book ever [... many uselesssentences ...]

I Which architecture should we use? CNN

Task: Machine Translation

Which architecture should we use?I Intuitively RNN (because MT is all about long-range dependencies),

but ...

I Attention gives CNNs the ability to capture long-range dependencies,while maintaining parallel processing (Gehring et al.)

I More about attention: Next week

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 27 / 28

Page 67: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

RNN vs. CNN: Quiz

Given a task description, choose appropriate architecture!I Task: predict the number of the main verb (sleep or sleeps)

F The cats, who were sitting on the map inside the house, [sleep/sleeps?]

I Which architecture should we use? RNNI Task: predict the polarity of the review:

F [... many useless sentences ...] best book ever [... many uselesssentences ...]

I Which architecture should we use? CNN

Task: Machine Translation

Which architecture should we use?I Intuitively RNN (because MT is all about long-range dependencies),

but ...I Attention gives CNNs the ability to capture long-range dependencies,

while maintaining parallel processing (Gehring et al.)I More about attention: Next week

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 27 / 28

Page 68: Convolution and Pooling - GitHub Pages3 Application to NLP 4 Comparison: RNN vs. CNN Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 2 / 28

Next week?

Attention

Keras advanced features

Adversarial training

Generative pre-training (a.k.a. Elmo & Bert)

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Convolution and Pooling November 28, 2018 28 / 28