Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... –...

44
Monte Carlo Methods and Neural Networks Noah Gamboa and Alexander Keller

Transcript of Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... –...

Page 1: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Monte Carlo Methods and Neural Networks

Noah Gamboa and Alexander Keller

Page 2: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Neural Networks

Fully connected layers

⌅neurons

.

.

.

a0,0

a0,1

a0,2

a0,n

0

�1

.

.

.

a1,0

a1,1

a1,2

a1,n

1

�1

.

.

.

a2,0

a2,1

a2,2

a2,n

2

�1

.

.

.

aL,0

aL,1

aL,2

aL,nL�1

NVIDIA Confidential 2

Page 3: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Neural Networks

Fully connected layers

⌅neurons compute max{0,Âj wjai ,j}

.

.

.

a0,0

a0,1

a0,2

a0,n

0

�1

.

.

.

a1,0

a1,1

a1,2

a1,n

1

�1

.

.

.

a2,0

a2,1

a2,2

a2,n

2

�1

a2,1

.

.

.

aL,0

aL,1

aL,2

aL,nL�1

NVIDIA Confidential 2

Page 4: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Monte Carlo Methods all over Neural Networks

Examples

⌅drop out

⌅drop connect

⌅stochastic binarization

⌅stochastic gradient descent

⌅fixed pseudo-random matrices for direct feedback alignment

⌅...

NVIDIA Confidential 3

Page 5: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Monte Carlo Methods all over Neural Networks

Observations

⌅the brain

about 10

11

nerve cells with to up to 10

4

connections to others

much more energy efficient than a GPU

⌅artificial neural networks

rigid layer structure

expensive to scale in depth

partially trained fully connected

⌅goal: explore algorithms linear in time and space

NVIDIA Confidential 4

Page 6: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Monte Carlo Methods all over Neural Networks

Observations

⌅the brain

about 10

11

nerve cells with to up to 10

4

connections to others

much more energy efficient than a GPU

⌅artificial neural networks

rigid layer structure

expensive to scale in depth

partially trained fully connected

⌅goal: explore algorithms linear in time and space

NVIDIA Confidential 4

Page 7: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Monte Carlo Methods all over Neural Networks

Observations

⌅the brain

about 10

11

nerve cells with to up to 10

4

connections to others

much more energy efficient than a GPU

⌅artificial neural networks

rigid layer structure

expensive to scale in depth

partially trained fully connected

⌅goal: explore algorithms linear in time and space

NVIDIA Confidential 4

Page 8: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Partition instead of Dropout

Page 9: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Partition instead of Dropout

Guaranteeing coverage of neural units

⌅so far: dropout neuron if threshold t > x

x by linear feedback register generator (for example)

⌅now: assign neuron to partition p = bx ·Pc out of P

less random number generator calls

all neurons guaranteed to be considered

LeNet on MNIST Average of t = 1/2 to 1/9 dropout Average of P = 2 to 9 partitions

Mean accuracy 0.6062 0.6057

StdDev accuracy 0.0106 0.009

NVIDIA Confidential 6

Page 10: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Partition instead of Dropout

Guaranteeing coverage of neural units

⌅so far: dropout neuron if threshold t > x

x by linear feedback register generator (for example)

⌅now: assign neuron to partition p = bx ·Pc out of P

less random number generator calls

all neurons guaranteed to be considered

LeNet on MNIST Average of t = 1/2 to 1/9 dropout Average of P = 2 to 9 partitions

Mean accuracy 0.6062 0.6057

StdDev accuracy 0.0106 0.009

NVIDIA Confidential 6

Page 11: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Partition instead of Dropout

Guaranteeing coverage of neural units

⌅so far: dropout neuron if threshold t > x

x by linear feedback register generator (for example)

⌅now: assign neuron to partition p = bx ·Pc out of P

less random number generator calls

all neurons guaranteed to be considered

LeNet on MNIST Average of t = 1/2 to 1/9 dropout Average of P = 2 to 9 partitions

Mean accuracy 0.6062 0.6057

StdDev accuracy 0.0106 0.009

NVIDIA Confidential 6

Page 12: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Partition instead of Dropout

Training accuracy with LeNet on MNIST

0 20 40 60 80 100 120 140

0.2

0.4

0.6

Epoch of Training

Accu

ra

cy

2 dropout partitions

1/2 dropout

NVIDIA Confidential 7

Page 13: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Partition instead of Dropout

Training accuracy with LeNet on MNIST

0 20 40 60 80 100 120 140

0.2

0.4

0.6

Epoch of Training

Accu

ra

cy

3 dropout partitions

1/3 dropout

NVIDIA Confidential 7

Page 14: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Simulating Discrete Densities

Page 15: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅discrete density approximation of the weights

1

0

remember to flip sign accordingly

transform jittered equidistant samples using cumulative distribution function of absolute value of weights

NVIDIA Confidential 9

Page 16: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅discrete density approximation of the weights

1

0

remember to flip sign accordingly

transform jittered equidistant samples using cumulative distribution function of absolute value of weights

NVIDIA Confidential 9

Page 17: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅discrete density approximation of the weights

1

0

remember to flip sign accordingly

transform jittered equidistant samples using cumulative distribution function of absolute value of weights

NVIDIA Confidential 9

Page 18: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅discrete density approximation of the weights

1

0

remember to flip sign accordingly

transform jittered equidistant samples using cumulative distribution function of absolute value of weights

NVIDIA Confidential 9

Page 19: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅partition of unit interval by sums Pk := Âk

j=1

|wj | of normalized absolute weights

0 = P0

< P1

< · · ·< Pm = 1

using a uniform random variable x 2 [0,1) we find

select neuron i , Pi�1

x < Pi satisfying Prob({Pi�1

x < Pi}) = |wi |–

transform jittered equidistant samples using cumulative distribution function of absolute value of weights

⌅in fact derivation of quantization to weights in {�1,0,+1}–

integer weights if a neuron referenced more than once

explains why ternary and binary did not work in some articles

relation to drop connect and drop out, too

NVIDIA Confidential 10

Page 20: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅partition of unit interval by sums Pk := Âk

j=1

|wj | of normalized absolute weights

0 = P0

< P1

< · · ·< Pm = 1

using a uniform random variable x 2 [0,1) we find

select neuron i , Pi�1

x < Pi satisfying Prob({Pi�1

x < Pi}) = |wi |

transform jittered equidistant samples using cumulative distribution function of absolute value of weights

⌅in fact derivation of quantization to weights in {�1,0,+1}–

integer weights if a neuron referenced more than once

explains why ternary and binary did not work in some articles

relation to drop connect and drop out, too

NVIDIA Confidential 10

Page 21: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅partition of unit interval by sums Pk := Âk

j=1

|wj | of normalized absolute weights

0 = P0

< P1

< · · ·< Pm = 1

using a uniform random variable x 2 [0,1) we find

select neuron i , Pi�1

x < Pi satisfying Prob({Pi�1

x < Pi}) = |wi |–

transform jittered equidistant samples using cumulative distribution function of absolute value of weights

⌅in fact derivation of quantization to weights in {�1,0,+1}–

integer weights if a neuron referenced more than once

explains why ternary and binary did not work in some articles

relation to drop connect and drop out, too

NVIDIA Confidential 10

Page 22: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Simulating Discrete Densities

Stochastic evaluation of scalar product

⌅partition of unit interval by sums Pk := Âk

j=1

|wj | of normalized absolute weights

0 = P0

< P1

< · · ·< Pm = 1

using a uniform random variable x 2 [0,1) we find

select neuron i , Pi�1

x < Pi satisfying Prob({Pi�1

x < Pi}) = |wi |–

transform jittered equidistant samples using cumulative distribution function of absolute value of weights

⌅in fact derivation of quantization to weights in {�1,0,+1}–

integer weights if a neuron referenced more than once

explains why ternary and binary did not work in some articles

relation to drop connect and drop out, too

NVIDIA Confidential 10

Page 23: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Simulating Discrete Densities

Test accuracy for two layer ReLU feedforward network on MNIST

⌅able to achieve 97% of accuracy of model by sampling most important 12% of weights!

0 100 200 300 400 500 600

0.94

0.96

0.98

1

Number of Samples per Neuron

Te

st

Accu

ra

cy

NVIDIA Confidential 11

Page 24: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Simulating Discrete Densities

Application to convolutional layers

⌅sample from distribution of filter (for example, 128x5x5 = 3200)

less redundant than fully connected layers

⌅LeNet Architecture on CIFAR-10, best accuracy is 0.6912

⌅able to get 88% of accuracy of full model at 50% sampled

NVIDIA Confidential 12

Page 25: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Simulating Discrete Densities

Test accuracy for LeNet on CIFAR-10

0 500 1,000 1,500 2,000 2,500 3,000

0.4

0.6

Number of Samples per Filter

Te

st

Accu

ra

cy

NVIDIA Confidential 13

Page 26: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Neural Networks linear in Time and Space

Page 27: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Neural Networks linear in Time and Space

Number n of neural units

⌅for L fully connected layers

n =

L

Âl=1

nl

where nl is the number of neurons in layer l

⌅number of weights

nw =

L

Âl=1

nl�1

·nl

⌅choose number of weights per neuron such that n proportional to nw

for example, constant number nw of weights per neuron

NVIDIA Confidential 15

Page 28: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Neural Networks linear in Time and Space

Number n of neural units

⌅for L fully connected layers

n =

L

Âl=1

nl

where nl is the number of neurons in layer l

⌅number of weights

nw =

L

Âl=1

nl�1

·nl

⌅choose number of weights per neuron such that n proportional to nw

for example, constant number nw of weights per neuron

NVIDIA Confidential 15

Page 29: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Neural Networks linear in Time and Space

Number n of neural units

⌅for L fully connected layers

n =

L

Âl=1

nl

where nl is the number of neurons in layer l

⌅number of weights

nw =

L

Âl=1

nl�1

·nl

⌅choose number of weights per neuron such that n proportional to nw

for example, constant number nw of weights per neuron

NVIDIA Confidential 15

Page 30: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Neural Networks linear in Time and Space

Results

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.2

0.4

0.6

0.8

1

Percent of FC layers sampled

Te

st

Accu

ra

cy

LeNet on MNIST

LeNet on CIFAR-10

NVIDIA Confidential 16

Page 31: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Neural Networks linear in Time and Space

Test accuracy for AlexNet on CIFAR-10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.2

0.4

0.6

0.8

1

Percent of FC layers sampled

Te

st

Accu

ra

cy

NVIDIA Confidential 17

Page 32: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Neural Networks linear in Time and Space

Test accuracy for AlexNet on ILSVRC12

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.2

0.4

0.6

0.8

1

Percent of FC layers sampled

Te

st

Accu

ra

cy

Top-5 Accuracy

Top-1 Accuracy

NVIDIA Confidential 18

Page 33: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Neural Networks linear in Time and Space

Sampling paths through networks

⌅complexity bounded by number of paths times depth

⌅strong indication of relation to Markov chains

⌅importance sampling by weights

NVIDIA Confidential 19

Page 34: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Neural Networks linear in Time and Space

Sampling paths through networks

⌅sparse from scratch

.

.

.

a0,0

a0,1

a0,2

a0,n

0

�1

.

.

.

a1,0

a1,1

a1,2

a1,n

1

�1

.

.

.

a2,0

a2,1

a2,2

a2,n

2

�1

.

.

.

aL,0

aL,1

aL,2

aL,nL�1

guaranteed connectivity

NVIDIA Confidential 20

Page 35: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Neural Networks linear in Time and Space

Sampling paths through networks

⌅sparse from scratch

.

.

.

a0,0

a0,1

a0,2

a0,n

0

�1

.

.

.

a1,0

a1,1

a1,2

a1,n

1

�1

.

.

.

a2,0

a2,1

a2,2

a2,n

2

�1

a2,1

.

.

.

aL,0

aL,1

aL,2

aL,nL�1

guaranteed connectivity

NVIDIA Confidential 20

Page 36: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Neural Networks linear in Time and Space

Sampling paths through networks

⌅sparse from scratch

.

.

.

a0,0

a0,1

a0,2

a0,n

0

�1

.

.

.

a1,0

a1,1

a1,2

a1,n

1

�1

.

.

.

a2,0

a2,1

a2,2

a2,n

2

�1

.

.

.

aL,0

aL,1

aL,2

aL,nL�1

guaranteed connectivity

NVIDIA Confidential 20

Page 37: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Neural Networks linear in Time and Space

Sampling paths through networks

⌅sparse from scratch

.

.

.

a0,0

a0,1

a0,2

a0,n

0

�1

.

.

.

a1,0

a1,1

a1,2

a1,n

1

�1

.

.

.

a2,0

a2,1

a2,2

a2,n

2

�1

.

.

.

aL,0

aL,1

aL,2

aL,nL�1

guaranteed connectivity

NVIDIA Confidential 20

Page 38: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Neural Networks linear in Time and Space

Sampling paths through networks

⌅sparse from scratch

.

.

.

a0,0

a0,1

a0,2

a0,n

0

�1

.

.

.

a1,0

a1,1

a1,2

a1,n

1

�1

.

.

.

a2,0

a2,1

a2,2

a2,n

2

�1

.

.

.

aL,0

aL,1

aL,2

aL,nL�1

guaranteed connectivity

NVIDIA Confidential 20

Page 39: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Neural Networks linear in Time and Space

Sampling paths through networks

⌅sparse from scratch

.

.

.

a0,0

a0,1

a0,2

a0,n

0

�1

.

.

.

a1,0

a1,1

a1,2

a1,n

1

�1

.

.

.

a2,0

a2,1

a2,2

a2,n

2

�1

.

.

.

aL,0

aL,1

aL,2

aL,nL�1

guaranteed connectivity

NVIDIA Confidential 20

Page 40: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Neural Networks linear in Time and Space

Test accuracy for 4 layer feedforward network (784/300/300/10)

0 50 100 150 200 250 300

0

0.2

0.4

0.6

0.8

1

Number of Per Pixel Paths Through Network

Te

st

Accu

ra

cy

MNIST

Fashion MNIST

NVIDIA Confidential 21

Page 41: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Monte Carlo Methods and Neural Networks

Page 42: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Monte Carlo Methods and Neural Networks

Summary

⌅dropout partitions reduce variance

using much less random numbers

⌅simulating discrete densities explains {�1,0,1} and integer weights

compression and quantization without retraining

⌅neural networks with linear complexity for both inference and training

sparse from scratch

sampling paths through neural networks instead of drop connect and drop out

NVIDIA Confidential 23

Page 43: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Monte Carlo Methods and Neural Networks

Summary

⌅dropout partitions reduce variance

using much less random numbers

⌅simulating discrete densities explains {�1,0,1} and integer weights

compression and quantization without retraining

⌅neural networks with linear complexity for both inference and training

sparse from scratch

sampling paths through neural networks instead of drop connect and drop out

NVIDIA Confidential 23

Page 44: Monte Carlo Methods and Neural Networks - NVIDIA€¦ · ⌅ drop connect ⌅ stochastic ... – partially trained fully connected ⌅ goal: explore algorithms linear in time and

Monte Carlo Methods and Neural Networks

Summary

⌅dropout partitions reduce variance

using much less random numbers

⌅simulating discrete densities explains {�1,0,1} and integer weights

compression and quantization without retraining

⌅neural networks with linear complexity for both inference and training

sparse from scratch

sampling paths through neural networks instead of drop connect and drop out

NVIDIA Confidential 23