Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton...
Transcript of Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton...
![Page 1: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/1.jpg)
Hardware-efficient Machine Learning
Marton Havasi, Robert Peharz
Machine Learning Reading Group
Cambridge, 30th of November 2017
![Page 2: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/2.jpg)
Motivation for Hardware-efficient Machine Learning
I current philosophy: stronger, higher, faster
I watching Formula-1 might be fun, but who is going to buildthe machine learning bicycle?
I applications often come with stringent constraints: embeddedsystems, autonomous navigation
I insight: what is the “practical complexity” of ML tasks?
![Page 3: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/3.jpg)
Example: Deep Compression1
I AlexNet: 240 MB → 6.9 MB
I VGG-16: 552 MB → 11.3 MB
I 3–5× more energy efficient
1S. Han, H. Mao, W. J. Dally, Deep Compression: Compressing Deep NeuralNetworks with Pruning, Trained Quantization and Huffman Coding, ICLR 2017.
![Page 4: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/4.jpg)
NIPS’16
![Page 5: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/5.jpg)
Binarized Neural Networks
I feed-forward neural network
I use binary weights: {−1,+1}I use threshold-units for hidden units:
sign(x) =
{+1 if x ≥ 0−1 if x < 0
I thus, hidden units reduced to XNOR-operations, integeraccumulation, and thresholding
![Page 6: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/6.jpg)
Training
I real-valued weights during training, restricted to [−1, 1]
I binarized (using sign(x)) in forward pass
I “straight-through estimator”: replace sign(x) with identityf (x) = x in backprop pass
I “shift-based batch-normalization” (test time?)
I ADAM or “shift-based AdaMax” for optimization
I first layer: 8-bit fixed point multiplication
![Page 7: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/7.jpg)
Classification Results
![Page 8: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/8.jpg)
Energy Considerations (45nm technology)
![Page 9: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/9.jpg)
Runtime using SWAR2
2SIMD within a register; SIMD: Single instruction, multiple data.
![Page 10: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/10.jpg)
submitted to ICLR’18
![Page 11: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/11.jpg)
Overview and Core Idea
cf. Soudry et al. NIPS’14Hernandez-Lobato and Adams, ICML’15
![Page 12: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/12.jpg)
Model and Approach
I inputs: x0
I ternary weights {−1, 0, 1}I L-layer neural net: al = Wl xl−1
I sign as non-linearity: x l = sign(al)
I softmax output: exp(al )∑l′ exp(a
l′ )(not needed in test phase)
I assume prior p(W), where W = {Wl}Ll=1
I interpret softmax as likelihood p(D |W)
I infer posterior p(W | D) ∝ p(D |W) p(W)
![Page 13: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/13.jpg)
Variational Approach
min. KL(q(W) || p(W | D)) =
KL(q(W) || p(W))− Eq(W)[log p(D |W)] + log p(D)
![Page 14: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/14.jpg)
Further Details
I 3-bit input weights,
I activation normalization (divide by√dl−1) during PFP
I re-weighting of variational objective
λKL(q(W) || p(W))− (1− λ)Eq(W)[log p(D |W)]
→ corresponds to λ/(1− λ) copies of training data
I drop out
I 50 iterations of spearmint (Bayesian optimization) forhyper-parameters
![Page 15: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/15.jpg)
Results on MNIST Variants
![Page 16: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/16.jpg)
Ensembles of Discrete NNs
![Page 17: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/17.jpg)
NIPS’17
![Page 18: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/18.jpg)
Overview
I Relation between variational inference and compression.
I Use Bayesian approach to prune weights and lower precision inneural networks.
I Experimental results.
![Page 19: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/19.jpg)
Optimal encodingI What is the mean number of bits required to encode samples
from a distribution?
I Discretize using buckets of length t. Use Huffman coding.limt→0
∑i P(it ≤ z < i(t + 1)
)[−log2
(P(it ≤ z < i(t + 1)
)]
I Shannon’s source coding theorem:H(p) =
∫p(z)[−log2p(z)]dx
![Page 20: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/20.jpg)
Optimal encodingI What is the mean number of bits required to encode samples
from a distribution?I Discretize using buckets of length t. Use Huffman coding.
limt→0∑
i P(it ≤ z < i(t + 1)
)[−log2
(P(it ≤ z < i(t + 1)
)]
I Shannon’s source coding theorem:H(p) =
∫p(z)[−log2p(z)]dx
![Page 21: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/21.jpg)
Optimal encodingI What is the mean number of bits required to encode samples
from a distribution?I Discretize using buckets of length t. Use Huffman coding.
limt→0∑
i P(it ≤ z < i(t + 1)
)[−log2
(P(it ≤ z < i(t + 1)
)]
I Shannon’s source coding theorem:H(p) =
∫p(z)[−log2p(z)]dx
![Page 22: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/22.jpg)
Bayesian Compression
I What if we only care about the distribution?
![Page 23: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/23.jpg)
Bayesian Compression
I What if we only care about the distribution?
I Naive approach: Eq[−log(p(z))]
![Page 24: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/24.jpg)
Bayesian Compression
I What if we only care about the distribution?
I The information cost is the additional information over theprior: Eq[−log(p(z))]−H(q) = KL(q||p)
![Page 25: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/25.jpg)
Variational Inference
I Approximate the posterior distribution p(z |x) by maximizingthe ELBO.
ELBO(φ) = Eqφ(z)[log p(x |z)]− KL(qφ(z)||p(z))
![Page 26: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/26.jpg)
Variational Inference as Occam’s Razor
I Occam’s Razor: What is the simplest explanation to the data?
I What is the information cost of describing the data?
![Page 27: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/27.jpg)
Variational Inference as Occam’s Razor
I Complexity cost: Lc = KL(q||p)
I Error cost: Le = Eq[−log(x |z)]
I Overall information cost: Le + Lc = −ELBO(φ)
![Page 28: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/28.jpg)
Hardware efficient neural networks
I Weight pruningI Pruning individual weightsI Pruning nodes
I QuantizationI Binary, tertiary weightsI k-means quantizationI Precision quantization
![Page 29: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/29.jpg)
Hardware efficient neural networks
I Weight pruningI Pruning individual weightsI Pruning nodes
I QuantizationI Binary, tertiary weightsI k-means quantizationI Precision quantization
![Page 30: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/30.jpg)
Advantages of the Bayesian approach
I Sparsity inducing priors for weight pruning.
I Noisy weights allow for reduced precision binary encoding.
![Page 31: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/31.jpg)
Training
I Data: DI Parameters: w
I Variational distribution: qφ(w)
ELBO(φ) = Eqφ(w)[log(D|w)]− KL(qφ(w)||p(w)
)
I Reparameterize w = f (φ, ε)
ELBO(φ) = Ep(ε)[log(D|f (φ, ε))]− KL(qφ(w)||p(w)
)
![Page 32: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/32.jpg)
Training
I Data: DI Parameters: w
I Variational distribution: qφ(w)
ELBO(φ) = Eqφ(w)[log(D|w)]− KL(qφ(w)||p(w)
)I Reparameterize w = f (φ, ε)
ELBO(φ) = Ep(ε)[log(D|f (φ, ε))]− KL(qφ(w)||p(w)
)
![Page 33: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/33.jpg)
Choice of prior
z ∼ p(z)
w ∼ N (w ; 0, z2)
![Page 34: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/34.jpg)
Improper log-uniform prior
p(z) ∝ |z |−1
p(w) ∝∫
1
|z |N (w ; 0, z2)dz =
1
|w |
![Page 35: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/35.jpg)
Improper log-uniform prior
p(z) ∝ |z |−1
p(w) ∝∫
1
|z |N (w ; 0, z2)dz =
1
|w |
p(W , z) ∝A∏i
1
|zi |
A,B∏i ,j
N (wij ; 0, z2i )
qφ(W , z) ∝A∏i
N (zi ;µzi , µ2ziαi )
A,B∏i ,j
N (wij ;µijzi , z2i σ
2ij)
![Page 36: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/36.jpg)
Improper log-uniform prior
I Test time! Fix the weights at their means.
I Pruning: logαi > t
I Precision:Var(wij) = Var(zi
wij
zi) =
= Var(zi )(E (
wij
zi)2 + Var(
wij
zi))
+ Var(wij)E (zi )2
= σ2zi (σij + µ2ij) + σ2ijµ2zi
![Page 37: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/37.jpg)
Half Cauchy scale prior
I Half Cauchy distribution: C+(0, s) = 2(sπ(1 + z2
s2))−1
I s ∝ C+(0, τ)
I z̃i ∝ C+(0, 1)
I w̃ij ∝ N (0, 1)
I wij = w̃ij z̃i s
I The limit is the improper log-uniform distribution.
![Page 38: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/38.jpg)
Experiments
![Page 39: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/39.jpg)
Experiments
![Page 40: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/40.jpg)
Summary
I Used a Bayesian approach to determine the dropout andprecision of the weights.
I Sparsity inducing priors allow for weight pruning.
I Noisy weights allow for reduced precision.
I The performance is on-par with existing, less principledapproaches.
![Page 41: Hardware-efficient Machine Learningcbl.eng.cam.ac.uk/...Havasi_Hardware_Efficient_ML.pdf · Marton Havasi, Robert Peharz Machine Learning Reading Group Cambridge, 30th of November](https://reader035.fdocuments.in/reader035/viewer/2022071410/61051cc1bb5a841e370c4d08/html5/thumbnails/41.jpg)
Thank you!