Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale...
Transcript of Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale...
![Page 1: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/1.jpg)
Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter
![Page 2: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/2.jpg)
2
![Page 3: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/3.jpg)
3
1Case study: CNN-based Image Classification (inference)
![Page 4: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/4.jpg)
4
+ Excellent maintainability in datacenter
+ Maximum flexibility for all workloads
- Performance of CNNs/DNNs vastly slower than specialized HW
![Page 5: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/5.jpg)
5
+ CNNs/DNNs that utilize GPUs or ASICs benefit significantly
- CNNs/DNNs cannot scale beyond limited pools
- Heterogeneity challenging for maintainability
![Page 6: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/6.jpg)
6
+ Homogeneous
- Increased cost and power per server (particularly GPUs)
- Not economical for all applications in the datacenter (GPUs and ASICs)
![Page 7: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/7.jpg)
7
+ Homogeneous
+ Low overhead in power and cost per server
+ Flexibility benefits many workloads?
- Lower peak performance than GPUs or ASICs on some workloads Overtake through scale?
???
![Page 8: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/8.jpg)
http://www.wired.com/2014/06/microsoft-fpga/
8
![Page 9: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/9.jpg)
http://www.wired.com/2014/06/microsoft-fpga/
9
![Page 10: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/10.jpg)
FPGA FPGA FPGA FPGA
Deep
Learning Compression
Service
Physics
Engine
Web Search Pipeline
10
![Page 11: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/11.jpg)
• Altera Stratix V D5
• 172,600 ALMs, 2,014 M20Ks, 1,590 DSPs
• PCIe Gen 3 x8
• 8GB DDR3-1333
• Powered by PCIe slot
• Torus Network
Stratix V
8GB DDR3
PCIe Gen3 x8
11
![Page 12: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/12.jpg)
• Two 8-core Xeon 2.1 GHz CPUs
• 64 GB DRAM
• 4 HDDs @ 2 TB, 2 SSDs @ 512 GB
• 10 Gb Ethernet
• No cable attachments to server 68 ⁰C
12
![Page 13: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/13.jpg)
13
Host
NIC
ASIC
FPGA
CPU
ToR
![Page 14: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/14.jpg)
14
![Page 15: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/15.jpg)
Krizhevsky
3-D Convolution and Max Pooling Dense Layers
“Dog”
INPUT OUTPUT
15
11
11
224
224
3
Stride
of 4
5
5
55
55
96
Max
pooling
27
27
256 Max
pooling
3
3
13
13
384
3
3
3
3
13
13
384
13
13
256
Max
pooling 4096 4096
1000
dense
Input
Imag
e
(RGB)
dense dense
![Page 16: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/16.jpg)
Input Feature Map
N k
k
Convolution Output Max Pooled Output
(Optional)
H = # feature maps
S = kernel stride
* N, k, H, and p may vary across layers
N = input height and width
k = kernel height and width
D = input depth
N
D
Convolution between k x k x D kernel
and region of Input Feature Map
Max value over p x p region
p
p
H
16
![Page 17: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/17.jpg)
Input Kernel
Weights
Output
17
![Page 18: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/18.jpg)
Input Kernel
Weights
Output
18
![Page 19: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/19.jpg)
Input Kernel
Weights
Output
19
![Page 20: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/20.jpg)
Input Kernel
Weights
Output
20
![Page 21: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/21.jpg)
Input Kernel
Weights
Output
21
![Page 22: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/22.jpg)
Input Kernel
Weights
Output
22
![Page 23: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/23.jpg)
Input Kernel
Weights
Output
23
![Page 24: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/24.jpg)
Input Kernel
Weights
Output
24
![Page 25: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/25.jpg)
Input Kernel
Weights
Output
25
![Page 26: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/26.jpg)
Input Kernel
Weights
Output
26
![Page 27: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/27.jpg)
Input Kernel
Weights
Output
27
![Page 28: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/28.jpg)
Input Kernel
Weights
Output
28
![Page 29: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/29.jpg)
29
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE
Input Volume
Segment 0
Output feature
mapInput kernel weight
0
Input Volume
Segment 1
Input Volume
Segment N-2
Input Volume
Segment N-1
Output feature
mapInput kernel weight
1
Output feature
mapInput kernel weight
M-2
Output feature
mapInput kernel weight
M-1
Layer Controller
Input Volume
x
z
y
Broad-cast
Scan chain
Address generation
TopController
Output Volume
x
Layer Config.
Data re-distribution
y
z
![Page 30: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/30.jpg)
30
x
AC
C
Wijk
Dijk
FF(ACC)
MU
X0
MU
X FF(SR)
From south
Termination To north
![Page 31: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/31.jpg)
31
![Page 32: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/32.jpg)
Platform Library/OS
ImageNet 1K
Inference
Throughput
Peak TFLOPs Effective
TFLOPs
Estimated
Peak Power with
Server
Estimated
GOPs/J
(assuming
peak power)
16-core, 2-socket Xeon
E5-2450, 2.1GHz
Caffe + Intel MKL
Ubuntu 14.04.1* 53 images/s 0.27T 0.074T (27%) ~225W ~0.3
Arria 10 GX1150 Windows Server 2012 369 images/s1 1.366T 0.51T (38%) ~265W ~1.9
32 2https://github.com/soumith/convnet-benchmarks
1Dense layer time estimated
![Page 33: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/33.jpg)
Platform Library/OS
ImageNet 1K
Inference
Throughput
Peak TFLOPs Effective
TFLOPs
Estimated
Peak Power with
Server
Estimated
GOPs/J
(assuming
peak power)
16-core, 2-socket Xeon
E5-2450, 2.1GHz
Caffe + Intel MKL
Ubuntu 14.04.1* 53 images/s 0.27T 0.074T (27%) ~225W ~0.3
Arria 10 GX1150 Windows Server 2012 369 images/s1 1.366T 0.51T (38%) ~265W ~1.9
NervanaSys-32 on
NVIDIA Titan X
NervanaSys-32 on
Ubuntu 14.0.4 4129 images/s2 6.1T 5.75T (94%) ~475W ~12.1
33 2https://github.com/soumith/convnet-benchmarks
Includes server power; however, CPUs available to other jobs in the datacenter 1Dense layer time estimated
![Page 34: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/34.jpg)
Platform Library/OS
ImageNet 1K
Inference
Throughput
Peak TFLOPs Effective
TFLOPs
Estimated
Peak Power for
CNN
Computation
Estimated
GOPs/J
(assuming
peak power)
16-core, 2-socket Xeon
E5-2450, 2.1GHz
Caffe + Intel MKL
Ubuntu 14.04.1* 53 images/s 0.27T 0.074T (27%) ~225W ~0.3
Arria 10 GX1150 Windows Server 2012 369 images/s1 1.366T 0.51T (38%) ~40W ~12.8
NervanaSys-32 on
NVIDIA Titan X
NervanaSys-32 on
Ubuntu 14.0.4 4129 images/s2 6.1T 5.75T (94%) ~250W ~23.0
34 2https://github.com/soumith/convnet-benchmarks
1Dense layer time estimated
Under-utilized FPGA vs. highly tuned GPU-friendly workload
![Page 35: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/35.jpg)
Platform Library/OS
ImageNet 1K
Inference
Throughput
Peak TFLOPs Effective
TFLOPs
Estimated
Peak Power for
CNN
Computation
Estimated
GOPs/J
(assuming
peak power)
16-core, 2-socket Xeon
E5-2450, 2.1GHz
Caffe + Intel MKL
Ubuntu 14.04.1* 53 images/s 0.27T 0.074T (27%) ~225W ~0.3
Arria 10 GX1150 Windows Server 2012 369 images/s1
~880 images/s 1.366T
0.51T (38%)
~1.2T (89%) ~40W
20.6
~30.6
NervanaSys-32 on
NVIDIA Titan X
NervanaSys-32 on
Ubuntu 14.0.4 4129 images/s2 6.1T 5.75T (94%) ~250W ~23.0
35 2https://github.com/soumith/convnet-benchmarks
1Dense layer time estimated
Projected Results Assuming Floorplanning and Scaling Up PEs
![Page 36: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/36.jpg)
36
Yes.
![Page 37: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/37.jpg)
37
![Page 38: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image](https://reader030.fdocuments.in/reader030/viewer/2022040608/5ec66bc20ec61c79724ff600/html5/thumbnails/38.jpg)
38