synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites...
Transcript of synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites...
![Page 1: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/1.jpg)
SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects
for DNN Training
ERIC QIN*, ANANDA SAMAJDAR*, HYOUKJUN KWON*, VINEET NADELLA*, SUDARSHAN SRINIVASAN#, DIPANKAR DAS#,
BHARAT KAUL#, TUSHAR KRISHNA*
http://synergy.ece.gatech.edu
* GEORGIA TECH# INTEL
1
![Page 2: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/2.jpg)
Outline• Motivation
○ GEMMs in Deep Learning
○ Utilization on TPU (Systolic Array)
• Accelerator Requirements
• SIGMA
○ Interconnects Implementations
○ Full System Design
• Evaluation
• Conclusion
2
![Page 3: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/3.jpg)
Outline
3
• Motivation
○ GEMMs in Deep Learning
○ Utilization on TPU (Systolic Array)
• Accelerator Requirements
• SIGMA
○ Interconnects Implementations
○ Full System Design
• Evaluation
• Conclusion
![Page 4: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/4.jpg)
Deep Learning Applications
Speech Recognition Recommender SystemsLanguage Understanding
4
![Page 5: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/5.jpg)
Deep Learning Applications
Speech Recognition Recommender SystemsLanguage Understanding
5
What is the key computation for these Deep Learning applications?
![Page 6: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/6.jpg)
Runtime breakdown on V100 GPU
6
77 % 64 %
Transformer (Language Understanding)
GNMT (Machine Translation)
![Page 7: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/7.jpg)
Runtime breakdown on V100 GPU
7
77 % 64 %
Transformer (Language Understanding)
GNMT (Machine Translation)
Matrix multiplications (GEMMs) consume around 70% of the total runtime when training modern deep learning workloads.
![Page 8: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/8.jpg)
GEMMs in Deep Learning
M
K
K
N N
MForward Pass(Inference and Training)
Input Feature Map
Output Feature MapWeights
GEMM MNK Dimension
Representation
M dim: batch size
N dim: number of channels in the next layer
K dim: [H * W * C]
8Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al.
![Page 9: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/9.jpg)
GEMMs in Deep Learning
Backward Pass(Training)
M
K
K
N N
MForward Pass(Inference and Training)
Input Feature Map
Output Feature MapWeights
N
M
K
NWeightsT Output Feature Error
M
K
Input Feature Error
GEMM MNK Dimension
Representation
M dim: batch size
N dim: number of channels in the next layer
K dim: [H * W * C]
9Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al.
![Page 10: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/10.jpg)
GEMMs in Deep Learning
Backward Pass(Training)
M
K
K
N N
MForward Pass(Inference and Training)
Input Feature Map
Output Feature MapWeights
N
M
K
NWeightsT
Gradient Computation(Training)
Output Feature Error
N
M Output Feature Error
M
K
InputFeature
MapT
M
K
Input Feature Error
K
N
𝚫Weights
GEMM MNK Dimension
Representation
M dim: batch size
N dim: number of channels in the next layer
K dim: [H * W * C]
Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al. 10
![Page 11: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/11.jpg)
GEMMs in Deep Learning
Backward Pass(Training)
M
K
K
N N
MForward Pass(Inference and Training)
Input Feature Map
Output Feature MapWeights
N
M
K
NWeightsT
Gradient Computation(Training)
Output Feature Error
N
M Output Feature Error
M
K
InputFeature
MapT
M
K
Input Feature Error
K
N
𝚫Weights
GEMM MNK Dimension
Representation
M dim: batch size
N dim: number of channels in the next layer
K dim: [H * W * C]
Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al. 11
GEMM is a key compute primitive to accelerate in hardware to speed up training.
![Page 12: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/12.jpg)
Hardware for Accelerating GEMMs
12
SIMT Architectures
Nvidia GTX GPUs
![Page 13: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/13.jpg)
13
SIMD ArchitecturesSIMT Architectures
Microsoft Brainwave ARM Trillium
Tesla FSDC
Nvidia GTX GPUs
Hardware for Accelerating GEMMs
![Page 14: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/14.jpg)
14
SIMD ArchitecturesSIMT Architectures Systolic Architectures
Microsoft Brainwave ARM Trillium
Tesla FSDC
Nvidia GTX GPUs
Google TPU
Xilinx xDNN Nvidia Tensor Cores
Hardware for Accelerating GEMMs
![Page 15: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/15.jpg)
15
SIMD ArchitecturesSIMT Architectures Systolic Architectures
Recently, systolic array based architectures are popular for accelerating GEMMs.
Microsoft Brainwave ARM Trillium
Tesla FSDC
Nvidia GTX GPUs
Google TPU
Xilinx xDNN Nvidia Tensor Cores
Hardware for Accelerating GEMMs
![Page 16: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/16.jpg)
Target comparison: Google TPU
TPU figure from https://cloud.google.com/tpu/docs/system-architecture
Our target comparison is with the Google TPU, which uses 128 x 128 systolic arrays.
16
![Page 17: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/17.jpg)
Outline• Motivation
○ GEMMs in Deep Learning
○ Utilization on TPU (Systolic Array)
• Accelerator Requirements
• SIGMA
○ Interconnects Implementations
○ Full System Design
• Evaluation
• Conclusion
17
![Page 18: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/18.jpg)
TPU (Systolic Array)
Systolic Array Architectures
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127
18
![Page 19: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/19.jpg)
TPU (Systolic Array) MAC PE
Systolic Array Architectures
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127
19
+✕
Weight Buffer
Top input
Streaming input
Streaming output
Load (0) / Accum. (1)
Load (0) / Accum. (1)
0
1
Bottom output
![Page 20: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/20.jpg)
TPU (Systolic Array)
Mapping GEMMs onto TPUs
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127
20
Workload Application Example DimensionsM N K
GNMT Machine Translation
128 2048 4096320 3072 40961632 36548 10242048 4096 32
DeepBench General Workload
1024 16 50000035 8457 2560
Transformer Language Understanding
31999 1024 8484 1024 4096
NCF Collaborative Filtering
2048 1 128256 256 2048
GEMMs used for evaluation.
![Page 21: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/21.jpg)
Workload Application Example DimensionsM N K
GNMT Machine Translation
128 2048 4096320 3072 40961632 36548 10242048 4096 32
DeepBench General Workload
1024 16 50000035 8457 2560
Transformer Language Understanding
31999 1024 8484 1024 4096
NCF Collaborative Filtering
2048 1 128256 256 2048
TPU (Systolic Array)
Mapping GEMMs onto TPUs
21
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127
Let’s map this GEMM!
GEMMs used for evaluation.
![Page 22: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/22.jpg)
TPU (Systolic Array)
Mapping GEMMs onto TPUs
StationarymatrixK
= 2
048
N = 256
22
K = 2048
M =
256
Streaming matrix
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
![Page 23: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/23.jpg)
TPU (Systolic Array)
Mapping GEMMs onto TPUs
StationarymatrixK
= 2
048
N = 256
23
K = 2048
M =
256
Streaming matrix
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127 128
128
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
![Page 24: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/24.jpg)
TPU (Systolic Array)
Mapping GEMMs onto TPUs
StationarymatrixK
= 2
048
N = 256
24
K = 2048
M =
256
Streaming matrix
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127 128
128
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
![Page 25: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/25.jpg)
TPU (Systolic Array)
Mapping GEMMs onto TPUs
StationarymatrixK
= 2
048
N = 256
25
K = 2048
M =
256
Streaming matrix
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127 128
128
128
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
![Page 26: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/26.jpg)
TPU (Systolic Array)
Mapping GEMMs onto TPUs
26
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
![Page 27: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/27.jpg)
TPU (Systolic Array)
Mapping GEMMs onto TPUs
27
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
The streaming elements get multiplied with the stationary elements. The partial sums
get accumulated down each column.
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
![Page 28: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/28.jpg)
TPU (Systolic Array)
Mapping GEMMs onto TPUs
28
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
The streaming elements get multiplied with the stationary elements. The partial sums
get accumulated down each column.
![Page 29: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/29.jpg)
TPU (Systolic Array)
Mapping GEMMs onto TPUs
29
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
The streaming elements get multiplied with the stationary elements. The partial sums
get accumulated down each column.
![Page 30: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/30.jpg)
TPU (Systolic Array)
Mapping GEMMs onto TPUs
30
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
The streaming elements get multiplied with the stationary elements. The partial sums
get accumulated down each column.
![Page 31: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/31.jpg)
TPU (Systolic Array)
Mapping GEMMs onto TPUs
31
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
Reduce partial sum down each column.
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
The streaming elements get multiplied with the stationary elements. The partial sums
get accumulated down each column.
![Page 32: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/32.jpg)
TPU (Systolic Array)
Mapping GEMMs onto TPUs
32
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
Reduce partial sum down each column.
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
The streaming elements get multiplied with the stationary elements. The partial sums
get accumulated down a column.
Systolic Arrays are popular because they enable efficient data reuse and are very simple to implement.
![Page 33: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/33.jpg)
Workload Application Example DimensionsM N K
GNMT Machine Translation
128 2048 4096320 3072 40961632 36548 10242048 4096 32
DeepBench General Workload
1024 16 50000035 8457 2560
Transformer Language Understanding
31999 1024 8484 1024 4096
NCF Collaborative Filtering
2048 1 128256 256 2048
Mapping GEMMs onto TPUs - Irregularity
33
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950
Let’s map another GEMM!** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
GEMMs used for evaluation.
![Page 34: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/34.jpg)
34
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
K = 32
N = 4096
K =
32
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Mapping GEMMs onto TPUs - Irregularity
![Page 35: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/35.jpg)
35
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
K = 32
N = 4096
K =
32
128
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Mapping GEMMs onto TPUs - Irregularity
![Page 36: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/36.jpg)
36
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950
75% of the PEs are not utilized for this GEMM.
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
K = 32
N = 4096
K =
32
128
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Mapping GEMMs onto TPUs - Irregularity
![Page 37: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/37.jpg)
37
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
K = 32
N = 4096
K =
32
128
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
75% of the PEs are not utilized for this GEMM.
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Mapping GEMMs onto TPUs - Irregularity
![Page 38: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/38.jpg)
38
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
75% of the PEs are not utilized for this GEMM.
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Mapping GEMMs onto TPUs - Irregularity
![Page 39: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/39.jpg)
39
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
75% of the PEs are not utilized for this GEMM.
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Mapping GEMMs onto TPUs - Irregularity
![Page 40: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/40.jpg)
40
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
75% of the PEs are not utilized for this GEMM.
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Mapping GEMMs onto TPUs - Irregularity
![Page 41: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/41.jpg)
Mapping GEMMs onto TPUs
41
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
75% of the PEs are not utilized for this GEMM.
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Mapping GEMMs onto TPUs - Irregularity
![Page 42: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/42.jpg)
Mapping GEMMs onto TPUs
42
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
75% of the PEs are not utilized for this GEMM.
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Mapping GEMMs onto TPUs - Irregularity
The rigid structure of Systolic Arrays cause PE underutilization. How can we enable the remaining PEs?
![Page 43: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/43.jpg)
43
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Can we map another section of the stationary matrix onto the TPU to increase utilization?
Mapping GEMMs onto TPUs - Irregularity
![Page 44: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/44.jpg)
44
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Can we map another section of the stationary matrix onto the TPU to increase utilization?
Mapping GEMMs onto TPUs - Irregularity
128
![Page 45: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/45.jpg)
45
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950
Can we map another section of the stationary matrix onto the TPU to increase utilization?
Mapping GEMMs onto TPUs - Irregularity
128
![Page 46: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/46.jpg)
46
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
duplicate streaming data
Can we map another section of the stationary matrix onto the TPU to increase utilization?
Mapping GEMMs onto TPUs - Irregularity
128
![Page 47: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/47.jpg)
47
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
duplicate streaming data
Can we map another section of the stationary matrix onto the TPU to increase utilization?
Mapping GEMMs onto TPUs - Irregularity
128
![Page 48: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/48.jpg)
48
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Can we map another section of the stationary matrix onto the TPU to increase utilization?
Mapping GEMMs onto TPUs - Irregularity
128
![Page 49: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/49.jpg)
49
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
Can we map another section of the stationary matrix onto the TPU to increase utilization?
128
Mapping GEMMs onto TPUs - Irregularity
![Page 50: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/50.jpg)
50
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
This is incorrect functionally, because the systolic reduction will accumulate dark blue and light purple partial product clusters together. This is due to a rigid aspect ratio of a systolic array. Can we map another section of
the stationary matrix onto the TPU to increase utilization?
Mapping GEMMs onto TPUs - Irregularity
128
![Page 51: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/51.jpg)
51
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
No, systolic reduction will accumulate dark blue and light purple partial product clusters together, which is incorrect functionally. This is due to a rigid aspect ratio of a systolic array.
Can we map another section of the stationary matrix onto the TPU to increase utilization?
Mapping GEMMs onto TPUs - Irregularity
Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.
Observation 2: Sparse weights cause underutilization of the PEs.
Observation 3: Large systolic arrays have large load and reduction latency.
Takeaway: Systolic Arrays have limitations on emerging GEMM workloads that are both
sparse and irregular.
![Page 52: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/52.jpg)
Sparsity in DNN Models
GNMT Pruning - Temporal Sparsity(https://www.intel.ai/compressing-gnmt-models)
52
![Page 53: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/53.jpg)
Sparsity in DNN Models
GNMT Pruning - Temporal Sparsity(https://www.intel.ai/compressing-gnmt-models)
Transformer Sparsity - Impact on BLEU(The State of Sparsity in Deep Neural Networks, Gale et al., arXiv)
53
![Page 54: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/54.jpg)
Sparsity in DNN Models
GNMT Pruning - Temporal Sparsity(https://www.intel.ai/compressing-gnmt-models)
Transformer Sparsity - Impact on BLEU(The State of Sparsity in Deep Neural Networks, Gale et al., arXiv)
54
Weight sparsity ranges from 40% to 90%. Activation sparsity is approximately
30% to 70% from ReLU, dropout, etc.
![Page 55: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/55.jpg)
Workload Application Example DimensionsM N K
GNMT Machine Translation
128 2048 4096320 3072 40961632 36548 10242048 4096 32
DeepBench General Workload
1024 16 50000035 8457 2560
Transformer Language Understanding
31999 1024 8484 1024 4096
NCF Collaborative Filtering
2048 1 128256 256 2048
Mapping GEMMs onto TPUs - Sparsity
55
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950
Usually these GEMMs are sparse!
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
GEMMs used for evaluation.
![Page 56: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/56.jpg)
56
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
What happens if half of the stationary matrix are zeros?
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950
Mapping GEMMs onto TPUs - Sparsity
![Page 57: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/57.jpg)
57
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
What happens if half of the stationary matrix are zeros?
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Mapping GEMMs onto TPUs - Sparsity
![Page 58: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/58.jpg)
58
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
What happens if half of the stationary matrix are zeros?
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Mapping GEMMs onto TPUs - Sparsity
![Page 59: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/59.jpg)
59
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
What happens if half of the stationary matrix are zeros?
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Multiplication with an operand that is zero is considered underutilized.
Mapping GEMMs onto TPUs - Sparsity
![Page 60: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/60.jpg)
60
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
87.5% of the PEs are not utilized for this GEMM.
Multiplication with an operand that is zero is considered underutilized.
Mapping GEMMs onto TPUs - Sparsity
![Page 61: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/61.jpg)
61
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
87.5% of the PEs are not utilized for this GEMM.
Multiplication with an operand that is zero is considered underutilized.
Mapping GEMMs onto TPUs - Sparsity
Weight stationary systolic arrays are underutilized for sparse GEMMs because they have to map zeros. How
can we map only nonzeros stationary?
![Page 62: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/62.jpg)
62
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Can we map other nonzero elements where the idle PEs used to be?
Mapping GEMMs onto TPUs - Sparsity
![Page 63: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/63.jpg)
63
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Can we map other nonzero elements where the idle PEs used to be?
Mapping GEMMs onto TPUs - Sparsity
128
![Page 64: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/64.jpg)
64
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Can we map other nonzero elements where the idle PEs used to be?
Mapping GEMMs onto TPUs - Sparsity
128
![Page 65: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/65.jpg)
65
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Can we map other nonzero elements where the idle PEs used to be?
Mapping GEMMs onto TPUs - Sparsity
128
![Page 66: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/66.jpg)
66
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
Can we map other nonzero elements where the idle PEs used to be?
Mapping GEMMs onto TPUs - Sparsity
128
![Page 67: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/67.jpg)
67
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
Dark blue and light purple clusters accumulate, which is incorrect. Systolic reduction only allows fixed size dot products, which is the size of a column.
Can we map other nonzero elements where the idle PEs used to be?
Mapping GEMMs onto TPUs - Sparsity
128
![Page 68: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/68.jpg)
68
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
Dark blue and light purple clusters accumulate, which is incorrect. Systolic reduction only allows fixed size dot products, which is the size of a column.
Can we map other nonzero elements where the idle PEs used to be?
Mapping GEMMs onto TPUs - Sparsity
Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.
Observation 2: Sparse weights cause underutilization of the PEs and require variable sized dot product accumulation.
Observation 3: Large systolic arrays have large load and reduction latency.
Takeaway: Systolic Arrays have limitations on emerging GEMM workloads that are both
sparse and irregular.
![Page 69: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/69.jpg)
TPU (Systolic Array)
Mapping GEMMs onto TPUs - Scalability
StationarymatrixK
= 2
048
N = 256
69
K = 2048
M =
256
Streaming matrix
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127 128
128
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
![Page 70: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/70.jpg)
TPU (Systolic Array)
StationarymatrixK
= 2
048
N = 256
70
K = 2048
M =
256
Streaming matrix
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127 128
128
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Cycle 1
Mapping GEMMs onto TPUs - Scalability
![Page 71: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/71.jpg)
TPU (Systolic Array)
StationarymatrixK
= 2
048
N = 256
71
K = 2048
M =
256
Streaming matrix
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127 128
128
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Cycle 2
Mapping GEMMs onto TPUs - Scalability
![Page 72: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/72.jpg)
TPU (Systolic Array)
StationarymatrixK
= 2
048
N = 256
72
K = 2048
M =
256
Streaming matrix
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127 128
128
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Cycle 3
Mapping GEMMs onto TPUs - Scalability
![Page 73: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/73.jpg)
TPU (Systolic Array)
StationarymatrixK
= 2
048
N = 256
73
K = 2048
M =
256
Streaming matrix
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127 128
128
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Cycle 128Loading takes 128 cycles, which
scales at O(sqrtN).
Mapping GEMMs onto TPUs - Scalability
![Page 74: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/74.jpg)
TPU (Systolic Array)
74
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
Reduce partial sum down each column.
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
Reduction will take 128 cycles for the last accumulate to finish before loading a new
portion of the stationary matrix.
Mapping GEMMs onto TPUs - Scalability
![Page 75: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/75.jpg)
TPU (Systolic Array)
75
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
Reduce partial sum down each column.
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
Reduction will take 128 cycles for the last accumulate to finish before loading a new
portion of the stationary matrix.
Mapping GEMMs onto TPUs - Scalability
Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.
Observation 2: Sparse weights cause underutilization of the PEs and require variable sized dot product accumulation.
Observation 3: Large systolic arrays have significant load and reduction latency.
Takeaway: Systolic Arrays have limitations on emerging GEMM workloads that are both
sparse and irregular.
![Page 76: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/76.jpg)
TPU (Systolic Array)
76
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
Reduce partial sum down each column.
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
Reduction will take 128 cycles for the last accumulate to finish before loading a new
portion of the stationary matrix.
Mapping GEMMs onto TPUs - Scalability
Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.
Observation 2: Sparse weights cause underutilization of the PEs and require variable sized dot product accumulation.
Observation 3: Large systolic arrays have significant load and reduction latency.
Takeaway: Systolic Arrays are underutilized on emerging GEMM workloads that are
both sparse and irregular.
![Page 77: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/77.jpg)
Outline• Motivation
○ GEMMs in Deep Learning
○ Utilization on TPU (Systolic Array)
• Accelerator Requirements
• SIGMA
○ Interconnects Implementations
○ Full System Design
• Evaluation
• Conclusion
77
![Page 78: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/78.jpg)
Key Accelerator Requirements
Requirement Systolic Array Limitation SIGMA Desired Traits
Flexibility ● rigid aspect ratio ● ability to mimic any 2D aspect ratio
Sparsity Support ● Data forwarding every cycle in horizontal / vertical direction requires systolic array to map zeros.
● Support sparsity by only mapping nonzeros
● ability to create simultaneous variable sized dot product
Scalability ● O(sqrtN) distribution ● O(sqrtN) reduction
● O(1) distribution● O(log2N) reduction
78
![Page 79: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/79.jpg)
Key Accelerator Requirements
Requirement Systolic Array Limitation SIGMA Desired Traits
Flexibility ● rigid aspect ratio ● ability to mimic any 2D aspect ratio
Sparsity Support ● data forwarding every cycle requires systolic array to map zeros
● sparsity support by mapping only nonzeros stationary
● ability to create simultaneous variable sized dot product
Scalability ● O(sqrtN) distribution ● O(sqrtN) reduction
● O(1) distribution● O(log2N) reduction
79
![Page 80: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/80.jpg)
Key Accelerator Requirements
Requirement Systolic Array Limitation SIGMA Desired Traits
Flexibility ● rigid aspect ratio ● ability to mimic any 2D aspect ratio
Sparsity Support ● data forwarding every cycle requires systolic array to map zeros
● sparsity support by mapping only nonzeros stationary
● ability to create simultaneous variable sized dot product
Scalability ● O(sqrtN) distribution ● O(sqrtN) reduction
● O(1) distribution● O(logN) reduction
80
![Page 81: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/81.jpg)
Key Accelerator Requirements
Requirement Systolic Array Limitation SIGMA Desired Traits
Flexibility ● rigid aspect ratio ● ability to mimic any 2D aspect ratio
Sparsity Support ● data forwarding every cycle requires systolic array to map zeros
● sparsity support by mapping only nonzeros
● ability to create simultaneous variable sized dot product
Scalability ● O(sqrtN) distribution ● O(sqrtN) reduction
● O(1) distribution● O(logN) reduction
81
With flexible and scalable interconnects between all PEs, SIGMA can solve the three requirements.
![Page 82: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/82.jpg)
Systolic vs SIGMA High Level Interconnects
A B C D
I J K L
4 x 4 Systolic Array
82
A B C D
I J K L
E F G H
M N O P
16 PE SIGMA
● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and
reduction
![Page 83: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/83.jpg)
Systolic vs SIGMA High Level Interconnects
83
16 PE SIGMA
A B C D
I J K L
E F G H
Distribution Network
M N O P
** Microarchitecture details on the networks will be discussed later
A B C D
I J K L
4 x 4 Systolic Array
● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and
reduction
![Page 84: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/84.jpg)
Systolic vs SIGMA High Level Interconnects
84
16 PE SIGMA
A B C D
I J K L
E F G H
Distribution Network
M N O P
** Microarchitecture details on the networks will be discussed later
A B C D
I J K L
4 x 4 Systolic Array
● Distribution network allows SIGMA to mimic any aspect ratio to address irregular GEMMs
● Ability to send any streaming element to any PE
● O(1) distribution
● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and
reduction
![Page 85: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/85.jpg)
Systolic vs SIGMA High Level Interconnects
85
** Microarchitecture details on the networks will be discussed later
A B C D
I J K L
4 x 4 Systolic Array 16 PE SIGMA
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
● Distribution network allows SIGMA to mimic any aspect ratio to address irregular GEMMs
● Ability to send any streaming element to any PE
● O(1) distribution
● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and
reduction
![Page 86: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/86.jpg)
Systolic vs SIGMA High Level Interconnects
86
** Microarchitecture details on the networks will be discussed later
A B C D
I J K L
4 x 4 Systolic Array 16 PE SIGMA
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
● Distribution network allows SIGMA to mimic any aspect ratio to address irregular GEMMs
● Ability to send any streaming element to any PE
● O(1) distribution
● Reduction network allows SIGMA to reduce variable sized dot products
● Addresses sparsity and irregularity
● O(logN) reduction
● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and
reduction
![Page 87: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/87.jpg)
A B C D
I J K L
4 x 4 Systolic Array
a
b
c
d
e
f
g
h
87
Irregular GEMMs on SIGMA
Set up the stationary values.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
![Page 88: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/88.jpg)
A B C D
I J K L
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b
c
d
a
b
c
d
e
f
g
h
e
f
g
h
88
16 PE SIGMA
Irregular GEMMs on SIGMA
Set up the stationary values.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
![Page 89: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/89.jpg)
A B C D
I J K L
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b
c
d
a
b
c
d
e
f
g
h
e
f
g
h
a
b
a
b
a
b
a
b
a
b
a
b
a
b
89
16 PE SIGMA
Irregular GEMMs on SIGMA
Next cycle: Multicast first row of MK to the corresponding stationary elements.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
![Page 90: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/90.jpg)
A B C D
I J K L
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b
c
d
c
d
e
f
g
h
e
f
g
h
c
d
c
d
c
d
c
d
d
c
c
d
c
d
90
16 PE SIGMA
Irregular GEMMs on SIGMA
Next cycle: Multicast second row of MK to the corresponding stationary elements.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
![Page 91: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/91.jpg)
A B C D
I J K L
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b
c
d
e
f
e
f
g
h
g
h
e
f
e
f
e
f
e
f
f
e
e
f
e
f
91
16 PE SIGMA
Irregular GEMMs on SIGMA
Next cycle: Multicast third row of MK to the corresponding stationary elements.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
![Page 92: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/92.jpg)
A B C D
I J K L
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b
c
d
g
h
e
f
g
h
g
h
g
h
g
h
g
h
h
g
g
h
g
h
92
16 PE SIGMA
Irregular GEMMs on SIGMA
Next cycle: Multicast fourth row of MK to the corresponding stationary elements.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
![Page 93: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/93.jpg)
E F G H
M N O P
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
g
h
g
h
g
h
g
h
g
h
h
g
g
h
g
h
a
b
c
d
e
f
g
h
93
16 PE SIGMA
Irregular GEMMs on SIGMA
After accumulation, SIGMA is done. However, the systolic array has to map the other side of the stationary matrix and stream in the MK matrix again (referred to as folding).
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
![Page 94: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/94.jpg)
E F G H
M N O P
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
g
h
g
h
g
h
g
h
g
h
h
g
g
h
g
h
a
b
c
d
e
f
g
h
94
16 PE SIGMA
Irregular GEMMs on SIGMA
After accumulation, SIGMA is done. However, the systolic array has to map the other side of the stationary matrix and stream in the MK matrix again.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
SIGMA reduces the number of folds, which then reduces the number of memory references on the streaming matrix.
![Page 95: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/95.jpg)
E F G H
M N O P
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
95
16 PE SIGMA
Irregular GEMMs on SIGMA
Final cycle count.
Systolic Array total runtime:
24 cycles
SIGMA total runtime:
13 cycles
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
![Page 96: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/96.jpg)
E F G H
M N O P
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
96
16 PE SIGMA
Irregular GEMMs on SIGMA
Final cycle count.
Systolic Array total runtime:
24 cycles
SIGMA total runtime:
13 cycles
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
SIGMA maximizes PE utilization with its flexible interconnects for irregular GEMMs.
![Page 97: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/97.jpg)
Sparse Irregular GEMMs on SIGMA
A B C D
H I J
4 x 4 Systolic Array
a
b
cd
f
g
e
97
Set up the stationary values.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
![Page 98: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/98.jpg)
A B C D
H I J
A B J C
H I L K
D E F G
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b
cd
f
g
e
a
b
cd
f
g
e
98
16 PE SIGMA
Sparse Irregular GEMMs on SIGMA
Set up the stationary values.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
![Page 99: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/99.jpg)
A B C D
H I J
A B J C
H I L K
D E F G
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b
a
b
cd
f
g
e
cd
f
g
e
a
b
a
b
b
b
a
b
a
b
a
b
a
b
99
16 PE SIGMA
Sparse Irregular GEMMs on SIGMA
Next cycle: Multicast first row of MK to the corresponding stationary elements.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
![Page 100: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/100.jpg)
A B C D
H I J
A B J C
H I L K
D E F G
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b
ccd
f
g
e
d
f
g
e
c c
c c c c
100
16 PE SIGMA
Sparse Irregular GEMMs on SIGMA
Next cycle: Multicast second row of MK to the corresponding stationary elements.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
![Page 101: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/101.jpg)
A B C D
H I J
A B J C
H I L K
D E F G
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b
dcd
f
g
e
g
e
d d
d d d d
101
16 PE SIGMA
Sparse Irregular GEMMs on SIGMA
Next cycle: Multicast third row of MK to the corresponding stationary elements.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
![Page 102: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/102.jpg)
A B C D
H I J
A B J C
H I L K
D E F G
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b e
cd
f
g
e e e
e
e
e e e e
102
16 PE SIGMA
Sparse Irregular GEMMs on SIGMA
Next cycle: Multicast fourth row of MK to the corresponding stationary elements.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
![Page 103: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/103.jpg)
C D E
K L M N
A B J C
H I L K
D E F G
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
e e e
e
e
e e e e
a
b
cd
f
g
e
103
16 PE SIGMA
Sparse Irregular GEMMs on SIGMA
After accumulation, SIGMA is done. However, the systolic array has to map the other part of the stationary matrix and stream in the MK matrix again (referred to as folding).
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
![Page 104: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/104.jpg)
F G
O P
A B J C
H I L K
D E F G
Distribution Network
Reduction NetworkM N O P
16 PE SIGMA Flex-DPE4 x 4 Systolic Array
e e e
e
e
e e e e
a
b
cd
f
g
e
104
16 PE SIGMA
Sparse Irregular GEMMs on SIGMA
Again, the systolic array has to map another part of the stationary matrix and stream MK again.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
![Page 105: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/105.jpg)
F G
O P
A B J C
H I L K
D E F G
Distribution Network
Reduction NetworkM N O P
16 PE SIGMA Flex-DPE4 x 4 Systolic Array
105
16 PE SIGMA
Sparse Irregular GEMMs on SIGMA
Final cycle count.
Systolic Array total runtime:
34 cycles
SIGMA total runtime:
13 cycles
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
![Page 106: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/106.jpg)
F G
O P
A B J C
H I L K
D E F G
Distribution Network
Reduction NetworkM N O P
16 PE SIGMA Flex-DPE4 x 4 Systolic Array
106
16 PE SIGMA
Sparse Irregular GEMMs on SIGMA
Final cycle count.
Systolic Array total runtime:
34 cycles
SIGMA total runtime:
13 cycles
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
SIGMA maps only nonzeros stationary; therefore, reduces the number of folds needed.
![Page 107: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/107.jpg)
Outline• Motivation
○ GEMMs in Deep Learning
○ Utilization on TPU (Systolic Array)
• Accelerator Requirements
• SIGMA
○ Interconnects Implementations
○ Full System Design
• Evaluation
• Conclusion
107
![Page 108: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/108.jpg)
O(1) Distribution TopologyUnicast
(Loading Stationary Matrix)
Crossbar
X X X XX X X X X X X X X X X X
Benes Crossbar Benes
108
Source
Destination Destination Destination Destination
Source Source Source
Multicast (Sending Streaming Matrix)
![Page 109: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/109.jpg)
O(1) Distribution Topology
Crossbar Benes
109
Source Source
Crossbar BenesSource Source
X X X XX X X X X X X X X X X X
Destination Destination Destination Destination
Crossbar Benes Crossbar Benes
Unicast (Loading Stationary Matrix)
Multicast (Sending Streaming Matrix)
![Page 110: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/110.jpg)
O(1) Distribution Topology
Crossbar Benes
110
Source Source
Crossbar BenesSource Source
X X X XX X X X X X X X X X X X
Destination Destination Destination Destination
Crossbar Benes Crossbar Benes
Unicast (Loading Stationary Matrix)
Multicast (Sending Streaming Matrix)
![Page 111: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/111.jpg)
O(1) Distribution Topology
Crossbar Benes
Unicast Multicast
111
Source Source
Crossbar BenesSource Source
X X X XX X X X X X X X X X X X
Destination Destination Destination Destination
SIGMA’s distribution can be either a Crossbar or Benes network. We chose Benes because the number of
switches scale by O(N logN).
![Page 112: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/112.jpg)
O(logN) Reduction (Limitation of Adder Tree)
Log2(N) reduction time 112
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 17 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15 2-input BF16 Mult Switch2-input BF32 Adder Switch
11
25 13 21
Different clusters of partial sums.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
![Page 113: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/113.jpg)
Log2(N) reduction time 113
1 5 9 17 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15 2-input BF16 Mult Switch2-input BF32 Adder Switch
11
25 13 21
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Different clusters of partial sums.
Regular adder tree will accumulate a + b, which is incorrect functionality.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
O(logN) Reduction (Limitation of Adder Tree)
![Page 114: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/114.jpg)
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 17 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
Forwarding Adder Network (FAN)
114
Different clusters of partial sums.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
![Page 115: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/115.jpg)
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 17 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
Forwarding Adder Network (FAN)
115
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
![Page 116: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/116.jpg)
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 17 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
Pipelined at 1 cycle per adder level.
Forwarding Adder Network (FAN)
116
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
![Page 117: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/117.jpg)
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 13 17 21 25 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
Cycle 1: Bypass conflicting partial sums
117
Forwarding Adder Network (FAN)
Bypass adder and forward partial sum to next level!
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
![Page 118: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/118.jpg)
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 13 17 21 25 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
Cycle 2: Partial sum red complete
118
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
sent to output buffer
Forwarding Adder Network (FAN)
![Page 119: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/119.jpg)
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 13 17 21 25 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 1 21
119
Forwarding Adder Network (FAN)
Cycle 3: Partial sums green and orange complete
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
sent to output buffer sent to output buffer
![Page 120: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/120.jpg)
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 13 17 21 25 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
Cycle 4: Partial sum blue complete
120
Forwarding Adder Network (FAN)
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
sent to output buffer
![Page 121: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/121.jpg)
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 13 17 21 25 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
121
Forwarding Adder Network (FAN)
Cycle 5: Partial sum purple complete
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
sent to output buffer
![Page 122: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/122.jpg)
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 13 17 21 25 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
122
Forwarding Adder Network (FAN)
Cycle 5: Partial sum purple complete
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
send to output buffer
Note: The output buffer has FFs to maintain correct timing since different clusters may complete at different time.
![Page 123: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/123.jpg)
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 13 17 21 25 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
123
Forwarding Adder Network (FAN)
Cycle 4: Partial sum purple complete
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
Our novel FAN topology is both lightweight and flexible.
It can easily replace any other accelerators that use adder trees.
More details such as the routing algorithm and overhead analysis can be found in the paper.
![Page 124: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/124.jpg)
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 13 17 21 25 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
124
Forwarding Adder Network (FAN)
Cycle 4: Partial sum purple complete
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
Our novel FAN topology is both lightweight and flexible.
It can replace regular adder trees in other hardware accelerators.
More details such as the routing algorithm and overhead analysis can be found in the paper.
![Page 125: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/125.jpg)
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 13 17 21 25 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
125
Forwarding Adder Network (FAN)
Cycle 4: Partial sum purple complete
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
Our novel FAN topology is both lightweight and flexible.
It can replace regular adder trees in other hardware accelerators.
More details such as the routing algorithm and overhead analysis can be found in the paper.
![Page 126: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/126.jpg)
Outline• Motivation
○ GEMMs in Deep Learning
○ Utilization on TPU
• Accelerator Requirements
• SIGMA
○ Interconnects Implementations
○ Full System Design
• Evaluation
• Conclusion
126
![Page 127: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/127.jpg)
Log2(N) reduction time 127
SIGMA High Level Diagram
Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.
![Page 128: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/128.jpg)
Log2(N) reduction time 128
SIGMA High Level Diagram
Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.
Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM
matrices.
![Page 129: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/129.jpg)
Log2(N) reduction time 129
SIGMA High Level Diagram
Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.
Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM
matrices.
Global Controller● Logic comparisons on bitmaps to determine
what nonzero stationary elements are required
![Page 130: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/130.jpg)
Log2(N) reduction time 130
SIGMA High Level Diagram
Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.
Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM
matrices.
Global Controller● Logic comparisons on bitmaps to determine
what nonzero stationary elements are required
Sparsity Filter && Input Data Arbiter● Reorganizes data for loading stationary
elements and sending streaming elements
![Page 131: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/131.jpg)
Log2(N) reduction time 131
SIGMA High Level Diagram
Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.
Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM
matrices.
Global Controller● Logic comparisons on bitmaps to determine
what nonzero stationary elements are required
Sparsity Filter && Input Data Arbiter● Reorganizes data for loading stationary
elements and sending streaming elements
Accumulation SRAM ● Buffer for partial sum accumulations
![Page 132: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/132.jpg)
Log2(N) reduction time 132
SIGMA High Level Diagram
Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.
Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM
matrices.
Global Controller● Logic comparisons on bitmaps to determine
what nonzero stationary elements are required
Sparsity Filter && Input Data Arbiter● Reorganizes data for loading stationary
elements and sending streaming elements
Accumulation SRAM ● Buffer for partial sum accumulations
SIGMA Engine● Compute engine
![Page 133: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/133.jpg)
Log2(N) reduction time 133
SIGMA High Level Diagram
X
x x x x
- - - -
+ +
+
- - - -
X
x x x x
- - - -
+ +
+
- - - -
SIGMA Flex-DPE 0
Distribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
SIGMAFlex-DPE N
Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.
+
SIGMA Engine● Compute engine
![Page 134: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/134.jpg)
Outline• Motivation
○ GEMMs in Deep Learning
○ Utilization on TPU
• Accelerator Requirements
• SIGMA
○ Interconnects Implementations
○ Full System Design
• Evaluation
• Conclusion
134
![Page 135: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/135.jpg)
Methodology
135
● Hardware components are written in Verilog
● Post layout area and power numbers are on a 28nm process
● Analytical model for cycle counts assumes uniform random sparsity
Workload Application Example DimensionsM N K
GNMT Machine Translation
128 2048 4096320 3072 40961632 36548 10242048 4096 32
DeepBench General Workload
1024 16 50000035 8457 2560
Transformer Language Understanding
31999 1024 8484 1024 4096
NCF Collaborative Filtering
2048 1 128256 256 2048
GEMMs used for evaluation.
![Page 136: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/136.jpg)
SIGMA vs TPU - Dense GEMMs
136
![Page 137: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/137.jpg)
SIGMA vs TPU - Dense GEMMs
For GEMMs with dimensions that fit perfectly on the TPU, SIGMA offers slight speedup from its O(1) distribution and O(logN) reduction.
137
![Page 138: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/138.jpg)
SIGMA vs TPU - Dense GEMMs
SIGMA offers 8x speedup for this GEMM. This is because the N dimension of 16 cannot fit perfectly on TPU. Also, O(logN) reduction is more effective since K = 500000.
138
![Page 139: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/139.jpg)
SIGMA vs TPU - Dense GEMMs
SIGMA offers large speedup from its scalable FAN reduction (more effective since K = 500000). Additionally, this GEMM cannot fit perfectly on TPU because of the N dimension.
139
SIGMA performs on average 1.8x better than systolic array architectures for irregular GEMMs.
![Page 140: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/140.jpg)
SIGMA vs TPU - Sparse GEMMs
140
![Page 141: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/141.jpg)
SIGMA vs TPU - Sparse GEMMs
141
SIGMA enables speedup by only mapping nonzero elements stationary.
![Page 142: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/142.jpg)
SIGMA vs TPU - Sparse GEMMs
142
SIGMA enables speedup by only mapping nonzero elements stationary.
SIGMA performs on average 5.7x better than systolic array architectures for sparse and irregular GEMMs.
![Page 143: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/143.jpg)
SIGMA vs Sparse Accelerators
143
![Page 144: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/144.jpg)
SIGMA Qualitative Analysis
144
![Page 145: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/145.jpg)
SIGMA Qualitative Analysis
145
SIGMA performs on average 3x better than state-of-the-art sparse accelerators. In depth analysis can
be found in the paper.
![Page 146: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/146.jpg)
Systolic Array vs SIGMA Comparison
146
** Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.
![Page 147: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/147.jpg)
Systolic Array vs SIGMA Comparison
147
** Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.
SIGMA consumes 38% more area and 82% more power than Systolic Array.
![Page 148: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/148.jpg)
Systolic Array vs SIGMA Comparison
148
** Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.
SIGMA achieves 5.7x higher effective TFLOPS for a 3.2x higher effective TFLOPS/W.
![Page 149: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/149.jpg)
Systolic Array vs SIGMA Comparison
149
** Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.
SIGMA achieves 5.7x higher effective TFLOPS for a 3.2x higher effective TFLOPS/W.
SIGMA consumes more resources but achieves higher effective TFLOPS/W.
![Page 150: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/150.jpg)
Outline• Motivation
○ GEMMs in Deep Learning
○ Utilization on TPU
• Accelerator Requirements
• SIGMA
○ Interconnects Implementations
○ Full System Design
• Evaluation
• Conclusion
150
![Page 151: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/151.jpg)
Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.
151
![Page 152: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/152.jpg)
Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.
● High utilization from systolic arrays is challenging because of their rigid structure.
152
![Page 153: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/153.jpg)
Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.
● High utilization from systolic arrays is challenging because of their rigid structure.
● SIGMA enables high compute utilization on sparse irregular GEMMs.
153
![Page 154: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/154.jpg)
Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.
● High utilization from systolic arrays is challenging because of their rigid structure.
● SIGMA enables high compute utilization on sparse irregular GEMMs.
● SIGMA performs 5.7x better than systolic arrays and 3x better than other state-of-the-art sparse
accelerators at the cost of extra hardware, specifically for the O(1) distribution and the novel FAN
reduction interconnects.
154
![Page 155: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/155.jpg)
Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.
● High utilization from systolic arrays is challenging because of their rigid structure.
● SIGMA enables high compute utilization on sparse irregular GEMMs.
● SIGMA performs 5.7x better than systolic arrays and 3x better than other state-of-the-art sparse
accelerators at the cost of extra hardware, specifically for the O(1) distribution and the novel FAN
reduction interconnects.
● SIGMA achieves 3.2x higher Effective TFLOPS/W than Systolic Arrays.
155
![Page 156: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/156.jpg)
Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.
● High utilization from systolic arrays is challenging because of their rigid structure.
● SIGMA enables high compute utilization on sparse irregular GEMMs.
● SIGMA performs 5.7x better than systolic arrays and 3x better than other state-of-the-art sparse
accelerators at the cost of extra hardware, specifically for the O(1) distribution and the novel FAN
reduction interconnects.
● SIGMA achieves 3.2x higher Effective TFLOPS/W than Systolic Arrays.
● Future work: Optimizations such as power gating and software stack design.
Thank you! 😊
156
![Page 157: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/157.jpg)
Material Backup
157
![Page 158: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/158.jpg)
Walkthrough
0 1 1 1
1 1 1 0
1 1 1 0
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Dim. Registers
M-dim 3
N-dim 4
K-dim 4
Compressed bitmap equivalent
Stationary Bitmap
Streaming Bitmap
A B C D E F G H I
Stationary Nonzero data
a b c d e f g
Streaming Nonzero data
158
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
![Page 159: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/159.jpg)
Walkthrough
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
159
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
![Page 160: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/160.jpg)
Walkthrough
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
1
1
1
0
160
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
![Page 161: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/161.jpg)
Walkthrough
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
1
1
1
0
0 1 1 1
1 1 1 0
1 1 1 0
Stationary Bitmap
161
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
![Page 162: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/162.jpg)
Walkthrough
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
1
1
1
0
0 1 1 1
1 1 1 0
1 1 1 0
Stationary Bitmap
162
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
![Page 163: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/163.jpg)
Walkthrough
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
1
1
1
0
0 1 1 1
1 1 1 0
1 1 1 0
Stationary Bitmap
0 1 1 1
1 1 1 0
1 1 1 0
Stationary’ Bitmap
163
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
![Page 164: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/164.jpg)
Walkthrough
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
1
1
1
0
0 1 1 1
1 1 1 0
1 1 1 0
Stationary Bitmap
A B C D E F G H I
0 1 1 0
1 1 1 0
1 1 1 0
Stationary’ Bitmap
164
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
![Page 165: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/165.jpg)
Walkthrough
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
1
1
1
0
0 1 1 1
1 1 1 0
1 1 1 0
Stationary Bitmap
A B C D E F G H I
0 1 1 0
1 1 1 0
1 1 1 0
Stationary’ Bitmap
165
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
The main idea is to find the necessary nonzero elements to map stationary.
![Page 166: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/166.jpg)
Walkthrough
X
x x x x
A B D E
+ +
+
- - - -
X
x x x x
- - - -
+ +
+
- - - -
Flex-DPE 0 Flex-DPE 1
Distribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I
166
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Cycle 1: unicast elements stationary to Flex-DPE 0
+
![Page 167: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/167.jpg)
Walkthrough
X
A B D E- - - -
X
- - - -
Flex-DPE 0 Flex-DPE 1
Distribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I
167
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Cycle 2: unicast elements stationary to Flex-DPE 1
x x x x
+ +
+
x x x x
+ +
+
+
![Page 168: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/168.jpg)
Walkthrough
0 1 1 0
1 1 1 0
1 1 1 0
Stationary’ Bitmap
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
0 1
2 3 0
1 2 3
1 ... ... ...
1 ... ... ...
1 ... ... ...
Output BitmapRow ID0
1
2
0
1
168
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
![Page 169: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/169.jpg)
Walkthrough
0 1 1 0
1 1 1 0
1 1 1 0
Stationary’ Bitmap
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
0 1
2 3 0
1 2 3
1 ... ... ...
1 ... ... ...
1 ... ... ...
Output Bitmap0
1
2
0
1
169
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Row ID
![Page 170: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/170.jpg)
Walkthrough
0 1 1 0
1 1 1 0
1 1 1 0
Stationary’ Bitmap
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
0 1
2 3 0
1 2 3
1 ... ... ...
1 ... ... ...
1 ... ... ...
Output Bitmap0
1
2
0
1
170
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Row ID
The main idea is to find the matching source and destination indices.
![Page 171: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/171.jpg)
Walkthrough
X
A B D E
XDistribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I
a f - - a f - -
171
Cycle 3: multicast 1st column of streaming matrix and reduce
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
- f a - f a - f
b d
SIGMA Flex-DPE 0
SIGMA Flex-DPE 1
x x x x
+ +
+
x x x x
+ +
+
+
![Page 172: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/172.jpg)
Walkthrough
X
A B D E
XDistribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I
a f - - a f - -
172
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
- f a - f a - f
b d
SIGMA Flex-DPE 0
SIGMA Flex-DPE 1
x x x x
+ +
+
x x x x
+ +
+
+
Cycle 3: multicast 1st column of streaming matrix and reduce
![Page 173: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/173.jpg)
Walkthrough
X
A B D E
XDistribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I
b d - - b d - -
SIGMA Flex-DPE 0
SIGMA Flex-DPE 1
173
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
e
d - b d - b d -
Bf Da-- -- Ga --Ff Ifx x x x
+ +
+
x x x x
+ +
+
+
Cycle 4: multicast 2nd column of streaming matrix and reduce
![Page 174: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/174.jpg)
Walkthrough
X
A B D E
XDistribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I
e - - - e - - -
SIGMA Flex-DPE 0
SIGMA Flex-DPE 1
174
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
e - - e - - e -
c g
-- DbAd Ed Gb Hd-- --
Bf Ga If
x x x x
+ +
+
x x x x
+ +
+
+
Da Ff
Cycle 5: multicast 3rd column of streaming matrix and reduce
![Page 175: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/175.jpg)
Walkthrough
X
A B D E
XDistribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I - g c - g c - g
c g - - c g - -
SIGMA Flex-DPE 0
SIGMA Flex-DPE 1
175
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
-- -- Ee -- He-- --
Ga + If
Db + Ed
x x x x
+ +
+
x x x x
+ +
+
+
Ae
Ad --
Da FfGb Hd
Cycle 6: multicast last column of streaming matrix and reduce
![Page 176: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/176.jpg)
Walkthrough
X
A B D E
XDistribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I - - - - - - - -
SIGMA Flex-DPE 0
SIGMA Flex-DPE 1
176
Cycle 7: FAN Reduction
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
-- Dc -- Gc --Fg Ig
Ee
x x x x
+ +
+
x x x x
+ +
+
+
Bg--
Ae --
Da + Ff
Db + Ed -- Gb + Hd
-- He
![Page 177: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/177.jpg)
Walkthrough
X
A B D E
XDistribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I - - - - - - - -
SIGMA Flex-DPE 0
SIGMA Flex-DPE 1
177
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
-- -- -- -- ---- --
Db + Ed
x x x x
+ +
+
x x x x
+ +
+
+
----
Bg Dc Fg
Ee -- He
Gc Ig
Cycle 8: FAN Reduction
![Page 178: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/178.jpg)
Walkthrough
X
A B D E
XDistribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I - - - - - - - -
SIGMA Flex-DPE 0
SIGMA Flex-DPE 1
178
Cycle 9: FAN Reduction
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Ee
x x x x
+ +
+
x x x x
+ +
+
+
Dc Fg Gc + Ig
![Page 179: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/179.jpg)
Walkthrough
X
A B D E
XDistribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I - - - - - - - -
SIGMA Flex-DPE 0
SIGMA Flex-DPE 1
179
Cycle 10: FAN Reduction
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Dc + Fg
x x x x
+ +
+
x x x x
+ +
+
+
![Page 180: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity](https://reader035.fdocuments.in/reader035/viewer/2022062603/5f0beb387e708231d432dd19/html5/thumbnails/180.jpg)
180