synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites...

SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects

for DNN Training

ERIC QIN*, ANANDA SAMAJDAR*, HYOUKJUN KWON*, VINEET NADELLA*, SUDARSHAN SRINIVASAN#, DIPANKAR DAS#,

BHARAT KAUL#, TUSHAR KRISHNA*

http://synergy.ece.gatech.edu

* GEORGIA TECH# INTEL

Outline• Motivation

○ GEMMs in Deep Learning

○ Utilization on TPU (Systolic Array)

• Accelerator Requirements

• SIGMA

○ Interconnects Implementations

○ Full System Design

• Evaluation

• Conclusion

Outline

• Motivation

• SIGMA

• Evaluation

• Conclusion

Deep Learning Applications

Speech Recognition Recommender SystemsLanguage Understanding

Deep Learning Applications

Speech Recognition Recommender SystemsLanguage Understanding

What is the key computation for these Deep Learning applications?

Runtime breakdown on V100 GPU

77 % 64 %

Transformer (Language Understanding)

GNMT (Machine Translation)

Runtime breakdown on V100 GPU

77 % 64 %

Transformer (Language Understanding)

GNMT (Machine Translation)

Matrix multiplications (GEMMs) consume around 70% of the total runtime when training modern deep learning workloads.

GEMMs in Deep Learning

MForward Pass(Inference and Training)

Input Feature Map

Output Feature MapWeights

GEMM MNK Dimension

Representation

M dim: batch size

N dim: number of channels in the next layer

K dim: [H * W * C]

8Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al.

Backward Pass(Training)

Input Feature Map

NWeightsT Output Feature Error

Input Feature Error

GEMM MNK Dimension

Representation

M dim: batch size

K dim: [H * W * C]

9Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al.

Input Feature Map

NWeightsT

Gradient Computation(Training)

Output Feature Error

M Output Feature Error

InputFeature

Input Feature Error

𝚫Weights

GEMM MNK Dimension

Representation

M dim: batch size

K dim: [H * W * C]

Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al. 10

Input Feature Map

NWeightsT

Gradient Computation(Training)

Output Feature Error

M Output Feature Error

InputFeature

Input Feature Error

𝚫Weights

GEMM MNK Dimension

Representation

M dim: batch size

K dim: [H * W * C]

Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al. 11

GEMM is a key compute primitive to accelerate in hardware to speed up training.

Hardware for Accelerating GEMMs

SIMT Architectures

Nvidia GTX GPUs

SIMD ArchitecturesSIMT Architectures

Microsoft Brainwave ARM Trillium

Tesla FSDC

Nvidia GTX GPUs

SIMD ArchitecturesSIMT Architectures Systolic Architectures

Tesla FSDC

Nvidia GTX GPUs

Google TPU

Xilinx xDNN Nvidia Tensor Cores

SIMD ArchitecturesSIMT Architectures Systolic Architectures

Recently, systolic array based architectures are popular for accelerating GEMMs.

Tesla FSDC

Nvidia GTX GPUs

Google TPU

Xilinx xDNN Nvidia Tensor Cores

Target comparison: Google TPU

TPU figure from https://cloud.google.com/tpu/docs/system-architecture

Our target comparison is with the Google TPU, which uses 128 x 128 systolic arrays.

• SIGMA

• Evaluation

• Conclusion

TPU (Systolic Array)

Systolic Array Architectures

0 1 2 3 4 . . 127

TPU (Systolic Array) MAC PE

Systolic Array Architectures

0 1 2 3 4 . . 127

＋✕

Weight Buffer

Top input

Streaming input

Streaming output

Load (0) / Accum. (1)

Bottom output

Mapping GEMMs onto TPUs

0 1 2 3 4 . . 127

Workload Application Example DimensionsM N K

GNMT Machine Translation

128 2048 4096320 3072 40961632 36548 10242048 4096 32

DeepBench General Workload

1024 16 50000035 8457 2560

Transformer Language Understanding

31999 1024 8484 1024 4096

NCF Collaborative Filtering

2048 1 128256 256 2048

GEMMs used for evaluation.

128 2048 4096320 3072 40961632 36548 10242048 4096 32

1024 16 50000035 8457 2560

31999 1024 8484 1024 4096

2048 1 128256 256 2048

0 1 2 3 4 . . 127

Let’s map this GEMM!

StationarymatrixK

N = 256

K = 2048

Streaming matrix

0 1 2 3 4 . . 127

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

N = 256

K = 2048

Streaming matrix

0 1 2 3 4 . . 127 128

StationarymatrixK

N = 256

K = 2048

Streaming matrix

0 1 2 3 4 . . 127 128

StationarymatrixK

N = 256

K = 2048

Streaming matrix

0 1 2 3 4 . . 127 128

0 1 2 3 4 . . 127M = 256

StationarymatrixK

N = 256

K = 2048

Streaming matrix

0 1 2 3 4 . . 127M = 256

The streaming elements get multiplied with the stationary elements. The partial sums

get accumulated down each column.

StationarymatrixK

N = 256

K = 2048

Streaming matrix

0 1 2 3 4 . . 127M = 256

StationarymatrixK

N = 256

K = 2048

Streaming matrix

0 1 2 3 4 . . 127M = 256

StationarymatrixK

N = 256

K = 2048

Streaming matrix

0 1 2 3 4 . . 127M = 256

StationarymatrixK

N = 256

K = 2048

Streaming matrix

0 1 2 3 4 . . 127M = 256

Reduce partial sum down each column.

StationarymatrixK

N = 256

K = 2048

Streaming matrix

0 1 2 3 4 . . 127M = 256

StationarymatrixK

N = 256

K = 2048

Streaming matrix

get accumulated down a column.

Systolic Arrays are popular because they enable efficient data reuse and are very simple to implement.

128 2048 4096320 3072 40961632 36548 10242048 4096 32

1024 16 50000035 8457 2560

31999 1024 8484 1024 4096

2048 1 128256 256 2048

Mapping GEMMs onto TPUs - Irregularity

15 31 47 63 79 111 127950

Let’s map another GEMM!** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Stationary matrix

K = 32

N = 4096

15 31 47 63 79 111 127950

Stationary matrix

K = 32

N = 4096

15 31 47 63 79 111 127950

75% of the PEs are not utilized for this GEMM.

Stationary matrix

K = 32

N = 4096

Stationary matrix

K = 32

N = 4096

15 31 47 63 79 111 127950M = 2048

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

Stationary matrix

N = 4096

K = 32

The rigid structure of Systolic Arrays cause PE underutilization. How can we enable the remaining PEs?

15 31 47 63 79 111 127950

Stationary matrix

N = 4096

K = 32

Can we map another section of the stationary matrix onto the TPU to increase utilization?

15 31 47 63 79 111 127950

Stationary matrix

N = 4096

K = 32

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

duplicate streaming data

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

duplicate streaming data

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

This is incorrect functionally, because the systolic reduction will accumulate dark blue and light purple partial product clusters together. This is due to a rigid aspect ratio of a systolic array. Can we map another section of

the stationary matrix onto the TPU to increase utilization?

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

No, systolic reduction will accumulate dark blue and light purple partial product clusters together, which is incorrect functionally. This is due to a rigid aspect ratio of a systolic array.

Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.

Observation 2: Sparse weights cause underutilization of the PEs.

Observation 3: Large systolic arrays have large load and reduction latency.

Takeaway: Systolic Arrays have limitations on emerging GEMM workloads that are both

sparse and irregular.

Sparsity in DNN Models

GNMT Pruning - Temporal Sparsity(https://www.intel.ai/compressing-gnmt-models)

Transformer Sparsity - Impact on BLEU(The State of Sparsity in Deep Neural Networks, Gale et al., arXiv)

Weight sparsity ranges from 40% to 90%. Activation sparsity is approximately

30% to 70% from ReLU, dropout, etc.

128 2048 4096320 3072 40961632 36548 10242048 4096 32

1024 16 50000035 8457 2560

31999 1024 8484 1024 4096

2048 1 128256 256 2048

Mapping GEMMs onto TPUs - Sparsity

15 31 47 63 79 111 127950

Usually these GEMMs are sparse!

Stationary matrix

N = 4096

K = 32

What happens if half of the stationary matrix are zeros?

15 31 47 63 79 111 127950

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

Multiplication with an operand that is zero is considered underutilized.

15 31 47 63 79 111 127950M = 2048

Stationary matrix

N = 4096

K = 32

87.5% of the PEs are not utilized for this GEMM.

15 31 47 63 79 111 127950M = 2048

Stationary matrix

N = 4096

K = 32

87.5% of the PEs are not utilized for this GEMM.

Weight stationary systolic arrays are underutilized for sparse GEMMs because they have to map zeros. How

can we map only nonzeros stationary?

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

Can we map other nonzero elements where the idle PEs used to be?

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

Dark blue and light purple clusters accumulate, which is incorrect. Systolic reduction only allows fixed size dot products, which is the size of a column.

Stationary matrix

N = 4096

K = 32

15 31 47 63 79 111 127950M = 2048

Dark blue and light purple clusters accumulate, which is incorrect. Systolic reduction only allows fixed size dot products, which is the size of a column.

Observation 2: Sparse weights cause underutilization of the PEs and require variable sized dot product accumulation.

Observation 3: Large systolic arrays have large load and reduction latency.

Mapping GEMMs onto TPUs - Scalability

StationarymatrixK

N = 256

K = 2048

Streaming matrix

0 1 2 3 4 . . 127 128

StationarymatrixK

N = 256

K = 2048

Streaming matrix

0 1 2 3 4 . . 127 128

Cycle 1

StationarymatrixK

N = 256

K = 2048

Streaming matrix

0 1 2 3 4 . . 127 128

Cycle 2

StationarymatrixK

N = 256

K = 2048

Streaming matrix

0 1 2 3 4 . . 127 128

Cycle 3

StationarymatrixK

N = 256

K = 2048

Streaming matrix

0 1 2 3 4 . . 127 128

Cycle 128Loading takes 128 cycles, which

scales at O(sqrtN).

0 1 2 3 4 . . 127M = 256

StationarymatrixK

N = 256

K = 2048

Streaming matrix

Reduction will take 128 cycles for the last accumulate to finish before loading a new

portion of the stationary matrix.

0 1 2 3 4 . . 127M = 256

StationarymatrixK

N = 256

K = 2048

Streaming matrix

Observation 3: Large systolic arrays have significant load and reduction latency.

0 1 2 3 4 . . 127M = 256

StationarymatrixK

N = 256

K = 2048

Streaming matrix

Observation 3: Large systolic arrays have significant load and reduction latency.

Takeaway: Systolic Arrays are underutilized on emerging GEMM workloads that are

both sparse and irregular.

• SIGMA

• Evaluation

• Conclusion

Key Accelerator Requirements

Requirement Systolic Array Limitation SIGMA Desired Traits

Flexibility ● rigid aspect ratio ● ability to mimic any 2D aspect ratio

Sparsity Support ● Data forwarding every cycle in horizontal / vertical direction requires systolic array to map zeros.

● Support sparsity by only mapping nonzeros

● ability to create simultaneous variable sized dot product

Scalability ● O(sqrtN) distribution ● O(sqrtN) reduction

● O(1) distribution● O(log2N) reduction

Sparsity Support ● data forwarding every cycle requires systolic array to map zeros

● sparsity support by mapping only nonzeros stationary

● O(1) distribution● O(log2N) reduction

● sparsity support by mapping only nonzeros stationary

● O(1) distribution● O(logN) reduction

● sparsity support by mapping only nonzeros

● O(1) distribution● O(logN) reduction

With flexible and scalable interconnects between all PEs, SIGMA can solve the three requirements.

Systolic vs SIGMA High Level Interconnects

A B C D

I J K L

4 x 4 Systolic Array

A B C D

I J K L

E F G H

M N O P

16 PE SIGMA

● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and

reduction

16 PE SIGMA

A B C D

I J K L

E F G H

Distribution Network

M N O P

** Microarchitecture details on the networks will be discussed later

A B C D

I J K L

reduction

16 PE SIGMA

A B C D

I J K L

E F G H

M N O P

A B C D

I J K L

● Distribution network allows SIGMA to mimic any aspect ratio to address irregular GEMMs

● Ability to send any streaming element to any PE

● O(1) distribution

reduction

A B C D

I J K L

4 x 4 Systolic Array 16 PE SIGMA

A B C D

I J K L

E F G H

Reduction NetworkM N O P

reduction

A B C D

I J K L

4 x 4 Systolic Array 16 PE SIGMA

A B C D

I J K L

E F G H

● Reduction network allows SIGMA to reduce variable sized dot products

● Addresses sparsity and irregularity

● O(logN) reduction

reduction

A B C D

I J K L

Irregular GEMMs on SIGMA

Set up the stationary values.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

A B C D

I J K L

A B C D

I J K L

E F G H

16 PE SIGMA

stationary)

A B C D

I J K L

A B C D

I J K L

E F G H

16 PE SIGMA

Next cycle: Multicast first row of MK to the corresponding stationary elements.

stationary)

A B C D

I J K L

A B C D

I J K L

E F G H

16 PE SIGMA

Next cycle: Multicast second row of MK to the corresponding stationary elements.

stationary)

A B C D

I J K L

A B C D

I J K L

E F G H

16 PE SIGMA

Next cycle: Multicast third row of MK to the corresponding stationary elements.

stationary)

A B C D

I J K L

A B C D

I J K L

E F G H

16 PE SIGMA

Next cycle: Multicast fourth row of MK to the corresponding stationary elements.

stationary)

E F G H

M N O P

A B C D

I J K L

E F G H

16 PE SIGMA

After accumulation, SIGMA is done. However, the systolic array has to map the other side of the stationary matrix and stream in the MK matrix again (referred to as folding).

stationary)

E F G H

M N O P

A B C D

I J K L

E F G H

16 PE SIGMA

After accumulation, SIGMA is done. However, the systolic array has to map the other side of the stationary matrix and stream in the MK matrix again.

stationary)

SIGMA reduces the number of folds, which then reduces the number of memory references on the streaming matrix.

E F G H

M N O P

A B C D

I J K L

E F G H

16 PE SIGMA

Final cycle count.

Systolic Array total runtime:

24 cycles

SIGMA total runtime:

13 cycles

stationary)

E F G H

M N O P

A B C D

I J K L

E F G H

16 PE SIGMA

Final cycle count.

24 cycles

13 cycles

stationary)

SIGMA maximizes PE utilization with its flexible interconnects for irregular GEMMs.

Sparse Irregular GEMMs on SIGMA

A B C D

stationary)

A B C D

A B J C

H I L K

D E F G

16 PE SIGMA

stationary)

A B C D

A B J C

H I L K

D E F G

16 PE SIGMA

Next cycle: Multicast first row of MK to the corresponding stationary elements.

stationary)

A B C D

A B J C

H I L K

D E F G

c c c c

16 PE SIGMA

Next cycle: Multicast second row of MK to the corresponding stationary elements.

stationary)

A B C D

A B J C

H I L K

D E F G

d d d d

16 PE SIGMA

Next cycle: Multicast third row of MK to the corresponding stationary elements.

stationary)

A B C D

A B J C

H I L K

D E F G

e e e e

16 PE SIGMA

Next cycle: Multicast fourth row of MK to the corresponding stationary elements.

stationary)

K L M N

A B J C

H I L K

D E F G

e e e e

16 PE SIGMA

After accumulation, SIGMA is done. However, the systolic array has to map the other part of the stationary matrix and stream in the MK matrix again (referred to as folding).

stationary)

A B J C

H I L K

D E F G

16 PE SIGMA Flex-DPE4 x 4 Systolic Array

e e e e

16 PE SIGMA

Again, the systolic array has to map another part of the stationary matrix and stream MK again.

stationary)

A B J C

H I L K

D E F G

16 PE SIGMA

Final cycle count.

34 cycles

13 cycles

stationary)

A B J C

H I L K

D E F G

16 PE SIGMA

Final cycle count.

34 cycles

13 cycles

stationary)

SIGMA maps only nonzeros stationary; therefore, reduces the number of folds needed.

• SIGMA

• Evaluation

• Conclusion

O(1) Distribution TopologyUnicast

(Loading Stationary Matrix)

Crossbar

X X X XX X X X X X X X X X X X

Benes Crossbar Benes

Source

Destination Destination Destination Destination

Source Source Source

Multicast (Sending Streaming Matrix)

O(1) Distribution Topology

Crossbar Benes

Source Source

Crossbar BenesSource Source

Crossbar Benes Crossbar Benes

Unicast (Loading Stationary Matrix)

Crossbar Benes

Source Source

Crossbar Benes Crossbar Benes

Unicast (Loading Stationary Matrix)

Crossbar Benes

Unicast Multicast

Source Source

SIGMA’s distribution can be either a Crossbar or Benes network. We chose Benes because the number of

switches scale by O(N logN).

O(logN) Reduction (Limitation of Adder Tree)

Log2(N) reduction time 112

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 17 29

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

15 2-input BF16 Mult Switch2-input BF32 Adder Switch

25 13 21

Different clusters of partial sums.

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

1 5 9 17 29

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

15 2-input BF16 Mult Switch2-input BF32 Adder Switch

25 13 21

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Regular adder tree will accumulate a + b, which is incorrect functionality.

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

O(logN) Reduction (Limitation of Adder Tree)

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 17 29

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

25 13 21

Forwarding Adder Network (FAN)

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 17 29

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

25 13 21

FAN is optimized for floating point reductions, commonly used during DNN training.

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 17 29

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

25 13 21

Pipelined at 1 cycle per adder level.

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

25 13 21

Cycle 1: Bypass conflicting partial sums

Bypass adder and forward partial sum to next level!

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

25 13 21

Cycle 2: Partial sum red complete

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

sent to output buffer

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

25 1 21

Cycle 3: Partial sums green and orange complete

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

sent to output buffer sent to output buffer

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

25 13 21

Cycle 4: Partial sum blue complete

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

25 13 21

Cycle 5: Partial sum purple complete

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

25 13 21

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

+x x xx

x x xx+

send to output buffer

Note: The output buffer has FFs to maintain correct timing since different clusters may complete at different time.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

25 13 21

Our novel FAN topology is both lightweight and flexible.

It can easily replace any other accelerators that use adder trees.

More details such as the routing algorithm and overhead analysis can be found in the paper.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

25 13 21

It can replace regular adder trees in other hardware accelerators.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

25 13 21

It can replace regular adder trees in other hardware accelerators.

○ Utilization on TPU

• SIGMA

• Evaluation

• Conclusion

SIGMA High Level Diagram

Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM

matrices.

Global Controller● Logic comparisons on bitmaps to determine

what nonzero stationary elements are required

matrices.

Sparsity Filter && Input Data Arbiter● Reorganizes data for loading stationary

elements and sending streaming elements

matrices.

Accumulation SRAM ● Buffer for partial sum accumulations

matrices.

Accumulation SRAM ● Buffer for partial sum accumulations

SIGMA Engine● Compute engine

x x x x

- - - -

x x x x

- - - -

SIGMA Flex-DPE 0

Distribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

SIGMAFlex-DPE N

SIGMA Engine● Compute engine

• SIGMA

• Evaluation

• Conclusion

Methodology

● Hardware components are written in Verilog

● Post layout area and power numbers are on a 28nm process

● Analytical model for cycle counts assumes uniform random sparsity

128 2048 4096320 3072 40961632 36548 10242048 4096 32

1024 16 50000035 8457 2560

31999 1024 8484 1024 4096

2048 1 128256 256 2048

SIGMA vs TPU - Dense GEMMs

For GEMMs with dimensions that fit perfectly on the TPU, SIGMA offers slight speedup from its O(1) distribution and O(logN) reduction.

SIGMA offers 8x speedup for this GEMM. This is because the N dimension of 16 cannot fit perfectly on TPU. Also, O(logN) reduction is more effective since K = 500000.

SIGMA offers large speedup from its scalable FAN reduction (more effective since K = 500000). Additionally, this GEMM cannot fit perfectly on TPU because of the N dimension.

SIGMA performs on average 1.8x better than systolic array architectures for irregular GEMMs.

SIGMA vs TPU - Sparse GEMMs

SIGMA enables speedup by only mapping nonzero elements stationary.

SIGMA vs TPU - Sparse GEMMs

SIGMA enables speedup by only mapping nonzero elements stationary.

SIGMA performs on average 5.7x better than systolic array architectures for sparse and irregular GEMMs.

SIGMA vs Sparse Accelerators

SIGMA Qualitative Analysis

SIGMA performs on average 3x better than state-of-the-art sparse accelerators. In depth analysis can

be found in the paper.

Systolic Array vs SIGMA Comparison

** Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.

SIGMA consumes 38% more area and 82% more power than Systolic Array.

SIGMA achieves 5.7x higher effective TFLOPS for a 3.2x higher effective TFLOPS/W.

SIGMA consumes more resources but achieves higher effective TFLOPS/W.

• SIGMA

• Evaluation

• Conclusion

Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.

● High utilization from systolic arrays is challenging because of their rigid structure.

● SIGMA enables high compute utilization on sparse irregular GEMMs.

● SIGMA performs 5.7x better than systolic arrays and 3x better than other state-of-the-art sparse

accelerators at the cost of extra hardware, specifically for the O(1) distribution and the novel FAN

reduction interconnects.

● SIGMA achieves 3.2x higher Effective TFLOPS/W than Systolic Arrays.

● Future work: Optimizations such as power gating and software stack design.

Thank you! 😊

Material Backup

Walkthrough

0 1 1 1

1 1 1 0

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Dim. Registers

M-dim 3

N-dim 4

K-dim 4

Compressed bitmap equivalent

Stationary Bitmap

Streaming Bitmap

A B C D E F G H I

Stationary Nonzero data

a b c d e f g

Streaming Nonzero data

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

0 1 1 1

1 1 1 0

Stationary Bitmap

Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

0 1 1 1

1 1 1 0

Stationary Bitmap

Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

0 1 1 1

1 1 1 0

Stationary Bitmap

0 1 1 1

1 1 1 0

Stationary’ Bitmap

Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

0 1 1 1

1 1 1 0

Stationary Bitmap

A B C D E F G H I

0 1 1 0

1 1 1 0

Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

0 1 1 1

1 1 1 0

Stationary Bitmap

A B C D E F G H I

0 1 1 0

1 1 1 0

Walkthrough Section

Example GEMM

The main idea is to find the necessary nonzero elements to map stationary.

Walkthrough

x x x x

A B D E

- - - -

x x x x

- - - -

Flex-DPE 0 Flex-DPE 1

Buffers

Multipliers

F G H I

Walkthrough Section

Example GEMM

Cycle 1: unicast elements stationary to Flex-DPE 0

Walkthrough

A B D E- - - -

- - - -

Flex-DPE 0 Flex-DPE 1

Buffers

Multipliers

F G H I

Walkthrough Section

Example GEMM

Cycle 2: unicast elements stationary to Flex-DPE 1

x x x x

Walkthrough

0 1 1 0

1 1 1 0

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1 ... ... ...

Output BitmapRow ID0

Walkthrough Section

Example GEMM

Walkthrough

0 1 1 0

1 1 1 0

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1 ... ... ...

Output Bitmap0

Walkthrough Section

Example GEMM

Row ID

Walkthrough

0 1 1 0

1 1 1 0

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1 ... ... ...

Output Bitmap0

Walkthrough Section

Example GEMM

Row ID

The main idea is to find the matching source and destination indices.

Walkthrough

A B D E

XDistribution busSwitch

Buffers

Multipliers

F G H I

a f - - a f - -

Cycle 3: multicast 1st column of streaming matrix and reduce

Walkthrough Section

Example GEMM

- f a - f a - f

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

x x x x

Walkthrough

A B D E

Buffers

Multipliers

F G H I

a f - - a f - -

Walkthrough Section

Example GEMM

- f a - f a - f

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

x x x x

Cycle 3: multicast 1st column of streaming matrix and reduce

Walkthrough

A B D E

Buffers

Multipliers

F G H I

b d - - b d - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

Walkthrough Section

Example GEMM

d - b d - b d -

Bf Da-- -- Ga --Ff Ifx x x x

x x x x

Cycle 4: multicast 2nd column of streaming matrix and reduce

Walkthrough

A B D E

Buffers

Multipliers

F G H I

e - - - e - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

Walkthrough Section

Example GEMM

e - - e - - e -

-- DbAd Ed Gb Hd-- --

Bf Ga If

x x x x

Cycle 5: multicast 3rd column of streaming matrix and reduce

Walkthrough

A B D E

Buffers

Multipliers

F G H I - g c - g c - g

c g - - c g - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

Walkthrough Section

Example GEMM

-- -- Ee -- He-- --

Ga + If

Db + Ed

x x x x

Da FfGb Hd

Cycle 6: multicast last column of streaming matrix and reduce

Walkthrough

A B D E

Buffers

Multipliers

F G H I - - - - - - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

Cycle 7: FAN Reduction

Walkthrough Section

Example GEMM

-- Dc -- Gc --Fg Ig

x x x x

Da + Ff

Db + Ed -- Gb + Hd

Walkthrough

A B D E

Buffers

Multipliers

F G H I - - - - - - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

Walkthrough Section

Example GEMM

-- -- -- -- ---- --

Db + Ed

x x x x

Bg Dc Fg

Ee -- He

Walkthrough

A B D E

Buffers

Multipliers

F G H I - - - - - - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

Walkthrough Section

Example GEMM

x x x x

Dc Fg Gc + Ig

Walkthrough

A B D E

Buffers

Multipliers

F G H I - - - - - - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

Walkthrough Section

Example GEMM

Dc + Fg

x x x x

synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites...

Documents

Transcript of synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites...

SB 2048 Loren Legarda

Bottem view - 上海贝岭股份有限公司€¦ · provides 2048/4096/8192/16384 bits of serial electrically erasable and programmable read- only memory (EEPROM), organized as 256/512/1024/2048

ACS-2048 rev02

transparencia.hidalgo.gob.mxtransparencia.hidalgo.gob.mx/descargables/ENTIDADES/CAASIM... · ... memory de operaciÖn 2048 registros 256 long intigers ... pruebas de: 1.00 sensor

Amazon CloudFront でATS 対応€“ Rivest-Shamir-Adleman (RSA) 、2048 bits以上 – Elliptic-Curve Cryptography (ECC) 、256 bits以上 ECDSA Certificates RSA Certificates ...

Memory Hierarchy BBM371-Data Management - Hacettepebbm371/Lecture2-StorageDevices-4... · Tracks Sectors. 10/3/17 3 ... * How many cylinders do we have? * Can we use 256 bytes, 2048,

Accelerating MRI Reconstruction on TPUs

Access to Google TPUs D. Bonacorsi (University of Bologna)

Lecture 1a Introduction to Protein Structures - VMD Tool€¦ · ns/day 1 10 100 128 256 512 1024 2048 4096 8192 16384 32768 40,000 registered users. Our ... • Display of large

DragonWave Gigabit Solutions - DoubleRadius€¦ · DragonWave Gigabit Solutions . DragonWave Proprietary Information 2 ... • @2048 QAM 345 Mbps!!!!! 30 MHz channel @256 QAM 190

PRECISION BAROMETER AND ALTIMETER SENSOR · Adventure and Sports watches ... 001: OSR = 2048 100: OSR = 256 . 010: OSR = 1024 101: OSR = 128 . Setting the CHNL bits to the value of

GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • Crowded game scenes ... Global Memory 2048 2048 256 512 1024 Memory Bus 64 64 64 256 512 ... CPU, optimized

ASCII Code Tablesensotek.ru/upload/iblock/78b/78b139b351687067610dc1033fdcc62f.pdf · ASCII Code Table Character Hex ... 256 512 1024 2048 4096 8192 16384 32768 SIZE 100 Setting Buffer

edge 4.2 bi bi - pco.de · 3 biback illuminated bi back illuminated technical specifications edge 4.2 bi frame rate table 2048 x 2048 40 fps 2048 x 1024 80 fps 2048 x 512 159 fps

de participare lot 3 AG 101.pdf · SOFTWARE Specificatii tehnice solicitate DDR4 SSD 256 GB Dedicata 2048 MB Difuzoare stereo, Microfon ... Parametri de functionare minim acceptaÿi

14/20-Pin, 8-Bit Flash Microcontrollerww1.microchip.com/downloads/en/DeviceDoc/40001770D.pdf · Features like 10-bit A/D, CCP, ... 2048 3.5 256 256 12 4 1 0 Y 2 2 8 1 2/0 1 0 Y 0

6.003 Signal Processingof Lec 05A Sampling rate 8000 samples/sec, 9600 samples window_size=4096, step_size=256 window_size=2048, step_size=256=512, window_size=256, step_size=128 window_size=128,

HX System - Montana Satellite Services · – 256 Ksps X 2 – 256 Ksps X 4 – 512 Ksps X 2 – 512 Ksps X 4 – 1024 Ksps X 2 – 1024 Ksps X 4 – 2048 Ksps X 2 Transmission types

Swift, via "swift-2048"

GlobalSign SSL Certificates · 2048 bit future proof SSL Certificates SGC security for minimum 128 bit to 256 bit SSL encryption levels Universal Compatibility with all browsers and