synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites...

Post on 24-Jun-2020

2 views 0 download

Transcript of synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites...

SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects

for DNN Training

ERIC QIN*, ANANDA SAMAJDAR*, HYOUKJUN KWON*, VINEET NADELLA*, SUDARSHAN SRINIVASAN#, DIPANKAR DAS#,

BHARAT KAUL#, TUSHAR KRISHNA*

http://synergy.ece.gatech.edu

* GEORGIA TECH# INTEL

1

Outline• Motivation

○ GEMMs in Deep Learning

○ Utilization on TPU (Systolic Array)

• Accelerator Requirements

• SIGMA

○ Interconnects Implementations

○ Full System Design

• Evaluation

• Conclusion

2

Outline

3

• Motivation

○ GEMMs in Deep Learning

○ Utilization on TPU (Systolic Array)

• Accelerator Requirements

• SIGMA

○ Interconnects Implementations

○ Full System Design

• Evaluation

• Conclusion

Deep Learning Applications

Speech Recognition Recommender SystemsLanguage Understanding

4

Deep Learning Applications

Speech Recognition Recommender SystemsLanguage Understanding

5

What is the key computation for these Deep Learning applications?

Runtime breakdown on V100 GPU

6

77 % 64 %

Transformer (Language Understanding)

GNMT (Machine Translation)

Runtime breakdown on V100 GPU

7

77 % 64 %

Transformer (Language Understanding)

GNMT (Machine Translation)

Matrix multiplications (GEMMs) consume around 70% of the total runtime when training modern deep learning workloads.

GEMMs in Deep Learning

M

K

K

N N

MForward Pass(Inference and Training)

Input Feature Map

Output Feature MapWeights

GEMM MNK Dimension

Representation

M dim: batch size

N dim: number of channels in the next layer

K dim: [H * W * C]

8Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al.

GEMMs in Deep Learning

Backward Pass(Training)

M

K

K

N N

MForward Pass(Inference and Training)

Input Feature Map

Output Feature MapWeights

N

M

K

NWeightsT Output Feature Error

M

K

Input Feature Error

GEMM MNK Dimension

Representation

M dim: batch size

N dim: number of channels in the next layer

K dim: [H * W * C]

9Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al.

GEMMs in Deep Learning

Backward Pass(Training)

M

K

K

N N

MForward Pass(Inference and Training)

Input Feature Map

Output Feature MapWeights

N

M

K

NWeightsT

Gradient Computation(Training)

Output Feature Error

N

M Output Feature Error

M

K

InputFeature

MapT

M

K

Input Feature Error

K

N

𝚫Weights

GEMM MNK Dimension

Representation

M dim: batch size

N dim: number of channels in the next layer

K dim: [H * W * C]

Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al. 10

GEMMs in Deep Learning

Backward Pass(Training)

M

K

K

N N

MForward Pass(Inference and Training)

Input Feature Map

Output Feature MapWeights

N

M

K

NWeightsT

Gradient Computation(Training)

Output Feature Error

N

M Output Feature Error

M

K

InputFeature

MapT

M

K

Input Feature Error

K

N

𝚫Weights

GEMM MNK Dimension

Representation

M dim: batch size

N dim: number of channels in the next layer

K dim: [H * W * C]

Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al. 11

GEMM is a key compute primitive to accelerate in hardware to speed up training.

Hardware for Accelerating GEMMs

12

SIMT Architectures

Nvidia GTX GPUs

13

SIMD ArchitecturesSIMT Architectures

Microsoft Brainwave ARM Trillium

Tesla FSDC

Nvidia GTX GPUs

Hardware for Accelerating GEMMs

14

SIMD ArchitecturesSIMT Architectures Systolic Architectures

Microsoft Brainwave ARM Trillium

Tesla FSDC

Nvidia GTX GPUs

Google TPU

Xilinx xDNN Nvidia Tensor Cores

Hardware for Accelerating GEMMs

15

SIMD ArchitecturesSIMT Architectures Systolic Architectures

Recently, systolic array based architectures are popular for accelerating GEMMs.

Microsoft Brainwave ARM Trillium

Tesla FSDC

Nvidia GTX GPUs

Google TPU

Xilinx xDNN Nvidia Tensor Cores

Hardware for Accelerating GEMMs

Target comparison: Google TPU

TPU figure from https://cloud.google.com/tpu/docs/system-architecture

Our target comparison is with the Google TPU, which uses 128 x 128 systolic arrays.

16

Outline• Motivation

○ GEMMs in Deep Learning

○ Utilization on TPU (Systolic Array)

• Accelerator Requirements

• SIGMA

○ Interconnects Implementations

○ Full System Design

• Evaluation

• Conclusion

17

TPU (Systolic Array)

Systolic Array Architectures

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127

18

TPU (Systolic Array) MAC PE

Systolic Array Architectures

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127

19

+✕

Weight Buffer

Top input

Streaming input

Streaming output

Load (0) / Accum. (1)

Load (0) / Accum. (1)

0

1

Bottom output

TPU (Systolic Array)

Mapping GEMMs onto TPUs

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127

20

Workload Application Example DimensionsM N K

GNMT Machine Translation

128 2048 4096320 3072 40961632 36548 10242048 4096 32

DeepBench General Workload

1024 16 50000035 8457 2560

Transformer Language Understanding

31999 1024 8484 1024 4096

NCF Collaborative Filtering

2048 1 128256 256 2048

GEMMs used for evaluation.

Workload Application Example DimensionsM N K

GNMT Machine Translation

128 2048 4096320 3072 40961632 36548 10242048 4096 32

DeepBench General Workload

1024 16 50000035 8457 2560

Transformer Language Understanding

31999 1024 8484 1024 4096

NCF Collaborative Filtering

2048 1 128256 256 2048

TPU (Systolic Array)

Mapping GEMMs onto TPUs

21

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127

Let’s map this GEMM!

GEMMs used for evaluation.

TPU (Systolic Array)

Mapping GEMMs onto TPUs

StationarymatrixK

= 2

048

N = 256

22

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

Mapping GEMMs onto TPUs

StationarymatrixK

= 2

048

N = 256

23

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

Mapping GEMMs onto TPUs

StationarymatrixK

= 2

048

N = 256

24

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

Mapping GEMMs onto TPUs

StationarymatrixK

= 2

048

N = 256

25

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128

128

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

Mapping GEMMs onto TPUs

26

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

TPU (Systolic Array)

Mapping GEMMs onto TPUs

27

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

The streaming elements get multiplied with the stationary elements. The partial sums

get accumulated down each column.

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

TPU (Systolic Array)

Mapping GEMMs onto TPUs

28

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

The streaming elements get multiplied with the stationary elements. The partial sums

get accumulated down each column.

TPU (Systolic Array)

Mapping GEMMs onto TPUs

29

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

The streaming elements get multiplied with the stationary elements. The partial sums

get accumulated down each column.

TPU (Systolic Array)

Mapping GEMMs onto TPUs

30

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

The streaming elements get multiplied with the stationary elements. The partial sums

get accumulated down each column.

TPU (Systolic Array)

Mapping GEMMs onto TPUs

31

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

Reduce partial sum down each column.

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

The streaming elements get multiplied with the stationary elements. The partial sums

get accumulated down each column.

TPU (Systolic Array)

Mapping GEMMs onto TPUs

32

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

Reduce partial sum down each column.

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

The streaming elements get multiplied with the stationary elements. The partial sums

get accumulated down a column.

Systolic Arrays are popular because they enable efficient data reuse and are very simple to implement.

Workload Application Example DimensionsM N K

GNMT Machine Translation

128 2048 4096320 3072 40961632 36548 10242048 4096 32

DeepBench General Workload

1024 16 50000035 8457 2560

Transformer Language Understanding

31999 1024 8484 1024 4096

NCF Collaborative Filtering

2048 1 128256 256 2048

Mapping GEMMs onto TPUs - Irregularity

33

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

Let’s map another GEMM!** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

GEMMs used for evaluation.

34

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

K = 32

N = 4096

K =

32

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Mapping GEMMs onto TPUs - Irregularity

35

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

K = 32

N = 4096

K =

32

128

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Mapping GEMMs onto TPUs - Irregularity

36

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

75% of the PEs are not utilized for this GEMM.

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

K = 32

N = 4096

K =

32

128

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Mapping GEMMs onto TPUs - Irregularity

37

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

K = 32

N = 4096

K =

32

128

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

75% of the PEs are not utilized for this GEMM.

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Mapping GEMMs onto TPUs - Irregularity

38

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

75% of the PEs are not utilized for this GEMM.

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Mapping GEMMs onto TPUs - Irregularity

39

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

75% of the PEs are not utilized for this GEMM.

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Mapping GEMMs onto TPUs - Irregularity

40

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

75% of the PEs are not utilized for this GEMM.

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Mapping GEMMs onto TPUs - Irregularity

Mapping GEMMs onto TPUs

41

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

75% of the PEs are not utilized for this GEMM.

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Mapping GEMMs onto TPUs - Irregularity

Mapping GEMMs onto TPUs

42

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

75% of the PEs are not utilized for this GEMM.

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Mapping GEMMs onto TPUs - Irregularity

The rigid structure of Systolic Arrays cause PE underutilization. How can we enable the remaining PEs?

43

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Can we map another section of the stationary matrix onto the TPU to increase utilization?

Mapping GEMMs onto TPUs - Irregularity

44

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Can we map another section of the stationary matrix onto the TPU to increase utilization?

Mapping GEMMs onto TPUs - Irregularity

128

45

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

Can we map another section of the stationary matrix onto the TPU to increase utilization?

Mapping GEMMs onto TPUs - Irregularity

128

46

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

duplicate streaming data

Can we map another section of the stationary matrix onto the TPU to increase utilization?

Mapping GEMMs onto TPUs - Irregularity

128

47

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

duplicate streaming data

Can we map another section of the stationary matrix onto the TPU to increase utilization?

Mapping GEMMs onto TPUs - Irregularity

128

48

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Can we map another section of the stationary matrix onto the TPU to increase utilization?

Mapping GEMMs onto TPUs - Irregularity

128

49

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

Can we map another section of the stationary matrix onto the TPU to increase utilization?

128

Mapping GEMMs onto TPUs - Irregularity

50

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

This is incorrect functionally, because the systolic reduction will accumulate dark blue and light purple partial product clusters together. This is due to a rigid aspect ratio of a systolic array. Can we map another section of

the stationary matrix onto the TPU to increase utilization?

Mapping GEMMs onto TPUs - Irregularity

128

51

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

No, systolic reduction will accumulate dark blue and light purple partial product clusters together, which is incorrect functionally. This is due to a rigid aspect ratio of a systolic array.

Can we map another section of the stationary matrix onto the TPU to increase utilization?

Mapping GEMMs onto TPUs - Irregularity

Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.

Observation 2: Sparse weights cause underutilization of the PEs.

Observation 3: Large systolic arrays have large load and reduction latency.

Takeaway: Systolic Arrays have limitations on emerging GEMM workloads that are both

sparse and irregular.

Sparsity in DNN Models

GNMT Pruning - Temporal Sparsity(https://www.intel.ai/compressing-gnmt-models)

52

Sparsity in DNN Models

GNMT Pruning - Temporal Sparsity(https://www.intel.ai/compressing-gnmt-models)

Transformer Sparsity - Impact on BLEU(The State of Sparsity in Deep Neural Networks, Gale et al., arXiv)

53

Sparsity in DNN Models

GNMT Pruning - Temporal Sparsity(https://www.intel.ai/compressing-gnmt-models)

Transformer Sparsity - Impact on BLEU(The State of Sparsity in Deep Neural Networks, Gale et al., arXiv)

54

Weight sparsity ranges from 40% to 90%. Activation sparsity is approximately

30% to 70% from ReLU, dropout, etc.

Workload Application Example DimensionsM N K

GNMT Machine Translation

128 2048 4096320 3072 40961632 36548 10242048 4096 32

DeepBench General Workload

1024 16 50000035 8457 2560

Transformer Language Understanding

31999 1024 8484 1024 4096

NCF Collaborative Filtering

2048 1 128256 256 2048

Mapping GEMMs onto TPUs - Sparsity

55

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

Usually these GEMMs are sparse!

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

GEMMs used for evaluation.

56

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

What happens if half of the stationary matrix are zeros?

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

Mapping GEMMs onto TPUs - Sparsity

57

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

What happens if half of the stationary matrix are zeros?

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Mapping GEMMs onto TPUs - Sparsity

58

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

What happens if half of the stationary matrix are zeros?

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Mapping GEMMs onto TPUs - Sparsity

59

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

What happens if half of the stationary matrix are zeros?

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Multiplication with an operand that is zero is considered underutilized.

Mapping GEMMs onto TPUs - Sparsity

60

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

87.5% of the PEs are not utilized for this GEMM.

Multiplication with an operand that is zero is considered underutilized.

Mapping GEMMs onto TPUs - Sparsity

61

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

87.5% of the PEs are not utilized for this GEMM.

Multiplication with an operand that is zero is considered underutilized.

Mapping GEMMs onto TPUs - Sparsity

Weight stationary systolic arrays are underutilized for sparse GEMMs because they have to map zeros. How

can we map only nonzeros stationary?

62

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Can we map other nonzero elements where the idle PEs used to be?

Mapping GEMMs onto TPUs - Sparsity

63

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Can we map other nonzero elements where the idle PEs used to be?

Mapping GEMMs onto TPUs - Sparsity

128

64

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Can we map other nonzero elements where the idle PEs used to be?

Mapping GEMMs onto TPUs - Sparsity

128

65

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Can we map other nonzero elements where the idle PEs used to be?

Mapping GEMMs onto TPUs - Sparsity

128

66

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

Can we map other nonzero elements where the idle PEs used to be?

Mapping GEMMs onto TPUs - Sparsity

128

67

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

Dark blue and light purple clusters accumulate, which is incorrect. Systolic reduction only allows fixed size dot products, which is the size of a column.

Can we map other nonzero elements where the idle PEs used to be?

Mapping GEMMs onto TPUs - Sparsity

128

68

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

Dark blue and light purple clusters accumulate, which is incorrect. Systolic reduction only allows fixed size dot products, which is the size of a column.

Can we map other nonzero elements where the idle PEs used to be?

Mapping GEMMs onto TPUs - Sparsity

Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.

Observation 2: Sparse weights cause underutilization of the PEs and require variable sized dot product accumulation.

Observation 3: Large systolic arrays have large load and reduction latency.

Takeaway: Systolic Arrays have limitations on emerging GEMM workloads that are both

sparse and irregular.

TPU (Systolic Array)

Mapping GEMMs onto TPUs - Scalability

StationarymatrixK

= 2

048

N = 256

69

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

StationarymatrixK

= 2

048

N = 256

70

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Cycle 1

Mapping GEMMs onto TPUs - Scalability

TPU (Systolic Array)

StationarymatrixK

= 2

048

N = 256

71

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Cycle 2

Mapping GEMMs onto TPUs - Scalability

TPU (Systolic Array)

StationarymatrixK

= 2

048

N = 256

72

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Cycle 3

Mapping GEMMs onto TPUs - Scalability

TPU (Systolic Array)

StationarymatrixK

= 2

048

N = 256

73

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Cycle 128Loading takes 128 cycles, which

scales at O(sqrtN).

Mapping GEMMs onto TPUs - Scalability

TPU (Systolic Array)

74

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

Reduce partial sum down each column.

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

Reduction will take 128 cycles for the last accumulate to finish before loading a new

portion of the stationary matrix.

Mapping GEMMs onto TPUs - Scalability

TPU (Systolic Array)

75

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

Reduce partial sum down each column.

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

Reduction will take 128 cycles for the last accumulate to finish before loading a new

portion of the stationary matrix.

Mapping GEMMs onto TPUs - Scalability

Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.

Observation 2: Sparse weights cause underutilization of the PEs and require variable sized dot product accumulation.

Observation 3: Large systolic arrays have significant load and reduction latency.

Takeaway: Systolic Arrays have limitations on emerging GEMM workloads that are both

sparse and irregular.

TPU (Systolic Array)

76

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

Reduce partial sum down each column.

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

Reduction will take 128 cycles for the last accumulate to finish before loading a new

portion of the stationary matrix.

Mapping GEMMs onto TPUs - Scalability

Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.

Observation 2: Sparse weights cause underutilization of the PEs and require variable sized dot product accumulation.

Observation 3: Large systolic arrays have significant load and reduction latency.

Takeaway: Systolic Arrays are underutilized on emerging GEMM workloads that are

both sparse and irregular.

Outline• Motivation

○ GEMMs in Deep Learning

○ Utilization on TPU (Systolic Array)

• Accelerator Requirements

• SIGMA

○ Interconnects Implementations

○ Full System Design

• Evaluation

• Conclusion

77

Key Accelerator Requirements

Requirement Systolic Array Limitation SIGMA Desired Traits

Flexibility ● rigid aspect ratio ● ability to mimic any 2D aspect ratio

Sparsity Support ● Data forwarding every cycle in horizontal / vertical direction requires systolic array to map zeros.

● Support sparsity by only mapping nonzeros

● ability to create simultaneous variable sized dot product

Scalability ● O(sqrtN) distribution ● O(sqrtN) reduction

● O(1) distribution● O(log2N) reduction

78

Key Accelerator Requirements

Requirement Systolic Array Limitation SIGMA Desired Traits

Flexibility ● rigid aspect ratio ● ability to mimic any 2D aspect ratio

Sparsity Support ● data forwarding every cycle requires systolic array to map zeros

● sparsity support by mapping only nonzeros stationary

● ability to create simultaneous variable sized dot product

Scalability ● O(sqrtN) distribution ● O(sqrtN) reduction

● O(1) distribution● O(log2N) reduction

79

Key Accelerator Requirements

Requirement Systolic Array Limitation SIGMA Desired Traits

Flexibility ● rigid aspect ratio ● ability to mimic any 2D aspect ratio

Sparsity Support ● data forwarding every cycle requires systolic array to map zeros

● sparsity support by mapping only nonzeros stationary

● ability to create simultaneous variable sized dot product

Scalability ● O(sqrtN) distribution ● O(sqrtN) reduction

● O(1) distribution● O(logN) reduction

80

Key Accelerator Requirements

Requirement Systolic Array Limitation SIGMA Desired Traits

Flexibility ● rigid aspect ratio ● ability to mimic any 2D aspect ratio

Sparsity Support ● data forwarding every cycle requires systolic array to map zeros

● sparsity support by mapping only nonzeros

● ability to create simultaneous variable sized dot product

Scalability ● O(sqrtN) distribution ● O(sqrtN) reduction

● O(1) distribution● O(logN) reduction

81

With flexible and scalable interconnects between all PEs, SIGMA can solve the three requirements.

Systolic vs SIGMA High Level Interconnects

A B C D

I J K L

4 x 4 Systolic Array

82

A B C D

I J K L

E F G H

M N O P

16 PE SIGMA

● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and

reduction

Systolic vs SIGMA High Level Interconnects

83

16 PE SIGMA

A B C D

I J K L

E F G H

Distribution Network

M N O P

** Microarchitecture details on the networks will be discussed later

A B C D

I J K L

4 x 4 Systolic Array

● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and

reduction

Systolic vs SIGMA High Level Interconnects

84

16 PE SIGMA

A B C D

I J K L

E F G H

Distribution Network

M N O P

** Microarchitecture details on the networks will be discussed later

A B C D

I J K L

4 x 4 Systolic Array

● Distribution network allows SIGMA to mimic any aspect ratio to address irregular GEMMs

● Ability to send any streaming element to any PE

● O(1) distribution

● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and

reduction

Systolic vs SIGMA High Level Interconnects

85

** Microarchitecture details on the networks will be discussed later

A B C D

I J K L

4 x 4 Systolic Array 16 PE SIGMA

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

● Distribution network allows SIGMA to mimic any aspect ratio to address irregular GEMMs

● Ability to send any streaming element to any PE

● O(1) distribution

● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and

reduction

Systolic vs SIGMA High Level Interconnects

86

** Microarchitecture details on the networks will be discussed later

A B C D

I J K L

4 x 4 Systolic Array 16 PE SIGMA

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

● Distribution network allows SIGMA to mimic any aspect ratio to address irregular GEMMs

● Ability to send any streaming element to any PE

● O(1) distribution

● Reduction network allows SIGMA to reduce variable sized dot products

● Addresses sparsity and irregularity

● O(logN) reduction

● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and

reduction

A B C D

I J K L

4 x 4 Systolic Array

a

b

c

d

e

f

g

h

87

Irregular GEMMs on SIGMA

Set up the stationary values.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

A B C D

I J K L

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b

c

d

a

b

c

d

e

f

g

h

e

f

g

h

88

16 PE SIGMA

Irregular GEMMs on SIGMA

Set up the stationary values.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

A B C D

I J K L

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b

c

d

a

b

c

d

e

f

g

h

e

f

g

h

a

b

a

b

a

b

a

b

a

b

a

b

a

b

89

16 PE SIGMA

Irregular GEMMs on SIGMA

Next cycle: Multicast first row of MK to the corresponding stationary elements.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

A B C D

I J K L

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b

c

d

c

d

e

f

g

h

e

f

g

h

c

d

c

d

c

d

c

d

d

c

c

d

c

d

90

16 PE SIGMA

Irregular GEMMs on SIGMA

Next cycle: Multicast second row of MK to the corresponding stationary elements.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

A B C D

I J K L

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b

c

d

e

f

e

f

g

h

g

h

e

f

e

f

e

f

e

f

f

e

e

f

e

f

91

16 PE SIGMA

Irregular GEMMs on SIGMA

Next cycle: Multicast third row of MK to the corresponding stationary elements.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

A B C D

I J K L

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b

c

d

g

h

e

f

g

h

g

h

g

h

g

h

g

h

h

g

g

h

g

h

92

16 PE SIGMA

Irregular GEMMs on SIGMA

Next cycle: Multicast fourth row of MK to the corresponding stationary elements.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

E F G H

M N O P

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

g

h

g

h

g

h

g

h

g

h

h

g

g

h

g

h

a

b

c

d

e

f

g

h

93

16 PE SIGMA

Irregular GEMMs on SIGMA

After accumulation, SIGMA is done. However, the systolic array has to map the other side of the stationary matrix and stream in the MK matrix again (referred to as folding).

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

E F G H

M N O P

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

g

h

g

h

g

h

g

h

g

h

h

g

g

h

g

h

a

b

c

d

e

f

g

h

94

16 PE SIGMA

Irregular GEMMs on SIGMA

After accumulation, SIGMA is done. However, the systolic array has to map the other side of the stationary matrix and stream in the MK matrix again.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

SIGMA reduces the number of folds, which then reduces the number of memory references on the streaming matrix.

E F G H

M N O P

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

95

16 PE SIGMA

Irregular GEMMs on SIGMA

Final cycle count.

Systolic Array total runtime:

24 cycles

SIGMA total runtime:

13 cycles

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

E F G H

M N O P

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

96

16 PE SIGMA

Irregular GEMMs on SIGMA

Final cycle count.

Systolic Array total runtime:

24 cycles

SIGMA total runtime:

13 cycles

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

SIGMA maximizes PE utilization with its flexible interconnects for irregular GEMMs.

Sparse Irregular GEMMs on SIGMA

A B C D

H I J

4 x 4 Systolic Array

a

b

cd

f

g

e

97

Set up the stationary values.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

A B C D

H I J

A B J C

H I L K

D E F G

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b

cd

f

g

e

a

b

cd

f

g

e

98

16 PE SIGMA

Sparse Irregular GEMMs on SIGMA

Set up the stationary values.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

A B C D

H I J

A B J C

H I L K

D E F G

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b

a

b

cd

f

g

e

cd

f

g

e

a

b

a

b

b

b

a

b

a

b

a

b

a

b

99

16 PE SIGMA

Sparse Irregular GEMMs on SIGMA

Next cycle: Multicast first row of MK to the corresponding stationary elements.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

A B C D

H I J

A B J C

H I L K

D E F G

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b

ccd

f

g

e

d

f

g

e

c c

c c c c

100

16 PE SIGMA

Sparse Irregular GEMMs on SIGMA

Next cycle: Multicast second row of MK to the corresponding stationary elements.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

A B C D

H I J

A B J C

H I L K

D E F G

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b

dcd

f

g

e

g

e

d d

d d d d

101

16 PE SIGMA

Sparse Irregular GEMMs on SIGMA

Next cycle: Multicast third row of MK to the corresponding stationary elements.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

A B C D

H I J

A B J C

H I L K

D E F G

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b e

cd

f

g

e e e

e

e

e e e e

102

16 PE SIGMA

Sparse Irregular GEMMs on SIGMA

Next cycle: Multicast fourth row of MK to the corresponding stationary elements.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

C D E

K L M N

A B J C

H I L K

D E F G

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

e e e

e

e

e e e e

a

b

cd

f

g

e

103

16 PE SIGMA

Sparse Irregular GEMMs on SIGMA

After accumulation, SIGMA is done. However, the systolic array has to map the other part of the stationary matrix and stream in the MK matrix again (referred to as folding).

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

F G

O P

A B J C

H I L K

D E F G

Distribution Network

Reduction NetworkM N O P

16 PE SIGMA Flex-DPE4 x 4 Systolic Array

e e e

e

e

e e e e

a

b

cd

f

g

e

104

16 PE SIGMA

Sparse Irregular GEMMs on SIGMA

Again, the systolic array has to map another part of the stationary matrix and stream MK again.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

F G

O P

A B J C

H I L K

D E F G

Distribution Network

Reduction NetworkM N O P

16 PE SIGMA Flex-DPE4 x 4 Systolic Array

105

16 PE SIGMA

Sparse Irregular GEMMs on SIGMA

Final cycle count.

Systolic Array total runtime:

34 cycles

SIGMA total runtime:

13 cycles

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

F G

O P

A B J C

H I L K

D E F G

Distribution Network

Reduction NetworkM N O P

16 PE SIGMA Flex-DPE4 x 4 Systolic Array

106

16 PE SIGMA

Sparse Irregular GEMMs on SIGMA

Final cycle count.

Systolic Array total runtime:

34 cycles

SIGMA total runtime:

13 cycles

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

SIGMA maps only nonzeros stationary; therefore, reduces the number of folds needed.

Outline• Motivation

○ GEMMs in Deep Learning

○ Utilization on TPU (Systolic Array)

• Accelerator Requirements

• SIGMA

○ Interconnects Implementations

○ Full System Design

• Evaluation

• Conclusion

107

O(1) Distribution TopologyUnicast

(Loading Stationary Matrix)

Crossbar

X X X XX X X X X X X X X X X X

Benes Crossbar Benes

108

Source

Destination Destination Destination Destination

Source Source Source

Multicast (Sending Streaming Matrix)

O(1) Distribution Topology

Crossbar Benes

109

Source Source

Crossbar BenesSource Source

X X X XX X X X X X X X X X X X

Destination Destination Destination Destination

Crossbar Benes Crossbar Benes

Unicast (Loading Stationary Matrix)

Multicast (Sending Streaming Matrix)

O(1) Distribution Topology

Crossbar Benes

110

Source Source

Crossbar BenesSource Source

X X X XX X X X X X X X X X X X

Destination Destination Destination Destination

Crossbar Benes Crossbar Benes

Unicast (Loading Stationary Matrix)

Multicast (Sending Streaming Matrix)

O(1) Distribution Topology

Crossbar Benes

Unicast Multicast

111

Source Source

Crossbar BenesSource Source

X X X XX X X X X X X X X X X X

Destination Destination Destination Destination

SIGMA’s distribution can be either a Crossbar or Benes network. We chose Benes because the number of

switches scale by O(N logN).

O(logN) Reduction (Limitation of Adder Tree)

Log2(N) reduction time 112

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 17 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15 2-input BF16 Mult Switch2-input BF32 Adder Switch

11

25 13 21

Different clusters of partial sums.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

Log2(N) reduction time 113

1 5 9 17 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15 2-input BF16 Mult Switch2-input BF32 Adder Switch

11

25 13 21

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Different clusters of partial sums.

Regular adder tree will accumulate a + b, which is incorrect functionality.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

O(logN) Reduction (Limitation of Adder Tree)

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 17 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

Forwarding Adder Network (FAN)

114

Different clusters of partial sums.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 17 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

Forwarding Adder Network (FAN)

115

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 17 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

Pipelined at 1 cycle per adder level.

Forwarding Adder Network (FAN)

116

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

Cycle 1: Bypass conflicting partial sums

117

Forwarding Adder Network (FAN)

Bypass adder and forward partial sum to next level!

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

Cycle 2: Partial sum red complete

118

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

sent to output buffer

Forwarding Adder Network (FAN)

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 1 21

119

Forwarding Adder Network (FAN)

Cycle 3: Partial sums green and orange complete

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

sent to output buffer sent to output buffer

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

Cycle 4: Partial sum blue complete

120

Forwarding Adder Network (FAN)

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

sent to output buffer

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

121

Forwarding Adder Network (FAN)

Cycle 5: Partial sum purple complete

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

sent to output buffer

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

122

Forwarding Adder Network (FAN)

Cycle 5: Partial sum purple complete

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

send to output buffer

Note: The output buffer has FFs to maintain correct timing since different clusters may complete at different time.

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

123

Forwarding Adder Network (FAN)

Cycle 4: Partial sum purple complete

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

Our novel FAN topology is both lightweight and flexible.

It can easily replace any other accelerators that use adder trees.

More details such as the routing algorithm and overhead analysis can be found in the paper.

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

124

Forwarding Adder Network (FAN)

Cycle 4: Partial sum purple complete

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

Our novel FAN topology is both lightweight and flexible.

It can replace regular adder trees in other hardware accelerators.

More details such as the routing algorithm and overhead analysis can be found in the paper.

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

125

Forwarding Adder Network (FAN)

Cycle 4: Partial sum purple complete

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

Our novel FAN topology is both lightweight and flexible.

It can replace regular adder trees in other hardware accelerators.

More details such as the routing algorithm and overhead analysis can be found in the paper.

Outline• Motivation

○ GEMMs in Deep Learning

○ Utilization on TPU

• Accelerator Requirements

• SIGMA

○ Interconnects Implementations

○ Full System Design

• Evaluation

• Conclusion

126

Log2(N) reduction time 127

SIGMA High Level Diagram

Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

Log2(N) reduction time 128

SIGMA High Level Diagram

Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM

matrices.

Log2(N) reduction time 129

SIGMA High Level Diagram

Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM

matrices.

Global Controller● Logic comparisons on bitmaps to determine

what nonzero stationary elements are required

Log2(N) reduction time 130

SIGMA High Level Diagram

Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM

matrices.

Global Controller● Logic comparisons on bitmaps to determine

what nonzero stationary elements are required

Sparsity Filter && Input Data Arbiter● Reorganizes data for loading stationary

elements and sending streaming elements

Log2(N) reduction time 131

SIGMA High Level Diagram

Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM

matrices.

Global Controller● Logic comparisons on bitmaps to determine

what nonzero stationary elements are required

Sparsity Filter && Input Data Arbiter● Reorganizes data for loading stationary

elements and sending streaming elements

Accumulation SRAM ● Buffer for partial sum accumulations

Log2(N) reduction time 132

SIGMA High Level Diagram

Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM

matrices.

Global Controller● Logic comparisons on bitmaps to determine

what nonzero stationary elements are required

Sparsity Filter && Input Data Arbiter● Reorganizes data for loading stationary

elements and sending streaming elements

Accumulation SRAM ● Buffer for partial sum accumulations

SIGMA Engine● Compute engine

Log2(N) reduction time 133

SIGMA High Level Diagram

X

x x x x

- - - -

+ +

+

- - - -

X

x x x x

- - - -

+ +

+

- - - -

SIGMA Flex-DPE 0

Distribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

SIGMAFlex-DPE N

Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

+

SIGMA Engine● Compute engine

Outline• Motivation

○ GEMMs in Deep Learning

○ Utilization on TPU

• Accelerator Requirements

• SIGMA

○ Interconnects Implementations

○ Full System Design

• Evaluation

• Conclusion

134

Methodology

135

● Hardware components are written in Verilog

● Post layout area and power numbers are on a 28nm process

● Analytical model for cycle counts assumes uniform random sparsity

Workload Application Example DimensionsM N K

GNMT Machine Translation

128 2048 4096320 3072 40961632 36548 10242048 4096 32

DeepBench General Workload

1024 16 50000035 8457 2560

Transformer Language Understanding

31999 1024 8484 1024 4096

NCF Collaborative Filtering

2048 1 128256 256 2048

GEMMs used for evaluation.

SIGMA vs TPU - Dense GEMMs

136

SIGMA vs TPU - Dense GEMMs

For GEMMs with dimensions that fit perfectly on the TPU, SIGMA offers slight speedup from its O(1) distribution and O(logN) reduction.

137

SIGMA vs TPU - Dense GEMMs

SIGMA offers 8x speedup for this GEMM. This is because the N dimension of 16 cannot fit perfectly on TPU. Also, O(logN) reduction is more effective since K = 500000.

138

SIGMA vs TPU - Dense GEMMs

SIGMA offers large speedup from its scalable FAN reduction (more effective since K = 500000). Additionally, this GEMM cannot fit perfectly on TPU because of the N dimension.

139

SIGMA performs on average 1.8x better than systolic array architectures for irregular GEMMs.

SIGMA vs TPU - Sparse GEMMs

140

SIGMA vs TPU - Sparse GEMMs

141

SIGMA enables speedup by only mapping nonzero elements stationary.

SIGMA vs TPU - Sparse GEMMs

142

SIGMA enables speedup by only mapping nonzero elements stationary.

SIGMA performs on average 5.7x better than systolic array architectures for sparse and irregular GEMMs.

SIGMA vs Sparse Accelerators

143

SIGMA Qualitative Analysis

144

SIGMA Qualitative Analysis

145

SIGMA performs on average 3x better than state-of-the-art sparse accelerators. In depth analysis can

be found in the paper.

Systolic Array vs SIGMA Comparison

146

** Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.

Systolic Array vs SIGMA Comparison

147

** Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.

SIGMA consumes 38% more area and 82% more power than Systolic Array.

Systolic Array vs SIGMA Comparison

148

** Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.

SIGMA achieves 5.7x higher effective TFLOPS for a 3.2x higher effective TFLOPS/W.

Systolic Array vs SIGMA Comparison

149

** Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.

SIGMA achieves 5.7x higher effective TFLOPS for a 3.2x higher effective TFLOPS/W.

SIGMA consumes more resources but achieves higher effective TFLOPS/W.

Outline• Motivation

○ GEMMs in Deep Learning

○ Utilization on TPU

• Accelerator Requirements

• SIGMA

○ Interconnects Implementations

○ Full System Design

• Evaluation

• Conclusion

150

Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.

151

Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.

● High utilization from systolic arrays is challenging because of their rigid structure.

152

Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.

● High utilization from systolic arrays is challenging because of their rigid structure.

● SIGMA enables high compute utilization on sparse irregular GEMMs.

153

Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.

● High utilization from systolic arrays is challenging because of their rigid structure.

● SIGMA enables high compute utilization on sparse irregular GEMMs.

● SIGMA performs 5.7x better than systolic arrays and 3x better than other state-of-the-art sparse

accelerators at the cost of extra hardware, specifically for the O(1) distribution and the novel FAN

reduction interconnects.

154

Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.

● High utilization from systolic arrays is challenging because of their rigid structure.

● SIGMA enables high compute utilization on sparse irregular GEMMs.

● SIGMA performs 5.7x better than systolic arrays and 3x better than other state-of-the-art sparse

accelerators at the cost of extra hardware, specifically for the O(1) distribution and the novel FAN

reduction interconnects.

● SIGMA achieves 3.2x higher Effective TFLOPS/W than Systolic Arrays.

155

Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.

● High utilization from systolic arrays is challenging because of their rigid structure.

● SIGMA enables high compute utilization on sparse irregular GEMMs.

● SIGMA performs 5.7x better than systolic arrays and 3x better than other state-of-the-art sparse

accelerators at the cost of extra hardware, specifically for the O(1) distribution and the novel FAN

reduction interconnects.

● SIGMA achieves 3.2x higher Effective TFLOPS/W than Systolic Arrays.

● Future work: Optimizations such as power gating and software stack design.

Thank you! 😊

156

Material Backup

157

Walkthrough

0 1 1 1

1 1 1 0

1 1 1 0

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Dim. Registers

M-dim 3

N-dim 4

K-dim 4

Compressed bitmap equivalent

Stationary Bitmap

Streaming Bitmap

A B C D E F G H I

Stationary Nonzero data

a b c d e f g

Streaming Nonzero data

158

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

159

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1

1

1

0

160

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1

1

1

0

0 1 1 1

1 1 1 0

1 1 1 0

Stationary Bitmap

161

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1

1

1

0

0 1 1 1

1 1 1 0

1 1 1 0

Stationary Bitmap

162

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1

1

1

0

0 1 1 1

1 1 1 0

1 1 1 0

Stationary Bitmap

0 1 1 1

1 1 1 0

1 1 1 0

Stationary’ Bitmap

163

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1

1

1

0

0 1 1 1

1 1 1 0

1 1 1 0

Stationary Bitmap

A B C D E F G H I

0 1 1 0

1 1 1 0

1 1 1 0

Stationary’ Bitmap

164

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1

1

1

0

0 1 1 1

1 1 1 0

1 1 1 0

Stationary Bitmap

A B C D E F G H I

0 1 1 0

1 1 1 0

1 1 1 0

Stationary’ Bitmap

165

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

The main idea is to find the necessary nonzero elements to map stationary.

Walkthrough

X

x x x x

A B D E

+ +

+

- - - -

X

x x x x

- - - -

+ +

+

- - - -

Flex-DPE 0 Flex-DPE 1

Distribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I

166

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Cycle 1: unicast elements stationary to Flex-DPE 0

+

Walkthrough

X

A B D E- - - -

X

- - - -

Flex-DPE 0 Flex-DPE 1

Distribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I

167

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Cycle 2: unicast elements stationary to Flex-DPE 1

x x x x

+ +

+

x x x x

+ +

+

+

Walkthrough

0 1 1 0

1 1 1 0

1 1 1 0

Stationary’ Bitmap

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

0 1

2 3 0

1 2 3

1 ... ... ...

1 ... ... ...

1 ... ... ...

Output BitmapRow ID0

1

2

0

1

168

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Walkthrough

0 1 1 0

1 1 1 0

1 1 1 0

Stationary’ Bitmap

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

0 1

2 3 0

1 2 3

1 ... ... ...

1 ... ... ...

1 ... ... ...

Output Bitmap0

1

2

0

1

169

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Row ID

Walkthrough

0 1 1 0

1 1 1 0

1 1 1 0

Stationary’ Bitmap

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

0 1

2 3 0

1 2 3

1 ... ... ...

1 ... ... ...

1 ... ... ...

Output Bitmap0

1

2

0

1

170

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Row ID

The main idea is to find the matching source and destination indices.

Walkthrough

X

A B D E

XDistribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I

a f - - a f - -

171

Cycle 3: multicast 1st column of streaming matrix and reduce

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

- f a - f a - f

b d

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

x x x x

+ +

+

x x x x

+ +

+

+

Walkthrough

X

A B D E

XDistribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I

a f - - a f - -

172

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

- f a - f a - f

b d

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

x x x x

+ +

+

x x x x

+ +

+

+

Cycle 3: multicast 1st column of streaming matrix and reduce

Walkthrough

X

A B D E

XDistribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I

b d - - b d - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

173

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

e

d - b d - b d -

Bf Da-- -- Ga --Ff Ifx x x x

+ +

+

x x x x

+ +

+

+

Cycle 4: multicast 2nd column of streaming matrix and reduce

Walkthrough

X

A B D E

XDistribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I

e - - - e - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

174

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

e - - e - - e -

c g

-- DbAd Ed Gb Hd-- --

Bf Ga If

x x x x

+ +

+

x x x x

+ +

+

+

Da Ff

Cycle 5: multicast 3rd column of streaming matrix and reduce

Walkthrough

X

A B D E

XDistribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I - g c - g c - g

c g - - c g - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

175

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

-- -- Ee -- He-- --

Ga + If

Db + Ed

x x x x

+ +

+

x x x x

+ +

+

+

Ae

Ad --

Da FfGb Hd

Cycle 6: multicast last column of streaming matrix and reduce

Walkthrough

X

A B D E

XDistribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I - - - - - - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

176

Cycle 7: FAN Reduction

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

-- Dc -- Gc --Fg Ig

Ee

x x x x

+ +

+

x x x x

+ +

+

+

Bg--

Ae --

Da + Ff

Db + Ed -- Gb + Hd

-- He

Walkthrough

X

A B D E

XDistribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I - - - - - - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

177

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

-- -- -- -- ---- --

Db + Ed

x x x x

+ +

+

x x x x

+ +

+

+

----

Bg Dc Fg

Ee -- He

Gc Ig

Cycle 8: FAN Reduction

Walkthrough

X

A B D E

XDistribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I - - - - - - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

178

Cycle 9: FAN Reduction

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Ee

x x x x

+ +

+

x x x x

+ +

+

+

Dc Fg Gc + Ig

Walkthrough

X

A B D E

XDistribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I - - - - - - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

179

Cycle 10: FAN Reduction

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Dc + Fg

x x x x

+ +

+

x x x x

+ +

+

+

180